Instructions to use dystrio/Qwen2.5-7B-Instruct-sculpt-default with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dystrio/Qwen2.5-7B-Instruct-sculpt-default with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dystrio/Qwen2.5-7B-Instruct-sculpt-default")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen2.5-7B-Instruct-sculpt-default")
model = AutoModelForMultimodalLM.from_pretrained("dystrio/Qwen2.5-7B-Instruct-sculpt-default")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dystrio/Qwen2.5-7B-Instruct-sculpt-default with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dystrio/Qwen2.5-7B-Instruct-sculpt-default"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dystrio/Qwen2.5-7B-Instruct-sculpt-default",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dystrio/Qwen2.5-7B-Instruct-sculpt-default

SGLang

How to use dystrio/Qwen2.5-7B-Instruct-sculpt-default with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dystrio/Qwen2.5-7B-Instruct-sculpt-default" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dystrio/Qwen2.5-7B-Instruct-sculpt-default",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dystrio/Qwen2.5-7B-Instruct-sculpt-default" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dystrio/Qwen2.5-7B-Instruct-sculpt-default",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dystrio/Qwen2.5-7B-Instruct-sculpt-default with Docker Model Runner:
```
docker model run hf.co/dystrio/Qwen2.5-7B-Instruct-sculpt-default
```

dystrio/Qwen2.5-7B-Instruct-sculpt-default

9% smaller, quality improved (0.9896x PPL), drop-in replacement. No custom kernels. No runtime changes.

Dystrio Sculpt structurally compresses transformer models, producing dense models that load with standard transformers — no custom code, no new ops, no deployment friction.

This is the Default tier of Qwen 2.5 7B Instruct.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("dystrio/Qwen2.5-7B-Instruct-sculpt-default", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen2.5-7B-Instruct-sculpt-default")

inputs = tokenizer("The future of AI inference is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmark Results

All tiers compiled from Qwen 2.5 7B Instruct on A100 80GB, bf16:

Model	PPL	PPL Ratio	Weights (GB)	Chat Prefill TPS	RAG TTFT p95 (ms)	Decode TPS
Baseline	12.4633	1.0	14.185191	11510.6	117.869	71.1
sculpt-default	12.334	0.9896	12.964976	12352.7	110.714	72.7
sculpt-production	21.9239	1.7591	10.596324	14700.3	95.291	73.5
sculpt-throughput	23.2366	1.8644	9.950328	15386.6	91.914	73.3

Key Metrics (this model)

Metric	Value
Weights memory	12.964976 GB (9% smaller)
PPL ratio	0.9896
Chat prefill TPS	12352.7 (+7%)
RAG TTFT p95	110.714 ms (-6%)
Decode TPS	72.7 (flat)
Parameters	6.96B

All Sculpt Tiers

Tier	HuggingFace	Size	PPL Ratio	Use Case
default	dystrio/Qwen2.5-7B-Instruct-sculpt-default 👈 this model	12.964976 GB	0.9896	Zero-regret: quality preserved, smaller footprint
production	dystrio/Qwen2.5-7B-Instruct-sculpt-production	10.596324 GB	1.7591	Practical savings with modest quality tradeoff
throughput	dystrio/Qwen2.5-7B-Instruct-sculpt-throughput	9.950328 GB	1.8644	Maximum usable compression for speed/edge

What is Dystrio Sculpt?

Dystrio Sculpt compiles transformer models into smaller, faster variants. Output models:

Are dense (not sparse) — standard architecture, fewer parameters
Load with standard HuggingFace Transformers — no custom code needed
Require no custom kernels and no runtime changes
Work as a one-step compile before deployment
Stack with quantization (AWQ, GPTQ, GGUF) for compound savings

Compatibility

✅ HuggingFace Transformers
✅ vLLM
✅ TGI (Text Generation Inference)
✅ llama.cpp / GGUF conversion
✅ AWQ / GPTQ quantization
✅ Any framework that loads standard safetensors

Benchmark Environment

GPU: NVIDIA A100-SXM4-80GB
dtype: bf16
Torch: 2.10.0+cu128
Transformers: 5.3.0
Deterministic: True
Single-GPU, standard HuggingFace Transformers, no custom kernels.

Metric Definitions

PPL ratio: WikiText-103 perplexity relative to baseline. <1.0 = quality improved.
Prefill TPS: Tokens per second during prompt encoding (higher = faster).
TTFT p95: Time to first token at 95th percentile (lower = faster).
Decode TPS: Tokens per second during generation (higher = faster).
Weights (GB): Model parameter memory (deterministic, runtime-independent).

Citation

@misc{dystrio_sculpt_2026,
  title={Dystrio Sculpt: Structural Compilation for Transformer LLMs},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio}
}

Downstream Benchmarks (lm-eval)

Evaluated with lm-eval-harness on A100-80GB, bf16, zero-shot.

Benchmark	Baseline	This Model	Delta
ARC-Challenge	0.5282	0.4676	-0.0606
HellaSwag	0.6204	0.5650	-0.0554
MMLU	0.7176	0.6506	-0.0670
TruthfulQA MC2	0.6475	0.5457	-0.1018

Downloads last month: 3

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for dystrio/Qwen2.5-7B-Instruct-sculpt-default

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2608)

this model

Dataset used to train dystrio/Qwen2.5-7B-Instruct-sculpt-default

Evaluation results

perplexity on WikiText-103 (validation)
self-reported

12.334
ppl_ratio on WikiText-103 (validation)
self-reported

0.990