dystrio/gemma-2-2b-it-sculpt-production

13% smaller, quality improved (0.8693x PPL), drop-in replacement. No custom kernels. No runtime changes.

Dystrio Sculpt structurally compresses transformer models, producing dense models that load with standard transformers — no custom code, no new ops, no deployment friction.

This is the Production tier of gemma 2 2b it.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("dystrio/gemma-2-2b-it-sculpt-production", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("dystrio/gemma-2-2b-it-sculpt-production")

inputs = tokenizer("The future of AI inference is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmark Results

All tiers compiled from gemma 2 2b it on A100 80GB, bf16:

Model	PPL	PPL Ratio	Weights (GB)	Chat Prefill TPS	RAG TTFT p95 (ms)	Decode TPS
Baseline	25.7807	1.0	4.869591	21611.9	70.251	59.6
sculpt-default	20.5854	0.7985	4.441124	23065.3	69.007	60.0
sculpt-production	22.4118	0.8693	4.226891	23404.0	66.554	60.7
sculpt-throughput	29.8372	1.1573	3.969811	24330.1	64.529	59.3
sculpt-experimental	48.9699	1.8995	3.412804	26496.2	60.97	59.5

Key Metrics (this model)

Metric	Value
Weights memory	4.226891 GB (13% smaller)
PPL ratio	0.8693
Chat prefill TPS	23404.0 (+8%)
RAG TTFT p95	66.554 ms (-5%)
Decode TPS	60.7 (flat)
Parameters	2.27B

All Sculpt Tiers

Tier	HuggingFace	Size	PPL Ratio	Use Case
default	dystrio/gemma-2-2b-it-sculpt-default	4.441124 GB	0.7985	Zero-regret: quality preserved, smaller footprint
production	dystrio/gemma-2-2b-it-sculpt-production 👈 this model	4.226891 GB	0.8693	Practical savings with modest quality tradeoff
throughput	dystrio/gemma-2-2b-it-sculpt-throughput	3.969811 GB	1.1573	Maximum usable compression for speed/edge
experimental	dystrio/gemma-2-2b-it-sculpt-experimental	3.412804 GB	1.8995	Boundary exploration, maximum structural compression

What is Dystrio Sculpt?

Dystrio Sculpt compiles transformer models into smaller, faster variants. Output models:

Are dense (not sparse) — standard architecture, fewer parameters
Load with standard HuggingFace Transformers — no custom code needed
Require no custom kernels and no runtime changes
Work as a one-step compile before deployment
Stack with quantization (AWQ, GPTQ, GGUF) for compound savings

Compatibility

✅ HuggingFace Transformers
✅ vLLM
✅ TGI (Text Generation Inference)
✅ llama.cpp / GGUF conversion
✅ AWQ / GPTQ quantization
✅ Any framework that loads standard safetensors

Benchmark Environment

GPU: NVIDIA A100-SXM4-80GB
dtype: bf16
Torch: 2.10.0+cu128
Transformers: 5.3.0
Deterministic: True
Single-GPU, standard HuggingFace Transformers, no custom kernels.

Metric Definitions

PPL ratio: WikiText-103 perplexity relative to baseline. <1.0 = quality improved.
Prefill TPS: Tokens per second during prompt encoding (higher = faster).
TTFT p95: Time to first token at 95th percentile (lower = faster).
Decode TPS: Tokens per second during generation (higher = faster).
Weights (GB): Model parameter memory (deterministic, runtime-independent).

Citation

@misc{dystrio_sculpt_2026,
  title={Dystrio Sculpt: Structural Compilation for Transformer LLMs},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio}
}

Downstream Benchmarks (lm-eval)

Evaluated with lm-eval-harness on A100-80GB, bf16, zero-shot.

Benchmark	Baseline	This Model	Delta
ARC-Challenge	0.5094	0.3959	-0.1135
HellaSwag	0.5375	0.4762	-0.0613
MMLU	0.5691	0.4287	-0.1404
TruthfulQA MC2	0.5322	0.5055	-0.0267

Downloads last month: 50

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for dystrio/gemma-2-2b-it-sculpt-production

Base model

google/gemma-2-2b

Finetuned

google/gemma-2-2b-it

Finetuned

(866)

this model

Dataset used to train dystrio/gemma-2-2b-it-sculpt-production

Evaluation results

perplexity on WikiText-103 (validation)
self-reported

22.412
ppl_ratio on WikiText-103 (validation)
self-reported

0.869