What is Dystrio Sculpt?
Dystrio Sculpt produces dense compiled variants of existing models that:
- reduce memory footprint
- improve prefill throughput
- remain runtime-agnostic
- require no custom kernels
- load with standard HuggingFace Transformers
Key Results
Compared to mistralai/Mistral-7B-v0.1 baseline on an A100 80GB:
- Weights memory: -11% (Conservative) / -23% (Balanced)
- RAG latency (TTFT p95): -7% / -14%
- Decode throughput: ~flat
- No runtime changes: no custom kernels, no new ops, standard
transformersloading
Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.
Benchmark Results
| Model | PPL | PPL Ratio | RAG TTFT p95 (ms) | Chat Decode p95 (ms/tok) | Prefill TPS | Decode TPS | Weights (GiB) | Post-load (GiB) | End-of-bench (GiB) | Peak (GiB) |
|---|---|---|---|---|---|---|---|---|---|---|
| mistral-7b-v0.1 (baseline) | 11.0557 | 1.0 | 158.357 | 33.096 | 7661.1 | 30.9 | 13.488777 | 13.488778 | 13.5 | 14.15 |
| sculpt-conservative | 12.4484 | 1.126 | 147.31 | 34.169 | 8296.3 | 30.2 | 11.988777 | 11.996713 | 12.0 | 12.63 |
| sculpt-balanced | 19.5153 | 1.7652 | 135.959 | 33.302 | 9175.1 | 30.7 | 10.395027 | 10.402963 | 10.4 | 11.02 |
Benchmark Environment
- GPU: NVIDIA A100-SXM4-80GB
- dtype: bf16
- Torch: 2.10.0+cu128
- Transformers: 5.2.0
- Deterministic: False
- Seed: 0
- Single-GPU, Hugging Face Transformers, no custom kernels.
Metric Definitions
- TTFT incl. prefill: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
- First decode step: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
- Prefill/Decode TPS: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
- Weights (GiB): Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
- Post-load (GiB):
torch.cuda.memory_allocated()immediately aftermodel.eval()+torch.cuda.empty_cache(). Captures weights + framework overhead before any inference. - End-of-bench (GiB):
torch.cuda.memory_allocated()at end of benchmark workload. Includes KV-cache and activations still held. - Peak (GiB):
torch.cuda.max_memory_allocated()during benchmark. High-water mark for planning GPU headroom.
- Downloads last month
- 8
Model tree for dystrio/Mistral-7B-v0.1-sculpt-balanced
Evaluation results
- perplexity on WikiText-103 (validation)self-reported11.056