What is Dystrio Sculpt?

Dystrio Sculpt produces dense compiled variants of existing models that:

Key Results

Compared to mistralai/Mistral-7B-v0.1 baseline on an A100 80GB:

Weights memory: -11% (Conservative) / -23% (Balanced)
RAG latency (TTFT p95): -7% / -14%
Decode throughput: ~flat
No runtime changes: no custom kernels, no new ops, standard transformers loading

Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.

Model	PPL	PPL Ratio	RAG TTFT p95 (ms)	Chat Decode p95 (ms/tok)	Prefill TPS	Decode TPS	Weights (GiB)	Post-load (GiB)	End-of-bench (GiB)	Peak (GiB)
mistral-7b-v0.1 (baseline)	11.0557	1.0	158.357	33.096	7661.1	30.9	13.488777	13.488778	13.5	14.15
sculpt-conservative	12.4484	1.126	147.31	34.169	8296.3	30.2	11.988777	11.996713	12.0	12.63
sculpt-balanced	19.5153	1.7652	135.959	33.302	9175.1	30.7	10.395027	10.402963	10.4	11.02

TTFT incl. prefill: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
First decode step: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
Prefill/Decode TPS: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
Weights (GiB): Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
Post-load (GiB): torch.cuda.memory_allocated() immediately after model.eval() + torch.cuda.empty_cache(). Captures weights + framework overhead before any inference.
End-of-bench (GiB): torch.cuda.memory_allocated() at end of benchmark workload. Includes KV-cache and activations still held.
Peak (GiB): torch.cuda.max_memory_allocated() during benchmark. High-water mark for planning GPU headroom.

Base model

Finetuned

this model

Quantizations