What is Dystrio Sculpt?

Dystrio Sculpt produces dense compiled variants of existing models that:

  • reduce memory footprint
  • improve prefill throughput
  • remain runtime-agnostic
  • require no custom kernels
  • load with standard HuggingFace Transformers

Key Results

Compared to mistralai/Mistral-7B-v0.1 baseline on an A100 80GB:

  • Weights memory: -11% (Conservative) / -23% (Balanced)
  • RAG latency (TTFT p95): -7% / -14%
  • Decode throughput: ~flat
  • No runtime changes: no custom kernels, no new ops, standard transformers loading

Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.

Benchmark Results

Model PPL PPL Ratio RAG TTFT p95 (ms) Chat Decode p95 (ms/tok) Prefill TPS Decode TPS Weights (GiB) Post-load (GiB) End-of-bench (GiB) Peak (GiB)
mistral-7b-v0.1 (baseline) 11.0557 1.0 158.357 33.096 7661.1 30.9 13.488777 13.488778 13.5 14.15
sculpt-conservative 12.4484 1.126 147.31 34.169 8296.3 30.2 11.988777 11.996713 12.0 12.63
sculpt-balanced 19.5153 1.7652 135.959 33.302 9175.1 30.7 10.395027 10.402963 10.4 11.02

Benchmark Environment

  • GPU: NVIDIA A100-SXM4-80GB
  • dtype: bf16
  • Torch: 2.10.0+cu128
  • Transformers: 5.2.0
  • Deterministic: False
  • Seed: 0
  • Single-GPU, Hugging Face Transformers, no custom kernels.

Metric Definitions

  • TTFT incl. prefill: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
  • First decode step: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
  • Prefill/Decode TPS: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
  • Weights (GiB): Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
  • Post-load (GiB): torch.cuda.memory_allocated() immediately after model.eval() + torch.cuda.empty_cache(). Captures weights + framework overhead before any inference.
  • End-of-bench (GiB): torch.cuda.memory_allocated() at end of benchmark workload. Includes KV-cache and activations still held.
  • Peak (GiB): torch.cuda.max_memory_allocated() during benchmark. High-water mark for planning GPU headroom.
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dystrio/Mistral-7B-v0.1-sculpt-balanced

Finetuned
(1013)
this model
Quantizations
2 models

Evaluation results