--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation base_model: mistralai/Mistral-7B-v0.1 tags: - dystrio - sculpt - dense - runtime-agnostic - no-custom-kernels - hf-drop-in model-index: - name: Dystrio Sculpt (Mistral-7B Conservative) results: - task: type: text-generation dataset: name: WikiText-103 (validation) type: wikitext metrics: - name: perplexity type: perplexity value: 11.0557 --- ## What is Dystrio Sculpt? Dystrio Sculpt produces dense compiled variants of existing models that: - reduce memory footprint - improve prefill throughput - remain runtime-agnostic - require no custom kernels - load with standard HuggingFace Transformers ## Key Results Compared to **mistralai/Mistral-7B-v0.1** baseline on an **A100 80GB**: - **Weights memory:** **-11% (Conservative)** / **-23% (Balanced)** - **RAG latency (TTFT p95):** **-7% / -14%** - **Decode throughput:** ~flat - **No runtime changes:** no custom kernels, no new ops, standard `transformers` loading > Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent. ## Benchmark Results | Model | PPL | PPL Ratio | RAG TTFT p95 (ms) | Chat Decode p95 (ms/tok) | Prefill TPS | Decode TPS | Weights (GiB) | Post-load (GiB) | End-of-bench (GiB) | Peak (GiB) | | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | | mistral-7b-v0.1 (baseline) | 11.0557 | 1.0 | 158.357 | 33.096 | 7661.1 | 30.9 | 13.488777 | 13.488778 | 13.5 | 14.15 | | sculpt-conservative | 12.4484 | 1.126 | 147.31 | 34.169 | 8296.3 | 30.2 | 11.988777 | 11.996713 | 12.0 | 12.63 | | sculpt-balanced | 19.5153 | 1.7652 | 135.959 | 33.302 | 9175.1 | 30.7 | 10.395027 | 10.402963 | 10.4 | 11.02 | ### Benchmark Environment - **GPU**: NVIDIA A100-SXM4-80GB - **dtype**: bf16 - **Torch**: 2.10.0+cu128 - **Transformers**: 5.2.0 - **Deterministic**: False - **Seed**: 0 - Single-GPU, Hugging Face Transformers, no custom kernels. ### Metric Definitions - **TTFT incl. prefill**: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement. - **First decode step**: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement. - **Prefill/Decode TPS**: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only). - **Weights (GiB)**: Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent. - **Post-load (GiB)**: `torch.cuda.memory_allocated()` immediately after `model.eval()` + `torch.cuda.empty_cache()`. Captures weights + framework overhead before any inference. - **End-of-bench (GiB)**: `torch.cuda.memory_allocated()` at end of benchmark workload. Includes KV-cache and activations still held. - **Peak (GiB)**: `torch.cuda.max_memory_allocated()` during benchmark. High-water mark for planning GPU headroom.