File size: 3,095 Bytes
4fc2d9d f27d37c 4fc2d9d 7f2e022 fa7c026 7f2e022 fa7c026 7f2e022 a0404a4 8f69c0e a0404a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
- dystrio
- sculpt
- dense
- runtime-agnostic
- no-custom-kernels
- hf-drop-in
model-index:
- name: Dystrio Sculpt (Mistral-7B Conservative)
results:
- task:
type: text-generation
dataset:
name: WikiText-103 (validation)
type: wikitext
metrics:
- name: perplexity
type: perplexity
value: 11.0557
---
## What is Dystrio Sculpt?
Dystrio Sculpt produces dense compiled variants of existing models that:
- reduce memory footprint
- improve prefill throughput
- remain runtime-agnostic
- require no custom kernels
- load with standard HuggingFace Transformers
## Key Results
Compared to **mistralai/Mistral-7B-v0.1** baseline on an **A100 80GB**:
- **Weights memory:** **-11% (Conservative)** / **-23% (Balanced)**
- **RAG latency (TTFT p95):** **-7% / -14%**
- **Decode throughput:** ~flat
- **No runtime changes:** no custom kernels, no new ops, standard `transformers` loading
> Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.
## Benchmark Results
| Model | PPL | PPL Ratio | RAG TTFT p95 (ms) | Chat Decode p95 (ms/tok) | Prefill TPS | Decode TPS | Weights (GiB) | Post-load (GiB) | End-of-bench (GiB) | Peak (GiB) |
| ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| mistral-7b-v0.1 (baseline) | 11.0557 | 1.0 | 158.357 | 33.096 | 7661.1 | 30.9 | 13.488777 | 13.488778 | 13.5 | 14.15 |
| sculpt-conservative | 12.4484 | 1.126 | 147.31 | 34.169 | 8296.3 | 30.2 | 11.988777 | 11.996713 | 12.0 | 12.63 |
| sculpt-balanced | 19.5153 | 1.7652 | 135.959 | 33.302 | 9175.1 | 30.7 | 10.395027 | 10.402963 | 10.4 | 11.02 |
### Benchmark Environment
- **GPU**: NVIDIA A100-SXM4-80GB
- **dtype**: bf16
- **Torch**: 2.10.0+cu128
- **Transformers**: 5.2.0
- **Deterministic**: False
- **Seed**: 0
- Single-GPU, Hugging Face Transformers, no custom kernels.
### Metric Definitions
- **TTFT incl. prefill**: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
- **First decode step**: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
- **Prefill/Decode TPS**: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
- **Weights (GiB)**: Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
- **Post-load (GiB)**: `torch.cuda.memory_allocated()` immediately after `model.eval()` + `torch.cuda.empty_cache()`. Captures weights + framework overhead before any inference.
- **End-of-bench (GiB)**: `torch.cuda.memory_allocated()` at end of benchmark workload. Includes KV-cache and activations still held.
- **Peak (GiB)**: `torch.cuda.max_memory_allocated()` during benchmark. High-water mark for planning GPU headroom.
|