File size: 3,095 Bytes
4fc2d9d
 
 
 
 
 
 
 
 
 
 
 
f27d37c
 
 
 
 
 
 
 
 
 
 
 
 
4fc2d9d
 
7f2e022
 
 
 
 
 
 
 
 
 
fa7c026
7f2e022
fa7c026
 
 
 
 
 
 
 
7f2e022
 
a0404a4
 
 
8f69c0e
 
 
 
 
a0404a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
- dystrio
- sculpt
- dense
- runtime-agnostic
- no-custom-kernels
- hf-drop-in

model-index:
- name: Dystrio Sculpt (Mistral-7B Conservative)
  results:
  - task:
      type: text-generation
    dataset:
      name: WikiText-103 (validation)
      type: wikitext
    metrics:
    - name: perplexity
      type: perplexity
      value: 11.0557
---

## What is Dystrio Sculpt?

Dystrio Sculpt produces dense compiled variants of existing models that:

- reduce memory footprint
- improve prefill throughput
- remain runtime-agnostic
- require no custom kernels
- load with standard HuggingFace Transformers

## Key Results

Compared to **mistralai/Mistral-7B-v0.1** baseline on an **A100 80GB**:

- **Weights memory:** **-11% (Conservative)** / **-23% (Balanced)**
- **RAG latency (TTFT p95):** **-7% / -14%**
- **Decode throughput:** ~flat
- **No runtime changes:** no custom kernels, no new ops, standard `transformers` loading

> Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.


## Benchmark Results

| Model | PPL | PPL Ratio | RAG TTFT p95 (ms) | Chat Decode p95 (ms/tok) | Prefill TPS | Decode TPS | Weights (GiB) | Post-load (GiB) | End-of-bench (GiB) | Peak (GiB) |
| ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| mistral-7b-v0.1 (baseline) | 11.0557 | 1.0 | 158.357 | 33.096 | 7661.1 | 30.9 | 13.488777 | 13.488778 | 13.5 | 14.15 |
| sculpt-conservative | 12.4484 | 1.126 | 147.31 | 34.169 | 8296.3 | 30.2 | 11.988777 | 11.996713 | 12.0 | 12.63 |
| sculpt-balanced | 19.5153 | 1.7652 | 135.959 | 33.302 | 9175.1 | 30.7 | 10.395027 | 10.402963 | 10.4 | 11.02 |


### Benchmark Environment

- **GPU**: NVIDIA A100-SXM4-80GB
- **dtype**: bf16
- **Torch**: 2.10.0+cu128
- **Transformers**: 5.2.0
- **Deterministic**: False
- **Seed**: 0
- Single-GPU, Hugging Face Transformers, no custom kernels.

### Metric Definitions

- **TTFT incl. prefill**: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
- **First decode step**: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
- **Prefill/Decode TPS**: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
- **Weights (GiB)**: Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
- **Post-load (GiB)**: `torch.cuda.memory_allocated()` immediately after `model.eval()` + `torch.cuda.empty_cache()`. Captures weights + framework overhead before any inference.
- **End-of-bench (GiB)**: `torch.cuda.memory_allocated()` at end of benchmark workload. Includes KV-cache and activations still held.
- **Peak (GiB)**: `torch.cuda.max_memory_allocated()` during benchmark. High-water mark for planning GPU headroom.