Tidy YAML header whitespace

2919cd9 verified about 2 months ago

3.1 kB

license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
  - dystrio
  - sculpt
  - dense
  - runtime-agnostic
  - no-custom-kernels
  - hf-drop-in
model-index:
  - name: Dystrio Sculpt (Mistral-7B Conservative)
    results:
      - task:
          type: text-generation
        dataset:
          name: WikiText-103 (validation)
          type: wikitext
        metrics:
          - name: perplexity
            type: perplexity
            value: 11.0557

What is Dystrio Sculpt?

Dystrio Sculpt produces dense compiled variants of existing models that:

reduce memory footprint
improve prefill throughput
remain runtime-agnostic
require no custom kernels
load with standard HuggingFace Transformers

Key Results

Compared to mistralai/Mistral-7B-v0.1 baseline on an A100 80GB:

Weights memory: -11% (Conservative) / -23% (Balanced)
RAG latency (TTFT p95): -7% / -14%
Decode throughput: ~flat
No runtime changes: no custom kernels, no new ops, standard transformers loading

Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.

Benchmark Results

Model	PPL	PPL Ratio	RAG TTFT p95 (ms)	Chat Decode p95 (ms/tok)	Prefill TPS	Decode TPS	Weights (GiB)	Post-load (GiB)	End-of-bench (GiB)	Peak (GiB)
mistral-7b-v0.1 (baseline)	11.0557	1.0	158.357	33.096	7661.1	30.9	13.488777	13.488778	13.5	14.15
sculpt-conservative	12.4484	1.126	147.31	34.169	8296.3	30.2	11.988777	11.996713	12.0	12.63
sculpt-balanced	19.5153	1.7652	135.959	33.302	9175.1	30.7	10.395027	10.402963	10.4	11.02

Benchmark Environment

GPU: NVIDIA A100-SXM4-80GB
dtype: bf16
Torch: 2.10.0+cu128
Transformers: 5.2.0
Deterministic: False
Seed: 0
Single-GPU, Hugging Face Transformers, no custom kernels.

Metric Definitions

TTFT incl. prefill: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
First decode step: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
Prefill/Decode TPS: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
Weights (GiB): Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
Post-load (GiB): torch.cuda.memory_allocated() immediately after model.eval() + torch.cuda.empty_cache(). Captures weights + framework overhead before any inference.
End-of-bench (GiB): torch.cuda.memory_allocated() at end of benchmark workload. Includes KV-cache and activations still held.
Peak (GiB): torch.cuda.max_memory_allocated() during benchmark. High-water mark for planning GPU headroom.