dystrio's picture
Tidy YAML header whitespace
2919cd9 verified
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
  - dystrio
  - sculpt
  - dense
  - runtime-agnostic
  - no-custom-kernels
  - hf-drop-in
model-index:
  - name: Dystrio Sculpt (Mistral-7B Conservative)
    results:
      - task:
          type: text-generation
        dataset:
          name: WikiText-103 (validation)
          type: wikitext
        metrics:
          - name: perplexity
            type: perplexity
            value: 11.0557

What is Dystrio Sculpt?

Dystrio Sculpt produces dense compiled variants of existing models that:

  • reduce memory footprint
  • improve prefill throughput
  • remain runtime-agnostic
  • require no custom kernels
  • load with standard HuggingFace Transformers

Key Results

Compared to mistralai/Mistral-7B-v0.1 baseline on an A100 80GB:

  • Weights memory: -11% (Conservative) / -23% (Balanced)
  • RAG latency (TTFT p95): -7% / -14%
  • Decode throughput: ~flat
  • No runtime changes: no custom kernels, no new ops, standard transformers loading

Notes: TTFT includes prefill + first decode step. “Weights memory” is computed from parameter sizes (GiB) and is workload-independent.

Benchmark Results

Model PPL PPL Ratio RAG TTFT p95 (ms) Chat Decode p95 (ms/tok) Prefill TPS Decode TPS Weights (GiB) Post-load (GiB) End-of-bench (GiB) Peak (GiB)
mistral-7b-v0.1 (baseline) 11.0557 1.0 158.357 33.096 7661.1 30.9 13.488777 13.488778 13.5 14.15
sculpt-conservative 12.4484 1.126 147.31 34.169 8296.3 30.2 11.988777 11.996713 12.0 12.63
sculpt-balanced 19.5153 1.7652 135.959 33.302 9175.1 30.7 10.395027 10.402963 10.4 11.02

Benchmark Environment

  • GPU: NVIDIA A100-SXM4-80GB
  • dtype: bf16
  • Torch: 2.10.0+cu128
  • Transformers: 5.2.0
  • Deterministic: False
  • Seed: 0
  • Single-GPU, Hugging Face Transformers, no custom kernels.

Metric Definitions

  • TTFT incl. prefill: Wall time from prompt submission to first generated token (prefill forward + first decode step). Per-prompt request-level measurement.
  • First decode step: Wall time of the first decode forward call only (post-prefill). Per-prompt request-level measurement.
  • Prefill/Decode TPS: Throughput from batched microbenchmark iterations (not request-level; used for throughput comparison only).
  • Weights (GiB): Model parameter memory only (sum of numel * element_size for all parameters). Deterministic and runtime-independent.
  • Post-load (GiB): torch.cuda.memory_allocated() immediately after model.eval() + torch.cuda.empty_cache(). Captures weights + framework overhead before any inference.
  • End-of-bench (GiB): torch.cuda.memory_allocated() at end of benchmark workload. Includes KV-cache and activations still held.
  • Peak (GiB): torch.cuda.max_memory_allocated() during benchmark. High-water mark for planning GPU headroom.