rdtand commited on
Commit
d6829ee
·
verified ·
1 Parent(s): b245dda

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -40,6 +40,17 @@ Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:
40
 
41
  The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
42
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Quantization details
44
 
45
  - **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4)
 
40
 
41
  The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
42
 
43
+ ## Quality benchmarks (0-shot, 200-sample subsets)
44
+
45
+ | Benchmark | Metric | This checkpoint | BF16 typical | Recovery |
46
+ |---|---|---|---|---|
47
+ | **ARC-Challenge** | acc_norm | **63.5%** | ~66% | ~96% |
48
+ | **HellaSwag** | acc_norm | **74.0%** | ~78% | ~95% |
49
+ | **TruthfulQA MC2** | acc | **54.2%** | ~55% | ~99% |
50
+ | **Winogrande** | acc | **51.5%** | ~52% | ~99% |
51
+
52
+ 95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.
53
+
54
  ## Quantization details
55
 
56
  - **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4)