Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -40,6 +40,17 @@ Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:
|
|
| 40 |
|
| 41 |
The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
## Quantization details
|
| 44 |
|
| 45 |
- **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4)
|
|
|
|
| 40 |
|
| 41 |
The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
|
| 42 |
|
| 43 |
+
## Quality benchmarks (0-shot, 200-sample subsets)
|
| 44 |
+
|
| 45 |
+
| Benchmark | Metric | This checkpoint | BF16 typical | Recovery |
|
| 46 |
+
|---|---|---|---|---|
|
| 47 |
+
| **ARC-Challenge** | acc_norm | **63.5%** | ~66% | ~96% |
|
| 48 |
+
| **HellaSwag** | acc_norm | **74.0%** | ~78% | ~95% |
|
| 49 |
+
| **TruthfulQA MC2** | acc | **54.2%** | ~55% | ~99% |
|
| 50 |
+
| **Winogrande** | acc | **51.5%** | ~52% | ~99% |
|
| 51 |
+
|
| 52 |
+
95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.
|
| 53 |
+
|
| 54 |
## Quantization details
|
| 55 |
|
| 56 |
- **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4)
|