Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -65,7 +65,7 @@ Both the format and the prune set are priced in the same knapsack via REAP-style
|
|
| 65 |
|
| 66 |
$$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$$
|
| 67 |
|
| 68 |
-
This is the **dropout-loss estimate** from the
|
| 69 |
|
| 70 |
Per-layer prune candidates emit `floor(R · num_experts)` lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a **uniform-kept** prune manifest so vLLM's MoE kernel sees a single `num_local_experts` per layer (this artifact: 176 of 256 kept everywhere).
|
| 71 |
|
|
@@ -215,7 +215,7 @@ Full source + reproduction notes: <https://github.com/RobTand/prismaquant>
|
|
| 215 |
|
| 216 |
- [MiniMaxAI](https://huggingface.co/MiniMaxAI) — source model.
|
| 217 |
- [vLLM](https://github.com/vllm-project/vllm) — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
|
| 218 |
-
-
|
| 219 |
- HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
|
| 220 |
- GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.
|
| 221 |
|
|
|
|
| 65 |
|
| 66 |
$$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$$
|
| 67 |
|
| 68 |
+
This is the **dropout-loss estimate** from the REAP family of MoE expert-importance scores: how much the layer's output norm drops in expectation when expert `j` is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.
|
| 69 |
|
| 70 |
Per-layer prune candidates emit `floor(R · num_experts)` lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a **uniform-kept** prune manifest so vLLM's MoE kernel sees a single `num_local_experts` per layer (this artifact: 176 of 256 kept everywhere).
|
| 71 |
|
|
|
|
| 215 |
|
| 216 |
- [MiniMaxAI](https://huggingface.co/MiniMaxAI) — source model.
|
| 217 |
- [vLLM](https://github.com/vllm-project/vllm) — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
|
| 218 |
+
- REAP-style per-expert dropout-loss saliency.
|
| 219 |
- HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
|
| 220 |
- GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.
|
| 221 |
|