Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -65,7 +65,7 @@ Both the format and the prune set are priced in the same knapsack via REAP-style
 $$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$$
-This is the **dropout-loss estimate** from the [REAP paper](https://arxiv.org/abs/2410.21271): how much the layer's output norm drops in expectation when expert `j` is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.
 Per-layer prune candidates emit `floor(R · num_experts)` lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a **uniform-kept** prune manifest so vLLM's MoE kernel sees a single `num_local_experts` per layer (this artifact: 176 of 256 kept everywhere).
@@ -215,7 +215,7 @@ Full source + reproduction notes: <https://github.com/RobTand/prismaquant>
 - [MiniMaxAI](https://huggingface.co/MiniMaxAI) — source model.
 - [vLLM](https://github.com/vllm-project/vllm) — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
-- [REAP (Lasby et al. 2025)](https://arxiv.org/abs/2410.21271) — per-expert dropout-loss saliency formulation.
 - HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
 - GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.

 $$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$$
+This is the **dropout-loss estimate** from the REAP family of MoE expert-importance scores: how much the layer's output norm drops in expectation when expert `j` is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.
 Per-layer prune candidates emit `floor(R · num_experts)` lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a **uniform-kept** prune manifest so vLLM's MoE kernel sees a single `num_local_experts` per layer (this artifact: 176 of 256 kept everywhere).
 - [MiniMaxAI](https://huggingface.co/MiniMaxAI) — source model.
 - [vLLM](https://github.com/vllm-project/vllm) — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
+- REAP-style per-expert dropout-loss saliency.
 - HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
 - GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.