How to reproduce Q4_X
Hi, if it's not a secret could you please describe how to reproduce your "full quality" Q4_X quant?
I would like to try do it myself and then also I would like to try the same method to produce Q4_X of YuanLabAI/Yuan3.0-Ultra-int4. Nobody seems to be interested in doing that :(
Hi, sure. First thing is you need a small patch to ggml/src/ggml-quants.c around line 90:
-- const float d = max / -8;
++ const float d = max / -7;
That change is from @jukofyork 's findings as documented here: https://github.com/ggml-org/llama.cpp/pull/17064#issuecomment-3520544778
Once you compile llama.cpp with that, you can run the convert_hf_to_gguf.py to make the 2.1TB BF16.gguf
Then it's a matter of running llama-quantize --tensor-type ffn_up_exps=Q4_0 --tensor-type ffn_gate_exps=Q4_0 --tensor-type ffn_down_exps=Q4_0 Kimi-K2.5-BF16.gguf Kimi-K2.5-Q4_X.gguf Q8_0 which will quantize the routed experts to Q4_0 and keep the rest of the model in Q8_0. An imatrix shouldn't be needed to make the Q4_X quant.