How to reproduce Q4_X

#8
by johnsmithxx - opened

Hi, if it's not a secret could you please describe how to reproduce your "full quality" Q4_X quant?

I would like to try do it myself and then also I would like to try the same method to produce Q4_X of YuanLabAI/Yuan3.0-Ultra-int4. Nobody seems to be interested in doing that :(

Hi, sure. First thing is you need a small patch to ggml/src/ggml-quants.c around line 90:

--        const float d  = max / -8;
++        const float d  = max / -7;

That change is from @jukofyork 's findings as documented here: https://github.com/ggml-org/llama.cpp/pull/17064#issuecomment-3520544778

Once you compile llama.cpp with that, you can run the convert_hf_to_gguf.py to make the 2.1TB BF16.gguf

Then it's a matter of running llama-quantize --tensor-type ffn_up_exps=Q4_0 --tensor-type ffn_gate_exps=Q4_0 --tensor-type ffn_down_exps=Q4_0 Kimi-K2.5-BF16.gguf Kimi-K2.5-Q4_X.gguf Q8_0 which will quantize the routed experts to Q4_0 and keep the rest of the model in Q8_0. An imatrix shouldn't be needed to make the Q4_X quant.

Sign up or log in to comment