Qwen3.5-27B-GGUF-4.151bpw
This is a 4.151 BPW quantized model for the GPU poors with 16 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.
From local testing with llama-perplexity, it holds up nicely against the quants tested in https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/, while being significantly smaller.
There are 2 variants, one without imatrix, and one with imatrix from mradermacher.
With 16 GiB of VRAM, we can fit a context size of 72000 with quantized KV cache:
-c 72000 -ctk q8_0 -ctv q8_0 -khad
or a context size of 100000 with more heavily quantized KV cache:
-c 100000 -ctk q6_0 -ctv q5_0 -khad
Size
Size from llama-server output:
llm_load_print_meta: model size = 12.999 GiB (4.151 BPW)
llm_load_print_meta: repeating layers = 11.667 GiB (4.115 BPW, 24.353 B parameters)
...
llm_load_tensors: CUDA_Host buffer size = 682.03 MiB
llm_load_tensors: CUDA0 buffer size = 12628.54 MiB
Quality
Recipe
blk\..*\.attn_q\.weight=iq4_k
blk\..*\.attn_k\.weight=iq5_ks
blk\..*\.attn_v\.weight=iq5_ks
blk\..*\.attn_output\.weight=iq5_ks
blk\..*\.attn_gate\.weight=iq4_k
blk\..*\.attn_qkv\.weight=iq4_k
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq5_ks
blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|18|21|24|27|30|33|36|39|42|45|48|51|54|57|60|63)\.ffn_(down|gate|up)\.weight=iq4_ks
blk\..*\.ffn_(down|gate|up)\.weight=iq3_k
token_embd\.weight=iq4_k
output\.weight=iq4_k
PPL/KLD/RMS result with wikitext2_test.txt (no imatrix):
Mean PPL(Q) : 7.094736 ± 0.049854
Mean PPL(base) : 6.799430 ± 0.046581
Cor(ln(PPL(Q)), ln(PPL(base))): 97.09%
...
Mean KLD: 0.108381 ± 0.002473
...
RMS Δp : 7.279 ± 0.082 %
Same top p: 91.501 ± 0.073 %
PPL/KLD/RMS result with wikitext2_test.txt (with imatrix from mradermacher):
Mean PPL(Q) : 6.603296 ± 0.044188
Mean PPL(base) : 6.799430 ± 0.046581
Cor(ln(PPL(Q)), ln(PPL(base))): 97.50%
...
Mean KLD: 0.083600 ± 0.002149
...
RMS Δp : 6.657 ± 0.083 %
Same top p: 92.501 ± 0.069 %
In general, llama-perplexity results are better with imatrix, but there is a possibility that imatrix will cause an unexpected token to be chosen in actual tasks (see https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/3).
- Downloads last month
- 78
We're not able to determine the quantization variants.
Model tree for sokann/Qwen3.5-27B-GGUF-4.151bpw
Base model
Qwen/Qwen3.5-27B