Qwen3.5-27B-GGUF-4.151bpw

This is a 4.151 BPW quantized model for the GPU poors with 16 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.

From local testing with llama-perplexity, it holds up nicely against the quants tested in https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/, while being significantly smaller.

There are 2 variants, one without imatrix, and one with imatrix from mradermacher.

With 16 GiB of VRAM, we can fit a context size of 72000 with quantized KV cache:

-c 72000 -ctk q8_0 -ctv q8_0 -khad

or a context size of 100000 with more heavily quantized KV cache:

-c 100000 -ctk q6_0 -ctv q5_0 -khad

Size

Size from llama-server output:

llm_load_print_meta: model size       = 12.999 GiB (4.151 BPW)
llm_load_print_meta: repeating layers = 11.667 GiB (4.115 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   682.03 MiB
llm_load_tensors:      CUDA0 buffer size = 12628.54 MiB

Quality

Recipe

blk\..*\.attn_q\.weight=iq4_k
blk\..*\.attn_k\.weight=iq5_ks
blk\..*\.attn_v\.weight=iq5_ks
blk\..*\.attn_output\.weight=iq5_ks
blk\..*\.attn_gate\.weight=iq4_k
blk\..*\.attn_qkv\.weight=iq4_k

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq5_ks

blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|18|21|24|27|30|33|36|39|42|45|48|51|54|57|60|63)\.ffn_(down|gate|up)\.weight=iq4_ks
blk\..*\.ffn_(down|gate|up)\.weight=iq3_k

token_embd\.weight=iq4_k
output\.weight=iq4_k

PPL/KLD/RMS result with wikitext2_test.txt (no imatrix):

Mean PPL(Q)                   :   7.094736 ±   0.049854
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  97.09%
...
Mean    KLD:   0.108381 ±   0.002473
...
RMS Δp    :  7.279 ± 0.082 %
Same top p: 91.501 ± 0.073 %

PPL/KLD/RMS result with wikitext2_test.txt (with imatrix from mradermacher):

Mean PPL(Q)                   :   6.603296 ±   0.044188
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  97.50%
...
Mean    KLD:   0.083600 ±   0.002149
...
RMS Δp    :  6.657 ± 0.083 %
Same top p: 92.501 ± 0.069 %

In general, llama-perplexity results are better with imatrix, but there is a possibility that imatrix will cause an unexpected token to be chosen in actual tasks (see https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/3).

Downloads last month: 78

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.5-27B-GGUF-4.151bpw

Base model

Qwen/Qwen3.5-27B

Quantized

(183)

this model