Mixed Precision GGUF layer quantization of gemma-4-E4B-it by Google

Original model: https://huggingface.co/google/gemma-4-E4B-it

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:

Q6_K_S : Q6_K
Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0

   LAYER_TYPES='[
   [0 ,"Q6_K_M"],[1 ,"Q6_K_S"],[2 ,"Q6_K_S"],[3 ,"Q6_K_S"],[4 ,"Q6_K_S"],[5 ,"Q6_K_S"],[6 ,"Q6_K_S"],[7 ,"Q6_K_S"],
   [8 ,"Q6_K_S"],[9 ,"Q6_K_S"],[10,"Q6_K_S"],[11,"Q6_K_S"],[12,"Q6_K_S"],[13,"Q6_K_S"],[14,"Q6_K_S"],[15,"Q6_K_S"],
   [16,"Q6_K_S"],[17,"Q6_K_S"],[18,"Q6_K_S"],[19,"Q6_K_S"],[20,"Q6_K_S"],[21,"Q6_K_S"],[22,"Q6_K_S"],[23,"Q6_K_S"],
   [24,"Q6_K_S"],[25,"Q6_K_S"],[26,"Q6_K_S"],[27,"Q6_K_S"],[28,"Q6_K_S"],[29,"Q6_K_S"],[30,"Q6_K_S"],[31,"Q6_K_S"],
   [32,"Q6_K_S"],[33,"Q6_K_S"],[34,"Q6_K_S"],[35,"Q6_K_S"],[36,"Q6_K_S"],[37,"Q6_K_S"],[38,"Q6_K_S"],[39,"Q6_K_S"],
   [40,"Q6_K_L"],[41,"Q8_0"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"

This quant uses a minumum Q6_K_S across layers, Q6_K embeddings, and and a strong Q8_0 final layer. The quant comes close to 100% accuracy using nothink across a set of reasoning eval prompts. It was specifically optimized for numerical stability (consistent answers) with and without speculation use. It is sized slightly larger than Q6_K.

Comparison:

Quant	size	PPL	Comment
Q6_K	6.2e9	37.7	modified PPL, see discussion below
Q6_K_H	6.3e9	37.5	modified PPL, ~ Q6_K size

Usage:

gemma 4 E4B it is a vision and audio capable dense RL model. It can be used together with its multimedia projector layers to process images, audio, and text inputs and generate text outputs. The mmproj file is made available in this repository.

Thinking:

By default the model will not use RL reasoning block. To get it to use reasoning specify a system prompt with:

<|think|>

as the first token. This is a special token in the model vocab and must be tokenized as such to work. No other text in the system prompt besides the think token is needed to get it to fill in the RL block though other text can be added if desired. When this is done the model will output a formatted think block prior to its final answer:

<|channel>thought
.
.
.
<channel|>

The model was found to be highly capable on reasoning tasks when skipping think block.

Running:

The model is compatible with specultion. A recommended low overhead speculator is gemma-3-270m-it-256k. To use this speculator the inference platform must support dynamic vocab translation between draft and target. Approximate performance with a 9900k CPU and 4070 GPU using a custom downstream speculator with fixed draft block size ND:

CONFIG (no vision tower)	QKV	NKV	gen tps	pp tps (batch 128)
ND=0	F16	128k	86	~2538
ND=2	F16	128k	105
no draft, SWA off (swa-full)	F16	128k	86
ND=2, SWA off (swa-full)	Q8_0	128k	101

The Q6_K_H model passed two long context tests running fully offloaded on a 4070 showing fast prompt processing speed for 100k+ token prompts. It handles correctly the 108k token prompt https://thireus.com/REDDIT/Qwen3_Runescape_Massive_Prompt.txt and also a 85k token needle in haystack prompt.

The SWA used by the model can be turned off using --swa-full on llama.cpp to enable advanced inference modes which need full context memory beyond the sliding window, at a tradeoff of using much more VRAM.

Vision:

The model was tested in vision mode on a couple pretty tough bird ID image and found to exhibit poor performance in both think and nonthink mode. The model did a great job on a text based image prompts though.

Code:

The model can generate working code on simple prompts. The quant selected was not robust at generating code without syntax errors on more complicated code prompts. Other MP quants tested could be made better with code but were found to be inconsistent when run with or without speculation so the party trick code tests were deprioritized for the final selected Q6_K_H quant for better gen consistency with reasoning. The model was found to be quite sensitive to performance variation vs. quant and was therefore triaged for consistent gen with and without speculation use.

Llama.cpp inference/isssues:

Llama.cpp is evolving rapidly to support the wide range of features across gemma-4. The latest version available should be used. Audio worked sort of correctly (with some extraneous chat template garbage output) up to b8775, but at b8775 something more major broke with audio where the beginning of the audio is now chopped off. Thus use of audio with the model is not recommended with llama.cpp until/if the issues with audio inference are corrected.

The model cannot compute valid perplexity due to the gen formatting tokens used in the it train such as

<|channel>

etc. which screw up the token probability distribuitons for text sequences not conforming to the chat template. These formatting token biases seem to have originated with GPT-OSS and have now spread into gemma 4.

To work around this problem a modifed perplexity is computed by overwriting the beginning of the perplexity chunk contents with an output format token as follows:

      # chunk is a string of text to eval perplexity on
      injects='<channel|>'
      chunk="${injects}${chunk:${#injects}}"

This is just an empirical hack to give a more meaningful number for relative comparisions across quants on the model.

logprobs are skipped over the beginning part of the perplexity prompt using a modified llama.cpp downstream server to compute perplexity. Discussion for gemma 4 at: https://github.com/ggml-org/llama.cpp/issues/21388

Benchmarks:

A full set of both math and vision and audio benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link	Type	Size/e9 B	Notes
gemma-4-E4B-it.Q6_K_H.gguf	Q6_K_H	6.26e9	~Q6_K size
gemma-4-E4B-it.mmproj.gguf	F16	0.99e9	multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month: 242

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

6-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steampunque/gemma-4-E4B-it-MP-GGUF

Base model

google/gemma-4-E4B-it

Quantized

(112)

this model