mlx-community/gemma-4-e4b-it-qat-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

A 4-bit mixed-precision MLX quant produced by mlx-optiq, built on Google's quantization-aware-trained (QAT) Gemma-4 base. OptIQ's sensitivity-guided per-layer bit allocation is applied on top of weights that were trained to survive low-bit quantization, and it still beats a uniform 4-bit quant of the same QAT base by +1.19 Capability Score points.

This is a quant of google/gemma-4-E4B-it-qat-q4_0-unquantized. Per-layer bit-widths come from a KL-divergence sensitivity pass on a six-domain calibration mix (prose, reasoning, code, agent, tool-call, constraint-bearing instructions). Sensitive layers go to 8-bit, robust ones stay at 4-bit.

Quantization details

Property Value
Base google/gemma-4-E4B-it-qat-q4_0-unquantized (QAT)
Predominant precision 4-bit
Components at 8-bit (sensitive) 221
Components at 4-bit (robust) 122
Total quantized components 343
Achieved bits-per-weight 5.17
Group size 64
Reference for sensitivity uniform-4-bit baseline (bf16 does not fit in RAM at this size)
Calibration mix six-domain mix
Vision bf16 sidecar (optiq_vision.safetensors), image+text via optiq
Speculative drafter google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant via optiq serve --drafter

Capability Score

Six-metric mean (MMLU, GSM8K, IFEval, BFCL, HumanEval, HashHop), scored against the published uniform 4-bit quant of the same QAT base (mlx-community/gemma-4-E4B-it-qat-4bit). That comparison isolates what the mixed-precision allocation adds, holding the base fixed.

Benchmark Uniform-4 (QAT base) This model (OptIQ, QAT base) Delta
MMLU (5-shot, 1000) 57.4% 57.7% +0.3
GSM8K (1000) 79.5% 80.0% +0.5
IFEval (full, strict) 67.8% 69.1% +1.3
BFCL-V3 simple (200) 70.0% 70.0% +0.0
HumanEval (pass@1, 164) 78.7% 81.7% +3.0
HashHop (long-context) 34.0% 36.0% +2.0
Capability Score (mean) 64.56 65.75 +1.19

The mixed-precision allocation adds +1.19 points over uniform 4-bit on the QAT base, with the largest gains on HumanEval and the long-context HashHop task. The mixed quant is 5.17 bits-per-weight (about 7.0 GB on disk) versus 4.0 bits-per-weight (about 6.3 GB) for uniform 4-bit: the gain comes from spending the extra budget on the layers that need it.

Usage

mlx-lm loads it directly for text:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-4-e4b-it-qat-OptiQ-4bit")
print(generate(model, tokenizer, "Explain mixed-precision quantization.", max_tokens=256))

Image+text input and the speculative drafter run through mlx-optiq:

pip install mlx-optiq
optiq serve --model mlx-community/gemma-4-e4b-it-qat-OptiQ-4bit \
            --drafter google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant

The same repo loads text-only under stock mlx-lm and image+text under optiq. The bf16 vision tower rides in optiq_vision.safetensors, which mlx-lm ignores (it globs model*.safetensors), so both paths work from one artifact.

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # or open the full local workbench: chat, compare, quantize, fine-tune

License

Gemma Terms of Use. Built on google/gemma-4-E4B-it-qat-q4_0-unquantized.

Downloads last month
656
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/gemma-4-e4b-it-qat-OptiQ-4bit

Quantized
(18)
this model