mlx-community/Qwen3.5-2B-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the LabAll OptIQ quantsDocs

A 4-bit mixed-precision MLX quant produced by mlx-optiq, the sensitivity-aware quantization toolkit for Apple Silicon. Beats stock uniform 4-bit on every benchmark in the six-metric Capability Score.

A 4-bit mixed-precision MLX quant of Qwen/Qwen3.5-2B. Per-layer bit-widths come from a KL-divergence sensitivity pass on a six-domain calibration mix (prose 路 reasoning 路 code 路 agent 路 tool-call 路 constraint-bearing instructions). Sensitive layers go to 8-bit; robust ones stay at 4-bit. The on-disk size is within ~5 % of a stock uniform 4-bit MLX quant.

Quantization details

Property Value
Predominant precision 4-bit
Layers at 8-bit (sensitive) 56
Layers at 4-bit (robust) 130
Total quantized layers 186
Group size 64
Calibration mix six-domain mix (40 samples 脳 6 domains)
Reference for sensitivity bf16 (auto-resolved; falls back to uniform-4-bit if bf16 doesn't fit)
Bundled MTP head mtp.safetensors (4-bit projections, BF16 norms), enables 1.4脳 decode via optiq serve --mtp

We follow the same naming convention llama.cpp uses for Q4_K_M and similar mixed-precision quants: the "4-bit" label is for the predominant precision, not the weighted average. The mixed allocation is what lets this build beat stock uniform-4-bit on every benchmark below at the same disk size.

Usage

Load it with mlx-lm and use it as usual:

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-2B-OptiQ-4bit")
response = generate(
    model, tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=200,
)

For more (mixed-precision KV-cache serving, sensitivity-aware LoRA fine-tuning, OpenAI + Anthropic-compatible inference server, hot-swap mounted adapters, sandboxed Python execution for agent workflows), install mlx-optiq:

pip install mlx-optiq

Speculative decoding (MTP)

This quant ships with a bundled Multi-Token Prediction head as mtp.safetensors. Enable it for ~1.4脳 faster decode:

optiq serve --model mlx-community/Qwen3.5-2B-OptiQ-4bit --mtp

Acceptance rate stays ~70% at depth 2 (the empirical sweet spot for Qwen3.5).

See the Qwen3.5 family guide on mlx-optiq.com for sampling defaults, training recipes, and family-specific caveats.

Benchmarks

Six-metric Capability Score (mean of MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop). Apples-to-apples comparison against stock uniform 4-bit:

Metric OptIQ Uniform 4-bit
MMLU (5-shot, 1000 samples) 58.9% 58.6% +0.3
GSM8K (1000 samples, 3-shot CoT) 55.6% 56.4% -0.8
IFEval (full set, strict) 59.7% 58.6% +1.1
BFCL-V3 simple (200 calls) 60.5% 60.0% +0.5
HumanEval (164 problems, pass@1) 51.2% 39.6% +11.6
HashHop (long-context retrieval) 0.0% 0.0% +0.0
Capability Score (mean of 6) 47.66 45.54 +2.12
KL vs bf16 reference (mean / p95) 0.2162 / 0.9484 , ,
On-disk size 1.4 GB 1.6 GB -0.2

Every metric gets one equal vote. Disk size is reported next to the score as an honest second axis instead of being folded into the score. See the eval-framework writeup for the full methodology.

Links

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune

License

Apache 2.0 (inherits from base model).

Downloads last month
2,041
Safetensors
Model size
0.4B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mlx-community/Qwen3.5-2B-OptiQ-4bit

Finetuned
Qwen/Qwen3.5-2B
Quantized
(120)
this model