Qwen3.6-35B-A3B-TQ-apex
A mixed-precision native TurboQuant checkpoint of Qwen/Qwen3.6-35B-A3B. Routed MoE experts at 3 bits (TQ3, where the HIGGS-scalar grid is near-optimal on low-κ Gaussian-like weight distributions), attention projections + down_proj + shared experts at 4 bits (TQ4, preserves softmax routing precision and handles heavy-tailed post-GLU weights).
Follows PAT-0349 "Kurtosis-Aware Automatic Bit-Width Selection" as a tensor-family fast-path approximation — we don't measure κ per-tensor; instead we apply the empirical family assignments from prior work on Gemma 4 27B (where routed experts averaged κ≈3.4, attention κ≈5.7, shared experts κ≈13).
Why this variant exists
The earlier varjosoft/Qwen3.6-35B-A3B-TQ3-native at uniform 3-bit scores 82 % gsm8k 5-shot CoT (200 Q, 1024 tok budget) — 11 ppt behind mlx-community/Qwen3.6-35B-A3B-4bit at 93 %. The gap isn't architectural, it's an uniform-bit-width choice that undershoots attention's softmax sensitivity. This variant applies mixed bits where κ actually demands higher precision.
| Variant | Disk | gsm8k-200 @ 1024 tok | Decode tok/s bs=1 | Notes |
|---|---|---|---|---|
| uniform TQ3 | 16 GB | 85.5 % | 35 tok/s | All layers 3-bit |
| this checkpoint (TQ-apex) | 17 GB | 93.0 % | 36 tok/s | 3-bit routed experts, 4-bit attention + down_proj + shared experts |
| mlx-community/*-4bit | 19 GB | 94.5 % | 79 tok/s | Uniform 4-bit (MLX reference) |
Methodology: 200 gsm8k test questions, 5-shot CoT, max_tokens=1024, same harness (mac-gsm8k-mlx.py). Numbers scoring-normalized (trailing .00 collapsed across all three runs for fair comparison). M4 Pro 48 GB MacBook, turboquant-vllm feat/mixed-bits-mlx-loader branch.
Result: TQ-apex closes 7.5 of the 9 ppt gap between uniform TQ3 and the reference 4-bit, landing within 1.5 ppt of the reference at 2 GB smaller on disk and the same decode speed as uniform TQ3. Quality gap remaining is likely from shared-expert κ still above the TQ4 threshold (≈13 on Gemma 4 27B) — a follow-up TQ-apex-plus with TQ5 on shared experts is the next lever.
How to use
Same as the uniform TQ3 variant — the loader auto-detects mixed bits via tq_config.json:
pip install git+https://github.com/varjoranta/turboquant-vllm.git
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ-apex \
--local-dir ~/models/qwen3.6-35b-a3b-tq-apex
from turboquant_vllm.mlx_loader import load_tq3
import mlx.core as mx
from mlx_lm.generate import generate
model, tok = load_tq3("~/models/qwen3.6-35b-a3b-tq-apex")
mx.eval(model.parameters())
print(generate(model, tok, prompt="The capital of Finland is", max_tokens=64))
Compression recipe
- Base bits: 3 — applied to routed expert
gate_proj/up_proj(bulk of parameter mass, near-Gaussian distributions) - Sensitive bits: 4 — applied to layers matching any of:
o_proj,q_proj,k_proj,v_proj(attention, softmax-routing precision)down_proj(post-GLU, heavy-tailed per empirical observations)shared_expert(full-token processing, high κ per Gemma 4 27B measurements)
- Group size: 128
- Rotation: Walsh-Hadamard + random sign vectors (seed=42)
- Codebook: Lloyd-Max 8-entry (3-bit) or 16-entry (4-bit) on unit Gaussian
- Shape-gain norm correction applied per group
Underlying technique is the scalar case of HIGGS (Malinovskii et al., NAACL 2025). The "apex" suffix indicates the mixed-precision profile.
Lineage: mixed-precision quantization for LLMs is now a recognized research direction — see the MXPLM survey (Mixed-Precision Quantization for Language Models, arXiv, Oct 2025) for a comprehensive overview of inter- and intra-layer bit allocation. Closest relatives to this checkpoint's recipe:
- OWQ — outlier-aware INT3/4 + FP16 at tensor granularity (our per-family split is a coarse version of the same idea)
- SpQR — sparse + high-precision outliers (we don't separate outliers; we raise the whole tensor)
- CMPQ — channel-wise adaptive bits (finer-grained; we're at tensor-family granularity)
- SqLLM — Hessian-guided clustering (needs calibration data; we're zero-calibration via the kurtosis rule-of-thumb)
The value of our variant is zero-calibration deployability: the bit-plan is derived from architectural role alone, so a fresh base model can be re-compressed in minutes without gathering a dataset.
Measured quality and throughput
Will populate once benched. Comparison against uniform TQ3 and mlx-4bit will go here.
Reproduce
git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.12 && uv pip install -e . accelerate "transformers>=5.5" torch
# Expanded sensitive-pattern list for this variant
python3 - <<'PY'
import turboquant_vllm.weight_quant as wq
import turboquant_vllm.checkpoint as cp
wq._SENSITIVE_PATTERNS = (
"o_proj", "q_proj", "k_proj", "v_proj",
"down_proj", "shared_expert",
)
cp._SENSITIVE_PATTERNS = wq._SENSITIVE_PATTERNS
cp.save_tq3_checkpoint(
model_id="Qwen/Qwen3.6-35B-A3B",
output_dir="./qwen3.6-apex",
bits=3,
sensitive_bits=4,
group_size=128,
)
PY
python3 scripts/publish_model.py upload ./qwen3.6-apex <your-repo>
Needs ≥ 100 GB CPU RAM (full bf16 model during compression). Inference only needs ~20 GB resident.
Citations
@article{qwen2026qwen36,
title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
title={HIGGS: Pushing the Limits of Large Language Model Quantization via
Hadamard Rotations and MSE-Optimal Grids},
author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
Richtarik, Peter},
booktitle={NAACL}, year={2025},
url={https://aclanthology.org/2025.naacl-long.543/}
}
License
Inherits Apache-2.0 from the base model.
Links
- Compression / loader code:
varjoranta/turboquant-vllm - Uniform-TQ3 baseline variant:
varjosoft/Qwen3.6-35B-A3B-TQ3-native - vLLM upstream PR: vllm-project/vllm#39970
- Downloads last month
- 1,091
8-bit
Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex
Base model
Qwen/Qwen3.6-35B-A3B