Qwen3.6-35B-A3B-TQ-apex2
Data-driven mixed-precision native TurboQuant checkpoint of Qwen/Qwen3.6-35B-A3B. Same recipe as varjosoft/Qwen3.6-35B-A3B-TQ-apex but additionally skips compression on the small, critical tensors our per-tensor kurtosis scan flagged as extremely heavy-tailed: the MoE router gates (mlp.gate.weight, κ mean 17.5, max 25.6) and the GatedDeltaNet linear-attention projections (in_proj_b, out_proj, linear_fc). Those ~40-200 MB stay at fp16/bf16.
This is PAT-0349 Kurtosis-Aware Automatic Bit-Width Selection applied one step further than -TQ-apex: not just raising bit width for moderate-κ tensors, but leaving the highest-κ tensors uncompressed where the scalar grid fundamentally can't represent them.
Why this variant exists
Per the per-tensor κ profile we ran on Qwen3.6-35B-A3B:
| Family | Tensors | Mean κ | Max κ | PAT-0349 bucket | apex2 does |
|---|---|---|---|---|---|
Routed gate_proj / up_proj |
82 | 3.4 | 5.6 | TQ3 | TQ3 ✓ |
Routed down_proj |
41 | 3.4 | 6.9 | TQ3 | TQ3 ✓ |
| Attention Q/K/V/O | 44 | 5-8 | 10-26 | TQ4 | TQ4 ✓ |
Shared expert gate_proj |
41 | 9.0 | 23 | mixed TQ4/FP16 | TQ4 |
Shared expert up_proj |
41 | 5.6 | 12 | TQ4 | TQ4 ✓ |
Shared expert down_proj |
41 | 16.3 | 67 | FP16 (all 41) | TQ4 ← under-shot |
Router mlp.gate.weight |
41 | 17.5 | 25.6 | FP16 | FP16 ✓ (skipped) |
GatedDeltaNet in_proj_b / out_proj / linear_fc |
~60 | 8-500 | >500 | mixed FP16/TQ4 | FP16 ✓ (skipped) |
Bit-plan bucket distribution (from scripts/kappa-to-bit-plan.py)
| Bucket | Tensors | Est. total size (κ-optimal) |
|---|---|---|
| TQ3 (κ<4) | 159 | ~12.5 GB |
| TQ4 (4≤κ<8) | 244 | ~1.4 GB |
| TQ5 (8≤κ<15) → promoted to FP16 (no kernel yet) | 90 tensors, ~0.3 GB at fp16 | — |
| FP16 (κ≥15) | 148 | ~0.6 GB |
| κ-optimal total | 541 | ~14.4 GB |
| uniform-TQ3 baseline | — | ~13.6 GB |
A fully κ-optimal checkpoint would be ~14.4 GB on disk (0.8 GB more than uniform TQ3) while matching bf16 quality on the highest-κ tensors. apex2 captures most of that ceiling cheaply by skipping what's clearly FP16-bucket.
-TQ-apex (the predecessor) forces all 41 router gates through the scalar grid at TQ3, damaging softmax routing precision — the main residual quality gap vs MLX-4bit reference. Skipping them costs ~40 MB on disk (gate tensors are small: hidden × num_experts ≈ 1 MB each × 40 layers). Other skip patterns add a similar amount.
Measured quality and throughput
Benched on M4 Pro 48 GB, gsm8k-200 5-shot CoT @ max_tokens=1024:
| Variant | Disk | gsm8k-200 | Aggregate tok/s |
|---|---|---|---|
| uniform TQ3 | 16 GB | 85.5 % | 35 |
-TQ-apex (TQ3 + TQ4 mixed) |
17 GB | 93.0 % | 36 |
this (-TQ-apex2) |
18 GB | 96.0 % | 17.5 |
| mlx-community/*-4bit (reference) | 19 GB | 94.5 % | 79 |
apex2 beats the MLX-4bit reference on accuracy (96.0 vs 94.5) while being 1 GB smaller on disk. The quality gap vs base model is effectively closed for reasoning tasks of this kind.
How to use
Same loader as -TQ-apex — the fp16-kept tensors automatically pass through mlx_lm's standard weight-load path (they don't carry .tq_packed sidecars, so the TurboQuant loader ignores them):
pip install git+https://github.com/varjoranta/turboquant-vllm.git@feat/mixed-bits-mlx-loader
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ-apex2 \
--local-dir ~/models/qwen3.6-35b-a3b-tq-apex2
Compression recipe (tensor-family)
- Skip (fp16):
lm_head,embed,norm,bias,.gate.weight(router),in_proj_b,out_proj,linear_fc(GatedDeltaNet) - TQ4 (sensitive):
o_proj,q_proj,k_proj,v_proj,down_proj,shared_expert - TQ3 (default): everything else that matches
.weightand hasshape[-1] >= 128(bulk of params — routed expertgate_proj/up_proj) - Group size: 128; rotation: Walsh-Hadamard + seed=42 random signs; codebook: Lloyd-Max scalar
Reproduce
git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.12 && uv pip install -e . accelerate "transformers>=5.5" torch
python3 - <<'PY'
import turboquant_vllm.weight_quant as wq
import turboquant_vllm.checkpoint as cp
wq._SENSITIVE_PATTERNS = (
"o_proj", "q_proj", "k_proj", "v_proj",
"down_proj", "shared_expert",
)
cp._SENSITIVE_PATTERNS = wq._SENSITIVE_PATTERNS
wq._SKIP_PATTERNS = tuple(list(wq._SKIP_PATTERNS) + [
".gate.weight", # MoE router — κ mean 17.5
"in_proj_b", # GatedDeltaNet
"out_proj", # GatedDeltaNet
"linear_fc", # GatedDeltaNet
])
cp._SKIP_PATTERNS = wq._SKIP_PATTERNS
cp.save_tq3_checkpoint(
model_id="Qwen/Qwen3.6-35B-A3B",
output_dir="./qwen3.6-apex2",
bits=3, sensitive_bits=4, group_size=128,
)
PY
Needs ≥ 100 GB CPU RAM (full bf16 load during compression). Inference: ~20 GB resident.
Citations
@article{qwen2026qwen36,
title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
title={HIGGS: Pushing the Limits of Large Language Model Quantization via
Hadamard Rotations and MSE-Optimal Grids},
author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
Richtarik, Peter},
booktitle={NAACL}, year={2025},
url={https://aclanthology.org/2025.naacl-long.543/}
}
Mixed-precision bit-width selection draws on PAT-0349 (Kurtosis-Aware Automatic Bit-Width Selection) and the MXPLM survey (Mixed-Precision Quantization for Language Models, arXiv, Oct 2025).
License
Inherits Apache-2.0 from the base model.
Links
- Compression / loader:
varjoranta/turboquant-vllm(feat/mixed-bits-mlx-loader) - Predecessors:
varjosoft/Qwen3.6-35B-A3B-TQ-apex,varjosoft/Qwen3.6-35B-A3B-TQ3-native
- Downloads last month
- 760
8-bit
Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex2
Base model
Qwen/Qwen3.6-35B-A3B