Qwen3.6-35B-A3B-TQ-apex2

Data-driven mixed-precision native TurboQuant checkpoint of Qwen/Qwen3.6-35B-A3B. Same recipe as varjosoft/Qwen3.6-35B-A3B-TQ-apex but additionally skips compression on the small, critical tensors our per-tensor kurtosis scan flagged as extremely heavy-tailed: the MoE router gates (mlp.gate.weight, κ mean 17.5, max 25.6) and the GatedDeltaNet linear-attention projections (in_proj_b, out_proj, linear_fc). Those ~40-200 MB stay at fp16/bf16.

This is PAT-0349 Kurtosis-Aware Automatic Bit-Width Selection applied one step further than -TQ-apex: not just raising bit width for moderate-κ tensors, but leaving the highest-κ tensors uncompressed where the scalar grid fundamentally can't represent them.

Why this variant exists

Per the per-tensor κ profile we ran on Qwen3.6-35B-A3B:

Family Tensors Mean κ Max κ PAT-0349 bucket apex2 does
Routed gate_proj / up_proj 82 3.4 5.6 TQ3 TQ3 ✓
Routed down_proj 41 3.4 6.9 TQ3 TQ3 ✓
Attention Q/K/V/O 44 5-8 10-26 TQ4 TQ4 ✓
Shared expert gate_proj 41 9.0 23 mixed TQ4/FP16 TQ4
Shared expert up_proj 41 5.6 12 TQ4 TQ4 ✓
Shared expert down_proj 41 16.3 67 FP16 (all 41) TQ4 ← under-shot
Router mlp.gate.weight 41 17.5 25.6 FP16 FP16 ✓ (skipped)
GatedDeltaNet in_proj_b / out_proj / linear_fc ~60 8-500 >500 mixed FP16/TQ4 FP16 ✓ (skipped)

Bit-plan bucket distribution (from scripts/kappa-to-bit-plan.py)

Bucket Tensors Est. total size (κ-optimal)
TQ3 (κ<4) 159 ~12.5 GB
TQ4 (4≤κ<8) 244 ~1.4 GB
TQ5 (8≤κ<15) → promoted to FP16 (no kernel yet) 90 tensors, ~0.3 GB at fp16
FP16 (κ≥15) 148 ~0.6 GB
κ-optimal total 541 ~14.4 GB
uniform-TQ3 baseline ~13.6 GB

A fully κ-optimal checkpoint would be ~14.4 GB on disk (0.8 GB more than uniform TQ3) while matching bf16 quality on the highest-κ tensors. apex2 captures most of that ceiling cheaply by skipping what's clearly FP16-bucket.

-TQ-apex (the predecessor) forces all 41 router gates through the scalar grid at TQ3, damaging softmax routing precision — the main residual quality gap vs MLX-4bit reference. Skipping them costs ~40 MB on disk (gate tensors are small: hidden × num_experts ≈ 1 MB each × 40 layers). Other skip patterns add a similar amount.

Measured quality and throughput

Benched on M4 Pro 48 GB, gsm8k-200 5-shot CoT @ max_tokens=1024:

Variant Disk gsm8k-200 Aggregate tok/s
uniform TQ3 16 GB 85.5 % 35
-TQ-apex (TQ3 + TQ4 mixed) 17 GB 93.0 % 36
this (-TQ-apex2) 18 GB 96.0 % 17.5
mlx-community/*-4bit (reference) 19 GB 94.5 % 79

apex2 beats the MLX-4bit reference on accuracy (96.0 vs 94.5) while being 1 GB smaller on disk. The quality gap vs base model is effectively closed for reasoning tasks of this kind.

How to use

Same loader as -TQ-apex — the fp16-kept tensors automatically pass through mlx_lm's standard weight-load path (they don't carry .tq_packed sidecars, so the TurboQuant loader ignores them):

pip install git+https://github.com/varjoranta/turboquant-vllm.git@feat/mixed-bits-mlx-loader
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ-apex2 \
    --local-dir ~/models/qwen3.6-35b-a3b-tq-apex2

Compression recipe (tensor-family)

  • Skip (fp16): lm_head, embed, norm, bias, .gate.weight (router), in_proj_b, out_proj, linear_fc (GatedDeltaNet)
  • TQ4 (sensitive): o_proj, q_proj, k_proj, v_proj, down_proj, shared_expert
  • TQ3 (default): everything else that matches .weight and has shape[-1] >= 128 (bulk of params — routed expert gate_proj/up_proj)
  • Group size: 128; rotation: Walsh-Hadamard + seed=42 random signs; codebook: Lloyd-Max scalar

Reproduce

git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.12 && uv pip install -e . accelerate "transformers>=5.5" torch

python3 - <<'PY'
import turboquant_vllm.weight_quant as wq
import turboquant_vllm.checkpoint as cp

wq._SENSITIVE_PATTERNS = (
    "o_proj", "q_proj", "k_proj", "v_proj",
    "down_proj", "shared_expert",
)
cp._SENSITIVE_PATTERNS = wq._SENSITIVE_PATTERNS

wq._SKIP_PATTERNS = tuple(list(wq._SKIP_PATTERNS) + [
    ".gate.weight",   # MoE router — κ mean 17.5
    "in_proj_b",      # GatedDeltaNet
    "out_proj",       # GatedDeltaNet
    "linear_fc",      # GatedDeltaNet
])
cp._SKIP_PATTERNS = wq._SKIP_PATTERNS

cp.save_tq3_checkpoint(
    model_id="Qwen/Qwen3.6-35B-A3B",
    output_dir="./qwen3.6-apex2",
    bits=3, sensitive_bits=4, group_size=128,
)
PY

Needs ≥ 100 GB CPU RAM (full bf16 load during compression). Inference: ~20 GB resident.

Citations

@article{qwen2026qwen36,
  title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
  url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
  title={HIGGS: Pushing the Limits of Large Language Model Quantization via
         Hadamard Rotations and MSE-Optimal Grids},
  author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
          Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
          Richtarik, Peter},
  booktitle={NAACL}, year={2025},
  url={https://aclanthology.org/2025.naacl-long.543/}
}

Mixed-precision bit-width selection draws on PAT-0349 (Kurtosis-Aware Automatic Bit-Width Selection) and the MXPLM survey (Mixed-Precision Quantization for Language Models, arXiv, Oct 2025).

License

Inherits Apache-2.0 from the base model.

Links

Downloads last month
760
Safetensors
Model size
16B params
Tensor type
F32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex2

Finetuned
(77)
this model