Qwen3.6-35B-A3B-TQ-apex2

Data-driven mixed-precision native TurboQuant checkpoint of Qwen/Qwen3.6-35B-A3B. Same recipe as varjosoft/Qwen3.6-35B-A3B-TQ-apex but additionally skips compression on the small, critical tensors our per-tensor kurtosis scan flagged as extremely heavy-tailed: the MoE router gates (mlp.gate.weight, κ mean 17.5, max 25.6) and the GatedDeltaNet linear-attention projections (in_proj_b, out_proj, linear_fc). Those ~40-200 MB stay at fp16/bf16.

This is PAT-0349 Kurtosis-Aware Automatic Bit-Width Selection applied one step further than -TQ-apex: not just raising bit width for moderate-κ tensors, but leaving the highest-κ tensors uncompressed where the scalar grid fundamentally can't represent them.

Why this variant exists

Per the per-tensor κ profile we ran on Qwen3.6-35B-A3B:

Family	Tensors	Mean κ	Max κ	PAT-0349 bucket	apex2 does
Routed `gate_proj` / `up_proj`	82	3.4	5.6	TQ3	TQ3 ✓
Routed `down_proj`	41	3.4	6.9	TQ3	TQ3 ✓
Attention Q/K/V/O	44	5-8	10-26	TQ4	TQ4 ✓
Shared expert `gate_proj`	41	9.0	23	mixed TQ4/FP16	TQ4
Shared expert `up_proj`	41	5.6	12	TQ4	TQ4 ✓
Shared expert `down_proj`	41	16.3	67	FP16 (all 41)	TQ4 ← under-shot
Router `mlp.gate.weight`	41	17.5	25.6	FP16	FP16 ✓ (skipped)
GatedDeltaNet `in_proj_b` / `out_proj` / `linear_fc`	~60	8-500	>500	mixed FP16/TQ4	FP16 ✓ (skipped)

Bit-plan bucket distribution (from `scripts/kappa-to-bit-plan.py`)

Bucket	Tensors	Est. total size (κ-optimal)
TQ3 (κ<4)	159	~12.5 GB
TQ4 (4≤κ<8)	244	~1.4 GB
TQ5 (8≤κ<15) → promoted to FP16 (no kernel yet)	90 tensors, ~0.3 GB at fp16	—
FP16 (κ≥15)	148	~0.6 GB
κ-optimal total	541	~14.4 GB
uniform-TQ3 baseline	—	~13.6 GB

A fully κ-optimal checkpoint would be ~14.4 GB on disk (0.8 GB more than uniform TQ3) while matching bf16 quality on the highest-κ tensors. apex2 captures most of that ceiling cheaply by skipping what's clearly FP16-bucket.

-TQ-apex (the predecessor) forces all 41 router gates through the scalar grid at TQ3, damaging softmax routing precision — the main residual quality gap vs MLX-4bit reference. Skipping them costs ~40 MB on disk (gate tensors are small: hidden × num_experts ≈ 1 MB each × 40 layers). Other skip patterns add a similar amount.

Measured quality and throughput

Benched on M4 Pro 48 GB, gsm8k-200 5-shot CoT @ max_tokens=1024:

Variant	Disk	gsm8k-200	Aggregate tok/s
uniform TQ3	16 GB	85.5 %	35
`-TQ-apex` (TQ3 + TQ4 mixed)	17 GB	93.0 %	36
this (`-TQ-apex2`)	18 GB	96.0 %	17.5
mlx-community/*-4bit (reference)	19 GB	94.5 %	79

apex2 beats the MLX-4bit reference on accuracy (96.0 vs 94.5) while being 1 GB smaller on disk. The quality gap vs base model is effectively closed for reasoning tasks of this kind.

How to use

Same loader as -TQ-apex — the fp16-kept tensors automatically pass through mlx_lm's standard weight-load path (they don't carry .tq_packed sidecars, so the TurboQuant loader ignores them):

pip install git+https://github.com/varjoranta/turboquant-vllm.git@feat/mixed-bits-mlx-loader
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ-apex2 \
    --local-dir ~/models/qwen3.6-35b-a3b-tq-apex2

Compression recipe (tensor-family)

Skip (fp16): lm_head, embed, norm, bias, .gate.weight (router), in_proj_b, out_proj, linear_fc (GatedDeltaNet)
TQ4 (sensitive): o_proj, q_proj, k_proj, v_proj, down_proj, shared_expert
TQ3 (default): everything else that matches .weight and has shape[-1] >= 128 (bulk of params — routed expert gate_proj/up_proj)
Group size: 128; rotation: Walsh-Hadamard + seed=42 random signs; codebook: Lloyd-Max scalar

Reproduce

git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.12 && uv pip install -e . accelerate "transformers>=5.5" torch

python3 - <<'PY'
import turboquant_vllm.weight_quant as wq
import turboquant_vllm.checkpoint as cp

wq._SENSITIVE_PATTERNS = (
    "o_proj", "q_proj", "k_proj", "v_proj",
    "down_proj", "shared_expert",
)
cp._SENSITIVE_PATTERNS = wq._SENSITIVE_PATTERNS

wq._SKIP_PATTERNS = tuple(list(wq._SKIP_PATTERNS) + [
    ".gate.weight",   # MoE router — κ mean 17.5
    "in_proj_b",      # GatedDeltaNet
    "out_proj",       # GatedDeltaNet
    "linear_fc",      # GatedDeltaNet
])
cp._SKIP_PATTERNS = wq._SKIP_PATTERNS

cp.save_tq3_checkpoint(
    model_id="Qwen/Qwen3.6-35B-A3B",
    output_dir="./qwen3.6-apex2",
    bits=3, sensitive_bits=4, group_size=128,
)
PY

Needs ≥ 100 GB CPU RAM (full bf16 load during compression). Inference: ~20 GB resident.

Citations

@article{qwen2026qwen36,
  title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
  url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
  title={HIGGS: Pushing the Limits of Large Language Model Quantization via
         Hadamard Rotations and MSE-Optimal Grids},
  author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
          Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
          Richtarik, Peter},
  booktitle={NAACL}, year={2025},
  url={https://aclanthology.org/2025.naacl-long.543/}
}

Mixed-precision bit-width selection draws on PAT-0349 (Kurtosis-Aware Automatic Bit-Width Selection) and the MXPLM survey (Mixed-Precision Quantization for Language Models, arXiv, Oct 2025).

License

Inherits Apache-2.0 from the base model.

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex2

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(77)

this model

varjosoft
/

Qwen3.6-35B-A3B-TQ-apex2

Qwen3.6-35B-A3B-TQ-apex2

Why this variant exists

Bit-plan bucket distribution (from `scripts/kappa-to-bit-plan.py`)

Measured quality and throughput

How to use

Compression recipe (tensor-family)

Reproduce

Citations

License

Links

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex2

Qwen3.6-35B-A3B-TQ-apex2

Why this variant exists

Bit-plan bucket distribution (from scripts/kappa-to-bit-plan.py)

Measured quality and throughput

How to use

Compression recipe (tensor-family)

Reproduce

Citations

License

Links

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex2

Bit-plan bucket distribution (from `scripts/kappa-to-bit-plan.py`)