Qwen3.6-35B-A3B-TQ-apex

A mixed-precision native TurboQuant checkpoint of Qwen/Qwen3.6-35B-A3B. Routed MoE experts at 3 bits (TQ3, where the HIGGS-scalar grid is near-optimal on low-κ Gaussian-like weight distributions), attention projections + down_proj + shared experts at 4 bits (TQ4, preserves softmax routing precision and handles heavy-tailed post-GLU weights).

Follows PAT-0349 "Kurtosis-Aware Automatic Bit-Width Selection" as a tensor-family fast-path approximation — we don't measure κ per-tensor; instead we apply the empirical family assignments from prior work on Gemma 4 27B (where routed experts averaged κ≈3.4, attention κ≈5.7, shared experts κ≈13).

Why this variant exists

The earlier varjosoft/Qwen3.6-35B-A3B-TQ3-native at uniform 3-bit scores 82 % gsm8k 5-shot CoT (200 Q, 1024 tok budget) — 11 ppt behind mlx-community/Qwen3.6-35B-A3B-4bit at 93 %. The gap isn't architectural, it's an uniform-bit-width choice that undershoots attention's softmax sensitivity. This variant applies mixed bits where κ actually demands higher precision.

Variant	Disk	gsm8k-200 @ 1024 tok	Decode tok/s bs=1	Notes
uniform TQ3	16 GB	85.5 %	35 tok/s	All layers 3-bit
this checkpoint (TQ-apex)	17 GB	93.0 %	36 tok/s	3-bit routed experts, 4-bit attention + down_proj + shared experts
mlx-community/*-4bit	19 GB	94.5 %	79 tok/s	Uniform 4-bit (MLX reference)

Methodology: 200 gsm8k test questions, 5-shot CoT, max_tokens=1024, same harness (mac-gsm8k-mlx.py). Numbers scoring-normalized (trailing .00 collapsed across all three runs for fair comparison). M4 Pro 48 GB MacBook, turboquant-vllm feat/mixed-bits-mlx-loader branch.

Result: TQ-apex closes 7.5 of the 9 ppt gap between uniform TQ3 and the reference 4-bit, landing within 1.5 ppt of the reference at 2 GB smaller on disk and the same decode speed as uniform TQ3. Quality gap remaining is likely from shared-expert κ still above the TQ4 threshold (≈13 on Gemma 4 27B) — a follow-up TQ-apex-plus with TQ5 on shared experts is the next lever.

How to use

Same as the uniform TQ3 variant — the loader auto-detects mixed bits via tq_config.json:

pip install git+https://github.com/varjoranta/turboquant-vllm.git
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ-apex \
    --local-dir ~/models/qwen3.6-35b-a3b-tq-apex

from turboquant_vllm.mlx_loader import load_tq3
import mlx.core as mx
from mlx_lm.generate import generate

model, tok = load_tq3("~/models/qwen3.6-35b-a3b-tq-apex")
mx.eval(model.parameters())
print(generate(model, tok, prompt="The capital of Finland is", max_tokens=64))

Compression recipe

Base bits: 3 — applied to routed expert gate_proj/up_proj (bulk of parameter mass, near-Gaussian distributions)
Sensitive bits: 4 — applied to layers matching any of:
- o_proj, q_proj, k_proj, v_proj (attention, softmax-routing precision)
- down_proj (post-GLU, heavy-tailed per empirical observations)
- shared_expert (full-token processing, high κ per Gemma 4 27B measurements)
Group size: 128
Rotation: Walsh-Hadamard + random sign vectors (seed=42)
Codebook: Lloyd-Max 8-entry (3-bit) or 16-entry (4-bit) on unit Gaussian
Shape-gain norm correction applied per group

Underlying technique is the scalar case of HIGGS (Malinovskii et al., NAACL 2025). The "apex" suffix indicates the mixed-precision profile.

Lineage: mixed-precision quantization for LLMs is now a recognized research direction — see the MXPLM survey (Mixed-Precision Quantization for Language Models, arXiv, Oct 2025) for a comprehensive overview of inter- and intra-layer bit allocation. Closest relatives to this checkpoint's recipe:

OWQ — outlier-aware INT3/4 + FP16 at tensor granularity (our per-family split is a coarse version of the same idea)
SpQR — sparse + high-precision outliers (we don't separate outliers; we raise the whole tensor)
CMPQ — channel-wise adaptive bits (finer-grained; we're at tensor-family granularity)
SqLLM — Hessian-guided clustering (needs calibration data; we're zero-calibration via the kurtosis rule-of-thumb)

The value of our variant is zero-calibration deployability: the bit-plan is derived from architectural role alone, so a fresh base model can be re-compressed in minutes without gathering a dataset.

Measured quality and throughput

Will populate once benched. Comparison against uniform TQ3 and mlx-4bit will go here.

Reproduce

git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.12 && uv pip install -e . accelerate "transformers>=5.5" torch

# Expanded sensitive-pattern list for this variant
python3 - <<'PY'
import turboquant_vllm.weight_quant as wq
import turboquant_vllm.checkpoint as cp

wq._SENSITIVE_PATTERNS = (
    "o_proj", "q_proj", "k_proj", "v_proj",
    "down_proj", "shared_expert",
)
cp._SENSITIVE_PATTERNS = wq._SENSITIVE_PATTERNS

cp.save_tq3_checkpoint(
    model_id="Qwen/Qwen3.6-35B-A3B",
    output_dir="./qwen3.6-apex",
    bits=3,
    sensitive_bits=4,
    group_size=128,
)
PY

python3 scripts/publish_model.py upload ./qwen3.6-apex <your-repo>

Needs ≥ 100 GB CPU RAM (full bf16 model during compression). Inference only needs ~20 GB resident.

Citations

@article{qwen2026qwen36,
  title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
  url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
  title={HIGGS: Pushing the Limits of Large Language Model Quantization via
         Hadamard Rotations and MSE-Optimal Grids},
  author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
          Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
          Richtarik, Peter},
  booktitle={NAACL}, year={2025},
  url={https://aclanthology.org/2025.naacl-long.543/}
}

License

Inherits Apache-2.0 from the base model.

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ-apex

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(78)

this model

varjosoft
/

Qwen3.6-35B-A3B-TQ-apex