Qwen3.6-35B-A3B MTPLX Optimized Balance

Balanced local 35B-A3B inference for Apple Silicon, packaged for MTPLX native Multi-Token-Prediction speculative decoding.

This checkpoint is the balanced 35B release: a 6-bit MLX body paired with calibrated INT4 MTP heads. It is tuned for strong reasoning-on generation speed while keeping prompt processing and memory use practical across normal coding contexts.

Run It

brew install youssofal/mtplx/mtplx
mtplx start
mtplx run "hello" --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance

For an OpenAI-compatible local server:

mtplx serve --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance --profile sustained --max --port 8000 --no-stats-footer

Why This Exists

MTPLX uses the model's own MTP heads to generate draft tokens, then verifies them with the main model. When the draft heads are well-matched, you get higher throughput without running a separate drafter model.

Optimized Balance is built for that path. MTPLX reads mtplx_runtime.json and selects the measured D2 defaults automatically.

Recommended Runtime Defaults

Setting Value
Backend qwen3-next-mtp
Default depth D2
Target sampler temp=0.60, top_p=0.95, top_k=20
Draft sampler temp=0.60, top_p=0.95, top_k=20
Profile sustained
Benchmark fan mode max

Performance

Measured in MTPLX Sustained Max on Apple Silicon with reasoning enabled.

Generation

Mode TPS Verify time Acceptance
AR baseline 86.30 - -
D1 comparison 123.00 7.24s 0.8329
D2 promoted default 126.43 6.62s 0.8134, 0.5048
D3 comparison 112.43 7.16s 0.7802, 0.4709, 0.2514

D2 is the promoted default because it gives the best balance of throughput, acceptance, and verify cost. A three-run D2 repeat averaged 123.44 tok/s, with every run above 122 tok/s.

Prompt Processing

Context Prompt TPS
512 1,756.8
1k 3,339.2
2k 4,109.8
4k 4,048.7
8k 3,872.0
16k 3,162.3
32k 2,761.3
64k 1,834.2

Average prompt processing across the 512-to-64k ladder was 3,110.5 tok/s.

Model Build

Component Format
Main body 6-bit MLX affine, group size 64
Router and gate tensors 8-bit where recorded by config
MTP numbered-expert weights INT4 MLX affine, group size 64
Norms, scales, biases, plain tensors BF16

This is not a full-precision checkpoint. It is built for fast local use on Apple Silicon through MTPLX.

Files

  • model-*.safetensors: MLX 6-bit body shards
  • mtp.safetensors: calibrated INT4 MTP sidecar
  • mtplx_runtime.json: MTPLX runtime contract and measured defaults
  • MTPLX_PUBLISH_MANIFEST.json: file sizes and benchmark summary
  • tokenizer and config files for local loading
Downloads last month
479
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support