Instructions to use Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance

Run Hermes

hermes

MLX LM

How to use Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6-35B-A3B MTPLX Optimized Balance

Balanced local 35B-A3B inference for Apple Silicon, packaged for MTPLX native Multi-Token-Prediction speculative decoding.

This checkpoint is the balanced 35B release: a 6-bit MLX body paired with calibrated INT4 MTP heads. It is tuned for strong reasoning-on generation speed while keeping prompt processing and memory use practical across normal coding contexts.

Run It

brew install youssofal/mtplx/mtplx
mtplx start
mtplx run "hello" --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance

For an OpenAI-compatible local server:

mtplx serve --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Balance --profile sustained --max --port 8000 --no-stats-footer

Why This Exists

MTPLX uses the model's own MTP heads to generate draft tokens, then verifies them with the main model. When the draft heads are well-matched, you get higher throughput without running a separate drafter model.

Optimized Balance is built for that path. MTPLX reads mtplx_runtime.json and selects the measured D2 defaults automatically.

Recommended Runtime Defaults

Setting	Value
Backend	`qwen3-next-mtp`
Default depth	`D2`
Target sampler	`temp=0.60`, `top_p=0.95`, `top_k=20`
Draft sampler	`temp=0.60`, `top_p=0.95`, `top_k=20`
Profile	`sustained`
Benchmark fan mode	`max`

Performance

Measured in MTPLX Sustained Max on Apple Silicon with reasoning enabled.

Generation

Mode	TPS	Verify time	Acceptance
AR baseline	86.30	-	-
D1 comparison	123.00	7.24s	0.8329
D2 promoted default	126.43	6.62s	0.8134, 0.5048
D3 comparison	112.43	7.16s	0.7802, 0.4709, 0.2514

D2 is the promoted default because it gives the best balance of throughput, acceptance, and verify cost. A three-run D2 repeat averaged 123.44 tok/s, with every run above 122 tok/s.

Prompt Processing

Context	Prompt TPS
512	1,756.8
1k	3,339.2
2k	4,109.8
4k	4,048.7
8k	3,872.0
16k	3,162.3
32k	2,761.3
64k	1,834.2

Average prompt processing across the 512-to-64k ladder was 3,110.5 tok/s.

Model Build

Component	Format
Main body	6-bit MLX affine, group size 64
Router and gate tensors	8-bit where recorded by config
MTP numbered-expert weights	INT4 MLX affine, group size 64
Norms, scales, biases, plain tensors	BF16

This is not a full-precision checkpoint. It is built for fast local use on Apple Silicon through MTPLX.

Files

model-*.safetensors: MLX 6-bit body shards
mtp.safetensors: calibrated INT4 MTP sidecar
mtplx_runtime.json: MTPLX runtime contract and measured defaults
MTPLX_PUBLISH_MANIFEST.json: file sizes and benchmark summary
tokenizer and config files for local loading

Downloads last month: 479

Safetensors

Model size

8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit