Instructions to use JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP") config = load_config("JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP
Run Hermes
hermes

Qwen3.6-27B-JANG_4M-MTP
Qwen3.6-27B (dense) quantized with the JANG_4M importance-weighted mixed-precision profile for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled.
| Source | Qwen/Qwen3.6-27B |
| License | Apache-2.0, inherited from upstream |
| Format | JANG v2 — JANG_4M profile (mx.quantize, asymmetric, block_size=64) |
| Architecture | qwen3_5 dense — hybrid GatedDeltaNet + full attention, has vision |
| Modality | image + video + text |
| Bundle size | 16.6 GB |
| Effective bits | 4.45 avg (4-bit floor, 8-bit on important tensors) |
| MTP | native head preserved, enabled (num_nextn_predict_layers=1) |
Why JANG_4M
JANG_4M is JANG's standard importance-weighted profile. Instead of a flat
bit width, it scores each tensor by weight magnitude and spends 8 bits
where it matters and a 4-bit floor elsewhere — MSE-calibrated, asymmetric
affine via MLX-native mx.quantize. The result here is 4.45 effective
bits: sharper than a flat MXFP4 bundle, materially smaller than flat MXFP8.
Norms and control tensors stay in fp16 passthrough.
Multi-Token Prediction
This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.
Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):
| Draft depth | tok/s | Speedup |
|---|---|---|
| Baseline (MTP off) | 24.2 | 1.00× |
| D1 | 37.6 | 1.55× |
| D2 | 43.3 | 1.79× |
| D3 (default) | 44.1 | 1.82× |
Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.
Vision, MTP and caching together
This bundle runs image/video input, native MTP speculative decode and prefix/KV caching in the same session — a combination not every MTP-enabled Qwen build exposes.
Loading
Loads via stock mlx-lm / mlx-vlm on Apple Silicon — JANG_4M weights are
native mx.quantize affine, no custom JANG runtime required for the core
model.
from mlx_vlm import load, generate
model, processor = load("JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP")
The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.
Related bundles
Flat-precision MXFP siblings of this model are published on OsaurusAI:
| Variant | Format | Size | Best MTP speedup |
|---|---|---|---|
Qwen3.6-27B-MXFP4-MTP |
flat mxfp4 | 14.4 GB | 1.85× (D2) |
Qwen3.6-27B-JANG_4M-MTP (this) |
JANG_4M mixed | 16.6 GB | 1.82× (D3) |
Qwen3.6-27B-MXFP8-MTP |
flat mxfp8 | 27.1 GB | 1.83× (D3) |
Credits
- Quantization toolchain: JANG by Jinho Jang <eric@jangq.ai>
- Base model: Qwen3.6-27B by Qwen
- Downloads last month
- 428
Quantized
Model tree for JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP
Base model
Qwen/Qwen3.6-27B