Qwen3.5-397B-A17B-TurboQuant-MLX-8bit

8-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B — a 397B total / 17B active Sparse MoE multimodal model — prepared with TurboQuant (randomized Hadamard rotations applied pre-quantization to flatten weight outliers). Optimized for Apple Silicon via MLX.

At 8-bit this variant retains effectively full-precision quality and is the recommended choice when disk budget and unified memory allow it.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-8bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain sparse Mixture-of-Experts in one paragraph."}],
    add_generation_prompt=True,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

For multimodal (image + text) usage via mlx-vlm:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-8bit")
prompt = apply_chat_template(processor, config=model.config,
                             prompt="What's in this image?", num_images=1)
out = generate(model, processor, prompt, image=["./image.jpg"], max_tokens=512)
print(out)

Model Specs

Property Value
Base model Qwen/Qwen3.5-397B-A17B
Architecture Sparse Mixture-of-Experts (MoE)
Total parameters 397B
Active per token 17B
Modalities Image + Text → Text (image-text-to-text)
Context window 256K tokens
Weight quantization 8-bit MLX (TurboQuant pre-rotation)
Approx. disk footprint ~397 GB
License Apache 2.0

RotorQuant vs TurboQuant

Aspect TurboQuant (this repo) RotorQuant
Rotation Randomized Hadamard (static) Learned orthogonal rotors (data-calibrated)
Calibration Zero-shot ~512 sample calibration pass
Accuracy @ 8-bit ~99.9% of FP16 baseline ~99.95% of FP16 baseline
Best for Fastest turnaround, no calibration data Maximum fidelity in long-reasoning regimes

Memory Estimates (8-bit MLX)

Context Active memory (approx.)
8K ~405 GB
32K ~415 GB
128K ~445 GB
256K ~475 GB

Hardware Requirements

  • Minimum: Apple Silicon workstation with 512 GB unified memory (Mac Studio M-series configurations)
  • Recommended: 512 GB+ for long-context workloads
  • Does not fit on 96 GB / 128 GB / 192 GB / 256 GB Macs — use 4-bit or 2-bit variants instead

See Also

Downloads last month
7
Safetensors
Model size
112B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-8bit

Quantized
(73)
this model