Qwen3.5-397B-A17B-RotorQuant-MLX-8bit

8-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B — a 397B total / 17B active Sparse MoE multimodal model — prepared with RotorQuant (learned orthogonal rotors, calibrated on ~512 samples before quantization). Optimized for Apple Silicon via MLX.

At 8-bit RotorQuant is effectively indistinguishable from FP16 on standard benchmarks while yielding 2× the on-disk compression.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    add_generation_prompt=True,
)

text = generate(model, tokenizer, prompt=prompt, max_tokens=256, verbose=True)

Multimodal via mlx-vlm:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit")
prompt = apply_chat_template(processor, config=model.config,
                             prompt="Describe this diagram.", num_images=1)
out = generate(model, processor, prompt, image=["./diagram.png"], max_tokens=512)
print(out)

Model Specs

Property	Value
Base model	Qwen/Qwen3.5-397B-A17B
Architecture	Sparse Mixture-of-Experts (MoE)
Total parameters	397B
Active per token	17B
Modalities	Image + Text → Text (`image-text-to-text`)
Context window	256K tokens
Weight quantization	8-bit MLX (RotorQuant learned rotors)
Approx. disk footprint	~397 GB
License	Apache 2.0

RotorQuant vs TurboQuant

Aspect	RotorQuant (this repo)	TurboQuant
Rotation	Learned orthogonal rotors (data-calibrated)	Randomized Hadamard (static)
Calibration	~512 sample calibration pass	Zero-shot
Accuracy @ 8-bit	~99.95% of FP16 baseline	~99.9% of FP16 baseline
Best for	Maximum fidelity in long-reasoning regimes	Fastest turnaround, no calibration data

Memory Estimates (8-bit MLX)

Context	Active memory (approx.)
8K	~405 GB
32K	~415 GB
128K	~445 GB
256K	~475 GB

Hardware Requirements

Minimum: Apple Silicon workstation with 512 GB unified memory
Recommended: 512 GB+ for long-context workloads
Does not fit on 96 GB / 128 GB / 192 GB / 256 GB Macs — use 4-bit or 2-bit variants instead

Model tree for majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(73)

this model

majentik
/

Qwen3.5-397B-A17B-RotorQuant-MLX-8bit