Qwen3-REAP-15B-A3B-W4A16 (Custom Calibration)

A compressed variant of Qwen/Qwen3-30B-A3B using REAP expert pruning followed by AutoRound W4A16 quantization, calibrated on a custom domain-specific dataset.

For the version calibrated on the default NeelNanda/pile-10k dataset, see atbender/Qwen3-REAP-15B-A3B-W4A16.

Property Value
Architecture Qwen3MoE (Mixture-of-Experts)
Original experts 128 per layer
Pruned experts 64 per layer (50% pruning via REAP)
Active experts per token 8
Layers 48
Hidden size 2048
Quantization W4A16 (4-bit weights, 16-bit activations)
Quantization method AutoRound v0.10.2 (signed gradient descent)
Group size 128
Calibration data Custom multi-domain (1000 samples, see below)
Calibration samples used 128 (seqlen=512)
Model size ~8.7 GB
Original model size ~19 GB (BF16 pruned) / ~34 GB (BF16 original)

Compression Pipeline

Stage 1: REAP Expert Pruning (128 → 64 experts/layer)

REAP (Retraining-free Expert Approximation Pruning) by Cerebras Research prunes MoE experts using router-weighted activation norms. Applied with 50% pruning ratio, reducing from 128 to 64 experts per layer while preserving all 8 active experts per token.

Stage 2: AutoRound W4A16 Quantization

AutoRound (Intel) applies signed gradient descent to find optimal weight rounding, superior to simple RTN for MoE models. All linear layers quantized to 4-bit integers with 128 group size, except MoE router weights (mlp.gate) which are preserved at FP16 across all 48 layers to maintain routing precision.

Calibration Dataset

This version uses a custom multi-domain calibration dataset (1000 samples) designed to better represent the model's target use cases:

Source Proportion Description
CoderForge agentic trajectories 40% Multi-turn agentic coding conversations
code_search_net (Python) 30% Python source code
C4 (English) 10% Web-crawled English text
NeelNanda/pile-10k 20% General-purpose text

Of the 1000 samples, 526 passed the seqlen≥512 filter; AutoRound sampled 128 for calibration.

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain mixture-of-experts models."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM

vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib \
    --trust-remote-code \
    --quantization auto_round \
    --max-model-len 4096

Technical Notes

Monkey-patches for AutoRound + Qwen3 MoE

Two patches are required before importing AutoRound to avoid compatibility issues:

  1. Conv1D shim — transformers.pytorch_utils.Conv1D was removed in transformers 5.x but AutoRound references it. Shimmed with torch.nn.Linear.
  2. MLLM detection override — Qwen models can be misdetected as multimodal by auto_round.utils.is_mllm_model, causing the wrong calibration path. Overridden to always return False.

Router weights preserved at FP16

All 48 model.layers.*.mlp.gate modules are kept at full precision (16-bit float) to ensure expert routing decisions remain accurate after quantization.

Hardware & Runtime

  • Hardware: 2× NVIDIA RTX A6000 (48 GB each)
  • Quantization time: ~50 minutes
  • Software: AutoRound 0.10.2, transformers 4.55.0, PyTorch 2.7
Downloads last month
5
Safetensors
Model size
0.6B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib

Quantized
(110)
this model

Paper for atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib