Qwen3-REAP-15B-A3B-W4A16 (Custom Calibration)

A compressed variant of Qwen/Qwen3-30B-A3B using REAP expert pruning followed by AutoRound W4A16 quantization, calibrated on a custom domain-specific dataset.

For the version calibrated on the default NeelNanda/pile-10k dataset, see atbender/Qwen3-REAP-15B-A3B-W4A16.

Property	Value
Architecture	Qwen3MoE (Mixture-of-Experts)
Original experts	128 per layer
Pruned experts	64 per layer (50% pruning via REAP)
Active experts per token	8
Layers	48
Hidden size	2048
Quantization	W4A16 (4-bit weights, 16-bit activations)
Quantization method	AutoRound v0.10.2 (signed gradient descent)
Group size	128
Calibration data	Custom multi-domain (1000 samples, see below)
Calibration samples used	128 (seqlen=512)
Model size	~8.7 GB
Original model size	~19 GB (BF16 pruned) / ~34 GB (BF16 original)

Compression Pipeline

Stage 1: REAP Expert Pruning (128 → 64 experts/layer)

REAP (Retraining-free Expert Approximation Pruning) by Cerebras Research prunes MoE experts using router-weighted activation norms. Applied with 50% pruning ratio, reducing from 128 to 64 experts per layer while preserving all 8 active experts per token.

Stage 2: AutoRound W4A16 Quantization

AutoRound (Intel) applies signed gradient descent to find optimal weight rounding, superior to simple RTN for MoE models. All linear layers quantized to 4-bit integers with 128 group size, except MoE router weights (mlp.gate) which are preserved at FP16 across all 48 layers to maintain routing precision.

Calibration Dataset

This version uses a custom multi-domain calibration dataset (1000 samples) designed to better represent the model's target use cases:

Source	Proportion	Description
CoderForge agentic trajectories	40%	Multi-turn agentic coding conversations
code_search_net (Python)	30%	Python source code
C4 (English)	10%	Web-crawled English text
NeelNanda/pile-10k	20%	General-purpose text

Of the 1000 samples, 526 passed the seqlen≥512 filter; AutoRound sampled 128 for calibration.

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain mixture-of-experts models."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM

vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib \
    --trust-remote-code \
    --quantization auto_round \
    --max-model-len 4096

Technical Notes

Monkey-patches for AutoRound + Qwen3 MoE

Two patches are required before importing AutoRound to avoid compatibility issues:

Conv1D shim — transformers.pytorch_utils.Conv1D was removed in transformers 5.x but AutoRound references it. Shimmed with torch.nn.Linear.
MLLM detection override — Qwen models can be misdetected as multimodal by auto_round.utils.is_mllm_model, causing the wrong calibration path. Overridden to always return False.

Router weights preserved at FP16

All 48 model.layers.*.mlp.gate modules are kept at full precision (16-bit float) to ensure expert routing decisions remain accurate after quantization.

Hardware & Runtime

Hardware: 2× NVIDIA RTX A6000 (48 GB each)
Quantization time: ~50 minutes
Software: AutoRound 0.10.2, transformers 4.55.0, PyTorch 2.7

Downloads last month: 5

Safetensors

Model size

0.6B params

Tensor type

I32

BF16

F16

Model tree for atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(110)

this model

Paper for atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib

Polynomial-Time Approximability of Constrained Reinforcement Learning

Paper • 2502.07764 • Published Feb 11, 2025