Qwen3-REAP-15B-A3B-W4A16 (Custom Calibration)
A compressed variant of Qwen/Qwen3-30B-A3B using REAP expert pruning followed by AutoRound W4A16 quantization, calibrated on a custom domain-specific dataset.
For the version calibrated on the default NeelNanda/pile-10k dataset, see atbender/Qwen3-REAP-15B-A3B-W4A16.
| Property | Value |
|---|---|
| Architecture | Qwen3MoE (Mixture-of-Experts) |
| Original experts | 128 per layer |
| Pruned experts | 64 per layer (50% pruning via REAP) |
| Active experts per token | 8 |
| Layers | 48 |
| Hidden size | 2048 |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Quantization method | AutoRound v0.10.2 (signed gradient descent) |
| Group size | 128 |
| Calibration data | Custom multi-domain (1000 samples, see below) |
| Calibration samples used | 128 (seqlen=512) |
| Model size | ~8.7 GB |
| Original model size | ~19 GB (BF16 pruned) / ~34 GB (BF16 original) |
Compression Pipeline
Stage 1: REAP Expert Pruning (128 → 64 experts/layer)
REAP (Retraining-free Expert Approximation Pruning) by Cerebras Research prunes MoE experts using router-weighted activation norms. Applied with 50% pruning ratio, reducing from 128 to 64 experts per layer while preserving all 8 active experts per token.
Stage 2: AutoRound W4A16 Quantization
AutoRound (Intel) applies signed gradient descent to find optimal weight rounding, superior to simple RTN for MoE models. All linear layers quantized to 4-bit integers with 128 group size, except MoE router weights (mlp.gate) which are preserved at FP16 across all 48 layers to maintain routing precision.
Calibration Dataset
This version uses a custom multi-domain calibration dataset (1000 samples) designed to better represent the model's target use cases:
| Source | Proportion | Description |
|---|---|---|
| CoderForge agentic trajectories | 40% | Multi-turn agentic coding conversations |
| code_search_net (Python) | 30% | Python source code |
| C4 (English) | 10% | Web-crawled English text |
| NeelNanda/pile-10k | 20% | General-purpose text |
Of the 1000 samples, 526 passed the seqlen≥512 filter; AutoRound sampled 128 for calibration.
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain mixture-of-experts models."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM
vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16-custom-calib \
--trust-remote-code \
--quantization auto_round \
--max-model-len 4096
Technical Notes
Monkey-patches for AutoRound + Qwen3 MoE
Two patches are required before importing AutoRound to avoid compatibility issues:
- Conv1D shim —
transformers.pytorch_utils.Conv1Dwas removed in transformers 5.x but AutoRound references it. Shimmed withtorch.nn.Linear. - MLLM detection override — Qwen models can be misdetected as multimodal by
auto_round.utils.is_mllm_model, causing the wrong calibration path. Overridden to always returnFalse.
Router weights preserved at FP16
All 48 model.layers.*.mlp.gate modules are kept at full precision (16-bit float) to ensure expert routing decisions remain accurate after quantization.
Hardware & Runtime
- Hardware: 2× NVIDIA RTX A6000 (48 GB each)
- Quantization time: ~50 minutes
- Software: AutoRound 0.10.2, transformers 4.55.0, PyTorch 2.7
- Downloads last month
- 5