HuggingFaceH4/ultrachat_200k
Viewer • Updated • 515k • 65.5k • 729
# Model Card: Qwen3-30B-A3B-AWQ-Sym
This model is a quantized 4-bit version of **Qwen3-30B-A3B**, optimized for efficient inference with frameworks like **vLLM**. The quantization was performed using `llm-compressor` with the **AWQ** (Activation-aware Weight Quantization) algorithm.
### 📊 Model Overview
* **Base Model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
* **Architecture:** Mixture-of-Experts (MoE)
* **Total Parameters:** ~30.5B
* **Active Parameters per Token:** ~3.3B
* **Experts:** 128 (8 active per token)
* **Quantization:** AWQ (4-bit Weights, 16-bit Activations)
* **Configuration:** Symmetric (`W4A16`), optimized for vLLM compatibility.
### 🛠 Quantization Details
The compression was executed with the following setup:
* **Framework:** `llm-compressor` (v6.2.0)
* **Calibration Dataset:** `HuggingFaceH4/ultrachat_200k` (SFT split)
* **Samples:** 256 samples with a maximum sequence length of 512 tokens.
* **Modifiers:**
* `AWQModifier` using the `W4A16` scheme.
* **Ignored Layers:** `lm_head` and specific MoE gate layers (`mlp.gate`, `mlp.shared_expert_gate`) were excluded to maintain expert routing accuracy.
---
### 🚀 Inference & Deployment
#### Using vLLM (Recommended)
This model is highly optimized for [vLLM](https://github.com/vllm-project/vllm). The 4-bit quantization significantly reduces the VRAM footprint required for this MoE architecture.
**Installation:**
```bash
pip install vllm>=0.11.0
Python Example (Offline Inference):
from vllm import LLM, SamplingParams
model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
repetition_penalty=1.1
)
# vLLM automatically detects the AWQ quantization
llm = LLM(model=model_id, trust_remote_code=True)
prompts = ["Explain the benefits of Mixture-of-Experts models."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Response: {output.outputs[0].text}")
Start as API Server:
python -m vllm.entrypoints.openai.api_server \
--model dtometzki/Qwen3-30B-A3B-awq-sym \
--quantization awq \
--dtype half \
--tensor-parallel-size 1
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
--reasoning-parser qwen3 in vLLM to handle <think> blocks correctly.This model uses symmetric AWQ as required by current vLLM support for asymmetric MoE architectures.