---
license: apache-2.0
base_model: Qwen/Qwen3-30B-A3B
datasets:
- HuggingFaceH4/ultrachat_200k
language:
- en
- zh
pipeline_tag: text-generation
library_name: llm-compressor
tags:
- awq
- quantization
- moe
- llm-compressor
- qwen
---

```

# Model Card: Qwen3-30B-A3B-AWQ-Sym

This model is a quantized 4-bit version of **Qwen3-30B-A3B**, optimized for efficient inference with frameworks like **vLLM**. The quantization was performed using `llm-compressor` with the **AWQ** (Activation-aware Weight Quantization) algorithm.

### 📊 Model Overview

* **Base Model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
* **Architecture:** Mixture-of-Experts (MoE)
* **Total Parameters:** ~30.5B
* **Active Parameters per Token:** ~3.3B
* **Experts:** 128 (8 active per token)


* **Quantization:** AWQ (4-bit Weights, 16-bit Activations)
* **Configuration:** Symmetric (`W4A16`), optimized for vLLM compatibility.

### 🛠 Quantization Details

The compression was executed with the following setup:

* **Framework:** `llm-compressor` (v6.2.0)
* **Calibration Dataset:** `HuggingFaceH4/ultrachat_200k` (SFT split)
* **Samples:** 256 samples with a maximum sequence length of 512 tokens.
* **Modifiers:**
* `AWQModifier` using the `W4A16` scheme.
* **Ignored Layers:** `lm_head` and specific MoE gate layers (`mlp.gate`, `mlp.shared_expert_gate`) were excluded to maintain expert routing accuracy.

---

### 🚀 Inference & Deployment

#### Using vLLM (Recommended)

This model is highly optimized for [vLLM](https://github.com/vllm-project/vllm). The 4-bit quantization significantly reduces the VRAM footprint required for this MoE architecture.

**Installation:**

```bash
pip install vllm>=0.11.0

```

**Python Example (Offline Inference):**

```python
from vllm import LLM, SamplingParams

model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    repetition_penalty=1.1
)

# vLLM automatically detects the AWQ quantization
llm = LLM(model=model_id, trust_remote_code=True)

prompts = ["Explain the benefits of Mixture-of-Experts models."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Response: {output.outputs[0].text}")

```

**Start as API Server:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model dtometzki/Qwen3-30B-A3B-awq-sym \
    --quantization awq \
    --dtype half \
    --tensor-parallel-size 1

```

#### Using Transformers:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype="auto"
)

```

---

### 💡 Tips for Qwen3-30B

* **VRAM Efficiency:** AWQ reduces the VRAM requirement from ~60 GB (FP16) to approximately **18-20 GB**, allowing the model to run on a single 24GB consumer GPU (e.g., RTX 3090/4090).
* **Routing Accuracy:** By ignoring the MoE gates during quantization, the precision of the routing mechanism between the 128 experts is preserved.
* **Thinking Mode:** If you are using the "Thinking" variant of Qwen3, it is recommended to enable `--reasoning-parser qwen3` in vLLM to handle `<think>` blocks correctly.

> [!IMPORTANT]
> This model uses **symmetric** AWQ as required by current vLLM support for asymmetric MoE architectures.

---