# Model Card: Qwen3-30B-A3B-AWQ-Sym

This model is a quantized 4-bit version of **Qwen3-30B-A3B**, optimized for efficient inference with frameworks like **vLLM**. The quantization was performed using `llm-compressor` with the **AWQ** (Activation-aware Weight Quantization) algorithm.

### 📊 Model Overview

* **Base Model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
* **Architecture:** Mixture-of-Experts (MoE)
* **Total Parameters:** ~30.5B
* **Active Parameters per Token:** ~3.3B
* **Experts:** 128 (8 active per token)


* **Quantization:** AWQ (4-bit Weights, 16-bit Activations)
* **Configuration:** Symmetric (`W4A16`), optimized for vLLM compatibility.

### 🛠 Quantization Details

The compression was executed with the following setup:

* **Framework:** `llm-compressor` (v6.2.0)
* **Calibration Dataset:** `HuggingFaceH4/ultrachat_200k` (SFT split)
* **Samples:** 256 samples with a maximum sequence length of 512 tokens.
* **Modifiers:**
* `AWQModifier` using the `W4A16` scheme.
* **Ignored Layers:** `lm_head` and specific MoE gate layers (`mlp.gate`, `mlp.shared_expert_gate`) were excluded to maintain expert routing accuracy.

---

### 🚀 Inference & Deployment

#### Using vLLM (Recommended)

This model is highly optimized for [vLLM](https://github.com/vllm-project/vllm). The 4-bit quantization significantly reduces the VRAM footprint required for this MoE architecture.

**Installation:**

```bash
pip install vllm>=0.11.0

Python Example (Offline Inference):

from vllm import LLM, SamplingParams

model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    repetition_penalty=1.1
)

# vLLM automatically detects the AWQ quantization
llm = LLM(model=model_id, trust_remote_code=True)

prompts = ["Explain the benefits of Mixture-of-Experts models."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Response: {output.outputs[0].text}")

Start as API Server:

python -m vllm.entrypoints.openai.api_server \
    --model dtometzki/Qwen3-30B-A3B-awq-sym \
    --quantization awq \
    --dtype half \
    --tensor-parallel-size 1

Using Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "dtometzki/Qwen3-30B-A3B-awq-sym"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype="auto"
)

💡 Tips for Qwen3-30B

  • VRAM Efficiency: AWQ reduces the VRAM requirement from ~60 GB (FP16) to approximately 18-20 GB, allowing the model to run on a single 24GB consumer GPU (e.g., RTX 3090/4090).
  • Routing Accuracy: By ignoring the MoE gates during quantization, the precision of the routing mechanism between the 128 experts is preserved.
  • Thinking Mode: If you are using the "Thinking" variant of Qwen3, it is recommended to enable --reasoning-parser qwen3 in vLLM to handle <think> blocks correctly.

This model uses symmetric AWQ as required by current vLLM support for asymmetric MoE architectures.


Downloads last month
47
Safetensors
Model size
5B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dtometzki/Qwen3-30B-A3B-awq-sym

Quantized
(118)
this model

Dataset used to train dtometzki/Qwen3-30B-A3B-awq-sym