--- license: apache-2.0 base_model: Qwen/Qwen3-30B-A3B datasets: - HuggingFaceH4/ultrachat_200k language: - en - zh pipeline_tag: text-generation library_name: llm-compressor tags: - awq - quantization - moe - llm-compressor - qwen --- ``` # Model Card: Qwen3-30B-A3B-AWQ-Sym This model is a quantized 4-bit version of **Qwen3-30B-A3B**, optimized for efficient inference with frameworks like **vLLM**. The quantization was performed using `llm-compressor` with the **AWQ** (Activation-aware Weight Quantization) algorithm. ### 📊 Model Overview * **Base Model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) * **Architecture:** Mixture-of-Experts (MoE) * **Total Parameters:** ~30.5B * **Active Parameters per Token:** ~3.3B * **Experts:** 128 (8 active per token) * **Quantization:** AWQ (4-bit Weights, 16-bit Activations) * **Configuration:** Symmetric (`W4A16`), optimized for vLLM compatibility. ### 🛠 Quantization Details The compression was executed with the following setup: * **Framework:** `llm-compressor` (v6.2.0) * **Calibration Dataset:** `HuggingFaceH4/ultrachat_200k` (SFT split) * **Samples:** 256 samples with a maximum sequence length of 512 tokens. * **Modifiers:** * `AWQModifier` using the `W4A16` scheme. * **Ignored Layers:** `lm_head` and specific MoE gate layers (`mlp.gate`, `mlp.shared_expert_gate`) were excluded to maintain expert routing accuracy. --- ### 🚀 Inference & Deployment #### Using vLLM (Recommended) This model is highly optimized for [vLLM](https://github.com/vllm-project/vllm). The 4-bit quantization significantly reduces the VRAM footprint required for this MoE architecture. **Installation:** ```bash pip install vllm>=0.11.0 ``` **Python Example (Offline Inference):** ```python from vllm import LLM, SamplingParams model_id = "dtometzki/Qwen3-30B-A3B-awq-sym" sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, repetition_penalty=1.1 ) # vLLM automatically detects the AWQ quantization llm = LLM(model=model_id, trust_remote_code=True) prompts = ["Explain the benefits of Mixture-of-Experts models."] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Response: {output.outputs[0].text}") ``` **Start as API Server:** ```bash python -m vllm.entrypoints.openai.api_server \ --model dtometzki/Qwen3-30B-A3B-awq-sym \ --quantization awq \ --dtype half \ --tensor-parallel-size 1 ``` #### Using Transformers: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "dtometzki/Qwen3-30B-A3B-awq-sym" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) ``` --- ### 💡 Tips for Qwen3-30B * **VRAM Efficiency:** AWQ reduces the VRAM requirement from ~60 GB (FP16) to approximately **18-20 GB**, allowing the model to run on a single 24GB consumer GPU (e.g., RTX 3090/4090). * **Routing Accuracy:** By ignoring the MoE gates during quantization, the precision of the routing mechanism between the 128 experts is preserved. * **Thinking Mode:** If you are using the "Thinking" variant of Qwen3, it is recommended to enable `--reasoning-parser qwen3` in vLLM to handle `` blocks correctly. > [!IMPORTANT] > This model uses **symmetric** AWQ as required by current vLLM support for asymmetric MoE architectures. ---