Instructions to use morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- vLLM
How to use morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512
- SGLang
How to use morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512 with Docker Model Runner:
docker model run hf.co/morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512
Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512
Post-training quantized checkpoint of Qwen/Qwen3-30B-A3B
produced by the pex/baselines pipeline as part of the PEX paper baselines.
Quantization
| Knob | Value |
|---|---|
| Method | AWQ |
| Scheme | W4 |
| Group size | 128 |
| Producer tool | autoawq |
| Format | awq |
Calibration
- Corpus:
datablations/c4-subsets(100m/c4_100m.jsonl) - Samples:
512 × 512 tokens - Seed:
0 - Recipe fingerprint:
d141fefd74a6f339
Skipped modules
Qwen3 MoE: skip lm_head, router (mlp.gate), and shared-expert gate. All MLP up/down/gate-proj inside each expert ARE quantized.
Serving with vLLM
vllm serve morriszjm/Qwen3-30B-A3B-AWQ-W4-g128-c4-512x512 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768
vLLM auto-detects the quantization format from config.json.
For AWQ, you may pass --quantization awq_marlin for the fastest kernel.
Reproducing
The exact producer recipe (including the calibration hash above) is in
meta.json next to the weights.
Reference
This checkpoint is one of three quantization baselines (RTN / GPTQ / AWQ) used to anchor the Pareto plots in the PEX paper. Not a SOTA release — it is an out-of-the-box reference produced with each method's paper-default recipe to enable fair method-vs-method comparison.
- Downloads last month
- 18