Qwen3-4B-Instruct-2507 — AWQ W4A16
AWQ 4-bit quantization of Qwen/Qwen3-4B-Instruct-2507, optimized for efficient inference with vLLM.
Quantization Details
| Method | AWQ (Activation-Aware Weight Quantization) |
| Scheme | W4A16 asymmetric, group size 128 |
| Tool | llm-compressor |
| Calibration dataset | Proprietary (multilingual IT/EN, chat templates, code, JSON output, tool calling) |
| Calibration samples | 256 |
Usage
vLLM (recommended)
vllm serve Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16")
messages = [{"role": "user", "content": "Ciao, come stai?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Notes
- The calibration dataset used for quantization is proprietary and not publicly available. It was specifically curated for multilingual (Italian/English) chat, code generation, structured JSON output, and tool calling use cases.
- This quantization was performed using the AWQ algorithm via llm-compressor, the official successor to AutoAWQ, maintained by the vLLM project.
- For the original unquantized model, refer to Qwen/Qwen3-4B-Instruct-2507.
Quantized by
- Downloads last month
- 71
Model tree for Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16
Base model
Qwen/Qwen3-4B-Instruct-2507