Qwen3-4B-Instruct-2507 — AWQ W4A16

AWQ 4-bit quantization of Qwen/Qwen3-4B-Instruct-2507, optimized for efficient inference with vLLM.

Quantization Details


Method	AWQ (Activation-Aware Weight Quantization)
Scheme	W4A16 asymmetric, group size 128
Tool	llm-compressor
Calibration dataset	Proprietary (multilingual IT/EN, chat templates, code, JSON output, tool calling)
Calibration samples	256

Usage

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16")

messages = [{"role": "user", "content": "Ciao, come stai?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Notes

The calibration dataset used for quantization is proprietary and not publicly available. It was specifically curated for multilingual (Italian/English) chat, code generation, structured JSON output, and tool calling use cases.
This quantization was performed using the AWQ algorithm via llm-compressor, the official successor to AutoAWQ, maintained by the vLLM project.
For the original unquantized model, refer to Qwen/Qwen3-4B-Instruct-2507.

Quantized by

Sophia-AI

Downloads last month: 71

Safetensors

Model size

1B params

Tensor type

I64

I32

BF16

Model tree for Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(229)

this model