Qwen3-4B-Instruct-2507 — AWQ W4A16

AWQ 4-bit quantization of Qwen/Qwen3-4B-Instruct-2507, optimized for efficient inference with vLLM.

Quantization Details

Method AWQ (Activation-Aware Weight Quantization)
Scheme W4A16 asymmetric, group size 128
Tool llm-compressor
Calibration dataset Proprietary (multilingual IT/EN, chat templates, code, JSON output, tool calling)
Calibration samples 256

Usage

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16")

messages = [{"role": "user", "content": "Ciao, come stai?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Notes

  • The calibration dataset used for quantization is proprietary and not publicly available. It was specifically curated for multilingual (Italian/English) chat, code generation, structured JSON output, and tool calling use cases.
  • This quantization was performed using the AWQ algorithm via llm-compressor, the official successor to AutoAWQ, maintained by the vLLM project.
  • For the original unquantized model, refer to Qwen/Qwen3-4B-Instruct-2507.

Quantized by

Sophia-AI

Downloads last month
71
Safetensors
Model size
1B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sophia-AI/Qwen3-4B-Instruct-2507-AWQ-W4A16

Quantized
(229)
this model