W8A16-INT Qwen/Qwen3-8B model

  • Developed by: namgyu-youn
  • License: apache-2.0
  • Quantized from Model: Qwen/Qwen3-8B
  • Quantization Method: W8A16-INT

Model Performance

A. Perplexity (lm-eval)

Original Model

# Perplexity (ppl) command
lm_eval --model hf   --model_args pretrained=Qwen/Qwen3-8B   --tasks mmlu   --device cuda:0   --batch_size 8   --limit 100

Quantized Model

# Perplexity (ppl) command
lm_eval --model hf   --model_args pretrained=namgyu-youn/Qwen3-8B-W8A16-INT   --tasks mmlu   --device cuda:0   --batch_size 8   --limit 100

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A16-INT
mmlu - -

B. Throughput (vLLM)

Original Model

vllm bench throughput --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --num-prompts 100

Quantized Model

vllm bench throughput --model namgyu-youn/Qwen3-8B-W8A16-INT --input-len 256 --output-len 256 --num-prompts 100

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A16-INT
Throughput (tok/s) - -

C. Latency (vLLM)

Original Model

vllm bench latency --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --batch-size 1

Quantized Model

vllm bench latency --model namgyu-youn/Qwen3-8B-W8A16-INT --input-len 256 --output-len 256 --batch-size 1

Summary

Benchmark
Qwen/Qwen3-8B namgyu-youn/Qwen3-8B-W8A16-INT
Latency (ms) - -

Resources

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for namgyu-youn/Qwen3-8B-W8A16-INT

Finetuned
Qwen/Qwen3-8B
Quantized
(301)
this model