Instructions to use JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4")
model = AutoModelForCausalLM.from_pretrained("JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4

SGLang

How to use JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4 with Docker Model Runner:
```
docker model run hf.co/JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4
```

Qwen2.5-7B-Instruct-MXFP4-W4A4

Model Description

This is an MXFP4 (Microscaling FP4) quantized version of Qwen/Qwen2.5-7B-Instruct using the compressed-tensors quantization method.

Base Model: Qwen/Qwen2.5-7B-Instruct
Quantization Method: compressed-tensors
Quantization Type: MXFP4 W4A4 (4-bit Weight and Activation)
Format: mxfp4-pack-quantized (MX Microscaling FP4)
Model Size: ~5.3GB (compared to ~15GB for BF16)
Compression Ratio: ~2.8x

Quantization Configuration

This model uses MXFP4 (Microscaling FP4) quantization with block-scaled quantization (group size 32) for both weights and activations. MXFP4 uses E8M0 (8-bit exponent-only) block scales shared across groups of 32 elements, following the OCP MX specification.

Weights

Precision: FP4 E2M1 (4-bit floating point)
Scale Format: E8M0 (uint8 exponent)
Strategy: Group (block-scaled)
Group Size: 32
Symmetric: Yes
Dynamic: No (static quantization with calibration)

Activations

Precision: FP4 E2M1 (4-bit floating point)
Scale Format: E8M0 (uint8 exponent)
Strategy: Group (block-scaled)
Group Size: 32
Symmetric: Yes
Dynamic: Yes (dynamic quantization at inference time)

Other Details

KV Cache: Not quantized (remains in BF16)
Ignored Layers: lm_head
Target Layers: All Linear layers
Calibration: 512 samples from CNN/DailyMail, max_seq_length=2048

Hardware Requirements

MXFP4 inference requires NVIDIA Blackwell (SM120+) GPUs with CUDA 12.8+ for native CUTLASS MXFP4 GEMM support.

Usage with vLLM

from vllm import LLM, SamplingParams

model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4"

llm = LLM(model=model_id, max_model_len=4096, enforce_eager=True)

outputs = llm.generate(
    ["The capital of France is"],
    SamplingParams(max_tokens=64, temperature=0)
)

for output in outputs:
    print(output.outputs[0].text)

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

Architecture: Qwen2ForCausalLM
Hidden Size: 3584
Intermediate Size: 18944
Number of Layers: 28
Number of Attention Heads: 28
Number of KV Heads: 4 (GQA)
Vocabulary Size: 152064
Max Position Embeddings: 32768

Differences from NVFP4

Feature	MXFP4	NVFP4
Scale Format	E8M0 (uint8 exponent)	E4M3 + FP32 global scale
Group Size	32	16
Standard	OCP MX Specification	NVIDIA proprietary
Hardware	SM120+ (Blackwell)	SM89+ (Ada/Hopper/Blackwell)

Intended Use

This quantized model is intended for efficient inference with significantly reduced memory footprint. It is suitable for:

Deployment on NVIDIA Blackwell GPUs
Memory-constrained serving environments
High-throughput inference scenarios

Limitations

Requires NVIDIA Blackwell (SM120+) GPUs for native MXFP4 GEMM support
FP4 quantization may result in some accuracy degradation compared to FP8 or BF16
KV cache remains in BF16 (not quantized)

License

Same as the base model: Apache 2.0

Downloads last month: 51

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

(310)

this model