| --- |
| language: |
| - en |
| - zh |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - quantization |
| - fp4 |
| - mxfp4 |
| - compressed-tensors |
| - qwen2 |
| - text-generation |
| - 4bit |
| base_model: Qwen/Qwen2.5-7B-Instruct |
| pipeline_tag: text-generation |
| model-index: |
| - name: Qwen2.5-7B-Instruct-MXFP4-W4A4 |
| results: [] |
| quantization: |
| quant_method: compressed-tensors |
| bits: 4 |
| type: float |
| format: mxfp4-pack-quantized |
| strategy: group |
| group_size: 32 |
| symmetric: true |
| --- |
| |
| # Qwen2.5-7B-Instruct-MXFP4-W4A4 |
|
|
| ## Model Description |
|
|
| This is an **MXFP4 (Microscaling FP4)** quantized version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using the compressed-tensors quantization method. |
|
|
| - **Base Model**: Qwen/Qwen2.5-7B-Instruct |
| - **Quantization Method**: compressed-tensors |
| - **Quantization Type**: MXFP4 W4A4 (4-bit Weight and Activation) |
| - **Format**: mxfp4-pack-quantized (MX Microscaling FP4) |
| - **Model Size**: ~5.3GB (compared to ~15GB for BF16) |
| - **Compression Ratio**: ~2.8x |
|
|
| ## Quantization Configuration |
|
|
| This model uses **MXFP4 (Microscaling FP4) quantization** with block-scaled quantization (group size 32) for both weights and activations. MXFP4 uses E8M0 (8-bit exponent-only) block scales shared across groups of 32 elements, following the [OCP MX specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). |
|
|
| ### Weights |
| - **Precision**: FP4 E2M1 (4-bit floating point) |
| - **Scale Format**: E8M0 (uint8 exponent) |
| - **Strategy**: Group (block-scaled) |
| - **Group Size**: 32 |
| - **Symmetric**: Yes |
| - **Dynamic**: No (static quantization with calibration) |
|
|
| ### Activations |
| - **Precision**: FP4 E2M1 (4-bit floating point) |
| - **Scale Format**: E8M0 (uint8 exponent) |
| - **Strategy**: Group (block-scaled) |
| - **Group Size**: 32 |
| - **Symmetric**: Yes |
| - **Dynamic**: Yes (dynamic quantization at inference time) |
|
|
| ### Other Details |
| - **KV Cache**: Not quantized (remains in BF16) |
| - **Ignored Layers**: lm_head |
| - **Target Layers**: All Linear layers |
| - **Calibration**: 512 samples from CNN/DailyMail, max_seq_length=2048 |
| |
| ## Hardware Requirements |
| |
| MXFP4 inference requires **NVIDIA Blackwell (SM120+)** GPUs with CUDA 12.8+ for native CUTLASS MXFP4 GEMM support. |
| |
| ## Usage with vLLM |
| |
| ```python |
| from vllm import LLM, SamplingParams |
| |
| model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" |
|
|
| llm = LLM(model=model_id, max_model_len=4096, enforce_eager=True) |
|
|
| outputs = llm.generate( |
| ["The capital of France is"], |
| SamplingParams(max_tokens=64, temperature=0) |
| ) |
| |
| for output in outputs: |
| print(output.outputs[0].text) |
| ``` |
| |
| ## Usage with Transformers |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| device_map="auto", |
| torch_dtype="auto" |
| ) |
| |
| messages = [ |
| {"role": "user", "content": "What is machine learning?"} |
| ] |
| |
| input_ids = tokenizer.apply_chat_template( |
| messages, |
| add_generation_prompt=True, |
| return_tensors="pt" |
| ).to(model.device) |
| |
| outputs = model.generate( |
| input_ids, |
| max_new_tokens=256, |
| do_sample=True, |
| temperature=0.7, |
| top_p=0.9, |
| ) |
| |
| response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ## Model Architecture |
|
|
| - **Architecture**: Qwen2ForCausalLM |
| - **Hidden Size**: 3584 |
| - **Intermediate Size**: 18944 |
| - **Number of Layers**: 28 |
| - **Number of Attention Heads**: 28 |
| - **Number of KV Heads**: 4 (GQA) |
| - **Vocabulary Size**: 152064 |
| - **Max Position Embeddings**: 32768 |
|
|
| ## Differences from NVFP4 |
|
|
| | Feature | MXFP4 | NVFP4 | |
| |---------|-------|-------| |
| | Scale Format | E8M0 (uint8 exponent) | E4M3 + FP32 global scale | |
| | Group Size | 32 | 16 | |
| | Standard | OCP MX Specification | NVIDIA proprietary | |
| | Hardware | SM120+ (Blackwell) | SM89+ (Ada/Hopper/Blackwell) | |
|
|
| ## Intended Use |
|
|
| This quantized model is intended for efficient inference with significantly reduced memory footprint. It is suitable for: |
|
|
| - Deployment on NVIDIA Blackwell GPUs |
| - Memory-constrained serving environments |
| - High-throughput inference scenarios |
|
|
| ## Limitations |
|
|
| - Requires NVIDIA Blackwell (SM120+) GPUs for native MXFP4 GEMM support |
| - FP4 quantization may result in some accuracy degradation compared to FP8 or BF16 |
| - KV cache remains in BF16 (not quantized) |
|
|
| ## License |
|
|
| Same as the base model: [Apache 2.0](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
|
|