--- language: - en - zh license: apache-2.0 library_name: transformers tags: - quantization - fp4 - mxfp4 - compressed-tensors - qwen2 - text-generation - 4bit base_model: Qwen/Qwen2.5-7B-Instruct pipeline_tag: text-generation model-index: - name: Qwen2.5-7B-Instruct-MXFP4-W4A4 results: [] quantization: quant_method: compressed-tensors bits: 4 type: float format: mxfp4-pack-quantized strategy: group group_size: 32 symmetric: true --- # Qwen2.5-7B-Instruct-MXFP4-W4A4 ## Model Description This is an **MXFP4 (Microscaling FP4)** quantized version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using the compressed-tensors quantization method. - **Base Model**: Qwen/Qwen2.5-7B-Instruct - **Quantization Method**: compressed-tensors - **Quantization Type**: MXFP4 W4A4 (4-bit Weight and Activation) - **Format**: mxfp4-pack-quantized (MX Microscaling FP4) - **Model Size**: ~5.3GB (compared to ~15GB for BF16) - **Compression Ratio**: ~2.8x ## Quantization Configuration This model uses **MXFP4 (Microscaling FP4) quantization** with block-scaled quantization (group size 32) for both weights and activations. MXFP4 uses E8M0 (8-bit exponent-only) block scales shared across groups of 32 elements, following the [OCP MX specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). ### Weights - **Precision**: FP4 E2M1 (4-bit floating point) - **Scale Format**: E8M0 (uint8 exponent) - **Strategy**: Group (block-scaled) - **Group Size**: 32 - **Symmetric**: Yes - **Dynamic**: No (static quantization with calibration) ### Activations - **Precision**: FP4 E2M1 (4-bit floating point) - **Scale Format**: E8M0 (uint8 exponent) - **Strategy**: Group (block-scaled) - **Group Size**: 32 - **Symmetric**: Yes - **Dynamic**: Yes (dynamic quantization at inference time) ### Other Details - **KV Cache**: Not quantized (remains in BF16) - **Ignored Layers**: lm_head - **Target Layers**: All Linear layers - **Calibration**: 512 samples from CNN/DailyMail, max_seq_length=2048 ## Hardware Requirements MXFP4 inference requires **NVIDIA Blackwell (SM120+)** GPUs with CUDA 12.8+ for native CUTLASS MXFP4 GEMM support. ## Usage with vLLM ```python from vllm import LLM, SamplingParams model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" llm = LLM(model=model_id, max_model_len=4096, enforce_eager=True) outputs = llm.generate( ["The capital of France is"], SamplingParams(max_tokens=64, temperature=0) ) for output in outputs: print(output.outputs[0].text) ``` ## Usage with Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "JongYeop/Qwen2.5-7B-Instruct-MXFP4-W4A4" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) messages = [ {"role": "user", "content": "What is machine learning?"} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( input_ids, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, ) response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True) print(response) ``` ## Model Architecture - **Architecture**: Qwen2ForCausalLM - **Hidden Size**: 3584 - **Intermediate Size**: 18944 - **Number of Layers**: 28 - **Number of Attention Heads**: 28 - **Number of KV Heads**: 4 (GQA) - **Vocabulary Size**: 152064 - **Max Position Embeddings**: 32768 ## Differences from NVFP4 | Feature | MXFP4 | NVFP4 | |---------|-------|-------| | Scale Format | E8M0 (uint8 exponent) | E4M3 + FP32 global scale | | Group Size | 32 | 16 | | Standard | OCP MX Specification | NVIDIA proprietary | | Hardware | SM120+ (Blackwell) | SM89+ (Ada/Hopper/Blackwell) | ## Intended Use This quantized model is intended for efficient inference with significantly reduced memory footprint. It is suitable for: - Deployment on NVIDIA Blackwell GPUs - Memory-constrained serving environments - High-throughput inference scenarios ## Limitations - Requires NVIDIA Blackwell (SM120+) GPUs for native MXFP4 GEMM support - FP4 quantization may result in some accuracy degradation compared to FP8 or BF16 - KV cache remains in BF16 (not quantized) ## License Same as the base model: [Apache 2.0](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)