Model Overview

  • Model Architecture: gpt-oss-120b
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.2.0
  • PyTorch: 2.9.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark (v0.11)
    • moe
      • Weight quantization: OCP MXFP4, Static
      • Activation quantization: OCP FP8, Dynamic
    • attention (linear)
      • Weight quantization: OCP FP8 per_channel, Static
      • Activation quantization: OCP FP8 per_token, Dynamic
    • KV cache quantization: OCP FP8, Static
    • FP8 attention: OCP FP8, Static
  • Calibration Dataset: Pile

This model was built with gpt-oss-120b model by applying AMD-Quark (v0.11) for mixed MXFP4-FP8 quantization.

Model Quantization

The model was quantized from openai/gpt-oss-120b using AMD-Quark (v0.11). The weights are quantized MXFP4 and activations were quantized to FP8.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*lm_head *router*"

python3 quantize_quark.py \
    --model_dir openai/gpt-oss-120b \
    --quant_scheme mxfp4_fp8 \
    --layer_quant_scheme *q_proj ptpc_fp8 \
    --layer_quant_scheme *k_proj ptpc_fp8 \
    --layer_quant_scheme *v_proj ptpc_fp8 \
    --layer_quant_scheme *o_proj ptpc_fp8 \
    --kv_cache_dtype fp8 \
    --attention_dtype fp8 \
    --exclude_layers $exclude_layers \
    --num_calib_data 512 \
    --output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
    --model_export hf_format \
    --multi_gpu

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on AIME25 and GPQA Diamond benchmarks with medium reasoning effort.

Accuracy

Benchmark gpt-oss-120b gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model) Recovery
GPQA 71.21 71.16 99.93%
AIME25 78.61 77.08 98.06%

Reproduction

The results for GPQA Diamond and AIME25 were obtained using vLLM Docker image rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226, together with gpt_oss.evals configured at the medium-effort setting. Please use the built vLLM and AITER included in the Docker image to ensure reproducibility and to avoid potential compatibility issues.

Launching server

export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0
export USE_Q_SCALE=1

vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
  --tensor_parallel_size 2 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 1024 \
  --kv_cache_dtype='fp8'

Evaluating model in a new terminal

export OPENAI_API_KEY="EMPTY"

python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jiaxwang/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn

Finetuned
(98)
this model