Model Overview

Model Architecture: gpt-oss-120b
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.2.0
PyTorch: 2.9.0
Operating System(s): Linux
Inference Engine: vLLM
Model Optimizer: AMD-Quark (v0.11)
- moe
  - Weight quantization: OCP MXFP4, Static
  - Activation quantization: OCP MXFP6, Dynamic
- attention (linear)
  - Weight quantization: OCP MXFP4, Static + GPTQ
  - Activation quantization: OCP MXFP6, Dynamic + GPTQ
- KV cache quantization: OCP FP8, Static
- FP8 attention: OCP FP8, Static
Calibration Dataset: WikiText-2

This model was built with gpt-oss-120b model by applying AMD-Quark (v0.11) for mixed MXFP4-MXFP6 quantization.

Model Quantization

The model was quantized from openai/gpt-oss-120b using AMD-Quark (v0.11). The weights are quantized MXFP4 and activations were quantized to MXFP6.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/

cat > gptq_config.json << 'EOF'
{
    "name": "gptq",
    "inside_layer_modules": [
        "self_attn.k_proj",
        "self_attn.v_proj",
        "self_attn.q_proj",
        "self_attn.o_proj"
    ],
    "model_decoder_layers": "model.layers"
}
EOF

exclude_layers="*lm_head *router*"

python3 quantize_quark.py \
    --model_dir openai/gpt-oss-120b \
    --quant_scheme mxfp4_mxfp6_e2m3 \
    --kv_cache_dtype fp8 \
    --attention_dtype fp8 \
    --quant_algo gptq \
    --quant_algo_config_file gptq gptq_config.json \
    --exclude_layers $exclude_layers \
    --num_calib_data 128 \
    --seq_len 2048 \
    --dataset wikitext_for_gptq_benchmark \
    --output_dir amd/gpt-oss-120b-w-mxfp4-a-mxfp6-kv-fp8-fp8attn-gptq \
    --model_export hf_format \
    --multi_gpu

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on AIME25 and GPQA Diamond benchmarks with medium reasoning effort.

Accuracy

Benchmark	gpt-oss-120b	gpt-oss-120b-w-mxfp4-a-mxfp6-kv-fp8-fp8attn-gptq(this model)	Recovery
GPQA	71.82	71.59	99.68%
AIME25	79.31	76.94	97.02%

Reproduction

The results for GPQA Diamond and AIME25 were obtained using vLLM Docker image rocm/vllm-private:vllm_dev_mxfp4_mxfp6_gpt_oss_emulation_20251225, together with gpt_oss.evals configured at the medium-effort setting. Please use the built vLLM and AITER included in the Docker image to ensure reproducibility and to avoid potential compatibility issues.

Launching server

vllm serve amd/gpt-oss-120b-w-mxfp4-a-mxfp6-kv-fp8-fp8attn-gptq \
  --tensor_parallel_size 2 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 1024 \
  --kv_cache_dtype='fp8'

Evaluating model in a new terminal

export OPENAI_API_KEY="EMPTY"

python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-mxfp6-kv-fp8-fp8attn-gptq --eval gpqa,aime25 --reasoning-effort medium --n-threads 128

License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jiaxwang/gpt-oss-120b-w-mxfp4-a-mxfp6-kv-fp8-fp8attn-gptq

Base model

openai/gpt-oss-120b

Finetuned

(98)

this model