Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A4 (experts + o_proj, MSE)

ModelOpt NVFP4 W4A4 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct.

Weights and activations are quantized to FP4 for the MoE expert MLP path and the attention output projection (o_proj). Attention QKV stays BF16; embeddings, norms, MoE router, multimodal encoders, talker, and code2wav all remain BF16.

Hardware requirement NVIDIA Blackwell (sm_100+) for native FP4 GEMM
Exported safetensors NaN bytes 0 (calibrated with the ModelOpt-side fix; see Mitigations below)

Recommendation: for production W4A4 with the best B200 throughput, prefer YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip. This experts+o_proj variant is the intermediate W4A4 scope (between experts-only and full-thinker) and is useful as a comparison point.

Accuracy

Benchmarked on RTX PRO 6000 Blackwell WS via the canonical vllm-omni in-tree harness.

Benchmark BF16 baseline This checkpoint Δ
Daily-Omni overall (n=50) 0.72 0.72 0.00

Daily-Omni breakdown

Task type n This ckpt BF16
Reasoning 11 1.000 1.000
Inference 4 1.000 1.000
AV Event Alignment 3 1.000 1.000
Comparative 9 0.667 0.667
Context understanding 10 0.600 0.500
Event Sequence 13 0.462 0.538

Matches BF16 exactly on the headline metric and on reasoning/inference/AV-alignment categories; slight reshuffling between Context understanding (+10pp) and Event Sequence (-7.6pp).

Quantization scope

Layer State
thinker.model.*.mlp.experts.* (gate_up_proj, down_proj) NVFP4 W4A4
thinker.model.*.self_attn.o_proj NVFP4 W4A4
thinker.model.*.self_attn.{q,k,v}_proj (fused qkv_proj at runtime) BF16
thinker.model.* embeddings, norms BF16
thinker.model.*.mlp.gate (MoE router) BF16
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head BF16
talker, code2wav BF16

Calibration recipe

  • Base model: Qwen/Qwen3-Omni-30B-A3B-Instruct in bfloat16
  • ModelOpt: nvidia-modelopt==0.44.0 with the ModelOpt-side calibration fix applied (see Mitigations below). Exported safetensors contain 0 NaN bytes.
  • Config: mtq.NVFP4_DEFAULT_CFG + algorithm mse (with fp8_scale_sweep=True)
  • Samples: 512 from HuggingFaceH4/ultrachat_200k train_sft (chat-templated, 512 tokens each)
  • Excluded patterns: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*, *q_proj*, *k_proj*, *v_proj* (plus per-layer entries patched post-export so vllm-omni's weight loader honors the BF16 routing of mlp.gate and qkv_proj)
  • Calibration time: ~60 min on a single RTX PRO 6000 Blackwell WS

Inference

from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse")

OpenAI-compatible server:

vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse \
    --omni --port 8000

Do not pass --enforce-eager for benchmarks. CUDA graphs amortize kernel launch overhead and unlock the FP4 throughput wins; with --enforce-eager set, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.

Compute requirement: sm_100+ (Blackwell, e.g. B100/B200, RTX 5090, RTX Pro 6000) for native FP4 tensor cores.

ModelOpt 0.44 NaN regression — two mitigation paths

ModelOpt 0.44's float32 -> torch.float8_e4m3fn cast of per-block weight_scale occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) when the pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FlashInfer FP4 GEMM into the residual stream and collapses the served model output to !!!!. Two complementary fixes:

  1. Calibration-time (ModelOpt-side): clamp the pre-cast values to torch.finfo(torch.float8_e4m3fn).max before every .to(torch.float8_e4m3fn) at the two cast sites in modelopt/torch/quantization/qtensor/nvfp4_tensor.py and modelopt/torch/export/quant_utils.py. This checkpoint was calibrated with that ModelOpt 0.44 patch applied — exported safetensors contain 0 NaN bytes. An upstream PR to NVIDIA/TensorRT-Model-Optimizer is in progress.

  2. Load-time (vllm-omni-side): vllm-project/vllm-omni#4025 installs a defensive override of ModelOptNvFp4LinearMethod.process_weights_after_loading that scans weight_scale for NaN bytes and clamps them to FP8 E4M3 max at worker init. Because this checkpoint is already clean, the override is a no-op safety net here; it primarily protects other in-the-wild W4A4 NVFP4 checkpoints that were exported with vanilla ModelOpt 0.44 (including the preview siblings of this checkpoint family) and currently serve as !!!!. Self-extinguishes once vllm-omni's vllm pin includes the corresponding upstream vLLM fix; can be disabled with VLLM_OMNI_SKIP_NVFP4_NAN_CLAMP=1 for diagnostics.

Sample output

"The sky appears blue during the day because molecules in the Earth's atmosphere scatter shorter wavelengths of light, such as blue and violet, more effectively than longer wavelengths like red and yellow. Although violet light is scattered even more than blue, our eyes are more sensitive to blue light and less sensitive to violet, making the sky appear blue."

Related

License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model).

Downloads last month
30
Safetensors
Model size
21B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse

Quantized
(24)
this model