Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A4 (experts + o_proj, MSE)
ModelOpt NVFP4 W4A4 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct.
Weights and activations are quantized to FP4 for the MoE expert MLP path and the attention output projection (o_proj). Attention QKV stays BF16; embeddings, norms, MoE router, multimodal encoders, talker, and code2wav all remain BF16.
| Hardware requirement | NVIDIA Blackwell (sm_100+) for native FP4 GEMM |
| Exported safetensors NaN bytes | 0 (calibrated with the ModelOpt-side fix; see Mitigations below) |
Recommendation: for production W4A4 with the best B200 throughput, prefer
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip. This experts+o_proj variant is the intermediate W4A4 scope (between experts-only and full-thinker) and is useful as a comparison point.
Accuracy
Benchmarked on RTX PRO 6000 Blackwell WS via the canonical vllm-omni in-tree harness.
| Benchmark | BF16 baseline | This checkpoint | Δ |
|---|---|---|---|
| Daily-Omni overall (n=50) | 0.72 | 0.72 | 0.00 |
Daily-Omni breakdown
| Task type | n | This ckpt | BF16 |
|---|---|---|---|
| Reasoning | 11 | 1.000 | 1.000 |
| Inference | 4 | 1.000 | 1.000 |
| AV Event Alignment | 3 | 1.000 | 1.000 |
| Comparative | 9 | 0.667 | 0.667 |
| Context understanding | 10 | 0.600 | 0.500 |
| Event Sequence | 13 | 0.462 | 0.538 |
Matches BF16 exactly on the headline metric and on reasoning/inference/AV-alignment categories; slight reshuffling between Context understanding (+10pp) and Event Sequence (-7.6pp).
Quantization scope
| Layer | State |
|---|---|
thinker.model.*.mlp.experts.* (gate_up_proj, down_proj) |
NVFP4 W4A4 |
thinker.model.*.self_attn.o_proj |
NVFP4 W4A4 |
thinker.model.*.self_attn.{q,k,v}_proj (fused qkv_proj at runtime) |
BF16 |
thinker.model.* embeddings, norms |
BF16 |
thinker.model.*.mlp.gate (MoE router) |
BF16 |
thinker.audio_tower.*, thinker.visual.*, thinker.lm_head |
BF16 |
talker, code2wav |
BF16 |
Calibration recipe
- Base model:
Qwen/Qwen3-Omni-30B-A3B-Instructinbfloat16 - ModelOpt:
nvidia-modelopt==0.44.0with the ModelOpt-side calibration fix applied (see Mitigations below). Exported safetensors contain 0 NaN bytes. - Config:
mtq.NVFP4_DEFAULT_CFG+ algorithmmse(withfp8_scale_sweep=True) - Samples: 512 from
HuggingFaceH4/ultrachat_200ktrain_sft (chat-templated, 512 tokens each) - Excluded patterns:
*audio_tower*,*visual*,*talker*,*code2wav*,*lm_head*,*mlp.gate*,*q_proj*,*k_proj*,*v_proj*(plus per-layer entries patched post-export so vllm-omni's weight loader honors the BF16 routing ofmlp.gateandqkv_proj) - Calibration time: ~60 min on a single RTX PRO 6000 Blackwell WS
Inference
from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse")
OpenAI-compatible server:
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse \
--omni --port 8000
Do not pass
--enforce-eagerfor benchmarks. CUDA graphs amortize kernel launch overhead and unlock the FP4 throughput wins; with--enforce-eagerset, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.
Compute requirement: sm_100+ (Blackwell, e.g. B100/B200, RTX 5090, RTX Pro 6000) for native FP4 tensor cores.
ModelOpt 0.44 NaN regression — two mitigation paths
ModelOpt 0.44's float32 -> torch.float8_e4m3fn cast of per-block weight_scale occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) when the pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FlashInfer FP4 GEMM into the residual stream and collapses the served model output to !!!!. Two complementary fixes:
Calibration-time (ModelOpt-side): clamp the pre-cast values to
torch.finfo(torch.float8_e4m3fn).maxbefore every.to(torch.float8_e4m3fn)at the two cast sites inmodelopt/torch/quantization/qtensor/nvfp4_tensor.pyandmodelopt/torch/export/quant_utils.py. This checkpoint was calibrated with that ModelOpt 0.44 patch applied — exported safetensors contain 0 NaN bytes. An upstream PR toNVIDIA/TensorRT-Model-Optimizeris in progress.Load-time (vllm-omni-side): vllm-project/vllm-omni#4025 installs a defensive override of
ModelOptNvFp4LinearMethod.process_weights_after_loadingthat scansweight_scalefor NaN bytes and clamps them to FP8 E4M3 max at worker init. Because this checkpoint is already clean, the override is a no-op safety net here; it primarily protects other in-the-wild W4A4 NVFP4 checkpoints that were exported with vanilla ModelOpt 0.44 (including the preview siblings of this checkpoint family) and currently serve as!!!!. Self-extinguishes once vllm-omni's vllm pin includes the corresponding upstream vLLM fix; can be disabled withVLLM_OMNI_SKIP_NVFP4_NAN_CLAMP=1for diagnostics.
Sample output
"The sky appears blue during the day because molecules in the Earth's atmosphere scatter shorter wavelengths of light, such as blue and violet, more effectively than longer wavelengths like red and yellow. Although violet light is scattered even more than blue, our eyes are more sensitive to blue light and less sensitive to violet, making the sky appear blue."
Related
- W4A4 production sibling:
YihongJin/...-NVFP4-W4A4-full-thinker-awqclip— wider scope (full thinker), AWQ-clip, B200 perf numbers - W4A16 sibling:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq— for pre-Blackwell platforms - W4A4 experts-only sibling:
YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-mse— narrowest scope
License
Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model).
- Downloads last month
- 30
Model tree for YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-experts-oproj-mse
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct