Question about q_scale / KV cache scale fallback in vLLM for Gemma-4-31B-IT-NVFP4: expected accuracy impact?

by Shaoqing - opened 10 days ago

Hi NVIDIA team,

I am testing nvidia/Gemma-4-31B-IT-NVFP4 with vLLM on an RTX PRO 6000, and I would like to ask whether the following warnings indicate a known compatibility / metadata issue, and how much accuracy degradation we should expect if vLLM falls back in this path.

Environment
GPU: RTX PRO 6000
Quantization: modelopt_fp4
Serving backend: vLLM
Command:
HF_HUB_OFFLINE=1 OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0
vllm serve /root/autodl-tmp/models/nvidia__Gemma-4-31B-IT-NVFP4
--host 0.0.0.0
--port 6006
--served-model-name nvfp4-models
--trust-remote-code
--tensor-parallel-size 1
--gpu-memory-utilization 0.90
--max-model-len 131072
--max-num-batched-tokens 16384
--max-num-seqs 8
--enable-prefix-caching
--enable-chunked-prefill
--reasoning-parser gemma4
--tool-call-parser gemma4
--quantization modelopt_fp4
--default-chat-template-kwargs '{"enable_thinking": false}'

Relevant startup logs
WARNING [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).

WARNING [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.

Other parts of startup appear normal:

model loads successfully
torch.compile cache is used
CUDA graph capture succeeds
API server starts successfully
runtime throughput also looks normal enough

So this does not look like a hard failure. However, I am concerned that vLLM may be entering a fallback / degraded scale path for attention / KV cache.

My questions
Is this expected for this model when served through vLLM + modelopt_fp4?
Or does it indicate that some quantization metadata is missing / not being loaded correctly?
Does the checkpoint intentionally omit q_scale and/or KV cache k/v_scale, with vLLM expected to fall back?
Or should these scales normally be present for the NVFP4 release?

If vLLM falls back to

q_scale := k_scale
KV cache scale := 1.0

then approximately how much accuracy loss should we expect in practice?

Is the impact usually:
negligible for short-context chat,
small but noticeable on long-context tasks,
or potentially significant for retrieval / long-context reasoning / tool use?
Do you recommend any validation benchmark or A/B test to quantify the quality drop for this fallback path?
What I want to confirm

I am mainly trying to determine whether this is:

a benign warning with little real-world impact, or
a real degradation path that could cause measurable quality loss, especially at long context lengths.

If this is a known issue, I would also appreciate guidance on whether the preferred fix is:

using a newer model export,
using a different vLLM version,
or checking specific quantization metadata inside the checkpoint.

Thanks a lot.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment