NVFP4 cannot be loaded in SGLang
[2026-03-13 14:02:59 TP0] Load weight begin. avail mem=94.13 GB
[2026-03-13 14:02:59 TP1] Load weight begin. avail mem=94.13 GB
[2026-03-13 14:02:59 TP0] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-13 14:02:59 TP0] ModelOptModelLoader: Loading base model...
[2026-03-13 14:02:59 TP0] Model is already quantized, loading directly...
[2026-03-13 14:02:59 TP1] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-13 14:02:59 TP1] ModelOptModelLoader: Loading base model...
[2026-03-13 14:02:59 TP1] Model is already quantized, loading directly...
[2026-03-13 14:02:59 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 368, in __init__
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
self.init_tp_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in __init__
self._init_model_runner()
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 413, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 493, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in load_model
self.model = self.loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 2635, in load_model
return super().load_model(
^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 668, in load_model
quant_config = _get_quantization_config(model_config, self.load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 223, in _get_quantization_config
quant_config = get_quant_config(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/weight_utils.py", line 204, in get_quant_config
return quant_cls.from_config(hf_quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 461, in from_config
raise ValueError(
ValueError: ModelOptFp8Config only supports static FP8 quantization in SGLang. For FP4 quantization, use ModelOptFp4Config. Check the quantization config for your model's configuration.
Not sure why this tries to use ModelOptFp8Config in the first place.
Using this docker image: https://hub.docker.com/layers/lmsysorg/sglang/latest/images/sha256-f337dfb36971becc98d12768f356af3c6c12ba57c9aebede0e6948da5ad37da7
Digest: sha256:e216b7dc4ac1938b599b982233ccf7eb2b11dd1f07fc2e00a7b9841052c553be
Date 2026-02-23 (it's tagged latest but 18 days old so ...)
Are you sure about the issue number? That's for an unrelated error KeyError: 'model.layers.65.mixer.experts.w2_weight_scale', mine is ValueError: ModelOptFp8Config only supports static FP8 quantization in SGLang. For FP4 quantization, use ModelOptFp4Config. Check the quantization config for your model's configuration.
SGLang doesn't support TMEM based NVFP4 instructions last I checked, breaking non-datacenter Blackwell.
It seems to me like the issue is that the NVFP4 checkpoint is Mixed Precision NVFP4 + FP8 and historically SGLang doesn't support mixed precision from either ModelOpt or LLMCompressor: