Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
nemotron-3
latent-moe
mtp
conversational
custom_code
8-bit precision
modelopt

NVFP4 cannot be loaded in SGLang

#12
by mratsim - opened
[2026-03-13 14:02:59 TP0] Load weight begin. avail mem=94.13 GB
[2026-03-13 14:02:59 TP1] Load weight begin. avail mem=94.13 GB
[2026-03-13 14:02:59 TP0] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-13 14:02:59 TP0] ModelOptModelLoader: Loading base model...
[2026-03-13 14:02:59 TP0] Model is already quantized, loading directly...
[2026-03-13 14:02:59 TP1] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-03-13 14:02:59 TP1] ModelOptModelLoader: Loading base model...
[2026-03-13 14:02:59 TP1] Model is already quantized, loading directly...
[2026-03-13 14:02:59 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3130, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 368, in __init__
    self.init_model_worker()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 564, in init_model_worker
    self.init_tp_model_worker()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 522, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in __init__
    self._init_model_runner()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 413, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 493, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in load_model
    self.model = self.loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 2635, in load_model
    return super().load_model(
           ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 668, in load_model
    quant_config = _get_quantization_config(model_config, self.load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 223, in _get_quantization_config
    quant_config = get_quant_config(
                   ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/weight_utils.py", line 204, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 461, in from_config
    raise ValueError(
ValueError: ModelOptFp8Config only supports static FP8 quantization in SGLang. For FP4 quantization, use ModelOptFp4Config. Check the quantization config for your model's configuration.

Not sure why this tries to use ModelOptFp8Config in the first place.

Using this docker image: https://hub.docker.com/layers/lmsysorg/sglang/latest/images/sha256-f337dfb36971becc98d12768f356af3c6c12ba57c9aebede0e6948da5ad37da7

Digest: sha256:e216b7dc4ac1938b599b982233ccf7eb2b11dd1f07fc2e00a7b9841052c553be

Date 2026-02-23 (it's tagged latest but 18 days old so ...)

NVIDIA org

Are you sure about the issue number? That's for an unrelated error KeyError: 'model.layers.65.mixer.experts.w2_weight_scale', mine is ValueError: ModelOptFp8Config only supports static FP8 quantization in SGLang. For FP4 quantization, use ModelOptFp4Config. Check the quantization config for your model's configuration.

SGLang doesn't support TMEM based NVFP4 instructions last I checked, breaking non-datacenter Blackwell.

It seems to me like the issue is that the NVFP4 checkpoint is Mixed Precision NVFP4 + FP8 and historically SGLang doesn't support mixed precision from either ModelOpt or LLMCompressor:

Sign up or log in to comment