Does not work on 3090 GPUs

by Neiko2002 - opened about 23 hours ago

Neiko2002

•

I found that none of the PrismaSCOUNT or PrismaQuant models here work on 3090 GPUs, but the PrismaSCOUNT and PrismaQuant models from cyburn do work. What is the difference?

I have tried those three models from byburn without any problem:

cyburn/Qwopus3.6-35B-A3B-v1-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm-4.75bits
cyburn/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-PrismaQuant-4.75bit-vllm
cyburn/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm

And than tried yours but got these error messages:

rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
    worker = WorkerProc(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
    self.worker.load_model()
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
    self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
  File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
    process_weights_after_loading(model, model_config, target_device)
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
    quant_method.process_weights_after_loading(module)
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 913, in process_weights_after_loading
    layer.scheme.process_weights_after_loading(layer)
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py", line 124, in process_weights_after_loading
    self.kernel.process_weights_after_loading(layer)
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/linear/nvfp4/marlin.py", line 40, in process_weights_after_loading
    prepare_fp4_layer_for_marlin(layer)
  File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 234, in prepare_fp4_layer_for_marlin
    marlin_qweight = ops.gptq_marlin_repack(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1279, in gptq_marlin_repack
    return torch.ops._C.gptq_marlin_repack(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: size_n = 2152 is not divisible by tile_n_size = 64

rdtand

Owner about 22 hours ago

I'm surprised, tbh, because I didn't think 3090s supported NVFP4.

Assuming they do though, definitely don't use Marlin. Should be cutlass all the way.

Neiko2002

about 22 hours ago

Its get emulated like you wrote in your repository:
https://github.com/RobTand/prismaquant

For both your "Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm" and cyburn "Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm" all the preloading works and looks the same. Just when the machine loads checkpoint 3 it crashed with the stack trace above. This is the log before the crash:

(Worker_TP0 pid=193) INFO 05-08 20:59:32 [gpu_model_runner.py:4777] Starting to load model rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm...
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [__init__.py:683] Using MarlinNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [__init__.py:560] Using MarlinMxfp8LinearKernel for MXFP8 GEMM
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [nvfp4.py:280] Using 'MARLIN' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=193) INFO 05-08 20:59:34 [weight_utils.py:904] Filesystem type for checkpoints: FUSE.SHFS. Checkpoint size: 21.31 GiB. Available RAM: 49.27 GiB.
(Worker_TP0 pid=193) INFO 05-08 20:59:34 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (FUSE.SHFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(Worker_TP0 pid=193) 
Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]
(Worker_TP0 pid=193) 
Loading safetensors checkpoint shards:  17% Completed | 1/6 [00:04<00:23,  4.75s/it]
(Worker_TP0 pid=193) 
Loading safetensors checkpoint shards:  33% Completed | 2/6 [00:12<00:25,  6.31s/it]
(Worker_TP1 pid=198) ERROR 05-08 20:59:52 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=198) ERROR 05-08 20:59:52 [multiproc_executor.py:870] Traceback (most recent call last):

rdtand

Owner about 20 hours ago

Ok, if it's emulating nvfp4 and mxfp8 on the 3090, then using Marlin is reasonable. No idea honestly why it wouldn't work. Cyburn's models are based off of my code. Maybe let claude code or codex try it out?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment