Does not work on 3090 GPUs
I found that none of the PrismaSCOUNT or PrismaQuant models here work on 3090 GPUs, but the PrismaSCOUNT and PrismaQuant models from cyburn do work. What is the difference?
I have tried those three models from byburn without any problem:
- cyburn/Qwopus3.6-35B-A3B-v1-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm-4.75bits
- cyburn/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-PrismaQuant-4.75bit-vllm
- cyburn/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm
And than tried yours but got these error messages:
- rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
- rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
- rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
worker = WorkerProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
self.worker.load_model()
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
self.model = model_loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
process_weights_after_loading(model, model_config, target_device)
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
quant_method.process_weights_after_loading(module)
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 913, in process_weights_after_loading
layer.scheme.process_weights_after_loading(layer)
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py", line 124, in process_weights_after_loading
self.kernel.process_weights_after_loading(layer)
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/linear/nvfp4/marlin.py", line 40, in process_weights_after_loading
prepare_fp4_layer_for_marlin(layer)
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", line 234, in prepare_fp4_layer_for_marlin
marlin_qweight = ops.gptq_marlin_repack(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1279, in gptq_marlin_repack
return torch.ops._C.gptq_marlin_repack(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: size_n = 2152 is not divisible by tile_n_size = 64
I'm surprised, tbh, because I didn't think 3090s supported NVFP4.
Assuming they do though, definitely don't use Marlin. Should be cutlass all the way.
Its get emulated like you wrote in your repository:
https://github.com/RobTand/prismaquant
For both your "Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm" and cyburn "Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm" all the preloading works and looks the same. Just when the machine loads checkpoint 3 it crashed with the stack trace above. This is the log before the crash:
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [gpu_model_runner.py:4777] Starting to load model rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm...
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [__init__.py:683] Using MarlinNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [__init__.py:560] Using MarlinMxfp8LinearKernel for MXFP8 GEMM
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [nvfp4.py:280] Using 'MARLIN' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=193) INFO 05-08 20:59:32 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=193) INFO 05-08 20:59:34 [weight_utils.py:904] Filesystem type for checkpoints: FUSE.SHFS. Checkpoint size: 21.31 GiB. Available RAM: 49.27 GiB.
(Worker_TP0 pid=193) INFO 05-08 20:59:34 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (FUSE.SHFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(Worker_TP0 pid=193)
Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00<?, ?it/s]
(Worker_TP0 pid=193)
Loading safetensors checkpoint shards: 17% Completed | 1/6 [00:04<00:23, 4.75s/it]
(Worker_TP0 pid=193)
Loading safetensors checkpoint shards: 33% Completed | 2/6 [00:12<00:25, 6.31s/it]
(Worker_TP1 pid=198) ERROR 05-08 20:59:52 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=198) ERROR 05-08 20:59:52 [multiproc_executor.py:870] Traceback (most recent call last):
Ok, if it's emulating nvfp4 and mxfp8 on the 3090, then using Marlin is reasonable. No idea honestly why it wouldn't work. Cyburn's models are based off of my code. Maybe let claude code or codex try it out?