这个版本对于5090单卡来说还是太大了

#4
by iwaitu - opened

nvfp4版本32gb对于5090单卡来说,还是太大了,能不能将更多的网络转为nvfp4实现5090单卡推理?

nvfp4版本32gb对于5090单卡来说,还是太大了,能不能将更多的网络转为nvfp4实现5090单卡推理?

Same question )

Same, its barely as large as fp8, what s the point of this nvfp4 quantization?
确实,这个版本差不多和fp8量化版本一样大了,nvidia在搞什么? 只准备给RTX Pro 6000用吗?

image

image

I thought I could defy the odds, but I failed after all.

DGX spark fail too

image

image

I thought I could defy the odds, but I failed after all.

Load success through --cpu-offload-gb 10, but the speed... [sad]

DGX spark fail too

It didnt work on the Spark?

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.

这个quant有问题,量化了个寂寞

DGX spark fail too

It didnt work on the Spark?

I ran it on dgx spark, and it works,

my system boots with init 3 so it is in headless mode.
use docker for

docker run --runtime nvidia --gpus all -it  --rm -d --env "HF_HUB_OFFLINE=1"  \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gemma4-cu130 --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4

the model takes all the mem though.

I dont have quantitative results but the outputs on images about complex procedural graphs or scientific images work, and the details (eg numbers) from the images are accurate.

image

image

Currently, the included tool call parser (docker image 9afe08ebfa30) is not right and may format the args for tools incorrectly.

vllm-openai:gemm4-cu130 这个镜像是存在问题的,我在2 x h100 上部署,尝试批量调用api 时会导致显卡掉驱动,但是用 vllm-openai:latest-cu130 部署qwen3.5-122b-a10b 或者 qwen3.5-27b 都是正常通过测试的

Sign up or log in to comment