Why is this 4bit version has a 32.7 GB size?
I've been expecting its size around ~ 20GB to fit into 1 x RTX5090
The size is closer to 8bit btw.
Tensor type
BF16 F8_E4M3 U8
lol...
Tensor type
BF16 F8_E4M3 U8lol...
No matter... we want it smaller to fit in consumer-grade video cards )
This 32.7GB build isn't a standard 4-bit quant; itβs a Compound AI Architecture optimized for the M5/RTX 6000 era. By leveraging Mixed-Precision Tensors, we maintain logic-critical layers in BF16, while offloading heavy computation to F8_E4M3 for a 4.5x throughput boost. The inclusion of U8-indexed KV Caching and speculative decoding allows for a staggering 400 t/s without the typical perplexity degradation of legacy INT4. When wrapped in a RAG + Validation Gates pipeline, this local engine effectively bridges the gap to frontier cloud models. High-density engineering for devs who prioritize local privacy without sacrificing 'GPT-5.2' class reasoning.
The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4
i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.
All academic, because it is the fully censored (i.e. useless) base version anyway.
i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.All academic, because it is the fully censored (i.e. useless) base version anyway.
I donβt get it . Given itβs dense model , If it needs significant swapping/ offloading to system RAM on , say Rtx 5090 32GB , then( especially considering some additional VRAM needed for some KV cache) -it would be way slower than Rtx pro 6000 or theoretical 5090 with 48GB VRAM, right?
Also , why itβs useless? Genuine question. Does it hurt something like coding assistance?
The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4
If I had a RTX Pro 6000 Blackwell I would just run the original version...
way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..
Finally, I discovered the unofficial NVFP4 qant (23GB). Tested it out. It works pretty well on 1 x RTX 5090 and vLLM.
But still, I can't realize why NVIDIA can't release their official NVFP4 (4biT!!!!!!!!!) qant of this astonishing model that we could be sure that everything is quantized properly. Come on, NVIDIA, the community is waiting. For you is 2ms of work, for us it's a lifebuoy!
way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..
OK maybe for a single user you can say it's borderline usable , but I mean if performance drops so much , then it makes little sense to use with something like 5090 (since it would wait most of the time swapping to system RAM) So in such case I think most would really be better off with smaller quants that fit in VRAM (and also leave enough space for KV cache) e.g. cyankiwi/gemma-4-31B-it-AWQ-4bit or other desent 4 or 5 bpw quants. And if you have more than 32GB VRAM then wouldn't e.g. cyankiwi/gemma-4-31B-it-AWQ-8bit be better quality?
NVFP4 and VRAM usage like FP8 ? NVFP4 is big scam.
Guys check out this improved version of turboquant https://github.com/scrya-com/rotorquant, it significantly improves prefill speed which was the main issue of turboquant. With this we would be able to run big models more easily in limited vram.
Just released a turbo variant (18.5GB GPU memory) which fits in a single RTX 5090.
Just released a turbo variant (18.5GB GPU memory) which fits in a single RTX 5090.
Very Interesting, but honestly it all just sounds too good to be true. Where is the catch?
Very Interesting, but honestly it all just sounds too good to be true. Where is the catch?
It fits on a 5090 but context tops out around 25K. For longer contexts you'd still need a bigger GPU like a PRO 6000.
But it's still 40% smaller while preserving FP4 tensor core support, so faster and cheaper to run.
Obviously, this model wasn't trained with FP4 or NVFP4. Its size is half that of the original model, which has FP16 accuracy. If it had been trained or compressed with FP4, it shouldn't weigh more than 20GB. This appears to be FP8 compression, not NVFP4.
Obviously, this model wasn't trained with FP4 or NVFP4. Its size is half that of the original model, which has FP16 accuracy. If it had been trained or compressed with FP4, it shouldn't weigh more than 20GB. This appears to be FP8 compression, not NVFP4.
Exactly, that's why I further quantized it to 18.4 GB.
https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo
Had tested several NVFP4 quants, two points: a) have rtx 6000 and also 5090 but i prefer qwen 122b over this for my rtx 6000 pro .. b) all quants that works on rtx 5090 produce weird stuff like '.B' not always but when this is happening model is nogo for me - this may explain nvidia 32gb quant - since quants like LilaRest are all produce it (perhaps vllm bug not sure or just nvidia realized this so they keep some layers in original weight)
I rent rtx 6000 pro blackwell to test it. Same issues that with other NVFP4 quants. And as I multiple times mentioned in other treads, the original wights of Gemma 4 31B are broken. Infinite loops are so usual.