Why is this 4bit version has a 32.7 GB size?

by alexcardo - opened 13 days ago

Discussion

alexcardo

13 days ago

I've been expecting its size around ~ 20GB to fit into 1 x RTX5090

The size is closer to 8bit btw.

KZHNB

13 days ago

Tensor type
BF16 F8_E4M3 U8

lol...

alexcardo

13 days ago

•

edited 13 days ago

Tensor type
BF16 F8_E4M3 U8

lol...

No matter... we want it smaller to fit in consumer-grade video cards )

tradespider

13 days ago

This 32.7GB build isn't a standard 4-bit quant; it’s a Compound AI Architecture optimized for the M5/RTX 6000 era. By leveraging Mixed-Precision Tensors, we maintain logic-critical layers in BF16, while offloading heavy computation to F8_E4M3 for a 4.5x throughput boost. The inclusion of U8-indexed KV Caching and speculative decoding allows for a staggering 400 t/s without the typical perplexity degradation of legacy INT4. When wrapped in a RAG + Validation Gates pipeline, this local engine effectively bridges the gap to frontier cloud models. High-density engineering for devs who prioritize local privacy without sacrificing 'GPT-5.2' class reasoning.

Tugay31

12 days ago

•

edited 12 days ago

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4

Andyx1976

11 days ago

i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.

All academic, because it is the fully censored (i.e. useless) base version anyway.

krzysztofma

11 days ago

•

edited 11 days ago

i mean if it's faster in general, that it's also still faster when vram swapping.
The real question is, how good is it still. That's the big claim up there. So if it's faster and still good, that's a net benefit.

All academic, because it is the fully censored (i.e. useless) base version anyway.

I don’t get it . Given it’s dense model , If it needs significant swapping/ offloading to system RAM on , say Rtx 5090 32GB , then( especially considering some additional VRAM needed for some KV cache) -it would be way slower than Rtx pro 6000 or theoretical 5090 with 48GB VRAM, right?

Also , why it’s useless? Genuine question. Does it hurt something like coding assistance?

excelle08

11 days ago

The model size was intentionally set so that it wouldn't work well with 32GB GPUs. Perhaps they want to sell more RTX Pro 6000s.
try prithivMLmods/gemma-4-31B-it-NVFP4

If I had a RTX Pro 6000 Blackwell I would just run the original version...

Andyx1976

9 days ago

way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..

alexcardo

9 days ago

Finally, I discovered the unofficial NVFP4 qant (23GB). Tested it out. It works pretty well on 1 x RTX 5090 and vLLM.

But still, I can't realize why NVIDIA can't release their official NVFP4 (4biT!!!!!!!!!) qant of this astonishing model that we could be sure that everything is quantized properly. Come on, NVIDIA, the community is waiting. For you is 2ms of work, for us it's a lifebuoy!

krzysztofma

9 days ago

way slower yes. Usable? Yes. Depends how much you have to do with it. Since unified memory is all the rage, it can't be that bad.
But 31b seems to be very slow in general compared with similar dense models. Maybe teething issues..

OK maybe for a single user you can say it's borderline usable , but I mean if performance drops so much , then it makes little sense to use with something like 5090 (since it would wait most of the time swapping to system RAM) So in such case I think most would really be better off with smaller quants that fit in VRAM (and also leave enough space for KV cache) e.g. cyankiwi/gemma-4-31B-it-AWQ-4bit or other desent 4 or 5 bpw quants. And if you have more than 32GB VRAM then wouldn't e.g. cyankiwi/gemma-4-31B-it-AWQ-8bit be better quality?

nikhilr12

9 days ago

NVFP4 and VRAM usage like FP8 ? NVFP4 is big scam.

Narutoouz

8 days ago

•

edited 8 days ago

Guys check out this improved version of turboquant https://github.com/scrya-com/rotorquant, it significantly improves prefill speed which was the main issue of turboquant. With this we would be able to run big models more easily in limited vram.

LilaRest

7 days ago

Just released a turbo variant (18.5GB GPU memory) which fits in a single RTX 5090.

https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo

krzysztofma

7 days ago

Just released a turbo variant (18.5GB GPU memory) which fits in a single RTX 5090.

https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo

Very Interesting, but honestly it all just sounds too good to be true. Where is the catch?

LilaRest

7 days ago

Very Interesting, but honestly it all just sounds too good to be true. Where is the catch?

It fits on a 5090 but context tops out around 25K. For longer contexts you'd still need a bigger GPU like a PRO 6000.
But it's still 40% smaller while preserving FP4 tensor core support, so faster and cheaper to run.

yangus87

7 days ago

Obviously, this model wasn't trained with FP4 or NVFP4. Its size is half that of the original model, which has FP16 accuracy. If it had been trained or compressed with FP4, it shouldn't weigh more than 20GB. This appears to be FP8 compression, not NVFP4.

LilaRest

6 days ago

Obviously, this model wasn't trained with FP4 or NVFP4. Its size is half that of the original model, which has FP16 accuracy. If it had been trained or compressed with FP4, it shouldn't weigh more than 20GB. This appears to be FP8 compression, not NVFP4.

Exactly, that's why I further quantized it to 18.4 GB.
https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo

slavikz

3 days ago

•

edited 3 days ago

Had tested several NVFP4 quants, two points: a) have rtx 6000 and also 5090 but i prefer qwen 122b over this for my rtx 6000 pro .. b) all quants that works on rtx 5090 produce weird stuff like '.B' not always but when this is happening model is nogo for me - this may explain nvidia 32gb quant - since quants like LilaRest are all produce it (perhaps vllm bug not sure or just nvidia realized this so they keep some layers in original weight)

alexcardo

3 days ago

I rent rtx 6000 pro blackwell to test it. Same issues that with other NVFP4 quants. And as I multiple times mentioned in other treads, the original wights of Gemma 4 31B are broken. Infinite loops are so usual.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment