Latest Revision Won't Load on 2x3090

#6
by poita66 - opened

Hi, thanks for maintaining this quant! I'm a big fan!

I had trouble today loading the model until I realised vLLM had pulled a new revision

Claude Code said this of the change:

The recent update (c6ad728) changed the vision encoder from quantized to bf16 and switched linear_attn layers from bf16 to int8. This makes the model ~784 MiB larger during profiling, which causes OOM on 2x RTX 3090 (24GB) with vLLM 0.18 + -O3. The previous revision (6e1e121) works fine.

I really like the previous revision, it's been working well for me!
Maybe the new revision would be better as a separate repo (like the BF16-4bit one)?

Cheers!

cyankiwi org

Thank you for letting me know. Claude Code is partially correct, as the recent update moves linear attention layers to int8 from int4, but both versions have vision layers at bf16. And yes, the new update should have been a separate int8-int4 repo.

Regarding vllm, you can access the prevision revision using --revision tag i.e., vllm serve cyankiwi/Qwen3.5-27B-AWQ-4bit --revision 6e1e121f40def5d8c9940bdd0b47684db115fe0b, or download the previous revision using hf download cyankiwi/Qwen3.5-27B-AWQ-4bit --revision 6e1e121f40def5d8c9940bdd0b47684db115fe0b --local-dir Qwen3.5-27B-AWQ-4bit :)

Thanks for the reply. I managed to get the previous revision up and running again.

Do you believe the int4 linear attention layers are problematic enough to warrant dropping my max context length to fit the int8?

cyankiwi org

Thank you for the suggestion of a separate repo. I reuploaded the int8 linear attention model as cyankiwi/Qwen3.5-27B-AWQ-INT8-INT4, and returned the current model repo to the int4 linear attention model.

You're right. Increasing linear attention to Int8 from Int4 reduces KLD by 10-20%, but at the same time, it also increases its size by that same amount.

Sign up or log in to comment