is it possible to apply --kv-cache-dtype fp8

#6
by magga77 - opened

Running on dgx spark with the set as per AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4

I think you would have to forgo the dflash sliding attention to do this and use MTP instead, you would want to use one of the NVFP4-MTP tagged models from this collection to do so.

Noted. Thanks.

If you want to do even NVFP4 kv cache for even faster speed and 3x capacity gain you can use this model https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS
Combined with my custom bake of vLLM for the DGX Spark that enables a lot of future features that I patched all the bugs out on located here: https://github.com/aeon-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4/pkgs/container/aeon-vllm-ultimate

AEON-7 changed discussion status to closed

Sign up or log in to comment