AEON-7 commited on
Commit
c2ebb91
·
verified ·
1 Parent(s): fd7c7da

docs: correct DFlash KV cache compat — needs BF16 in vLLM 0.22.1 (regression vs v3)

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -63,7 +63,7 @@ tags:
63
  >
64
  > Quick start + full DFlash recipe: [container README](https://github.com/AEON-7/vllm-ultimate-dgx-spark).
65
  >
66
- > ⚠️ **NVFP4 KV + DFlash is not yet compatible on sm_121a in vLLM 0.22.1**: the DFlash drafter uses non-causal attention, which currently has no NVFP4-KV backend on Spark (FLASHINFER requires SM100; TRITON_ATTN is causal-only). For DFlash use `--kv-cache-dtype fp8_e4m3`. NVFP4 KV pairs cleanly with MTP / Eagle / ngram speculators.
67
 
68
  ## Variants
69
 
 
63
  >
64
  > Quick start + full DFlash recipe: [container README](https://github.com/AEON-7/vllm-ultimate-dgx-spark).
65
  >
66
+ > ⚠️ **DFlash in vLLM 0.22.1 needs `--kv-cache-dtype auto` (BF16) on Spark.** The DFlash drafter uses non-causal attention. In PR #44389's refactor, neither FLASHINFER nor TRITON_ATTN has a non-causal kernel for FP8 or NVFP4 KV anymore only FLASH_ATTN works, and it's BF16-only. This is a regression vs the v3 production image (vLLM 0.20.0 had a FLASHINFER non-causal+FP8 path). For FP8-KV-with-DFlash today, **stay on v3**. For NVFP4 KV's ~3× capacity gain, use a causal speculator (MTP / Eagle / ngram) with the MTP-XS body.
67
 
68
  ## Variants
69