DeepSeek-V4-Flash native FP4 / FP8 GGUF
Native, 1:1 conversion of deepseek-ai/DeepSeek-V4-Flash from the original
safetensors into a single GGUF file that preserves the model's native
low-precision weights:
- Dense weights: FP8 E4M3 (
F8_E4M3_B128, 128-element blocks with one E8M0 scale) - MoE expert weights: MXFP4 (
MXFP4)
This file is not derived from a higher-precision intermediate; the FP4 and FP8 codes from the upstream checkpoint are written directly into the GGUF.
File
| File | Size | Quant |
|---|---|---|
DeepSeek-V4-Flash-FP4-FP8-native.gguf |
~146 GB | F8_E4M3 + MXFP4 |
Loading
This GGUF requires a llama.cpp build with native F8_E4M3_B128 and MXFP4
support and the DeepSeek V4 Flash architecture. Stock upstream llama.cpp
cannot load this file.
Reference (WIP) build that can both produce and run this GGUF:
https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support
That branch adds:
GGML_TYPE_F8_E4M3_B128(ggml type 42)LLAMA_FTYPE_MOSTLY_F8_E4M3_MXFP4(ftype 41, exposed asF8_E4M3_MXFP4/moe-f8-e4m3-mxfp4)- CUDA dequant / MMVQ kernels for
F8_E4M3_B128 - Loader / converter /
gguf-pysupport - Custom DeepSeek V4 Flash model graph
The branch is an active WIP, expect rough edges.
Notes
- DeepSeek V4 Flash is a custom architecture (MoE + sliding-window attention + compressor + indexer). The runtime in the reference branch implements that graph as a custom model path.
- For matching activation behavior the runtime also applies HF's blockwise FP8 / FP4 fake-activation-quant on attention KV and indexer Q/KV after the Hadamard rotation.
Provenance
- Upstream model:
deepseek-ai/DeepSeek-V4-Flash - Conversion command:
(run from https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support)python3 convert_hf_to_gguf.py /mnt/models/hf/DeepSeek-V4-Flash \ --outtype moe-f8-e4m3-mxfp4 \ --torch-threads 96 \ --outfile DeepSeek-V4-Flash-FP4-FP8-native.gguf - License: Inherits the upstream DeepSeek V4 Flash license.
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash