DeepSeek-V4-Flash native FP4 / FP8 GGUF

Native, 1:1 conversion of deepseek-ai/DeepSeek-V4-Flash from the original safetensors into a single GGUF file that preserves the model's native low-precision weights:

  • Dense weights: FP8 E4M3 (F8_E4M3_B128, 128-element blocks with one E8M0 scale)
  • MoE expert weights: MXFP4 (MXFP4)

This file is not derived from a higher-precision intermediate; the FP4 and FP8 codes from the upstream checkpoint are written directly into the GGUF.

File

File Size Quant
DeepSeek-V4-Flash-FP4-FP8-native.gguf ~146 GB F8_E4M3 + MXFP4

Loading

This GGUF requires a llama.cpp build with native F8_E4M3_B128 and MXFP4 support and the DeepSeek V4 Flash architecture. Stock upstream llama.cpp cannot load this file.

Reference (WIP) build that can both produce and run this GGUF:

https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support

That branch adds:

  • GGML_TYPE_F8_E4M3_B128 (ggml type 42)
  • LLAMA_FTYPE_MOSTLY_F8_E4M3_MXFP4 (ftype 41, exposed as F8_E4M3_MXFP4 / moe-f8-e4m3-mxfp4)
  • CUDA dequant / MMVQ kernels for F8_E4M3_B128
  • Loader / converter / gguf-py support
  • Custom DeepSeek V4 Flash model graph

The branch is an active WIP, expect rough edges.

Notes

  • DeepSeek V4 Flash is a custom architecture (MoE + sliding-window attention + compressor + indexer). The runtime in the reference branch implements that graph as a custom model path.
  • For matching activation behavior the runtime also applies HF's blockwise FP8 / FP4 fake-activation-quant on attention KV and indexer Q/KV after the Hadamard rotation.

Provenance

Downloads last month
-
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF

Quantized
(15)
this model