DeepSeek-V4-Flash native FP4 / FP8 GGUF

Native, 1:1 conversion of deepseek-ai/DeepSeek-V4-Flash from the original safetensors into a single GGUF file that preserves the model's native low-precision weights:

Dense weights: FP8 E4M3 (F8_E4M3_B128, 128-element blocks with one E8M0 scale)
MoE expert weights: MXFP4 (MXFP4)

This file is not derived from a higher-precision intermediate; the FP4 and FP8 codes from the upstream checkpoint are written directly into the GGUF.

File

File	Size	Quant
`DeepSeek-V4-Flash-FP4-FP8-native.gguf`	~146 GB	F8_E4M3 + MXFP4

Loading

This GGUF requires a llama.cpp build with native F8_E4M3_B128 and MXFP4 support and the DeepSeek V4 Flash architecture. Stock upstream llama.cpp cannot load this file.

Reference (WIP) build that can both produce and run this GGUF:

https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support

That branch adds:

GGML_TYPE_F8_E4M3_B128 (ggml type 42)
LLAMA_FTYPE_MOSTLY_F8_E4M3_MXFP4 (ftype 41, exposed as F8_E4M3_MXFP4 / moe-f8-e4m3-mxfp4)
CUDA dequant / MMVQ kernels for F8_E4M3_B128
Loader / converter / gguf-py support
Custom DeepSeek V4 Flash model graph

The branch is an active WIP, expect rough edges.

Notes

DeepSeek V4 Flash is a custom architecture (MoE + sliding-window attention + compressor + indexer). The runtime in the reference branch implements that graph as a custom model path.
For matching activation behavior the runtime also applies HF's blockwise FP8 / FP4 fake-activation-quant on attention KV and indexer Q/KV after the Hadamard rotation.

Provenance

Upstream model: deepseek-ai/DeepSeek-V4-Flash

Conversion command:

python3 convert_hf_to_gguf.py /mnt/models/hf/DeepSeek-V4-Flash \
    --outtype moe-f8-e4m3-mxfp4 \
    --torch-threads 96 \
    --outfile DeepSeek-V4-Flash-FP4-FP8-native.gguf

(run from https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support)

License: Inherits the upstream DeepSeek V4 Flash license.

Downloads last month: -

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(15)

this model