tecaprovn's picture
Update README.md
001f00d verified
metadata
license: apache-2.0
language:
  - en
  - zh
  - vi
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
tags:
  - deepseek
  - deepseek4
  - deepseekpro
  - llm
  - quantization
  - gguf
  - llama.cpp
  - inference-optimization

DeepSeekV4Flash Quantization Repository

v4-benchmark-2

This repository provides scripts and guidelines for quantizing the DeepSeek V4 Flash model, enabling reduced model size and optimized inference performance.

v4-efficiency


๐Ÿš€ Purpose

  • Reduce model size (BF16 โ†’ Q3/Q4/Q5/Q8, etc.)
  • Improve inference speed
  • Enable deployment on limited GPU/CPU resources

๐ŸŒ Languages

  • English (en)
  • Vietnamese (vi)

๐Ÿง  Base Model


๐Ÿ“ฆ Contents

  • Model conversion and quantization scripts
  • Usage examples for llama.cpp / GGUF workflows
  • Common quantization configurations

๐Ÿ› ๏ธ Requirements

  • Python >= 3.12
  • Latest version of llama.cpp (with GGUF support)
  • HuggingFace Transformers (if converting from HF format)
  • Sufficient RAM/VRAM depending on model size

โš™๏ธ Example Usage

python convert_hf_to_gguf.py   --model deepseek-ai/DeepSeek-V4-Flash   --outfile models/DeepSeekV4Flash.gguf

./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M

๐Ÿ“Œ Notes

  • Quantization may require significant system memory depending on model size
  • Some quantization formats may not be compatible with all runtimes or versions
  • Always validate output quality after quantization

๐Ÿ‘ค Author


๐Ÿ“„ License

This repository follows the original DeepSeek model license.

  • Base model: Apache 2.0 (DeepSeek)
  • Only conversion scripts included, no weight modification