Update README.md

001f00d verified 3 days ago

2.01 kB

license: apache-2.0
language:
  - en
  - zh
  - vi
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
tags:
  - deepseek
  - deepseek4
  - deepseekpro
  - llm
  - quantization
  - gguf
  - llama.cpp
  - inference-optimization

DeepSeekV4Flash Quantization Repository

This repository provides scripts and guidelines for quantizing the DeepSeek V4 Flash model, enabling reduced model size and optimized inference performance.

🚀 Purpose

Reduce model size (BF16 → Q3/Q4/Q5/Q8, etc.)
Improve inference speed
Enable deployment on limited GPU/CPU resources

🌍 Languages

English (en)
Vietnamese (vi)

🧠 Base Model

DeepSeek-V4-Flash (https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)

📦 Contents

Model conversion and quantization scripts
Usage examples for llama.cpp / GGUF workflows
Common quantization configurations

🛠️ Requirements

Python >= 3.12
Latest version of llama.cpp (with GGUF support)
HuggingFace Transformers (if converting from HF format)
Sufficient RAM/VRAM depending on model size

⚙️ Example Usage

python convert_hf_to_gguf.py   --model deepseek-ai/DeepSeek-V4-Flash   --outfile models/DeepSeekV4Flash.gguf

./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M

📌 Notes

Quantization may require significant system memory depending on model size
Some quantization formats may not be compatible with all runtimes or versions
Always validate output quality after quantization

👤 Author

Email: tecaprovn@gmail.com
Telegram: https://t.me/tamndx

📄 License

This repository follows the original DeepSeek model license.

Base model: Apache 2.0 (DeepSeek)
Only conversion scripts included, no weight modification