--- license: apache-2.0 language: - en - zh - vi base_model: - deepseek-ai/DeepSeek-V4-Flash tags: - deepseek - deepseek4 - deepseekpro - llm - quantization - gguf - llama.cpp - inference-optimization --- # DeepSeekV4Flash Quantization Repository ![v4-benchmark-2](https://cdn-uploads.huggingface.co/production/uploads/671ab90d28ec35263e09152f/WXhyPJ5E8r3B2p0TO8us6.png) This repository provides scripts and guidelines for quantizing the **DeepSeek V4 Flash** model, enabling reduced model size and optimized inference performance. ![v4-efficiency](https://cdn-uploads.huggingface.co/production/uploads/671ab90d28ec35263e09152f/m4HSN3MmYyW2SHZytAbFE.png) --- ## 🚀 Purpose - Reduce model size (BF16 → Q3/Q4/Q5/Q8, etc.) - Improve inference speed - Enable deployment on limited GPU/CPU resources --- ## 🌍 Languages - English (en) - Vietnamese (vi) --- ## 🧠 Base Model - DeepSeek-V4-Flash (https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) --- ## 📦 Contents - Model conversion and quantization scripts - Usage examples for llama.cpp / GGUF workflows - Common quantization configurations --- ## 🛠️ Requirements - Python >= 3.12 - Latest version of llama.cpp (with GGUF support) - HuggingFace Transformers (if converting from HF format) - Sufficient RAM/VRAM depending on model size --- ## ⚙️ Example Usage ```bash python convert_hf_to_gguf.py --model deepseek-ai/DeepSeek-V4-Flash --outfile models/DeepSeekV4Flash.gguf ./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M ``` --- ## 📌 Notes - Quantization may require significant system memory depending on model size - Some quantization formats may not be compatible with all runtimes or versions - Always validate output quality after quantization --- ## 👤 Author - Email: tecaprovn@gmail.com - Telegram: https://t.me/tamndx --- ## 📄 License This repository follows the original DeepSeek model license. - Base model: Apache 2.0 (DeepSeek) - Only conversion scripts included, no weight modification