--- license: mit base_model: zai-org/GLM-4.7-Flash tags: - gguf - glm - llama-cpp - quantized - moe --- # GLM-4.7-Flash Q8_0 GGUF Q8_0 quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for use with llama.cpp. ## Model Details | Property | Value | |----------|-------| | Base model | zai-org/GLM-4.7-Flash | | Architecture | 30B-A3B MoE (DeepSeek v2) | | Quantization | Q8_0 | | Size | ~30 GB | | Context length | 128K tokens | ## Hardware Requirements - **Minimum VRAM:** 32 GB (single GPU) - **Recommended:** 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090) ## Usage ### Basic usage with llama.cpp ```bash llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536 ``` ### Full 128K context on dual GPU ```bash llama-server -m GLM-4.7-Flash-Q8_0.gguf \ -ngl 99 \ -c 131072 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --split-mode layer \ --tensor-split 32,24 \ --host 0.0.0.0 \ --port 8080 ``` ### OpenAI-compatible API After starting the server, the API is available at: - `http://localhost:8080/v1/chat/completions` - `http://localhost:8080/v1/completions` ## Quantization Notes This model was quantized from [ngxson/GLM-4.7-Flash-GGUF](https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF) F16 version. The missing `deepseek2.rope.scaling.yarn_log_multiplier` metadata key was added to enable quantization with llama.cpp. ## Quality Comparison | Quantization | Size | Perplexity Impact | |--------------|------|-------------------| | F16 | 56 GB | Baseline | | **Q8_0** | **30 GB** | **~0.1%** | | Q4_K_M | 18 GB | ~2-4% | Q8_0 provides near-lossless quality while reducing model size by ~47%.