khasinski's picture
Upload README.md with huggingface_hub
ace1836 verified
metadata
license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
  - gguf
  - glm
  - llama-cpp
  - quantized
  - moe

GLM-4.7-Flash Q8_0 GGUF

Q8_0 quantization of zai-org/GLM-4.7-Flash for use with llama.cpp.

Model Details

Property Value
Base model zai-org/GLM-4.7-Flash
Architecture 30B-A3B MoE (DeepSeek v2)
Quantization Q8_0
Size ~30 GB
Context length 128K tokens

Hardware Requirements

  • Minimum VRAM: 32 GB (single GPU)
  • Recommended: 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090)

Usage

Basic usage with llama.cpp

llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536

Full 128K context on dual GPU

llama-server -m GLM-4.7-Flash-Q8_0.gguf \
    -ngl 99 \
    -c 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --split-mode layer \
    --tensor-split 32,24 \
    --host 0.0.0.0 \
    --port 8080

OpenAI-compatible API

After starting the server, the API is available at:

  • http://localhost:8080/v1/chat/completions
  • http://localhost:8080/v1/completions

Quantization Notes

This model was quantized from ngxson/GLM-4.7-Flash-GGUF F16 version.

The missing deepseek2.rope.scaling.yarn_log_multiplier metadata key was added to enable quantization with llama.cpp.

Quality Comparison

Quantization Size Perplexity Impact
F16 56 GB Baseline
Q8_0 30 GB ~0.1%
Q4_K_M 18 GB ~2-4%

Q8_0 provides near-lossless quality while reducing model size by ~47%.