GLM-4.7-Flash Q8_0 GGUF

Q8_0 quantization of zai-org/GLM-4.7-Flash for use with llama.cpp.

Model Details

Property Value
Base model zai-org/GLM-4.7-Flash
Architecture 30B-A3B MoE (DeepSeek v2)
Quantization Q8_0
Size ~30 GB
Context length 128K tokens

Hardware Requirements

  • Minimum VRAM: 32 GB (single GPU)
  • Recommended: 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090)

Usage

Basic usage with llama.cpp

llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536

Full 128K context on dual GPU

llama-server -m GLM-4.7-Flash-Q8_0.gguf \
    -ngl 99 \
    -c 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --split-mode layer \
    --tensor-split 32,24 \
    --host 0.0.0.0 \
    --port 8080

OpenAI-compatible API

After starting the server, the API is available at:

  • http://localhost:8080/v1/chat/completions
  • http://localhost:8080/v1/completions

Quantization Notes

This model was quantized from ngxson/GLM-4.7-Flash-GGUF F16 version.

The missing deepseek2.rope.scaling.yarn_log_multiplier metadata key was added to enable quantization with llama.cpp.

Quality Comparison

Quantization Size Perplexity Impact
F16 56 GB Baseline
Q8_0 30 GB ~0.1%
Q4_K_M 18 GB ~2-4%

Q8_0 provides near-lossless quality while reducing model size by ~47%.

Downloads last month
49
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for khasinski/GLM-4.7-Flash-Q8_0-GGUF

Quantized
(81)
this model