Upload README.md with huggingface_hub

ace1836 verified 3 months ago

1.69 kB

license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
  - gguf
  - glm
  - llama-cpp
  - quantized
  - moe

GLM-4.7-Flash Q8_0 GGUF

Q8_0 quantization of zai-org/GLM-4.7-Flash for use with llama.cpp.

Model Details

Property	Value
Base model	zai-org/GLM-4.7-Flash
Architecture	30B-A3B MoE (DeepSeek v2)
Quantization	Q8_0
Size	~30 GB
Context length	128K tokens

Hardware Requirements

Minimum VRAM: 32 GB (single GPU)
Recommended: 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090)

Usage

Basic usage with llama.cpp

llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536

Full 128K context on dual GPU

llama-server -m GLM-4.7-Flash-Q8_0.gguf \
    -ngl 99 \
    -c 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --split-mode layer \
    --tensor-split 32,24 \
    --host 0.0.0.0 \
    --port 8080

OpenAI-compatible API

After starting the server, the API is available at:

http://localhost:8080/v1/chat/completions
http://localhost:8080/v1/completions

Quantization Notes

This model was quantized from ngxson/GLM-4.7-Flash-GGUF F16 version.

The missing deepseek2.rope.scaling.yarn_log_multiplier metadata key was added to enable quantization with llama.cpp.

Quality Comparison

Quantization	Size	Perplexity Impact
F16	56 GB	Baseline
Q8_0	30 GB	~0.1%
Q4_K_M	18 GB	~2-4%

Q8_0 provides near-lossless quality while reducing model size by ~47%.