---
license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
  - gguf
  - glm
  - llama-cpp
  - quantized
  - moe
---

# GLM-4.7-Flash Q8_0 GGUF

Q8_0 quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for use with llama.cpp.

## Model Details

| Property | Value |
|----------|-------|
| Base model | zai-org/GLM-4.7-Flash |
| Architecture | 30B-A3B MoE (DeepSeek v2) |
| Quantization | Q8_0 |
| Size | ~30 GB |
| Context length | 128K tokens |

## Hardware Requirements

- **Minimum VRAM:** 32 GB (single GPU)
- **Recommended:** 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090)

## Usage

### Basic usage with llama.cpp

```bash
llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536
```

### Full 128K context on dual GPU

```bash
llama-server -m GLM-4.7-Flash-Q8_0.gguf \
    -ngl 99 \
    -c 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --split-mode layer \
    --tensor-split 32,24 \
    --host 0.0.0.0 \
    --port 8080
```

### OpenAI-compatible API

After starting the server, the API is available at:
- `http://localhost:8080/v1/chat/completions`
- `http://localhost:8080/v1/completions`

## Quantization Notes

This model was quantized from [ngxson/GLM-4.7-Flash-GGUF](https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF) F16 version.

The missing `deepseek2.rope.scaling.yarn_log_multiplier` metadata key was added to enable quantization with llama.cpp.

## Quality Comparison

| Quantization | Size | Perplexity Impact |
|--------------|------|-------------------|
| F16 | 56 GB | Baseline |
| **Q8_0** | **30 GB** | **~0.1%** |
| Q4_K_M | 18 GB | ~2-4% |

Q8_0 provides near-lossless quality while reducing model size by ~47%.