README.md · 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 at main

File size: 5,323 Bytes

---
license: other
license_name: glm-5
license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE
base_model: 0xSero/GLM-5.1-555B-A14B-REAP
tags:
  - reap
  - pruning
  - moe
  - expert-pruning
  - glm
  - gptq
  - w4a16
  - autoround
  - vllm
library_name: transformers
pipeline_tag: text-generation
quantization_config:
  quant_method: gptq
  bits: 4
  group_size: 128
  sym: true
  desc_act: false
  checkpoint_format: gptq
---

# GLM-5.1 — 25% Expert Pruned (REAP) — W4A16

This is a **GPTQ 4-bit weight-quantized** variant of the 25% expert-pruned [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1) using [REAP](https://github.com/CerebrasResearch/reap) (Relative Expert Activation Pruning), produced with [AutoRound](https://github.com/intel/auto-round) for learned rounding optimization.

| Property | Value |
|----------|-------|
| Base model | `zai-org/GLM-5.1` (744B MoE, 256 experts/layer) |
| Architecture | `GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention) |
| Routed experts | 256 → 192 (25% removed, 64 per layer) |
| Active params/token | ~14B (top-8 routing preserved) |
| Quantization | GPTQ W4A16 (int4 symmetric, group_size=128) |
| Quantizer | auto-round 0.12.2 (200 iterations, SignSGD) |
| Quantized size | **277 GB** (56 safetensor shards) |
| BF16 source | [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) |
| GGUF variant | [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) (325 GB, Q4_K_M) |

## Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)

The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):

| Suite | Metric | Result | Repetition Loops |
|-------|--------|--------|-----------------|
| Terminal-Bench (50) | Proxy Pass | 44/50 (88%) | 0/50 |
| SWE-bench Pro (50) | Proxy Pass | 33/50 (66%) | 0/50 |
| GSM8K (50) | Correct | 30/50 (60%) | 0/50 |
| HLE (50) | Correct | 9/50 (18%) | 0/50 |

**Zero repetition loops across 220 benchmark probes.** The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

## How to Use

### vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
    tensor_parallel_size=4,    # 4× B200 or 8× A100
    max_model_len=8192,
    trust_remote_code=True,
)

params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)
```

### SGLang

```bash
python -m sglang.launch_server \
  --model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
  --tp 4 \
  --trust-remote-code
```

### Requires

- ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
- CUDA 12.8+ (sm_100a / Blackwell)
- vLLM >= 0.19.0 with `deep_gemm` installed (for DSA sparse attention)
- `trust_remote_code=True`

## Quantization Details

**Method:** AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.

**Protected (kept at full precision):**
- Dense MLP layers 0-2 (`gate_proj`, `up_proj`, `down_proj`)
- DSA indexer (`weights_proj`)
- `lm_head`

**Quantized to int4 (43,971/44,059 linear layers):**
- All attention projections (`q_a_proj`, `q_b_proj`, `kv_a_proj`, `kv_b_proj`, `o_proj`)
- All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
- Shared expert projections

**GPTQ config:** `bits=4, group_size=128, sym=true, desc_act=false`

## Why GPTQ over GGUF Q4_K_M?

| | GPTQ W4A16 (this) | GGUF Q4_K_M |
|---|---|---|
| Size | 277 GB | 325 GB |
| Serving | vLLM, SGLang, TGI (GPU) | llama.cpp (CPU/GPU hybrid) |
| Quant method | Learned rounding (SignSGD) | K-means clustering |
| Throughput | Higher (GPU-native kernels) | Lower |
| Best for | Production GPU serving | Local inference, edge |

GPTQ packs 4-bit weights more efficiently with `group_size=128` symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.

## Related Models

| Model | Prune % | Experts | Format | Size | Status |
|-------|---------|---------|--------|------|--------|
| [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) | 25% | 192/256 | BF16 | 1.1T | Source checkpoint |
| [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) | 25% | 192/256 | GGUF Q4_K_M | 325G | llama.cpp serving |
| **This model** | **25%** | **192/256** | **GPTQ W4A16** | **277G** | **vLLM/SGLang serving** |
| [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) | 40% | 154/256 | BF16 | 910G | Has repetition issues — use 25% |

## Support This Work

If you find these models useful, please consider supporting continued open-source model compression research:

**[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**

## Citation

If you use this model, please cite the [REAP paper](https://github.com/CerebrasResearch/reap) and [AutoRound](https://github.com/intel/auto-round).

## Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle