File size: 5,323 Bytes
6b78b3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66f1382
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: other
license_name: glm-5
license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE
base_model: 0xSero/GLM-5.1-555B-A14B-REAP
tags:
  - reap
  - pruning
  - moe
  - expert-pruning
  - glm
  - gptq
  - w4a16
  - autoround
  - vllm
library_name: transformers
pipeline_tag: text-generation
quantization_config:
  quant_method: gptq
  bits: 4
  group_size: 128
  sym: true
  desc_act: false
  checkpoint_format: gptq
---

# GLM-5.1 — 25% Expert Pruned (REAP) — W4A16

This is a **GPTQ 4-bit weight-quantized** variant of the 25% expert-pruned [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1) using [REAP](https://github.com/CerebrasResearch/reap) (Relative Expert Activation Pruning), produced with [AutoRound](https://github.com/intel/auto-round) for learned rounding optimization.

| Property | Value |
|----------|-------|
| Base model | `zai-org/GLM-5.1` (744B MoE, 256 experts/layer) |
| Architecture | `GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention) |
| Routed experts | 256 → 192 (25% removed, 64 per layer) |
| Active params/token | ~14B (top-8 routing preserved) |
| Quantization | GPTQ W4A16 (int4 symmetric, group_size=128) |
| Quantizer | auto-round 0.12.2 (200 iterations, SignSGD) |
| Quantized size | **277 GB** (56 safetensor shards) |
| BF16 source | [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) |
| GGUF variant | [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) (325 GB, Q4_K_M) |

## Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)

The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):

| Suite | Metric | Result | Repetition Loops |
|-------|--------|--------|-----------------|
| Terminal-Bench (50) | Proxy Pass | 44/50 (88%) | 0/50 |
| SWE-bench Pro (50) | Proxy Pass | 33/50 (66%) | 0/50 |
| GSM8K (50) | Correct | 30/50 (60%) | 0/50 |
| HLE (50) | Correct | 9/50 (18%) | 0/50 |

**Zero repetition loops across 220 benchmark probes.** The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

## How to Use

### vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
    tensor_parallel_size=4,    # 4× B200 or 8× A100
    max_model_len=8192,
    trust_remote_code=True,
)

params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)
```

### SGLang

```bash
python -m sglang.launch_server \
  --model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
  --tp 4 \
  --trust-remote-code
```

### Requires

- ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
- CUDA 12.8+ (sm_100a / Blackwell)
- vLLM >= 0.19.0 with `deep_gemm` installed (for DSA sparse attention)
- `trust_remote_code=True`

## Quantization Details

**Method:** AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.

**Protected (kept at full precision):**
- Dense MLP layers 0-2 (`gate_proj`, `up_proj`, `down_proj`)
- DSA indexer (`weights_proj`)
- `lm_head`

**Quantized to int4 (43,971/44,059 linear layers):**
- All attention projections (`q_a_proj`, `q_b_proj`, `kv_a_proj`, `kv_b_proj`, `o_proj`)
- All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
- Shared expert projections

**GPTQ config:** `bits=4, group_size=128, sym=true, desc_act=false`

## Why GPTQ over GGUF Q4_K_M?

| | GPTQ W4A16 (this) | GGUF Q4_K_M |
|---|---|---|
| Size | 277 GB | 325 GB |
| Serving | vLLM, SGLang, TGI (GPU) | llama.cpp (CPU/GPU hybrid) |
| Quant method | Learned rounding (SignSGD) | K-means clustering |
| Throughput | Higher (GPU-native kernels) | Lower |
| Best for | Production GPU serving | Local inference, edge |

GPTQ packs 4-bit weights more efficiently with `group_size=128` symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.

## Related Models

| Model | Prune % | Experts | Format | Size | Status |
|-------|---------|---------|--------|------|--------|
| [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) | 25% | 192/256 | BF16 | 1.1T | Source checkpoint |
| [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) | 25% | 192/256 | GGUF Q4_K_M | 325G | llama.cpp serving |
| **This model** | **25%** | **192/256** | **GPTQ W4A16** | **277G** | **vLLM/SGLang serving** |
| [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) | 40% | 154/256 | BF16 | 910G | Has repetition issues — use 25% |

## Support This Work

If you find these models useful, please consider supporting continued open-source model compression research:

**[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**

## Citation

If you use this model, please cite the [REAP paper](https://github.com/CerebrasResearch/reap) and [AutoRound](https://github.com/intel/auto-round).

## Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle