File size: 5,323 Bytes
6b78b3b 66f1382 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: other
license_name: glm-5
license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE
base_model: 0xSero/GLM-5.1-555B-A14B-REAP
tags:
- reap
- pruning
- moe
- expert-pruning
- glm
- gptq
- w4a16
- autoround
- vllm
library_name: transformers
pipeline_tag: text-generation
quantization_config:
quant_method: gptq
bits: 4
group_size: 128
sym: true
desc_act: false
checkpoint_format: gptq
---
# GLM-5.1 — 25% Expert Pruned (REAP) — W4A16
This is a **GPTQ 4-bit weight-quantized** variant of the 25% expert-pruned [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1) using [REAP](https://github.com/CerebrasResearch/reap) (Relative Expert Activation Pruning), produced with [AutoRound](https://github.com/intel/auto-round) for learned rounding optimization.
| Property | Value |
|----------|-------|
| Base model | `zai-org/GLM-5.1` (744B MoE, 256 experts/layer) |
| Architecture | `GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention) |
| Routed experts | 256 → 192 (25% removed, 64 per layer) |
| Active params/token | ~14B (top-8 routing preserved) |
| Quantization | GPTQ W4A16 (int4 symmetric, group_size=128) |
| Quantizer | auto-round 0.12.2 (200 iterations, SignSGD) |
| Quantized size | **277 GB** (56 safetensor shards) |
| BF16 source | [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) |
| GGUF variant | [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) (325 GB, Q4_K_M) |
## Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)
The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):
| Suite | Metric | Result | Repetition Loops |
|-------|--------|--------|-----------------|
| Terminal-Bench (50) | Proxy Pass | 44/50 (88%) | 0/50 |
| SWE-bench Pro (50) | Proxy Pass | 33/50 (66%) | 0/50 |
| GSM8K (50) | Correct | 30/50 (60%) | 0/50 |
| HLE (50) | Correct | 9/50 (18%) | 0/50 |
**Zero repetition loops across 220 benchmark probes.** The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.
## How to Use
### vLLM
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
tensor_parallel_size=4, # 4× B200 or 8× A100
max_model_len=8192,
trust_remote_code=True,
)
params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)
```
### SGLang
```bash
python -m sglang.launch_server \
--model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
--tp 4 \
--trust-remote-code
```
### Requires
- ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
- CUDA 12.8+ (sm_100a / Blackwell)
- vLLM >= 0.19.0 with `deep_gemm` installed (for DSA sparse attention)
- `trust_remote_code=True`
## Quantization Details
**Method:** AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.
**Protected (kept at full precision):**
- Dense MLP layers 0-2 (`gate_proj`, `up_proj`, `down_proj`)
- DSA indexer (`weights_proj`)
- `lm_head`
**Quantized to int4 (43,971/44,059 linear layers):**
- All attention projections (`q_a_proj`, `q_b_proj`, `kv_a_proj`, `kv_b_proj`, `o_proj`)
- All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
- Shared expert projections
**GPTQ config:** `bits=4, group_size=128, sym=true, desc_act=false`
## Why GPTQ over GGUF Q4_K_M?
| | GPTQ W4A16 (this) | GGUF Q4_K_M |
|---|---|---|
| Size | 277 GB | 325 GB |
| Serving | vLLM, SGLang, TGI (GPU) | llama.cpp (CPU/GPU hybrid) |
| Quant method | Learned rounding (SignSGD) | K-means clustering |
| Throughput | Higher (GPU-native kernels) | Lower |
| Best for | Production GPU serving | Local inference, edge |
GPTQ packs 4-bit weights more efficiently with `group_size=128` symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.
## Related Models
| Model | Prune % | Experts | Format | Size | Status |
|-------|---------|---------|--------|------|--------|
| [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) | 25% | 192/256 | BF16 | 1.1T | Source checkpoint |
| [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) | 25% | 192/256 | GGUF Q4_K_M | 325G | llama.cpp serving |
| **This model** | **25%** | **192/256** | **GPTQ W4A16** | **277G** | **vLLM/SGLang serving** |
| [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) | 40% | 154/256 | BF16 | 910G | Has repetition issues — use 25% |
## Support This Work
If you find these models useful, please consider supporting continued open-source model compression research:
**[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**
## Citation
If you use this model, please cite the [REAP paper](https://github.com/CerebrasResearch/reap) and [AutoRound](https://github.com/intel/auto-round).
## Sponsors
Thank you for the kind sponsors, wouldn't be possible without them:
- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle
|