--- license: other license_name: glm-5 license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE base_model: 0xSero/GLM-5.1-555B-A14B-REAP tags: - reap - pruning - moe - expert-pruning - glm - gptq - w4a16 - autoround - vllm library_name: transformers pipeline_tag: text-generation quantization_config: quant_method: gptq bits: 4 group_size: 128 sym: true desc_act: false checkpoint_format: gptq --- # GLM-5.1 — 25% Expert Pruned (REAP) — W4A16 This is a **GPTQ 4-bit weight-quantized** variant of the 25% expert-pruned [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1) using [REAP](https://github.com/CerebrasResearch/reap) (Relative Expert Activation Pruning), produced with [AutoRound](https://github.com/intel/auto-round) for learned rounding optimization. | Property | Value | |----------|-------| | Base model | `zai-org/GLM-5.1` (744B MoE, 256 experts/layer) | | Architecture | `GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention) | | Routed experts | 256 → 192 (25% removed, 64 per layer) | | Active params/token | ~14B (top-8 routing preserved) | | Quantization | GPTQ W4A16 (int4 symmetric, group_size=128) | | Quantizer | auto-round 0.12.2 (200 iterations, SignSGD) | | Quantized size | **277 GB** (56 safetensor shards) | | BF16 source | [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) | | GGUF variant | [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) (325 GB, Q4_K_M) | ## Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8) The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops): | Suite | Metric | Result | Repetition Loops | |-------|--------|--------|-----------------| | Terminal-Bench (50) | Proxy Pass | 44/50 (88%) | 0/50 | | SWE-bench Pro (50) | Proxy Pass | 33/50 (66%) | 0/50 | | GSM8K (50) | Correct | 30/50 (60%) | 0/50 | | HLE (50) | Correct | 9/50 (18%) | 0/50 | **Zero repetition loops across 220 benchmark probes.** The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths. ## How to Use ### vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16", tensor_parallel_size=4, # 4× B200 or 8× A100 max_model_len=8192, trust_remote_code=True, ) params = SamplingParams(temperature=0.8, max_tokens=4096) outputs = llm.generate(["Hello, world!"], params) ``` ### SGLang ```bash python -m sglang.launch_server \ --model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \ --tp 4 \ --trust-remote-code ``` ### Requires - ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total - CUDA 12.8+ (sm_100a / Blackwell) - vLLM >= 0.19.0 with `deep_gemm` installed (for DSA sparse attention) - `trust_remote_code=True` ## Quantization Details **Method:** AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length. **Protected (kept at full precision):** - Dense MLP layers 0-2 (`gate_proj`, `up_proj`, `down_proj`) - DSA indexer (`weights_proj`) - `lm_head` **Quantized to int4 (43,971/44,059 linear layers):** - All attention projections (`q_a_proj`, `q_b_proj`, `kv_a_proj`, `kv_b_proj`, `o_proj`) - All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers) - Shared expert projections **GPTQ config:** `bits=4, group_size=128, sym=true, desc_act=false` ## Why GPTQ over GGUF Q4_K_M? | | GPTQ W4A16 (this) | GGUF Q4_K_M | |---|---|---| | Size | 277 GB | 325 GB | | Serving | vLLM, SGLang, TGI (GPU) | llama.cpp (CPU/GPU hybrid) | | Quant method | Learned rounding (SignSGD) | K-means clustering | | Throughput | Higher (GPU-native kernels) | Lower | | Best for | Production GPU serving | Local inference, edge | GPTQ packs 4-bit weights more efficiently with `group_size=128` symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width. ## Related Models | Model | Prune % | Experts | Format | Size | Status | |-------|---------|---------|--------|------|--------| | [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) | 25% | 192/256 | BF16 | 1.1T | Source checkpoint | | [`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP-GGUF) | 25% | 192/256 | GGUF Q4_K_M | 325G | llama.cpp serving | | **This model** | **25%** | **192/256** | **GPTQ W4A16** | **277G** | **vLLM/SGLang serving** | | [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) | 40% | 154/256 | BF16 | 910G | Has repetition issues — use 25% | ## Support This Work If you find these models useful, please consider supporting continued open-source model compression research: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)** ## Citation If you use this model, please cite the [REAP paper](https://github.com/CerebrasResearch/reap) and [AutoRound](https://github.com/intel/auto-round). ## Sponsors Thank you for the kind sponsors, wouldn't be possible without them: - Nvidia - TNG Technology - Lambda - Prime Intellect - HotAisle