GLM-5.1 — 25% Expert Pruned (REAP) — W4A16

This is a GPTQ 4-bit weight-quantized variant of the 25% expert-pruned zai-org/GLM-5.1 using REAP (Relative Expert Activation Pruning), produced with AutoRound for learned rounding optimization.

Property	Value
Base model	`zai-org/GLM-5.1` (744B MoE, 256 experts/layer)
Architecture	`GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention)
Routed experts	256 → 192 (25% removed, 64 per layer)
Active params/token	~14B (top-8 routing preserved)
Quantization	GPTQ W4A16 (int4 symmetric, group_size=128)
Quantizer	auto-round 0.12.2 (200 iterations, SignSGD)
Quantized size	277 GB (56 safetensor shards)
BF16 source	`0xSero/GLM-5.1-555B-A14B-REAP`
GGUF variant	`0xSero/GLM-5.1-555B-A14B-REAP-GGUF` (325 GB, Q4_K_M)

Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)

The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):

Suite	Metric	Result	Repetition Loops
Terminal-Bench (50)	Proxy Pass	44/50 (88%)	0/50
SWE-bench Pro (50)	Proxy Pass	33/50 (66%)	0/50
GSM8K (50)	Correct	30/50 (60%)	0/50
HLE (50)	Correct	9/50 (18%)	0/50

Zero repetition loops across 220 benchmark probes. The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

How to Use

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
    tensor_parallel_size=4,    # 4× B200 or 8× A100
    max_model_len=8192,
    trust_remote_code=True,
)

params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)

SGLang

python -m sglang.launch_server \
  --model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
  --tp 4 \
  --trust-remote-code

Requires

~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
CUDA 12.8+ (sm_100a / Blackwell)
vLLM >= 0.19.0 with deep_gemm installed (for DSA sparse attention)
trust_remote_code=True

Quantization Details

Method: AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.

Protected (kept at full precision):

Dense MLP layers 0-2 (gate_proj, up_proj, down_proj)
DSA indexer (weights_proj)
lm_head

Quantized to int4 (43,971/44,059 linear layers):

All attention projections (q_a_proj, q_b_proj, kv_a_proj, kv_b_proj, o_proj)
All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
Shared expert projections

GPTQ config: bits=4, group_size=128, sym=true, desc_act=false

Why GPTQ over GGUF Q4_K_M?

	GPTQ W4A16 (this)	GGUF Q4_K_M
Size	277 GB	325 GB
Serving	vLLM, SGLang, TGI (GPU)	llama.cpp (CPU/GPU hybrid)
Quant method	Learned rounding (SignSGD)	K-means clustering
Throughput	Higher (GPU-native kernels)	Lower
Best for	Production GPU serving	Local inference, edge

GPTQ packs 4-bit weights more efficiently with group_size=128 symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.

Related Models

Model	Prune %	Experts	Format	Size	Status
`0xSero/GLM-5.1-555B-A14B-REAP`	25%	192/256	BF16	1.1T	Source checkpoint
`0xSero/GLM-5.1-555B-A14B-REAP-GGUF`	25%	192/256	GGUF Q4_K_M	325G	llama.cpp serving
This model	25%	192/256	GPTQ W4A16	277G	vLLM/SGLang serving
`0xSero/GLM-5.1-444B-A14B-REAP`	40%	154/256	BF16	910G	Has repetition issues — use 25%

Support This Work

If you find these models useful, please consider supporting continued open-source model compression research:

donate.sybilsolutions.ai

Citation

If you use this model, please cite the REAP paper and AutoRound.

Downloads last month: 28

Safetensors

Model size

78B params

Tensor type

I32

BF16

F32

Model tree for 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16

Base model

zai-org/GLM-5.1

Finetuned

0xSero/GLM-5.1-555B-A14B-REAP

Quantized

(1)

this model