PrunedHub GPT-OSS-20B-27x-Zerobias
Experimental β Near-lossless MoE pruning with proprietary router optimization.
15.6% of experts removed with only -1pp quality loss, achieved through GOBA-AI-Labs' Zerobias router optimization technique.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/gpt-oss-20b |
| Total Parameters | ~16.5B |
| Active Parameters | 3.6B per token |
| Experts per Layer | 27 (from 32, uniform) |
| MoE Layers | 24 |
| Routing | Top-4, sigmoid activation, bias-optimized |
| Context Length | 128K tokens |
| Quantization | Q4_K_M |
| License | Apache 2.0 |
Benchmark Results
| Benchmark | Original (32 experts) | 28x (standard) | 27x Zerobias |
|---|---|---|---|
| MMLU (0-shot, 100Q) | 78% | 78% | 77% (-1pp) |
| GSM8K (0-shot, 50Q) | β | 92% | 84% |
What is Zerobias?
Standard MoE pruning at this compression level causes a sharp quality cliff (-10pp). GOBA-AI-Labs' Zerobias technique recovers most of the lost quality, turning a -10pp cliff into only -1pp.
This is an experimental release demonstrating the technique. For production use, we recommend the 28x model which achieves lossless compression.
Methodology
- Calibration-based importance scoring: Expert importance is measured through actual inference behavior, producing more accurate rankings than static weight analysis
- Layer-adaptive expert allocation: Each layer retains a dynamically determined number of experts based on its measured contribution to model quality
- Zerobias router optimization: After expert pruning, the MoE router still carries learned biases calibrated for the original expert count. Zerobias neutralizes these stale routing biases, allowing the router to redistribute load optimally among the remaining experts. This zero-cost post-processing step recovers quality at the pruning cliff without any retraining
- Cliff recovery: MoE models exhibit sharp quality cliffs at specific pruning thresholds. Zerobias specifically targets this phenomenon, extending the lossless pruning frontier by one additional expert per layer
Size Comparison
| Metric | Original | 28x | 27x Zerobias |
|---|---|---|---|
| File Size | 11.67 GB | 10.40 GB | ~9.4 GB |
| Experts/Layer | 32 | 28 | 27 |
| Size Reduction | β | -10.9% | -19.5% |
Usage
llama.cpp (recommended)
This model uses uniform expert counts and is fully compatible with llama.cpp:
llama-server -m PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 8090 -ngl 99 -c 4096
moe-stream
Also supported by moe-stream, which offers GPU-resident inference and OpenAI-compatible HTTP API:
# CLI inference
moe-stream PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf 512 \
--prompt "Explain quantum computing" --stream
# OpenAI-compatible HTTP server
moe-stream-server --model PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 11434
Citation
@misc{goba-ai-labs-prunedhub-gptoss-27x-zerobias,
title={PrunedHub GPT-OSS-20B-27x-Zerobias: Near-Lossless MoE Pruning with Router Optimization},
author={GOBA-AI-Labs},
year={2026},
url={https://huggingface.co/GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias}
}
Links
- GOBA AI Labs β project website
- moe-stream β inference engine
- GOBA-AI-Labs on HuggingFace
- PrunedHub GPT-OSS-20B-28x (Lossless)
- Base Model: GPT-OSS-20B
- Support GOBA-AI-Labs on Ko-fi
- Downloads last month
- 47
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias
Base model
openai/gpt-oss-20b