PrunedHub GPT-OSS-20B-27x-Zerobias

Experimental — Near-lossless MoE pruning with proprietary router optimization.

15.6% of experts removed with only -1pp quality loss, achieved through GOBA-AI-Labs' Zerobias router optimization technique.

Model Details

Property	Value
Base Model	openai/gpt-oss-20b
Total Parameters	~16.5B
Active Parameters	3.6B per token
Experts per Layer	27 (from 32, uniform)
MoE Layers	24
Routing	Top-4, sigmoid activation, bias-optimized
Context Length	128K tokens
Quantization	Q4_K_M
License	Apache 2.0

Benchmark Results

Benchmark	Original (32 experts)	28x (standard)	27x Zerobias
MMLU (0-shot, 100Q)	78%	78%	77% (-1pp)
GSM8K (0-shot, 50Q)	—	92%	84%

What is Zerobias?

Standard MoE pruning at this compression level causes a sharp quality cliff (-10pp). GOBA-AI-Labs' Zerobias technique recovers most of the lost quality, turning a -10pp cliff into only -1pp.

This is an experimental release demonstrating the technique. For production use, we recommend the 28x model which achieves lossless compression.

Methodology

Calibration-based importance scoring: Expert importance is measured through actual inference behavior, producing more accurate rankings than static weight analysis
Layer-adaptive expert allocation: Each layer retains a dynamically determined number of experts based on its measured contribution to model quality
Zerobias router optimization: After expert pruning, the MoE router still carries learned biases calibrated for the original expert count. Zerobias neutralizes these stale routing biases, allowing the router to redistribute load optimally among the remaining experts. This zero-cost post-processing step recovers quality at the pruning cliff without any retraining
Cliff recovery: MoE models exhibit sharp quality cliffs at specific pruning thresholds. Zerobias specifically targets this phenomenon, extending the lossless pruning frontier by one additional expert per layer

Size Comparison

Metric	Original	28x	27x Zerobias
File Size	11.67 GB	10.40 GB	~9.4 GB
Experts/Layer	32	28	27
Size Reduction	—	-10.9%	-19.5%

Usage

llama.cpp (recommended)

This model uses uniform expert counts and is fully compatible with llama.cpp:

llama-server -m PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 8090 -ngl 99 -c 4096

moe-stream

Also supported by moe-stream, which offers GPU-resident inference and OpenAI-compatible HTTP API:

# CLI inference
moe-stream PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf 512 \
  --prompt "Explain quantum computing" --stream

# OpenAI-compatible HTTP server
moe-stream-server --model PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 11434

Citation

@misc{goba-ai-labs-prunedhub-gptoss-27x-zerobias,
  title={PrunedHub GPT-OSS-20B-27x-Zerobias: Near-Lossless MoE Pruning with Router Optimization},
  author={GOBA-AI-Labs},
  year={2026},
  url={https://huggingface.co/GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias}
}

Model tree for GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias

Base model

openai/gpt-oss-20b

Quantized

(196)

this model

GOBA-AI-Labs
/

PrunedHub-GPT-OSS-20B-27x-Zerobias