PrunedHub GPT-OSS-20B-27x-Zerobias

Experimental β€” Near-lossless MoE pruning with proprietary router optimization.

15.6% of experts removed with only -1pp quality loss, achieved through GOBA-AI-Labs' Zerobias router optimization technique.

Model Details

Property Value
Base Model openai/gpt-oss-20b
Total Parameters ~16.5B
Active Parameters 3.6B per token
Experts per Layer 27 (from 32, uniform)
MoE Layers 24
Routing Top-4, sigmoid activation, bias-optimized
Context Length 128K tokens
Quantization Q4_K_M
License Apache 2.0

Benchmark Results

Benchmark Original (32 experts) 28x (standard) 27x Zerobias
MMLU (0-shot, 100Q) 78% 78% 77% (-1pp)
GSM8K (0-shot, 50Q) β€” 92% 84%

What is Zerobias?

Standard MoE pruning at this compression level causes a sharp quality cliff (-10pp). GOBA-AI-Labs' Zerobias technique recovers most of the lost quality, turning a -10pp cliff into only -1pp.

This is an experimental release demonstrating the technique. For production use, we recommend the 28x model which achieves lossless compression.

Methodology

  • Calibration-based importance scoring: Expert importance is measured through actual inference behavior, producing more accurate rankings than static weight analysis
  • Layer-adaptive expert allocation: Each layer retains a dynamically determined number of experts based on its measured contribution to model quality
  • Zerobias router optimization: After expert pruning, the MoE router still carries learned biases calibrated for the original expert count. Zerobias neutralizes these stale routing biases, allowing the router to redistribute load optimally among the remaining experts. This zero-cost post-processing step recovers quality at the pruning cliff without any retraining
  • Cliff recovery: MoE models exhibit sharp quality cliffs at specific pruning thresholds. Zerobias specifically targets this phenomenon, extending the lossless pruning frontier by one additional expert per layer

Size Comparison

Metric Original 28x 27x Zerobias
File Size 11.67 GB 10.40 GB ~9.4 GB
Experts/Layer 32 28 27
Size Reduction β€” -10.9% -19.5%

Usage

llama.cpp (recommended)

This model uses uniform expert counts and is fully compatible with llama.cpp:

llama-server -m PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 8090 -ngl 99 -c 4096

moe-stream

Also supported by moe-stream, which offers GPU-resident inference and OpenAI-compatible HTTP API:

# CLI inference
moe-stream PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf 512 \
  --prompt "Explain quantum computing" --stream

# OpenAI-compatible HTTP server
moe-stream-server --model PrunedHub-GPT-OSS-20B-27x-Zerobias-Q4_K_M.gguf --port 11434

Citation

@misc{goba-ai-labs-prunedhub-gptoss-27x-zerobias,
  title={PrunedHub GPT-OSS-20B-27x-Zerobias: Near-Lossless MoE Pruning with Router Optimization},
  author={GOBA-AI-Labs},
  year={2026},
  url={https://huggingface.co/GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias}
}

Links

Downloads last month
47
GGUF
Model size
18B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GOBA-AI-Labs/PrunedHub-GPT-OSS-20B-27x-Zerobias

Quantized
(196)
this model