llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)PrunedHub Qwen3-30B-A3B-JP-80pct โ MoE-Stream Edition
ๆฅๆฌ่ชๅ่ณชใ็ถญๆใใ MoE ๅง็ธฎใขใใซ โ GOBA-AI-Labs ็ฌ่ชใฎ่จ่ช่ช่ญ expert ๆ้ฉๅๆๆณใซใใใๆฅๆฌ่ชๆง่ฝใไฟ่ญทใใชใใ 20% ใฎใใฉใกใผใฟใๅๆธใ
Qwen3-30B-A3B ใใใผในใซใGOBA-AI-Labs ใฎ่จ่ช็นๅ MoE ๆ้ฉๅใใคใใฉใคใณใงๅง็ธฎใใฆใใพใใ
ๆจ่ซใจใณใธใณ: ใใฎใขใใซใฏใฌใคใคใผ้ฉๅฟๅ pruning๏ผๅฑคใใจใซ็ฐใชใ expert ๆฐ๏ผใไฝฟ็จใใฆใใใใใๆจ่ซใซใฏ moe-stream ใๅฟ ้ ใงใใllama.cpp ใฏ็พๅจ
experts_per_layerใกใฟใใผใฟๅฝขๅผใซๅฏพๅฟใใฆใใพใใใ
ใขใใซไปๆง
| ้ ็ฎ | ๅค |
|---|---|
| ใใผในใขใใซ | Qwen/Qwen3-30B-A3B |
| Expert ๆฐ/ๅฑค | ใฌใคใคใผ้ฉๅฟๅ (ๅนณๅ ~102, ๅ : 128) |
| MoE ๅฑคๆฐ | 48 |
| ใซใผใใฃใณใฐ | Top-8 |
| ใณใณใใญในใ้ท | 32K tokens |
| ้ๅญๅ | Q4_K_M |
| ๆจ่ซใจใณใธใณ | moe-stream (ๅฟ ้ ) |
| ใฉใคใปใณใน | Apache 2.0 |
ใใณใใใผใฏ็ตๆ
Thinking OFF (no-think ใขใผใ)
| ใใณใใใผใฏ | ใชใชใธใใซ (128 experts) | JP-80pct (~102 experts) | ๅทฎๅ |
|---|---|---|---|
| MMLU (0-shot, 100Q) | 77% | 74% | -3pp |
| GSM8K (0-shot, 50Q) | โ | 92% | โ |
| ๆฅๆฌ่ชๅ่ณช (20Q) | 90% | 85% | -5pp |
Thinking ON (ๆจ่ซใขใผใ)
| ใใณใใใผใฏ | JP-80pct (Thinking ON) |
|---|---|
| MMLU (0-shot, 100Q) | 79% (+5pp vs no-think) |
| GSM8K (0-shot, 50Q) | 84% |
| ๆฅๆฌ่ชๅ่ณช (20Q) | 90% (็ฎๆจ้ๆ) |
ใตใคใบๆฏ่ผ
| ๆๆจ | ใชใชใธใใซ | JP-80pct | ๅๆธ |
|---|---|---|---|
| ใใกใคใซใตใคใบ | 17.3 GB | 14.0 GB | -19.1% |
| Expert ๆฐ/ๅฑค | 128 | ~102 (ๅนณๅ) | -20.3% |
| ๅๆธ Expert ๆฐ | โ | 1,248 | โ |
็นๅพด
- ๆฅๆฌ่ชๅ่ณชไฟ่ญท: GOBA-AI-Labs ็ฌ่ชใฎ่จ่ช่ช่ญๆ้ฉๅใซใใใๆฅๆฌ่ชใฎๆจ่ซ่ฝๅใ็ถญๆ
- Thinking ON ใงๆๅคงๆง่ฝ: MMLU 79%ใๆฅๆฌ่ช 90% ใ้ๆ
- GSM8K 92%: ๆฐๅญฆ็ๆจ่ซ่ฝๅใๅฎๅ จไฟๆ
- 14 GB: 16GB RAM ็ฐๅขใงๅไฝๅฏ่ฝ
- ~55 tok/s: Apple Silicon (Metal GPU) ใง้ซ้ๆจ่ซ
ไฝฟใๆน
ใใฎใขใใซใฏใฌใคใคใผ้ฉๅฟๅ pruning ใไฝฟ็จใใฆใใใใใmoe-stream ใงใฎๆจ่ซใๅฟ ้ ใงใใ
ใคใณในใใผใซ
# moe-stream ใฎใใซใ (Rust + Metal GPU)
git clone https://github.com/GOBA-AI-Labs/moe-stream
cd moe-stream
cargo build --release --features metal,accelerate
# ใขใใซใฎใใฆใณใญใผใ
huggingface-cli download goba-ai-labs/PrunedHub-Qwen3-30B-A3B-JP-80pct \
--local-dir models/
CLI ใงใฎๆจ่ซ
# ใใญในใ็ๆ
./target/release/moe-stream models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf 512 \
--prompt "ๆฅๆฌใฎๅๅญฃใซใคใใฆๆใใฆใใ ใใ" --stream
# Thinking ON ใขใผใ (ๆจๅฅจ โ ้ซ็ฒพๅบฆ)
./target/release/moe-stream models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf 1024 \
--think --stream
OpenAI ไบๆ HTTP ใตใผใใผ
# ใตใผใใผ่ตทๅ
./target/release/moe-stream-server \
--model models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf --port 11434
# curl ใงใในใ
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"ๆฅๆฌใฎ้ฆ้ฝใฏใฉใใงใใ๏ผ"}],"stream":true}'
Python ใใไฝฟ็จ
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "้ๅญใณใณใใฅใผใฟใซใคใใฆ่ชฌๆใใฆใใ ใใ"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
ๆๆณใซใคใใฆ
GOBA-AI-Labs ็ฌ่ชใฎ่จ่ช่ช่ญ expert ๆ้ฉๅใใคใใฉใคใณใไฝฟ็จใใฆใใพใใ
- Calibration ใใผในใฎ้่ฆๅบฆในใณใขใชใณใฐ: ๅค่จ่ชใใญในใ๏ผๆฅๆฌ่ชใป่ฑ่ชใปใณใผใใปๆฐๅญฆ๏ผใๅฎ้ใซๆจ่ซใใๅ expert ใฎๆดปๆงๅใใฟใผใณใใ้่ฆๅบฆใๅฎๆธฌใ้็ใช้ใฟๅๆใจๆฏ่ผใใฆๅคงๅน ใซ้ซ็ฒพๅบฆใช้่ฆๅบฆใฉใณใญใณใฐใๅฎ็พ
- ่จ่ช็นๅ expert ใฎ่ชๅๆคๅบใจไฟ่ญท: MoE routing ใใฟใผใณใฎ่จ่ช้ๅทฎๅๅๆใซใใใๆฅๆฌ่ชๅ่ณชใซๅฏไธใใ expert ใ่ชๅ็ใซๅๅฎใใpruning ๅฏพ่ฑกใใไฟ่ญท
- ใฌใคใคใผ้ฉๅฟๅ expert ๅฒใๅฝใฆ: ๅใฌใคใคใผใฎๅ่ณชๅฏไธๅบฆใซๅบใฅใใใฌใคใคใผใใจใซๆ้ฉใช expert ๆฐใๅ็ใซๆฑบๅฎใๅไธใช pruning ใจๆฏ่ผใใฆๅ่ณชไฟๆ็ใๅคงๅน ใซๅไธ
- Thinking ใขใผใๅฏพๅฟ: Thinking ON/OFF ไธกๆนใง่ฉไพกๆธใฟใThinking ON ใงใฏ MMLU +5ppใๆฅๆฌ่ชๅ่ณช 90% ใ้ๆ
ๆจ่ซใจใณใธใณ: moe-stream
moe-stream ใฏ GOBA-AI-Labs ใ้็บใใ Rust ่ฃฝ MoE ๆจ่ซใจใณใธใณใงใใ
| ๆฉ่ฝ | ่ฉณ็ดฐ |
|---|---|
| ๆจ่ซใขใผใ | GPU Resident / GPU Hybrid / SSD Streaming (่ชๅ้ธๆ) |
| GPU ๅฏพๅฟ | Apple Metal / NVIDIA CUDA |
| ้ๅญๅ | Q2K-Q8K, MXFP4, F16, F32 (13 ๅฝขๅผๅฏพๅฟ) |
| API | OpenAI ไบๆ HTTP / JSONL / MCP |
| ็นๆฎๆฉ่ฝ | Q4 Quantized MatMul (+79% ้ซ้ๅ), Dynamic K |
ๅผ็จ
@misc{goba-ai-labs-prunedhub-qwen3-30b-jp,
title={PrunedHub Qwen3-30B-A3B-JP-80pct: ๆฅๆฌ่ชๅ่ณชไฟ่ญท MoE ๅง็ธฎ},
author={GOBA-AI-Labs},
year={2026},
url={https://huggingface.co/goba-ai-labs/PrunedHub-Qwen3-30B-A3B-JP-80pct}
}
ใชใณใฏ
- Downloads last month
- 30
4-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="GOBA-AI-Labs/PrunedHub-Qwen3-30B-A3B-JP-80pct", filename="PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf", )