How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="GOBA-AI-Labs/PrunedHub-Qwen3-30B-A3B-JP-80pct",
	filename="PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

PrunedHub Qwen3-30B-A3B-JP-80pct โ€” MoE-Stream Edition

ๆ—ฅๆœฌ่ชžๅ“่ณชใ‚’็ถญๆŒใ—ใŸ MoE ๅœง็ธฎใƒขใƒ‡ใƒซ โ€” GOBA-AI-Labs ็‹ฌ่‡ชใฎ่จ€่ชž่ช่ญ˜ expert ๆœ€้ฉๅŒ–ๆ‰‹ๆณ•ใซใ‚ˆใ‚Šใ€ๆ—ฅๆœฌ่ชžๆ€ง่ƒฝใ‚’ไฟ่ญทใ—ใชใŒใ‚‰ 20% ใฎใƒ‘ใƒฉใƒกใƒผใ‚ฟใ‚’ๅ‰Šๆธ›ใ€‚

Qwen3-30B-A3B ใ‚’ใƒ™ใƒผใ‚นใซใ€GOBA-AI-Labs ใฎ่จ€่ชž็‰นๅŒ– MoE ๆœ€้ฉๅŒ–ใƒ‘ใ‚คใƒ—ใƒฉใ‚คใƒณใงๅœง็ธฎใ—ใฆใ„ใพใ™ใ€‚

ๆŽจ่ซ–ใ‚จใƒณใ‚ธใƒณ: ใ“ใฎใƒขใƒ‡ใƒซใฏใƒฌใ‚คใƒคใƒผ้ฉๅฟœๅž‹ pruning๏ผˆๅฑคใ”ใจใซ็•ฐใชใ‚‹ expert ๆ•ฐ๏ผ‰ใ‚’ไฝฟ็”จใ—ใฆใ„ใ‚‹ใŸใ‚ใ€ๆŽจ่ซ–ใซใฏ moe-stream ใŒๅฟ…้ ˆใงใ™ใ€‚llama.cpp ใฏ็พๅœจ experts_per_layer ใƒกใ‚ฟใƒ‡ใƒผใ‚ฟๅฝขๅผใซๅฏพๅฟœใ—ใฆใ„ใพใ›ใ‚“ใ€‚

ใƒขใƒ‡ใƒซไป•ๆง˜

้ …็›ฎ ๅ€ค
ใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซ Qwen/Qwen3-30B-A3B
Expert ๆ•ฐ/ๅฑค ใƒฌใ‚คใƒคใƒผ้ฉๅฟœๅž‹ (ๅนณๅ‡ ~102, ๅ…ƒ: 128)
MoE ๅฑคๆ•ฐ 48
ใƒซใƒผใƒ†ใ‚ฃใƒณใ‚ฐ Top-8
ใ‚ณใƒณใƒ†ใ‚ญใ‚นใƒˆ้•ท 32K tokens
้‡ๅญๅŒ– Q4_K_M
ๆŽจ่ซ–ใ‚จใƒณใ‚ธใƒณ moe-stream (ๅฟ…้ ˆ)
ใƒฉใ‚คใ‚ปใƒณใ‚น Apache 2.0

ใƒ™ใƒณใƒใƒžใƒผใ‚ฏ็ตๆžœ

Thinking OFF (no-think ใƒขใƒผใƒ‰)

ใƒ™ใƒณใƒใƒžใƒผใ‚ฏ ใ‚ชใƒชใ‚ธใƒŠใƒซ (128 experts) JP-80pct (~102 experts) ๅทฎๅˆ†
MMLU (0-shot, 100Q) 77% 74% -3pp
GSM8K (0-shot, 50Q) โ€” 92% โ€”
ๆ—ฅๆœฌ่ชžๅ“่ณช (20Q) 90% 85% -5pp

Thinking ON (ๆŽจ่ซ–ใƒขใƒผใƒ‰)

ใƒ™ใƒณใƒใƒžใƒผใ‚ฏ JP-80pct (Thinking ON)
MMLU (0-shot, 100Q) 79% (+5pp vs no-think)
GSM8K (0-shot, 50Q) 84%
ๆ—ฅๆœฌ่ชžๅ“่ณช (20Q) 90% (็›ฎๆจ™้”ๆˆ)

ใ‚ตใ‚คใ‚บๆฏ”่ผƒ

ๆŒ‡ๆจ™ ใ‚ชใƒชใ‚ธใƒŠใƒซ JP-80pct ๅ‰Šๆธ›
ใƒ•ใ‚กใ‚คใƒซใ‚ตใ‚คใ‚บ 17.3 GB 14.0 GB -19.1%
Expert ๆ•ฐ/ๅฑค 128 ~102 (ๅนณๅ‡) -20.3%
ๅ‰Šๆธ› Expert ๆ•ฐ โ€” 1,248 โ€”

็‰นๅพด

  • ๆ—ฅๆœฌ่ชžๅ“่ณชไฟ่ญท: GOBA-AI-Labs ็‹ฌ่‡ชใฎ่จ€่ชž่ช่ญ˜ๆœ€้ฉๅŒ–ใซใ‚ˆใ‚Šใ€ๆ—ฅๆœฌ่ชžใฎๆŽจ่ซ–่ƒฝๅŠ›ใ‚’็ถญๆŒ
  • Thinking ON ใงๆœ€ๅคงๆ€ง่ƒฝ: MMLU 79%ใ€ๆ—ฅๆœฌ่ชž 90% ใ‚’้”ๆˆ
  • GSM8K 92%: ๆ•ฐๅญฆ็š„ๆŽจ่ซ–่ƒฝๅŠ›ใ‚’ๅฎŒๅ…จไฟๆŒ
  • 14 GB: 16GB RAM ็’ฐๅขƒใงๅ‹•ไฝœๅฏ่ƒฝ
  • ~55 tok/s: Apple Silicon (Metal GPU) ใง้ซ˜้€ŸๆŽจ่ซ–

ไฝฟใ„ๆ–น

ใ“ใฎใƒขใƒ‡ใƒซใฏใƒฌใ‚คใƒคใƒผ้ฉๅฟœๅž‹ pruning ใ‚’ไฝฟ็”จใ—ใฆใ„ใ‚‹ใŸใ‚ใ€moe-stream ใงใฎๆŽจ่ซ–ใŒๅฟ…้ ˆใงใ™ใ€‚

ใ‚คใƒณใ‚นใƒˆใƒผใƒซ

# moe-stream ใฎใƒ“ใƒซใƒ‰ (Rust + Metal GPU)
git clone https://github.com/GOBA-AI-Labs/moe-stream
cd moe-stream
cargo build --release --features metal,accelerate

# ใƒขใƒ‡ใƒซใฎใƒ€ใ‚ฆใƒณใƒญใƒผใƒ‰
huggingface-cli download goba-ai-labs/PrunedHub-Qwen3-30B-A3B-JP-80pct \
  --local-dir models/

CLI ใงใฎๆŽจ่ซ–

# ใƒ†ใ‚ญใ‚นใƒˆ็”Ÿๆˆ
./target/release/moe-stream models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf 512 \
  --prompt "ๆ—ฅๆœฌใฎๅ››ๅญฃใซใคใ„ใฆๆ•™ใˆใฆใใ ใ•ใ„" --stream

# Thinking ON ใƒขใƒผใƒ‰ (ๆŽจๅฅจ โ€” ้ซ˜็ฒพๅบฆ)
./target/release/moe-stream models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf 1024 \
  --think --stream

OpenAI ไบ’ๆ› HTTP ใ‚ตใƒผใƒใƒผ

# ใ‚ตใƒผใƒใƒผ่ตทๅ‹•
./target/release/moe-stream-server \
  --model models/PrunedHub-Qwen3-30B-A3B-JP-80pct-Q4_K_M.gguf --port 11434

# curl ใงใƒ†ใ‚นใƒˆ
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"ๆ—ฅๆœฌใฎ้ฆ–้ƒฝใฏใฉใ“ใงใ™ใ‹๏ผŸ"}],"stream":true}'

Python ใ‹ใ‚‰ไฝฟ็”จ

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "้‡ๅญใ‚ณใƒณใƒ”ใƒฅใƒผใ‚ฟใซใคใ„ใฆ่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„"}],
    stream=True
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

ๆ‰‹ๆณ•ใซใคใ„ใฆ

GOBA-AI-Labs ็‹ฌ่‡ชใฎ่จ€่ชž่ช่ญ˜ expert ๆœ€้ฉๅŒ–ใƒ‘ใ‚คใƒ—ใƒฉใ‚คใƒณใ‚’ไฝฟ็”จใ—ใฆใ„ใพใ™ใ€‚

  • Calibration ใƒ™ใƒผใ‚นใฎ้‡่ฆๅบฆใ‚นใ‚ณใ‚ขใƒชใƒณใ‚ฐ: ๅคš่จ€่ชžใƒ†ใ‚ญใ‚นใƒˆ๏ผˆๆ—ฅๆœฌ่ชžใƒป่‹ฑ่ชžใƒปใ‚ณใƒผใƒ‰ใƒปๆ•ฐๅญฆ๏ผ‰ใ‚’ๅฎŸ้š›ใซๆŽจ่ซ–ใ—ใ€ๅ„ expert ใฎๆดปๆ€งๅŒ–ใƒ‘ใ‚ฟใƒผใƒณใ‹ใ‚‰้‡่ฆๅบฆใ‚’ๅฎŸๆธฌใ€‚้™็š„ใช้‡ใฟๅˆ†ๆžใจๆฏ”่ผƒใ—ใฆๅคงๅน…ใซ้ซ˜็ฒพๅบฆใช้‡่ฆๅบฆใƒฉใƒณใ‚ญใƒณใ‚ฐใ‚’ๅฎŸ็พ
  • ่จ€่ชž็‰นๅŒ– expert ใฎ่‡ชๅ‹•ๆคœๅ‡บใจไฟ่ญท: MoE routing ใƒ‘ใ‚ฟใƒผใƒณใฎ่จ€่ชž้–“ๅทฎๅˆ†ๅˆ†ๆžใซใ‚ˆใ‚Šใ€ๆ—ฅๆœฌ่ชžๅ“่ณชใซๅฏ„ไธŽใ™ใ‚‹ expert ใ‚’่‡ชๅ‹•็š„ใซๅŒๅฎšใ—ใ€pruning ๅฏพ่ฑกใ‹ใ‚‰ไฟ่ญท
  • ใƒฌใ‚คใƒคใƒผ้ฉๅฟœๅž‹ expert ๅ‰ฒใ‚Šๅฝ“ใฆ: ๅ„ใƒฌใ‚คใƒคใƒผใฎๅ“่ณชๅฏ„ไธŽๅบฆใซๅŸบใฅใใ€ใƒฌใ‚คใƒคใƒผใ”ใจใซๆœ€้ฉใช expert ๆ•ฐใ‚’ๅ‹•็š„ใซๆฑบๅฎšใ€‚ๅ‡ไธ€ใช pruning ใจๆฏ”่ผƒใ—ใฆๅ“่ณชไฟๆŒ็އใŒๅคงๅน…ใซๅ‘ไธŠ
  • Thinking ใƒขใƒผใƒ‰ๅฏพๅฟœ: Thinking ON/OFF ไธกๆ–นใง่ฉ•ไพกๆธˆใฟใ€‚Thinking ON ใงใฏ MMLU +5ppใ€ๆ—ฅๆœฌ่ชžๅ“่ณช 90% ใ‚’้”ๆˆ

ๆŽจ่ซ–ใ‚จใƒณใ‚ธใƒณ: moe-stream

moe-stream ใฏ GOBA-AI-Labs ใŒ้–‹็™บใ—ใŸ Rust ่ฃฝ MoE ๆŽจ่ซ–ใ‚จใƒณใ‚ธใƒณใงใ™ใ€‚

ๆฉŸ่ƒฝ ่ฉณ็ดฐ
ๆŽจ่ซ–ใƒขใƒผใƒ‰ GPU Resident / GPU Hybrid / SSD Streaming (่‡ชๅ‹•้ธๆŠž)
GPU ๅฏพๅฟœ Apple Metal / NVIDIA CUDA
้‡ๅญๅŒ– Q2K-Q8K, MXFP4, F16, F32 (13 ๅฝขๅผๅฏพๅฟœ)
API OpenAI ไบ’ๆ› HTTP / JSONL / MCP
็‰นๆฎŠๆฉŸ่ƒฝ Q4 Quantized MatMul (+79% ้ซ˜้€ŸๅŒ–), Dynamic K

ๅผ•็”จ

@misc{goba-ai-labs-prunedhub-qwen3-30b-jp,
  title={PrunedHub Qwen3-30B-A3B-JP-80pct: ๆ—ฅๆœฌ่ชžๅ“่ณชไฟ่ญท MoE ๅœง็ธฎ},
  author={GOBA-AI-Labs},
  year={2026},
  url={https://huggingface.co/goba-ai-labs/PrunedHub-Qwen3-30B-A3B-JP-80pct}
}

ใƒชใƒณใ‚ฏ

Downloads last month
30
GGUF
Model size
25B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for GOBA-AI-Labs/PrunedHub-Qwen3-30B-A3B-JP-80pct

Quantized
(113)
this model