SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Paper • 2506.24119 • Published • 51
How to use maxbittker/nemotron3-nano-30b-a3b-spiral-step130 with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16")
model = PeftModel.from_pretrained(base_model, "maxbittker/nemotron3-nano-30b-a3b-spiral-step130")LoRA adapter trained with the SPIRAL self-play RL framework on top of
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (30B-total / 3B-active MoE, reasoning-capable)nemotron3 (thinking enabled)TicTacToe-v0, KuhnPoker-v1, SimpleNegotiation-v1 (self-play, role-conditioned advantage estimation / RAE)target_modules=all-linear, alpha 32)| Benchmark | Base | Step-130 | Δ |
|---|---|---|---|
| AIME24 | 36.7% | 36.7% | 0.0 |
| AMC23 | 67.1% | 74.4% | +7.3 |
| MATH500 | 89.0% | 90.8% | +1.8 |
| Minerva | 29.4% | 30.1% | +0.7 |
| Olympiad-Bench | 50.1% | 53.2% | +3.1 |
| Average | 54.5% | 57.0% | +2.5 |
All evals done with nemotron3 renderer (thinking enabled), max_tokens 8192, full test sets, unified \boxed{} answer extraction.
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained("maxbittker/nemotron3-nano-30b-a3b-spiral-step130",
device_map="auto",
torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16")
Or merge and save as a full model:
merged = model.merge_and_unload()
merged.save_pretrained("./nemotron3-spiral-step130-merged")
Training is ongoing — further checkpoints will land at step200, step300, step400.
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16