ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning Paper • 2602.21534 • Published Feb 25 • 25
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis Paper • 2603.20278 • Published Mar 17 • 94
Budget-aware Test-time Scaling via Discriminative Verification Paper • 2510.14913 • Published Oct 16, 2025 • 5
Predicting Task Performance with Context-aware Scaling Laws Paper • 2510.14919 • Published Oct 16, 2025 • 4
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published Oct 16, 2024 • 47