multilingual-reward-bench

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

seungone authored a paper 5 days ago

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

seungone authored a paper 5 days ago

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

DKYoon authored a paper 8 days ago

A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

View all activity

seungone

authored 2 papers 5 days ago

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Paper • 2605.26457 • Published 15 days ago • 6

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Paper • 2606.02404 • Published 9 days ago • 55

DKYoon

authored 4 papers 8 days ago

A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

Paper • 2412.01340 • Published Dec 2, 2024

K-EXAONE Technical Report

Paper • 2601.01739 • Published Jan 5 • 95

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Paper • 2605.20668 • Published 21 days ago • 12

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Paper • 2606.02404 • Published 9 days ago • 55

seungone

submitted a paper to Daily Papers 8 days ago

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Paper • 2606.02404 • Published 9 days ago • 55

amphora

authored 13 papers 13 days ago

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Paper • 2402.11597 • Published Feb 18, 2024

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Paper • 2406.05761 • Published Jun 9, 2024 • 3

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Paper • 2506.00482 • Published May 31, 2025 • 8

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Paper • 2507.08924 • Published Jul 11, 2025 • 18

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Paper • 2510.24081 • Published Oct 28, 2025 • 24

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Paper • 2602.06291 • Published Feb 6 • 24

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Paper • 2604.13058 • Published Mar 18 • 2

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Paper • 2605.09063 • Published May 9 • 80

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

Paper • 2605.17448 • Published 24 days ago • 19

AI & ML interests

Recent Activity

Team members 15

multilingual-reward-bench's activity