Judging LLM-as-a-judge with MT-Bench and Chatbot Arena Paper • 2306.05685 • Published Jun 9, 2023 • 42
JudgeLM: Fine-tuned Large Language Models are Scalable Judges Paper • 2310.17631 • Published Oct 26, 2023 • 35
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Paper • 2310.08491 • Published Oct 12, 2023 • 57
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment Paper • 2402.14016 • Published Feb 21, 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published Jun 18, 2024 • 37
Benchmarking Cognitive Biases in Large Language Models as Evaluators Paper • 2309.17012 • Published Sep 29, 2023 • 3
Evaluating Large Language Models: A Comprehensive Survey Paper • 2310.19736 • Published Oct 30, 2023 • 2
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models Paper • 2305.13711 • Published May 23, 2023 • 2
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment Paper • 2303.16634 • Published Mar 29, 2023 • 3
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers Paper • 2403.02839 • Published Mar 5, 2024 • 2
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2, 2024 • 124