arxiv:2504.00050

JudgeLRM: Large Reasoning Models as a Judge

Published on Mar 31

· Submitted by

zhiyuanhucs on Apr 2

#2 Paper of the day

Upvote

Authors:

Nuo Chen ,

Zhiyuan Hu ,

Qingyun Zou ,

Bryan Hooi ,

Abstract

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

zhiyuanhucs

Paper author Paper submitter 6 days ago

Large Reasoning Model for Judge

nuojohnchen

Paper author 6 days ago

Welcome to use JudgeLRM! Compare any Hugging Face language models by asking your own questions, and explore JudgeLRM’s reasoning and detailed comparisons!
Demo: https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo
Model: https://huggingface.co/nuojohnchen/JudgeLRM-7B
Code: https://github.com/NuoJohnChen/JudgeLRM

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ajaiasai2022

5 days ago

•

edited 5 days ago

How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s1 - s2| and otherwise.