Papers
arxiv:2504.00050

JudgeLRM: Large Reasoning Models as a Judge

Published on Mar 31
· Submitted by zhiyuanhucs on Apr 2
#2 Paper of the day
Authors:
,
,

Abstract

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

Community

Paper author Paper submitter

Large Reasoning Model for Judge

Paper author

Welcome to use JudgeLRM! Compare any Hugging Face language models by asking your own questions, and explore JudgeLRM’s reasoning and detailed comparisons!
Demo: https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo
Model: https://huggingface.co/nuojohnchen/JudgeLRM-7B
Code: https://github.com/NuoJohnChen/JudgeLRM

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s1 - s2| and otherwise.

·
Paper author

Thanks for the suggestion and your interest! We’ll further explore reward design.

amazing work, how about test your model on recent benchmark like judgebench, rmbench or rewardbench?
i believe it will bring more insights

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.00050 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 5