Abstract
The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
Community
Large Reasoning Model for Judge
Welcome to use JudgeLRM! Compare any Hugging Face language models by asking your own questions, and explore JudgeLRM’s reasoning and detailed comparisons!
Demo: https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo
Model: https://huggingface.co/nuojohnchen/JudgeLRM-7B
Code: https://github.com/NuoJohnChen/JudgeLRM
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improve LLM-as-a-Judge Ability as a General Ability (2025)
- Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL (2025)
- Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning (2025)
- Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance (2025)
- ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (2025)
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
- Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s1 - s2| and otherwise.
Thanks for the suggestion and your interest! We’ll further explore reward design.
amazing work, how about test your model on recent benchmark like judgebench, rmbench or rewardbench?
i believe it will bring more insights
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper