arxiv:2410.05193

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Published on Oct 7

· Submitted by

DonJoey on Oct 9

Upvote

Authors:

Qiyuan Zhang ,

Yufei Wang ,

Tiezheng YU ,

Yuxin Jiang ,

Lifeng Shang ,

Fuyuan Lyu ,

Abstract

With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.

View arXiv page View PDF Add to collection

Community

DonJoey

Paper author Paper submitter 3 days ago

''RevisEval: Improving LLM-as-a-Judge via Response-Adapted References'', Evaluation has long been a cornerstone of progress in text generation capabilities. With the limitations of traditional metrics, LLM-as-a-Judge has become a viable method for assessing generative abilities in open-ended tasks, though it still faces significant reliability gaps compared to human evaluation. By harnessing the revision capabilities of LLMs, we unlock the potential of references in traditional evaluations, generating response-adapted references that can significantly enhance general evaluation methods on various tasks. This approach not only boosts the accuracy of LLM-as-a-Judge but also revives traditional metrics like BLEU, enabling them to effectively evaluate tasks on benchmarks such as MTBench and Alpacafarm, with results that are even comparable to those of LLM-as-a-Judge. It also performs well in using weak LLMs for evaluation and mitigating positional bias.

DonJoey

Paper author Paper submitter 3 days ago

Most interestingly, we found that recent efforts to train a strong judge by fine-tuning (SFT) a weak LLM, like Llama-2 7B, have encountered significant challenges, particularly bias issues. Surprisingly, our approach reveals that instead of fine-tuning a weak LLM into a judge, it may be more effective to use the same resources to train it as a reviser. By generating response-adapted references and combining them with traditional metrics, better results can be achieved. This offers an alternative feasible approach for using LLMs as judges.