Post
1705
A new paper titled "Long-Form Factuality in Large Language Models" proposes a new approach to evaluate the long-form factuality of large language models using an AI agent! They introduce SAFE (Search-Augmented Factuality Evaluator) which leverages an LLM to break down responses into individual facts, query Google to verify each fact, and perform multi-step reasoning.
Keypoints:
* SAFE (Search-Augmented Factuality Evaluator) is an automated method using an LLM agent to evaluate factuality
* It also introduces LongFact, a 2,280 prompt set spanning 38 topics to test open-domain factual knowledge
* SAFE achieves a 72% humans agreement while being 20x cheaper. It also wins 76% of the disagreements measured on a small scale experiment where a more thorough human procedure (researchers + full internet search) was used.
* Larger models like GPT-4, Claude Opus and Gemini Ultra tend to exhibit better long-form factuality.
Paper: Long-form factuality in large language models (2403.18802)
Code and data: https://github.com/google-deepmind/long-form-factuality
Congrats to the authors for their work!
Keypoints:
* SAFE (Search-Augmented Factuality Evaluator) is an automated method using an LLM agent to evaluate factuality
* It also introduces LongFact, a 2,280 prompt set spanning 38 topics to test open-domain factual knowledge
* SAFE achieves a 72% humans agreement while being 20x cheaper. It also wins 76% of the disagreements measured on a small scale experiment where a more thorough human procedure (researchers + full internet search) was used.
* Larger models like GPT-4, Claude Opus and Gemini Ultra tend to exhibit better long-form factuality.
Paper: Long-form factuality in large language models (2403.18802)
Code and data: https://github.com/google-deepmind/long-form-factuality
Congrats to the authors for their work!