Post
553
Having some fun with long context benchmarks (watch the video!!)
NoLiMA: NoLiMa: Long-Context Evaluation Beyond Literal Matching (2502.05167)
Fiction LiveBench: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
Michalenglo: https://deepmind.google/research/publications/117639/
LongGenBench: Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models (2409.02076)
NeedleBench: NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? (2407.11963)
RULER: RULER: What's the Real Context Size of Your Long-Context Language Models? (2404.06654)
For more: https://www.reddit.com/r/rajistics/comments/1jxwk29/long_context_llm_benchmarks_video/
let me know if you like these posts
NoLiMA: NoLiMa: Long-Context Evaluation Beyond Literal Matching (2502.05167)
Fiction LiveBench: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
Michalenglo: https://deepmind.google/research/publications/117639/
LongGenBench: Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models (2409.02076)
NeedleBench: NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? (2407.11963)
RULER: RULER: What's the Real Context Size of Your Long-Context Language Models? (2404.06654)
For more: https://www.reddit.com/r/rajistics/comments/1jxwk29/long_context_llm_benchmarks_video/
let me know if you like these posts