admarcosai
's Collections
Benchmarks
updated
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
25
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
185
Updated
•
133
•
64
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
•
2312.04724
•
Published
•
20
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
•
2401.03065
•
Published
•
11
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
•
2401.04695
•
Published
•
11
Updated
•
1.38k
•
76
Viewer
•
Updated
•
100
•
239
•
8
reasoning-machines/gsm-hard
Viewer
•
Updated
•
1.32k
•
560
•
39
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
•
2402.01622
•
Published
•
33
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool
Utilization in Real-World Complex Scenarios
Paper
•
2401.17167
•
Published
•
1
Language Models, Agent Models, and World Models: The LAW for Machine
Reasoning and Planning
Paper
•
2312.05230
•
Published
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
•
2401.18058
•
Published
•
20
Premise Order Matters in Reasoning with Large Language Models
Paper
•
2402.08939
•
Published
•
27
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs
Miss
Paper
•
2402.10790
•
Published
•
41
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
•
2412.15204
•
Published
•
31
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World
Scenarios
Paper
•
2412.08972
•
Published
•
9