ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Abstract
Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CallNavi: A Study and Challenge on Function Calling Routing and Invocation in Large Language Models (2025)
- EXAONE 3.5: Series of Large Language Models for Real-world Use Cases (2024)
- Are Your LLMs Capable of Stable Reasoning? (2024)
- CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering (2024)
- FairCode: Evaluating Social Bias of LLMs in Code Generation (2025)
- ExecRepoBench: Multi-level Executable Code Completion Evaluation (2024)
- Asynchronous LLM Function Calling (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper