arxiv:2409.15523

SEAL: Suite for Evaluating API-use of LLMs

Published on Sep 23, 2024

Authors:

Aditya Vempaty

Abstract

Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.15523 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.15523 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.15523 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.