Papers
arxiv:2503.01763

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Published on Mar 3
· Submitted by vansin on Mar 6
Authors:
,
,
,
,
,
,

Abstract

Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

Community

Paper submitter

Tool learning aims to augment large language models (LLMs) with diverse
tools, enabling them to act as agents for solving practical tasks. Due to the
limited context length of tool-using LLMs, adopting information retrieval (IR)
models to select useful tools from large toolsets is a critical initial step.
However, the performance of IR models in tool retrieval tasks remains
underexplored and unclear. Most tool-use benchmarks simplify this step by
manually pre-annotating a small set of relevant tools for each task, which is
far from the real-world scenarios. In this paper, we propose ToolRet, a
heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks,
and a corpus of 43k tools, collected from existing datasets. We benchmark six
types of models on ToolRet. Surprisingly, even the models with strong
performance in conventional IR benchmarks, exhibit poor performance on ToolRet.
This low retrieval quality degrades the task pass rate of tool-use LLMs. As a
further step, we contribute a large-scale training dataset with over 200k
instances, which substantially optimizes the tool retrieval ability of IR
models.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.01763 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.01763 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01763 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.