arxiv:2503.02878

Language Models can Self-Improve at State-Value Estimation for Better Search

Published on Mar 4

· Submitted by

emendes3 on Mar 5

Upvote

Authors:

Ethan Mendes ,

Alan Ritter

Abstract

Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.

View arXiv page View PDF GitHub repository Add to collection

Community

emendes3

Paper author Paper submitter 1 day ago

•

edited 1 day ago

TLDR:

Conventionally, improving language models for search on reasoning tasks (e.g., web agents) often requires human demonstrations or ground truth rewards, which are expensive
We propose self-taught-lookahead (STL), a method that can self-improve models on state transitions only by capturing the Bellman update in natural language
Specifically, we train the value model used to guide search with our self-supervised approach
We find that this leads to a 39% improvement in performance compared to using the base value model, matching the performance of using a GPT-4o value model
Search with STL value models is also 37x cheaper than previous search methods and 10x cheaper than using closed source models
We also find STL is possible with very small value models (~3B parameters), which approach the performance of GPT-4o

librarian-bot

about 18 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.02878 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.02878 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02878 in a Space README.md to link it from this page.