Language Models can Self-Improve at State-Value Estimation for Better Search
Abstract
Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.
Community
TLDR:
- Conventionally, improving language models for search on reasoning tasks (e.g., web agents) often requires human demonstrations or ground truth rewards, which are expensive
- We propose self-taught-lookahead (STL), a method that can self-improve models on state transitions only by capturing the Bellman update in natural language
- Specifically, we train the value model used to guide search with our self-supervised approach
- We find that this leads to a 39% improvement in performance compared to using the base value model, matching the performance of using a GPT-4o value model
- Search with STL value models is also 37x cheaper than previous search methods and 10x cheaper than using closed source models
- We also find STL is possible with very small value models (~3B parameters), which approach the performance of GPT-4o
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (2025)
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning (2025)
- VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data (2025)
- rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (2025)
- Stepwise Informativeness Search for Improving LLM Reasoning (2025)
- RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision (2025)
- Multi-LLM Collaborative Search for Complex Problem Solving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper