Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
Abstract
Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models (2025)
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (2025)
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (2025)
- Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation (2025)
- Towards Widening The Distillation Bottleneck for Reasoning Models (2025)
- Iterative Deepening Sampling for Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi! appreciate the clarity and thoroughness of your experiments. I do have several questions:
What exactly is the AM - 32B model? Could you describe how it was trained, specifically in terms of the data composition?
Have you conducted tests on general questions—those not related to math or coding problems? If so, what kind of benefits does the “think - twice” approach bring?
Hi! Thank you very much for your interest in our work!
For the training details of the AM-32B model, please refer to our previous work (https://arxiv.org/abs/2503.19633), where we provide a detailed description of the data processing methods. Due to time constraints, we have not yet tested the effectiveness of our approach on other datasets. In the meantime, if you're interested, you’re welcome to conduct evaluations based on the methods described in the paper.
All of our related work will be updated at https://github.com/a-m-team/a-m-models and welcome discussions!
The "Think Twice" approach, in which the language model performs multiple rounds of reasoning internally, can be seen as a process where the model engages in self-dialogue to deeply analyze a problem. This seems similar to the interactions between agents in a multi-agent system.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper