SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Abstract
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.
Community
"Code is the right action interface for spatial reasoning!!"
SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.
It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by +11.2 points on 20 benchmarks and improves consistently across six VLM backbones.
the most interesting detail is treating the code interface as the action surface, letting the agent run one executable cell per step in a persistent python kernel and revise its plan as new evidence comes in. that kind of iterative loop seems essential for open-ended 3d/4d reasoning, since you can compose perception ops on the fly instead of a fixed tool chain. i’d love to see an ablation where you disable revision of earlier cells to see how much of the gain comes from iteration versus raw tool coverage. btw, the arxivlens breakdown helped me parse the method more clearly, nice to have a walkthrough here: https://arxivlens.com/PaperView/Details/spatialclaw-rethinking-action-interface-for-agentic-spatial-reasoning-4861-a91f0b49
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning (2026)
- Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning (2026)
- AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models (2026)
- SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning (2026)
- Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning (2026)
- Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction (2026)
- ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.13673 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper