Abstract
GUI Process Automation (GPA) offers robust, deterministic, and privacy-preserving vision-based robotic process automation with faster execution than current vision-language model approaches.
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
Community
GUI Process Automation (GPA) - from Salesforce AI Research
For product or enterprise use cases, please contact [email protected] or [email protected].
What is GPA?
GPA is a demo-based RPA (Robotic Process Automation) framework for automating desktop GUI tasks on macOS.
The core idea: record a workflow once, replay it reliably — even when the UI changes slightly. Unlike traditional RPA tools that rely on pixel coordinates or brittle selectors, GPA uses lightweight local models to understand UI structure and locate elements robustly at replay time.
Key capabilities:
- Record: Capture a GUI workflow as a sequence of user actions and screenshots
- Build: LLM-powered analysis (done once at build time) converts the recording into a parameterized workflow template with named variables
- Run: Action execution uses small-scale, locally-running visual detectors and feature extractors — no large vision-language model required at runtime. UI elements are matched via efficient embedding-based retrieval, keeping replay fast and privacy-preserving
- Variables: Workflows accept runtime variable overrides (e.g. filenames, text content, search terms)
- Loops: Run partial step ranges to support batched or iterative workflows
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory (2026)
- ANCHOR: Branch-Point Data Generation for GUI Agents (2026)
- Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense (2026)
- AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines (2026)
- Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition (2026)
- GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents (2026)
- CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper