Post
142
๐ข๐ฆ-๐๐ฒ๐ป๐ฒ๐๐ถ๐: ๐ป๐ฒ๐ ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐๐ฒ๐ ๐ฎ ๐ป๐ผ๐๐ฒ๐น ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฑ๐ฎ๐๐ฎ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐ณ๐ผ๐ฟ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐๐ผ๐บ๐ฝ๐๐๐ฒ๐ฟ-๐จ๐๐ฒ-๐น๐ถ๐ธ๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐, ๐๐ถ๐๐ต ๐ถ๐บ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐! ๐ฅ
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)