
CGIAR
Enterprise
non-profit
Verified
AI & ML interests
None defined yet.
Recent Activity
View all activity
Articles
CGIAR's activity
Post
800
Our new Agentic leaderboard is now live!๐ฅ
If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โ
๐ GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐ช
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โ
๐ GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐ช
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
feedcomposerย
updated
a
dataset
15 days ago
sleeperscioย
updated
a
dataset
15 days ago

apsourgย
updated
a
dataset
15 days ago
Post
4731
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐ฅ
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐ http://www.surveyx.cn/
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐ http://www.surveyx.cn/
Post
3029
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐คฏ
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis:
โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques:
โฃ High-quality reasoning chains with self-verification steps
โฃ 817 handpicked problems that encourage deeper reasoning
โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains:
โฃ Only 817 examples instead of 100k+
โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
Read the full paper here ๐ย LIMO: Less is More for Reasoning (2502.03387)
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis:
โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques:
โฃ High-quality reasoning chains with self-verification steps
โฃ 817 handpicked problems that encourage deeper reasoning
โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains:
โฃ Only 817 examples instead of 100k+
โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
Read the full paper here ๐ย LIMO: Less is More for Reasoning (2502.03387)
Post
2906
๐๐ฟ๐ฒ๐ฎ๐ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฎ๐น๐ฒ๐ฟ๐: you can now share agents to the Hub! ๐ฅณ๐ฅณ
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
Go try it out! ๐ https://github.com/huggingface/smolagents
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
Go try it out! ๐ https://github.com/huggingface/smolagents
Post
2518
For those who haven't come across it yet, here's a handy trick to discuss an entire GitHub repo with an LLM:
=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs

AnKyrย
updated
a
dataset
28 days ago
Post
4823
"๐ฎ๐ฌ๐ฎ๐ฑ ๐๐ถ๐น๐น ๐ฏ๐ฒ ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ ๐ผ๐ณ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐": this statement has often been made, here are numbers to support it.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
Post
3737
๐๐ฑ๐๐ฒ๐ป'๐ ๐ป๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ด๐ฒ๐ป๐๐ ๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐๐ต๐ผ๐๐ ๐๐ต๐ฎ๐ ๐๐ฒ๐ฒ๐ฝ๐ฆ๐ฒ๐ฒ๐ธ-๐ฅ๐ญ ๐๐๐ฟ๐๐ด๐ด๐น๐ฒ๐ ๐ผ๐ป ๐ฑ๐ฎ๐๐ฎ ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ฎ๐๐ธ๐! โ
โก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.
So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.
๐ But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.
๐ง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.
It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐
Read more in the blog post ๐ https://huggingface.co/blog/dabstep
โก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.
So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.
๐ But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.
๐ง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.
It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐
Read more in the blog post ๐ https://huggingface.co/blog/dabstep
Post
9775
Introducing ๐ผ๐ฝ๐ฒ๐ป ๐๐ฒ๐ฒ๐ฝ-๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต by Hugging Face! ๐ฅ
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Read the blog post ๐ https://huggingface.co/blog/open-deep-research
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Read the blog post ๐ https://huggingface.co/blog/open-deep-research
Post
3128
Now you can launch a code agent directly from your terminal!
โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent
โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
Check the release notes here ๐ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent
โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
Check the release notes here ๐ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
Post
4078
๐ง๐ต๐ฒ ๐๐๐ฏ ๐๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ๐ ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฝ๐ฟ๐ผ๐๐ถ๐ฑ๐ฒ๐ฟ๐!
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co/blog/inference-providers
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co/blog/inference-providers
Post
3309
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co/blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co/blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
Post
1376
๐ ๐ถ๐ป๐ถ๐ ๐ฎ๐
'๐ ๐ป๐ฒ๐ ๐ ๐ผ๐ ๐๐๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฆ๐ผ๐ป๐ป๐ฒ๐ ๐น๐ฒ๐๐ฒ๐น ๐๐ถ๐๐ต ๐ฐ๐ ๐๐ผ๐ธ๐ฒ๐ป๐ ๐ฐ๐ผ๐ป๐๐ฒ๐
๐ ๐น๐ฒ๐ป๐ด๐๐ต ๐ฅ
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01
Post
2549
๐ช๐ฒ'๐๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ญ.๐ฏ.๐ฌ ๐, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
The setup is very easy, in a few lines of code.
Find a tutorial here ๐ https://huggingface.co/docs/smolagents/tutorials/inspect_runs
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
The setup is very easy, in a few lines of code.
Find a tutorial here ๐ https://huggingface.co/docs/smolagents/tutorials/inspect_runs
Post
675
๐ข๐ฆ-๐๐ฒ๐ป๐ฒ๐๐ถ๐: ๐ป๐ฒ๐ ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐๐ฒ๐ ๐ฎ ๐ป๐ผ๐๐ฒ๐น ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฑ๐ฎ๐๐ฎ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐ณ๐ผ๐ฟ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐๐ผ๐บ๐ฝ๐๐๐ฒ๐ฟ-๐จ๐๐ฒ-๐น๐ถ๐ธ๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐, ๐๐ถ๐๐ต ๐ถ๐บ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐! ๐ฅ
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)
Post
5153
Since I published it on GitHub a few days ago,
Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
We will make it work better, and fully open. โจ
Sounds like something you'd like to do? Apply here ๐ https://apply.workable.com/huggingface/j/AF1D4E3FEB/
Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
We will make it work better, and fully open. โจ
Sounds like something you'd like to do? Apply here ๐ https://apply.workable.com/huggingface/j/AF1D4E3FEB/
Post
2385
After 6 years, BERT, the workhorse of encoder models, finally gets a replacement: ๐ช๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐ฟ๐ป๐๐๐ฅ๐ง! ๐ค
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co/blog/modernbert
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co/blog/modernbert