- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity - I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions
My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.
This significantly reduces computation costs while expanding reasoning dataset domain coverage.
I read the 456-page AI Index report so you don't have to (kidding). The wild part? While AI gets ridiculously more accessible, the power gap is actually widening:
1๏ธโฃ The democratization of AI capabilities is accelerating rapidly: - The gap between open and closed models is basically closed: difference in benchmarks like MMLU and HumanEval shrunk to just 1.7% in 2024 - The cost to run GPT-3.5-level performance dropped 280x in 2 years - Model size is shrinking while maintaining performance - Phi-3-mini hitting 60%+ MMLU at fraction of parameters of early models like PaLM
2๏ธโฃ But we're seeing concerning divides deepening: - Geographic: US private investment ($109B) dwarfs everyone else - 12x China's $9.3B - Research concentration: US and China dominate highly-cited papers (50 and 34 respectively in 2023), while next closest is only 7 - Gender: Major gaps in AI skill penetration rates - US shows 2.39 vs 1.71 male/female ratio
The tech is getting more accessible but the benefits aren't being distributed evenly. Worth thinking about as these tools become more central to the economy.
AI agents are transforming how we interact with technology, but how sustainable are they? ๐
Design choices โ like model size and structure โ can massively impact energy use and cost. โก๐ฐ The key takeaway: smaller, task-specific models can be far more efficient than large, general-purpose ones.
๐ Open-source models offer greater transparency, allowing us to track energy consumption and make more informed decisions on deployment. ๐ฑ Open-source = more efficient, eco-friendly, and accountable AI.
See that purple banner on the Llama 4 models? It's Xet storage, and this is actually huge for anyone building with AI models. Let's geek out a little bit ๐ค
Current problem: AI models are massive files using Git LFS. But with models getting bigger and downloads exploding, we needed something better. Xet lets you version large files like code, with compression and deduplication, all Git-compatible. That means less bandwidth, faster sharing, and smoother collaboration.
Real numbers: ~25% deduplication on Llama 4 models, hitting ~40% for finetunes.
Scale matters here - the Hub served 2B model downloads in 30 days, Llama models alone at 60M. The upcoming Llama 4 Behemoth has 2T parameters! Xet's chunk-based system was built exactly for this.
This is the kind of engineering that makes the next wave of large models actually usable. Kudos to the team! ๐งจ
"Am I going to be replaced by AI?" - Crucial question, but maybe we're asking the wrong one.
๐ There's a statistic from my reads this week that stays with me: Tomer Cohen, LinkedIn's CPO, shares to Jeremy Kahn that 70% of skills used in most jobs will change by 2030. Not jobs disappearing, but transforming. And he calls out bad leadership: "If in one year's time, you are disappointed that your workforce is not 'AI native,' it is your fault."
๐ Apparently, the Great Recalibration has begun. We're now heading into an era where AI is fundamentally redefining the nature of work itself, by forcing a complete reassessment of human value in the workplace, according to a piece in Fast Company. But it might be driven more by "the need for humans to change the way they work" than AI.
โก The Washington Post draws a crucial parallel: We're facing an "AI shock" similar to manufacturing's "China shock" - but hitting knowledge workers. Especially entry-level, white-collar work could get automated. The key difference? "Winning the AI tech competition with other countries won't be enough. It's equally vital to win the battle to re-skill workers."
Did we just drop personalized AI evaluation?! This tool auto-generates custom benchmarks on your docs to test which models are the best.
Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like: - Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut? - Could a smaller, fine-tuned model work better? - How well do different models understand your domain?
Some cool features: ๐ Generates custom benchmarks from your own documents (PDFs, Word, HTML) ๐ฏ Tests models on real tasks, not just general capabilities ๐ Supports multiple models for different pipeline stages ๐ง Generate both single-hop and multi-hop questions ๐ Evaluate top models and deploy leaderboards instantly ๐ฐ Full cost analysis to optimize for your budget ๐ ๏ธ Fully configurable via a single YAML file
26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.
You can also run it locally on any models you want.
๐ DeepSeek R1 moment has come for GUI agents: Rule-based Reinforcement Learning gives better results than SFT with 500x smaller datasets!
Traditionally (by which I mean "in the last few months"), GUI agents have been trained with supervised fine-tuning (SFT). This meant, collecting huge datasets of screen captures from people using computers, and using these to fine-tune your model. ๐
๐ But last week, a new paper introduced UI-R1, applying DeepSeek's R1-style rule-based reinforcement learning (RL) specifically to GUI action prediction tasks. This is big news: with RL, maybe we could build good agents without the need for huge datasets.
UI-R1 uses a unified reward function that evaluates multiple responses from models, optimizing via policy algorithms like Group Relative Policy Optimization (GRPO).
Specifically, the reward function assesses: ๐ฏ Action type accuracy: Does the predicted action match the ground truth? ๐ Coordinate accuracy (specifically for clicks): Is the predicted click within the correct bounding box? ๐ Output format: Does the model clearly articulate both its reasoning and final action?
Using just 136 carefully selected mobile tasksโcompared to 76,000 tasks for larger models like OS-AtlasโUI-R1 shows significant efficiency and improved performance: ๐ Boosted action prediction accuracy from 76% to 89% on AndroidControl. ๐ Outperformed larger, SFT-trained models (e.g., OS-Atlas-7B), demonstrating superior results with vastly fewer data points (136 tasks vs. 76K). ๐ Enhanced adaptability and generalization, excelling even in out-of-domain scenarios.
The paper tests this RL-based method only in low-level GUI tasks. Could it generalize to more complex interactions? ๐ง