Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

m-ricย 
posted an update 1 day ago
view post
Post
4750
Introducing ๐—ผ๐—ฝ๐—ฒ๐—ป ๐——๐—ฒ๐—ฒ๐—ฝ-๐—ฅ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต by Hugging Face! ๐Ÿ’ฅ

OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.

โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ

โžก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...

We aimed for the best performance: are the agent's answers really rigorous?

On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โžก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐Ÿ’ช๐Ÿ’ช

And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !

Read the blog post ๐Ÿ‘‰ https://huggingface.co/blog/open-deep-research
ahmed-masryย 
posted an update 1 day ago
view post
Post
3058
Happy to announce AlignVLM ๐Ÿ“ โ€“ a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) ๐ŸŒ๐Ÿ“„๐Ÿ–ผ

๐Ÿ”— Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

๐Ÿง Whatโ€™s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. โŒ

๐ŸŽฏ Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. โœ…

๐Ÿ”ฌ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all ๐Ÿ† on diverse document understanding tasks ๐Ÿ“„.

๐Ÿ“Š Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
โœ… AlignVLM surpasses all Base VLMs trained under similar configurations. โœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 ๐Ÿš€.

๐Ÿค” What about robustness to noise?
We injected Gaussian noise (ฮผ=0, ฯƒ=3) into the vision encoderโ€™s outputs before feeding them to the connector:
โœ… ALIGN Connector: Minimal drop (โ†“1.67%) โ€“ proving its high robustness!
โŒ MLP Connector: Severe degradation (โ†“25.54%) โ€“ struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! ๐Ÿ”ฅ
victorย 
posted an update 1 day ago
view post
Post
2390
Hey everyone, we've given https://hf.co/spaces page a fresh update!

Smart Search: Now just type what you want to doโ€”like "make a viral meme" or "generate music"โ€”and our search gets it.

New Categories: Check out the cool new filter bar with icons to help you pick a category fast.

Redesigned Space Cards: Reworked a bit to really show off the app descriptions, so you know what each Space does at a glance.

Random Prompt: Need ideas? Hit the dice button for a burst of inspiration.

Weโ€™d love to hear what you thinkโ€”drop us some feedback plz!
ยท
Jawardย 
posted an update 2 days ago
view post
Post
2549
ByteDance drops OmniHuman๐Ÿ”ฅ
This is peak SOTA performance - flawless natural gestures with perfect lip sync and facial expressions. This is the second time they've released SOTA level talking-heads only this time with hands and body motion.
Project: https://omnihuman-lab.github.io/
  • 2 replies
ยท
hexgradย 
posted an update about 14 hours ago
view post
Post
1344
I wrote an article about G2P: https://hf.co/blog/hexgrad/g2p

G2P is an underrated piece of small TTS models, like offensive linemen who do a bunch of work and get no credit.

Instead of relying on explicit G2P, larger speech models implicitly learn this task by eating many thousands of hours of audio data. They often use a 500M+ parameter LLM at the front to predict latent audio tokens over a learned codebook, then decode these tokens into audio.

Kokoro instead relies on G2P preprocessing, is 82M parameters, and thus needs less audio to learn. Because of this, we can cherrypick high fidelity audio for training data, and deliver solid speech for those voices. In turn, this excellent audio quality & lack of background noise helps explain why Kokoro is very competitive in single-voice TTS Arenas.
  • 1 reply
ยท
lin-tanย 
posted an update about 18 hours ago
view post
Post
990
๐Ÿš€ Excited to share that our paper, "SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models", has been accepted to #ICRA2025! ๐Ÿ”— Preprint: https://arxiv.org/pdf/2409.19471

We introduce SELP (Safe Efficient LLM Planner), a novel approach for generating plans that adhere to user-specified constraints while optimizing for time-efficient execution. By leveraging linear temporal logic (LTL) to interpret natural language commands, SELP effectively handles complex commands and long-horizon tasks. ๐Ÿค–

๐Ÿ’กSELP presents three key insights:
1๏ธโƒฃ Equivalence Voting: Ensures robust translations from natural language instructions into LTL specifications.
2๏ธโƒฃ Constrained Decoding: Uses the generated LTL formula to guide the autoregressive inference of plans, ensuring the generated plans conform to the LTL.
3๏ธโƒฃ Domain-Specific Fine-Tuning: Customizes LLMs for specific robotic tasks, boosting both safety and efficiency.

๐Ÿ“Š Experiment: Our experiments demonstrate SELPโ€™s effectiveness and generalizability across diverse tasks. In drone navigation, SELP outperforms state-of-the-art LLM planners by 10.8% in safety rate and by 19.8% in plan efficiency. For robot manipulation, SELP achieves a 20.4% improvement in safety rate.

@yiwu @jiang719

#ICRA2025 #LLM #Robotics #Agent #LLMPlanner
davidberenstein1957ย 
posted an update 1 day ago
albertvillanovaย 
posted an update 1 day ago
view post
Post
1423
๐Ÿš€ Introducing @huggingface Open Deep-Research๐Ÿ’ฅ

In just 24 hours, we built an open-source agent that:
โœ… Autonomously browse the web
โœ… Search, scroll & extract info
โœ… Download & manipulate files
โœ… Run calculations on data

55% on GAIA validation set! Help us improve it!๐Ÿ’ก
https://huggingface.co/blog/open-deep-research
  • 1 reply
ยท
rubenroyย 
posted an update 2 days ago
view post
Post
2108
๐Ÿ”ฅ๐Ÿš€ Hey everyone! I'm excited to share my latest LLM release: Gilgamesh 72B, a model built on Qwen 2.5-72B Instruct. Gilgamesh was trained on a couple of my GammaCorpus datasets, specifically:

- rubenroy/GammaCorpus-CoT-Math-170k
- rubenroy/GammaCorpus-v2-5m
- rubenroy/GammaCorpus-Fact-QA-450k

I've submitted GGM 72B to the Open LLM Leaderboard for benchmarking, I'll send an update post once the results are in!

You can try it out and share your feedback, check out the model page and see what it can do:
๐Ÿ‘‰ rubenroy/Gilgamesh-72B

Would love to hear your thoughts!
nyuuzyouย 
posted an update about 19 hours ago
view post
Post
834
๐Ÿ“ฑ UI Navigation Corpus - teleren/ui-navigation-corpus

A comprehensive collection of mobile and web UI elements created by a new member of the Hugging Face community @teleren . I'm glad that I was able to provide a little help together with @its5Q to get this dataset published.

This dataset contains:
- Screenshots and recordings of mobile (iOS/Android) and web interfaces
- UI navigation annotations and metadata
- Screen categorization tags and text extractions
- Navigation paths and screen relationships
- Version control for UI imagery

Perfect for training UI navigation agents and understanding interface patterns. The dataset provides detailed annotations linking screens, sections, and navigation flows together.