Quentin Lhoest PRO

lhoestq

AI & ML interests

Maintainer of πŸ€—Datasets: NLP, Multimodal data processing and sharing

Recent Activity

updated a dataset about 3 hours ago
infinite-dataset-hub/CapedContinuumCreators
updated a dataset about 4 hours ago
infinite-dataset-hub/TrendsetterTitans
updated a dataset about 4 hours ago
infinite-dataset-hub/ModernMythosMakers
View all activity

Articles

Organizations

Hugging Face's profile picture WMT: Workshop on Statistical Machine Translation's profile picture BigScience Workshop's profile picture Neuropark's profile picture Hugging Face Internal Testing Organization's profile picture Training Transformers Together's profile picture BigScience Catalogue Data's profile picture OpenSLR's profile picture BigScience Data's profile picture Evaluation on the Hub's profile picture Datasets Maintainers's profile picture 2023 Jan Offsite hackathon's profile picture Whisper Distillation's profile picture Open LLM Leaderboard's profile picture huggingPartyParis's profile picture CommonCanvas's profile picture ZeroGPU Explorers's profile picture Datasets examples's profile picture Pixel Parsing's profile picture HuggingFaceFW-Dev's profile picture Infinite Dataset Hub's profile picture Hugging Face FineVideo's profile picture Dataset ReWriter's profile picture Dataset Tools's profile picture Rainforest Connection's profile picture

lhoestq's activity

posted an update 12 days ago
view post
Post
1604
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
πŸ”— Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
reacted to christopher's post with πŸ‘ 16 days ago
reacted to christopher's post with πŸ”₯ 16 days ago
view post
Post
1562
The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot
Β·
reacted to dvilasuero's post with ❀️πŸ”₯ 18 days ago
view post
Post
2257
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. πŸ—½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βš–οΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
reacted to davidberenstein1957's post with πŸš€ 20 days ago
view post
Post
3409
The Data Is Better Together community is set to release the first Apache 2 licensed image preference dataset!

Great work and let's give this a final push :)

@aashish1904 congrats on your month of HF pro. There is more to win during this sprint!

@aashish1904 @AnyaDesdein @davidberenstein1957 @Malalatiana @beta3 @fffiloni @munish0838 @Reza2kn @bbunzeck @Creazycreator @andrei-saceleanu @jafhaponiuk @rca-etl @kf120 @burtenshaw @mmhamdy @grib0ed0v @Doopus @AnyaDes @ttkap @Xceron @Lewox @davanstrien @Azazelle @adirik @Ashish08 @AntonVic @kenantang @sdiazlor @g-ronimo @dennis-rall @prithivMLmods @girtss3 @flozi00 @WaveCut @Taylor658 @Wildminder @Sara9999 @phaelishall @sararob @dvilasuero @pgabrys @plaguss @CDS899 @timajwilliams @rudzinskimaciej @pavel-ai @aggr8 @ignacioct @MouseAI @Leeps @MaksKul @NicolasDmln @Muinez @kusht55 @caiolang @Jakub-Brand24 @loamy @Demijan @eliab96 @Viewegger @JosephCatrambone @p1atdev @mrshu @o639 @Targezed @Aviv-anthonnyolime @thliang01 @Ahmed-Amine @glards @pranaykoppula @nataliaElv @MaPirlet @alvarobartt @gabrielmbmb @zlicastro @Jaydip @Chouettecheveche @lilcheaty @ruyrdiaz @robintema @fdaudens @ggcristian @a-r-r-o-w @pates @joheras @stopsatgreen @bezo97 @chachi902 @iamyann @liamcripwell @dmb23 @korbih @anonymous7743 @akbdx18 @OVAWARE @severo @akontra @lichorosario @lhoestq @SebastianBodza @Vishnou @ameerazam08 @appoose @Mukei @mearco @joaquincabezas @Fizzarolli @thomastraum @igortopolski @OxxoCodes @patrickfleith @asoria @bn22 @sitammeur @Krodolf @bergr7f @Sbxxn @wietsevenema @sugatoray @Iamladi @MikeTrizna @feveromo @mokady @Bolero @prath @Dowwie @kfahn @decodingchris @alili2050 @RahulRaman @yzimmermann @Ameeeee @ecyht2 @MattMC001 @hemanthkumarak @Thegorgibus @akos2 @LawRun @ramithuh @SuperMuel @sjans @peterizsak @mosama @Eyel @mtr3 @cfahlgren1 @legentil @clem @Citaman @Aurelien-Morgan @AntoineBourgois @TotoB12 @Stanmey @osanseviero @multimodalart @maxiw @ariG23498 @ngk89 @femboysLover @dvs @tacohiddink @blanchon @DavidJimenez
  • 1 reply
Β·
reacted to rwightman's post with πŸ‘ about 1 month ago
view post
Post
1276
I'm currently on a push to expand the scope of image based datasets on the Hub. There's certainly a lot already, but for anyone who's looked closely, there's not a whole lot of standardization. I am to fix that, datasets under the https://huggingface.co/timm and https://huggingface.co/pixparse orgs will serve as canonical examples for various task / modality combinations and be useable without fuss in libraries like timm, OpenCLIP, and hopefully more.

I just uploaded the first multi-label dataset that I'll support with timm scripts soon: timm/plant-pathology-2021

Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means timm support for object detection, eventually segmentation, is finally under development :O
reacted to merve's post with πŸ”₯ about 1 month ago
view post
Post
4994
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
πŸ’¨ a new vision language model with 9x less image tokens, super efficient
πŸ“– aligned with DPO for reducing hallucinations
⚑️ Apache 2.0 license πŸ”₯

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M
  • 4 replies
Β·
reacted to jsulz's post with πŸš€ 3 months ago
view post
Post
2042
In August, the XetHub team joined Hugging Face
- https://huggingface.co/blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.

Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.

You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248
reacted to asoria's post with πŸ‘ 3 months ago
view post
Post
2458
πŸ“ I wrote a tutorial on how to get started with the fine-tuning process using Hugging Face tools, providing an end-to-end workflow.

The tutorial covers creating a new dataset using the new SQL Console πŸ›’ and fine-tuning a model with SFT, guided by the Notebook Creator App πŸ“™.

πŸ‘‰ You can read the full article here:
https://huggingface.co/blog/asoria/easy-fine-tuning-with-hf
asoria/auto-notebook-creator
reacted to clem's post with ❀️ 4 months ago
view post
Post
3678
This isn’t a goal of ours because we have plenty of money in the bank but quite excited to see that @huggingfaceis profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!

Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!
Β·
replied to their post 5 months ago
replied to their post 5 months ago
posted an update 5 months ago
view post
Post
4045
Hey ! I'm working on a 100% synthetic Dataset Hub here (you can search for any kind of datasets an the app invents them). The link is here: infinite-dataset-hub/infinite-dataset-hub

Question for the Community:

Which models should I use to generate images and audio samples for those datasets ? πŸ€—
  • 4 replies
Β·
reacted to severo's post with β€οΈπŸš€ 5 months ago
view post
Post
3483
[New tool] Follow interesting ML persons πŸ‘©β€πŸŽ¨ πŸ‘¨β€πŸŽ€ πŸ‘©β€πŸ« with Followgraph

severo/followgraph

Please try it and tell me if it helped you discover high-quality content πŸ‘ πŸ‘Ž

I repurposed "Followgraph for Mastodon" (https://followgraph.vercel.app/).

My new follows: @TheBloke @mlabonne @teknium @KnutJaegersberg @SkalskiP @AmelieSchreiber @lbourdois @ceyda @andrewyng @Pclanglais @karpathy

And you?
Β·
reacted to Wauplin's post with πŸ”₯ 5 months ago
view post
Post
1984
πŸš€ Just released version 0.24.0 of the πš‘πšžπšπšπš’πš—πšπšπšŠπšŒπšŽ_πš‘πšžπš‹ Python library!

Exciting updates include:
⚑ InferenceClient is now a drop-in replacement for OpenAI's chat completion!

✨ Support for response_format, adapter_id , truncate, and more in InferenceClient

πŸ’Ύ Serialization module with a save_torch_model helper that handles shared layers, sharding, naming convention, and safe serialization. Basically a condensed version of logic scattered across safetensors, transformers , accelerate

πŸ“ Optimized HfFileSystem to avoid getting rate limited when browsing HuggingFaceFW/fineweb

πŸ”¨ HfApi & CLI improvements: prevent empty commits, create repo inside resource group, webhooks API, more options in the Search API, etc.

Check out the full release notes for more details:
Wauplin/huggingface_hub#7
πŸ‘€
Β·
reacted to dvilasuero's post with πŸš€πŸ”₯ 6 months ago
view post
Post
8071
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and AmΓ©lie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
Β·
reacted to albertvillanova's post with πŸ‘ 7 months ago
view post
Post
2699
Easily convert your script-based datasets to Parquet and explore them in the dataset viewer. 🌟

πŸ› οΈ Use @huggingface Datasets CLI:
$ 𝚍𝚊𝚝𝚊𝚜𝚎𝚝𝚜-πšŒπš•πš’ πšŒπš˜πš—πšŸπšŽπš›πš_𝚝𝚘_πš™πšŠπš›πššπšžπšŽπš πš„πš‚π™΄πšπ™½π™°π™Όπ™΄/π™³π™°πšƒπ™°πš‚π™΄πšƒ_𝙽𝙰𝙼𝙴

Learn more: https://huggingface.co/docs/datasets/main/en/cli#convert-to-parquet
#Data #AI