AI & ML interests

None defined yet.

Recent Activity

fr-gouv-coordination-ia's activity

nataliaElvย 
posted an update 8 days ago
view post
Post
1596
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
nataliaElvย 
posted an update 14 days ago
view post
Post
1244
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates ๐Ÿ˜‚


Do you want to see how your annotations compare to others?
๐Ÿ‘‰ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
โœ๏ธ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
nataliaElvย 
posted an update 22 days ago
view post
Post
1177
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
frascuchonย 
posted an update 26 days ago
view post
Post
378
๐Ÿš€ Argilla v2.5.0 is out! ๐ŸŽ‰
Weโ€™re excited to announce the latest version of Argilla, packed with features to make your data annotation workflows more powerful and seamless. Hereโ€™s whatโ€™s new:

โœจ 1. Argilla Webhooks
With Argilla webhooks, you can:
* Trigger custom workflows
* Seamlessly integrate with external tools
* Build custom event-driven pipelines

๐Ÿ 2. Support for Python 3.13 and Pydantic v2
Argilla v2.5.0 now runs on:
* Python 3.13 for enhanced compatibility and speed
* Pydantic v2 for improved performance and type validation

๐ŸŽจ 3. Redesigned Home Page
Argilla's home page has been redesigned to provide a better user experience, showing a newโ€จdataset card view, which provides a better overview of the datasets and annotation progress.

๐Ÿ“– Read the full release notes ๐Ÿ‘‰ https://github.com/argilla-io/argilla/releases/tag/v2.5.0)
โฌ‡๏ธ Update now ๐Ÿ‘‰ https://pypi.org/project/argilla)
or use the live demo ๐Ÿ‘‰ argilla/argilla-template-space
nataliaElvย 
posted an update 28 days ago
view post
Post
1626
Would you like to get a high-quality dataset to pre-train LLMs in your language? ๐ŸŒ

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6
nataliaElvย 
posted an update 30 days ago
view post
Post
361
You can now add your Bluesky handle to your Hugging Face profile! ๐Ÿฆ‹
Have you noticed?
clefourrierย 
posted an update 8 months ago
view post
Post
5446
In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences ๐Ÿฉธ

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm
clefourrierย 
posted an update 8 months ago
view post
Post
4427
Contamination free code evaluations with LiveCodeBench! ๐Ÿ–ฅ๏ธ

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date ๐Ÿ“…

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! ๐Ÿš€

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
clefourrierย 
posted an update 8 months ago
view post
Post
2209
๐Ÿ†• Evaluate your RL agents - who's best at Atari?๐Ÿ†

The new RL leaderboard evaluates agents in 87 possible environments (from Atari ๐ŸŽฎ to motion control simulations๐Ÿšถand more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! โœจ

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! ๐Ÿš€

open-rl-leaderboard/leaderboard
clefourrierย 
posted an update 9 months ago
view post
Post
2216
Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...
clefourrierย 
posted an update 9 months ago
view post
Post
2352
Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
โ™ป๏ธthe order in which the few shot examples are added to the prompt โ™ป๏ธ
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.
ยท
clefourrierย 
posted an update 9 months ago
view post
Post
2012
Are you looking for the perfect leaderboard/arena for your use case? ๐Ÿ‘€

There's a new tool for this!
https://huggingface.co/spaces/leaderboards/LeaderboardFinder

Select your modality, language, task... then search! ๐Ÿ”
Some categories of interest:
- does the leaderboard accept submissions?
- is the test set private or public?
- is it using an automatic metric, human evaluators, or llm as a judge?

The spaces list is build from space metadata, and reloaded every hour.

Enjoy!
clefourrierย 
posted an update 9 months ago
view post
Post
1526
How talkative is your chatbot about your internal data? ๐Ÿ˜ฌ

As more chatbots get deployed in production, with access to internal databases, we need to make sure they don't leak private information to anyone interacting with them.

The Lighthouz AI team therefore introduced the Chatbot Guardrails Arena to stress test models and see how well guarded your private information is.
Anyone can try to make models reveal information they should not share ๐Ÿ˜ˆ
(which is quite fun to do for the strongest models)!

The votes will then be gathered to create an Elo ranking of the safest models with respect to PII.

In the future, with the support of the community, this arena could inform safety choices that company make, when choosing models and guardrails on their resistance to adversarial attacks.
It's also a good way to easily demonstrate the limitations of current systems!

Check out the arena: lighthouzai/guardrails-arena
Learn more in the blog: https://huggingface.co/blog/arena-lighthouz
clefourrierย 
posted an update 10 months ago
view post
Post
๐Ÿ”ฅ New multimodal leaderboard on the hub: ConTextual!

Many situations require models to parse images containing text: maps, web pages, real world pictures, memes, ... ๐Ÿ–ผ๏ธ
So how do you evaluate performance on this task?

The ConTextual team introduced a brand new dataset of instructions and images, to test LMM (large multimodal models) reasoning capabilities, and an associated leaderboard (with a private test set).

This is super exciting imo because it has the potential to be a good benchmark both for multimodal models and for assistants' vision capabilities, thanks to the instructions in the dataset.

Congrats to @rohan598 , @hbXNov , @kaiweichang and @violetpeng !!

Learn more in the blog: https://huggingface.co/blog/leaderboard-contextual
Leaderboard: ucla-contextual/contextual_leaderboard
clefourrierย 
posted an update 10 months ago
view post
Post
First big community contribution on our evaluation suite, lighteval โ›…๏ธ

@Ali-C137 added 3 evaluation tasks in Arabic:
- ACVA, a benchmark about Arabic culture
- MMLU, translated
- Exams, translated
(datasets provided/translated by the AceGPT team)

Congrats to them!
https://github.com/huggingface/lighteval/pull/44
  • 1 reply
ยท
clefourrierย 
posted an update 10 months ago
clefourrierย 
posted an update 10 months ago
view post
Post
๐Ÿ”ฅ New LLM leaderboard blog: Open Ko LLM!

One of the oldest leaderboards on the hub, it has already evaluated more than 1000 models! It uses Korean translations of MMLU, ARC, HellaSwag, TruthfulQA, and a new dataset, Korean CommonGen, about specific common sense alignement.

upstage/open-ko-llm-leaderboard

What's interesting about this leaderboard is how it drove LLM development in Korea, with on average about 4 submissions/models per day since it started!
Really looking forward to seeing similar initiatives in other languages, to help qualitative models emerge outside of "just English" (for the other 2/3rds of the world).

Read more about how the leaderboard in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-upstage
Congrats to @Chanjun , @hunkim and the Upstage team!