Kenneth C. Enevoldsen

KennethEnevoldsen

AI & ML interests

NLP, multimodal learning, Scandinavian NLP, Theory of Mind, Medical NLP, Psychiatry

Recent Activity

Organizations

Spaces-explorers's profile picture Dansk Data Science Community's profile picture Center for Humanities Computing Aarhus's profile picture Massive Text Embedding Benchmark's profile picture Automatic Abstractive Summarisation in Danish's profile picture KCE-ORG's profile picture Danish Foundation Models's profile picture Merge Crew's profile picture Social Post Explorers's profile picture

KennethEnevoldsen's activity

New activity in mteb/leaderboard 6 days ago

Clean up duplicates

2
#151 opened 9 days ago by
pylotlight
New activity in mteb/climate-fever 6 days ago

Not actually Climate FEVER

5
#1 opened 4 months ago by
sasha
reacted to davanstrien's post with πŸ€—πŸ”₯ 21 days ago
view post
Post
3051
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

πŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Β·
New activity in BAAI/bge-small-en-v1.5 25 days ago
New activity in PleIAs/Danish-PD 27 days ago
New activity in danish-foundation-models/danish-dynaword 27 days ago

add Danish-PD

#38 opened 27 days ago by
KennethEnevoldsen