Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
14.2
TFLOPS
15
4
9
Omar Kamali
PRO
omarkamali
Follow
XXXMARK's profile picture
webxos's profile picture
Tinytee's profile picture
57 followers
Β·
24 following
https://omarkama.li
omarkamali
omarkamali
omar-kamali
AI & ML interests
NLP & LLMs for low resource languages.
Recent Activity
updated
a dataset
3 days ago
omarkamali/wikipedia-monthly
posted
an
update
5 days ago
You're probably training on outdated Wikipedia data right now and don't know it. π‘ In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace." He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time. β’ For English, that's 700,000 missing articles. β’ For Moroccan Arabic, 30% of the language's entire Wikipedia. β’ For 31 other languages, there was literally no text corpus at all until recently. I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works). Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on. Here's the full story of how I built Wikipedia Monthly π https://omarkamali.com/blog/wikipedia-monthly-pipeline
updated
a model
8 days ago
wikilangs/hu
View all activity
Organizations
omarkamali
's Spaces
2
Sort:Β Recently updated
Sleeping
3
LLM Scope
π
Explore the inners of your favorite LLMs
Sleeping
1
Harmony Inspector
π«
Parse and inspect how the GPT-OSS Harmony format works