-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper ⢠2305.18290 ⢠Published ⢠56 -
The Prompt Report: A Systematic Survey of Prompting Techniques
Paper ⢠2406.06608 ⢠Published ⢠63 -
Emu3: Next-Token Prediction is All You Need
Paper ⢠2409.18869 ⢠Published ⢠95 -
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper ⢠2504.01990 ⢠Published ⢠246
Jakhongir Saydaliev
Jakh0103
AI & ML interests
None yet
Recent Activity
reacted
to
kargaranamir's
post
with š
6 days ago
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.
š¤ corpus v1: https://huggingface.co/datasets/cis-lmu/GlotCC-V1
š± pipeline v3: https://github.com/cisnlp/GlotCC
More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.
Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
reacted
to
kargaranamir's
post
with š
6 days ago
A Text Language Identification Model with Support for +2000 Labels:
space: https://huggingface.co/spaces/cis-lmu/glotlid-space
model: https://huggingface.co/cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: https://huggingface.co/papers/2310.16248
updated
a collection
13 days ago
Papers