Alan Tseng

agentlans

AI & ML interests

Small data, boring AI

Recent Activity

updated a dataset about 11 hours ago
agentlans/combined-roleplay
updated a dataset about 13 hours ago
agentlans/Conversational-Reasoning-Topical-Chat
published a dataset about 13 hours ago
agentlans/Conversational-Reasoning-Topical-Chat
View all activity

Organizations

None yet

agentlans's activity

New activity in agentlans/multilingual-sentences 3 days ago

Miss numbers

1
#2 opened 4 days ago by
neurlang
reacted to tomaarsen's post with โค๏ธ 3 days ago
view post
Post
6187
An assembly of 18 European companies, labs, and universities have banded together to launch ๐Ÿ‡ช๐Ÿ‡บ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

๐Ÿ‡ช๐Ÿ‡บ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3๏ธโƒฃ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
โžก๏ธ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
โš™๏ธ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
๐Ÿ”ฅ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
๐Ÿ“Š Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
๐Ÿ“ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
  • 1 reply
ยท
replied to tomaarsen's post 3 days ago
view reply

I was about to finetune my own English-French and English-Chinese BERTs. But this is way better!