Leshem Choshen

borgr

56 27 23

https://ktilana.wixsite.com/leshem-choshen

AI & ML interests

Future of Humane AI, technology that matters to all of us CoLab PI - doing Science together

Recent Activity

published an article about 17 hours ago

Featuring Every Eval Ever Results on Hugging Face Model Pages

updated a model 2 days ago

The-CoLab/llama3-7b-en-anchored-ar-aya

updated a model 2 days ago

The-CoLab/llama3-7b-en-ru-aya

View all activity

Organizations

upvoted a collection 17 days ago

Multilingual-Transfer

Collection

Pretraining models to find what allows multilingual transfer • 26 items • Updated 2 days ago • 2

upvoted a paper 28 days ago

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Paper • 2605.28556 • Published May 27 • 73

upvoted an article about 2 months ago

Article

AI evals are becoming the new compute bottleneck

evaleval

•

Apr 29

• 30

upvoted 2 papers 2 months ago

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Paper • 2604.15574 • Published Apr 16 • 25

ZipNN: Lossless Compression for AI Models

Paper • 2411.05239 • Published Nov 7, 2024 • 3

upvoted a paper 4 months ago

General Agent Evaluation

Paper • 2602.22953 • Published Feb 26 • 12

upvoted a collection 9 months ago

BabyBabelLM

Collection

A multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. • 45 items • Updated Oct 29, 2025 • 10

upvoted a paper 11 months ago

Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

Paper • 2508.07485 • Published Aug 10, 2025 • 10

upvoted an article 11 months ago

Article

The AI Evaluation Chart Crisis

andrewtran117

•

Aug 12, 2025

• 4

upvoted 4 papers about 1 year ago

upvoted an article over 1 year ago

Article

FeeL: Making Multilingual LMs Better, One Feedback Loop at a Time

borgr

•

Mar 25, 2025

• 12

upvoted 2 papers over 1 year ago

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 97

Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

Paper • 2502.09619 • Published Feb 13, 2025 • 36

upvoted a collection over 1 year ago

Dicta-LM 2.0 Collection

Collection

9 items • Updated Apr 27, 2024 • 21

upvoted 3 papers over 1 year ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 20

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Paper • 2410.10783 • Published Oct 14, 2024 • 26

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Paper • 2410.05057 • Published Oct 7, 2024 • 7

Leshem Choshen

AI & ML interests

Recent Activity

Organizations

borgr's activity

AI evals are becoming the new compute bottleneck

The AI Evaluation Chart Crisis

FeeL: Making Multilingual LMs Better, One Feedback Loop at a Time