WildEval

non-profit

wild_eval

WildEval

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

yuchenlin updated a Space 13 days ago

WildEval/ZebraLogic

yuchenlin authored a paper 15 days ago

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

ChengsongHuang authored a paper 15 days ago

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

View all activity

WildEval's activity

yuchenlin

updated a Space 13 days ago

Zebra Logic Bench

🦓

Explore and evaluate Zebra Logic models

yuchenlin

authored a paper 15 days ago

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Paper • 2504.00043 • Published 25 days ago • 9

ChengsongHuang

authored a paper 15 days ago

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Paper • 2504.00043 • Published 25 days ago • 9

lasha-nlp

authored a paper 28 days ago

Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models

Paper • 2503.12072 • Published Mar 15

ChengsongHuang

authored 4 papers about 2 months ago

On Grounded Planning for Embodied Tasks with Language Models

Paper • 2209.00465 • Published Aug 29, 2022 • 1

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Paper • 2405.04086 • Published May 7, 2024 • 2

Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

Paper • 2410.10074 • Published Oct 14, 2024 • 1

Efficient Test-Time Scaling via Self-Calibration

Paper • 2503.00031 • Published Feb 25 • 15

faezeb

authored a paper about 2 months ago

Large-Scale Data Selection for Instruction Tuning

Paper • 2503.01807 • Published Mar 3 • 12

yuchenlin

authored a paper 2 months ago

Small Models Struggle to Learn from Strong Reasoners

Paper • 2502.12143 • Published Feb 17 • 35

DongfuJiang

authored a paper 3 months ago

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Paper • 2502.01718 • Published Feb 3 • 28

yuchenlin

authored a paper 3 months ago

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Paper • 2502.01100 • Published Feb 3 • 17

yuchenlin

updated a dataset 3 months ago

WildEval/ZebraLogic

Viewer • Updated Feb 4 • 4.26k • 497 • 5

yuchenlin

published a dataset 3 months ago

WildEval/ZebraLogic

Viewer • Updated Feb 4 • 4.26k • 497 • 5

lasha-nlp

authored 6 papers 3 months ago

Stress Test Evaluation for Natural Language Inference

Paper • 1806.00692 • Published Jun 2, 2018

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Paper • 2305.15065 • Published May 24, 2023 • 1

What's In My Big Data?

Paper • 2310.20707 • Published Oct 31, 2023 • 11

CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation

Paper • 2211.00295 • Published Nov 1, 2022

The Art of Saying No: Contextual Noncompliance in Language Models

Paper • 2407.12043 • Published Jul 2, 2024 • 4

WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

Paper • 2407.17468 • Published Jul 24, 2024

AI & ML interests

Recent Activity

Team members 9

WildEval's activity

Zebra Logic Bench