Evals & Monitoring - a sbarman25 Collection

sbarman25 's Collections

Training & Architectures

Models

Safety / Alignment / Policies / SMI

Evals & Monitoring

Spaces

Agentic

Vulnerabilities

CV / Text-to-Image / Image-to-Image / Diffusion

Others

Hardware-aware Models

Tool Usage (w/VLMs)

Vision Language Models

Evals & Monitoring

updated Jul 25, 2024

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
miracl/miracl-corpus

Viewer • Updated Jan 5, 2023 • 77.2M • 5.07k • 43

Note https://github.com/project-miracl/miracl?tab=readme-ov-file MTEB: https://github.com/embeddings-benchmark/mteb
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 33
How is ChatGPT's behavior changing over time?

Paper • 2307.09009 • Published Jul 18, 2023 • 24
Evaluating Large Language Models: A Comprehensive Survey

Paper • 2310.19736 • Published Oct 30, 2023 • 2
Instruction-Following Evaluation for Large Language Models

Paper • 2311.07911 • Published Nov 14, 2023 • 20
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Paper • 2303.08896 • Published Mar 15, 2023 • 4

Note https://aclanthology.org/2023.eacl-main.75/
Landmark Attention: Random-Access Infinite Context Length for Transformers

Paper • 2305.16300 • Published May 25, 2023

Note Original "needle-in-haystack-test" for long context input - Passkey Retrieval More: https://arxiv.org/pdf/2402.13753.pdf
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Paper • 2402.03744 • Published Feb 6, 2024 • 4

Note The hyperparameters, including temperature, top-k and top-p, of the LLMs’decoder determine the diversity of the generations. To evaluate the impact of those hyper parameters. We provide a sensitivity analysis in Figure 4. As observed, the performance is greatly influenced by temperature but shows little sensitivity to top-k. The performance of the consistency based methods (EigenScore and Lexical Similarity) drops significantly when the temperature is greater than 1.
Chainpoll: A high efficacy method for LLM hallucination detection

Paper • 2310.18344 • Published Oct 22, 2023 • 1

Note As a substitute for G-Eval
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

Paper • 2305.13711 • Published May 23, 2023 • 2
WILDS: A Benchmark of in-the-Wild Distribution Shifts

Paper • 2012.07421 • Published Dec 14, 2020 • 1
Extending the WILDS Benchmark for Unsupervised Adaptation

Paper • 2112.05090 • Published Dec 9, 2021 • 1
vectara/hallucination_evaluation_model

Text Classification • Updated Oct 30, 2024 • 189k • 239
MMMU/MMMU

Viewer • Updated Sep 19, 2024 • 11.6k • 7.3k • 220
HuggingFaceH4/mt_bench_prompts

Viewer • Updated Jul 3, 2023 • 80 • 172 • 16
Running on CPU Upgrade

89

🥇

HHEM Leaderboard
Runtime error

6

✅

Transparency Self Assessment (FMTI)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 34
TRUE: Re-evaluating Factual Consistency Evaluation

Paper • 2204.04991 • Published Apr 11, 2022 • 1
Evaluating Very Long-Term Conversational Memory of LLM Agents

Paper • 2402.17753 • Published Feb 27, 2024 • 19
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2, 2024 • 121
TIGER-Lab/MMLU-Pro

Viewer • Updated Nov 27, 2024 • 12.1k • 39.1k • 311
patched-codes/static-analysis-eval

Viewer • Updated Sep 13, 2024 • 113 • 480 • 17
nvidia/ChatRAG-Bench

Viewer • Updated May 24, 2024 • 34.6k • 1.33k • 107
Running

222

🦁

AI2 WildBench Leaderboard (V2)
allenai/WildBench

Viewer • Updated Nov 4, 2024 • 2.3k • 1.72k • 34