Spaces:
Running
Running
from __future__ import annotations | |
TITLE = """<h1 align="center" id="space-title">TabArena Leaderboard for Predictive Machine Learning on IID Tabular Data</h1>""" | |
INTRODUCTION_TEXT = """ | |
TabArena is a living benchmark system for predictive machine learning on tabular data. | |
The goal of TabArena and its leaderboard is to asses the peak performance of | |
model-specific pipelines. | |
""" | |
OVERVIEW_DATASETS = """ | |
The leaderboard is based on a manually curated collection of | |
51 tabular classification and regression datasets for independent and identically distributed | |
(IID) data, spanning the small to medium data regime. The datasets were carefully | |
curated to represent various real-world predictive machine learning use cases. | |
""" | |
OVERVIEW_MODELS = """ | |
The focus of the leaderboard is on model-specific pipelines. Each pipeline | |
is evaluated with default and tuned hyperparameter configuration or as an ensemble of | |
tuned configurations. Each model is implemented in a tested real-world pipeline that was | |
optimized to get the most out of the model by the maintainers of TabArena, and where | |
possible together with the authors of the model. | |
""" | |
OVERVIEW_METRICS = """ | |
The leaderboards are ranked based on Elo. We present several additional | |
metrics. See `More Details` for more information on the metrics. | |
**Note, we impute** the performance for models that cannot run on all datasets due to | |
task or dataset size constraints (e.g. TabPFN, TabICL). In general, imputation | |
negatively represents the model performance, punishing the model for not being able | |
to run on all datasets. We provide leaderboards computed only on the subset of datasets | |
where TabPFN, TabICL, or both can run. We denote these leaderboards by `X-data`. | |
""" | |
OVERVIEW_REF_PIPE = """ | |
The leaderboard includes a reference pipeline, which is applied | |
independently of the tuning protocol and constraints we constructed for models within TabArena. | |
The reference pipeline aims to represent the performance quickly achievable by a | |
practitioner on a dataset. The current reference pipeline is the predictive machine | |
learning system AutoGluon (version 1.3, with the best_quality preset and | |
4 hours for training). AutoGluon represents an ensemble pipeline across various model | |
types and thus provides a reference for model-specific pipelines. | |
""" | |
ABOUT_TEXT = r""" | |
### Extended Overview of TabArena (References / Papers) | |
We introduce TabArena and provide an overview of TabArena-v0.1 in our paper: https://tabarena.ai/paper-tabular-ml-iid-study. | |
### Using TabArena for Benchmarking | |
To compare your own methods to the pre-computed results for all models on the leaderboard, | |
you can use the TabArena framework. For examples on how to use TabArena for benchmarking, | |
please see https://tabarena.ai/code-examples | |
### Contributing to the Leaderboard; Contributing Models | |
For guidelines on how to contribute your model to TabArena, or the result of your model | |
to the official leaderboard, please see the appendix of our paper: https://tabarena.ai/paper-tabular-ml-iid-study. | |
### Contributing Data | |
For anything related to the datasets used in TabArena, please see https://tabarena.ai/data-tabular-ml-iid-study | |
--- | |
### Leaderboard Documentation | |
The leaderboard is ranked by Elo and includes several other metrics. Here is a short | |
description for these metrics: | |
#### Elo | |
We evaluate models using the Elo rating system, following Chatbot Arena. Elo is a | |
pairwise comparison-based rating system where each model's rating predicts its expected | |
win probability against others, with a 400-point Elo gap corresponding to a 10 to 1 | |
(91\%) expected win rate. We calibrate 1000 Elo to the performance of our default | |
random forest configuration across all figures, and perform 100 rounds of bootstrapping | |
to obtain 95\% confidence intervals. Elo scores are computed using ROC AUC for binary | |
classification, log-loss for multiclass classification, and RMSE for regression. | |
#### Score | |
Following TabRepo, we compute a normalized score to provide an additional relative | |
comparison. We linearly rescale the error such that the best method has a normalized | |
score of one, and the median method has a normalized score of 0. Scores below zero | |
are clipped to zero. These scores are then averaged across datasets. | |
#### Average Rank | |
Ranks of methods are computed on each dataset (lower is better) and averaged. | |
#### Harmonic Rank | |
We compute the harmonic mean of ranks across datasets. The harmonic mean of ranks, | |
1/((1/N) * sum(1/rank_i for i in range(N))), more strongly favors methods having very | |
low ranks on some datasets. It therefore favors methods that are sometimes very good | |
and sometimes very bad over methods that are always mediocre, as the former are more | |
likely to be useful in conjunction with other methods. | |
#### Improvability | |
We introduce improvability as a metric that measures how many percent lower the error | |
of the best method is than the current method on a dataset. This is then averaged over | |
datasets. Formally, for a single dataset improvability is (err_i - besterr_i)/err_i * 100\%. | |
Improvability is always between 0\% and 100\%. | |
--- | |
### Contact | |
For most inquires, please open issues in the relevant GitHub repository or here on | |
HuggingFace. | |
For any other inquiries related to TabArena, please reach out to: [email protected] | |
### Core Maintainers | |
The current core maintainers of TabArena are: | |
[Nick Erickson](https://github.com/Innixma), | |
[Lennart Purucker](https://github.com/LennartPurucker/), | |
[Andrej Tschalzev](https://github.com/atschalz), | |
[David Holzmüller](https://github.com/dholzmueller) | |
""" | |
CITATION_BUTTON_LABEL = ( | |
"If you use TabArena or the leaderboard in your research please cite the following:" | |
) | |
CITATION_BUTTON_TEXT = r"""@article{erickson2025tabarena, | |
title={TabArena: A Living Benchmark for Machine Learning on Tabular Data}, | |
author={Nick Erickson and Lennart Purucker and Andrej Tschalzev and David Holzmüller and Prateek Mutalik Desai and David Salinas and Frank Hutter}, | |
year={2025}, | |
journal={arXiv preprint arXiv:2506.16791}, | |
url={https://arxiv.org/abs/2506.16791}, | |
} | |
""" | |
VERSION_HISTORY_BUTTON_TEXT = """ | |
**Current Version: TabArena-v0.1.1** | |
The following details updates to the leaderboard (date format is YYYY/MM/DD): | |
* 2025/06/13: Add data for all subsets and re-runs on GPU; Add leaderboards for subsets; | |
new overview; add Figures to LBs. | |
* 2025/05: Initialization of the TabArena-v0.1 leaderboard. | |
""" | |