sonicc's picture
Update README.md
a80f823 verified
---
library_name: transformers
tags:
- MOE
- GPT-2
- tabular
- generative
- causalLM
pipeline_tag: tabular-regression
---
# Tabby Model Card
Tabby is a post-training architecture modification for Transformer-based large language models,
enabling their use for **tabular dataset synthesis**. This specific demo checkpoint is based on [DistilGPT-2](https://huggingface.co/distilbert/distilgpt2)
and fine-tuned on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37),
using our novel Plain training method,
as an example of Tabby’s tabular synthesis capabilities.
Tabby enhances transformer-based LLMs by incorporating **Mixture of Experts (MoE) layers**,
allowing for better modeling of structured data.
🐱 **Check out our [blog](https://sprocketlab.github.io/posts/2025/02/tabby/) or [paper](https://arxiv.org/abs/2503.02152) for more details and our [GitHub repo](https://github.com/soCromp/tabby) for code to use the model!**
- **Developed by:** University of Wisconsin-Madison
- **Shared by:** Sonia Cromp et al.
- **Model type:** MoE-enhanced GPT-2-based causal language model for tabular data
- **License:** MIT
- **Finetuned from model:** [`distilgpt2`](https://huggingface.co/distilbert/distilgpt2)
## Uses
### How to Use
[This demo notebook](https://github.com/soCromp/tabby/blob/main/demo.ipynb) loads the model checkpoint provided here and uses it to perform synthesis.
To get started, follow the [environment setup instructions](https://github.com/soCromp/tabby/tree/main) in the GitHub readme.
### Direct Use
This Tabby checkpoint can be used for:
- High-fidelity synthesis of diabetes patients based on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37).
- Data augmentation for training machine learning models on the UCI Diabetes dataset.
- Comparison with other tabular synthesis approaches.
### Downstream Use
- Further fine-tuning on other structured datasets (e.g., financial records, medical records, or survey data).
- Generating synthetic tabular data for privacy-preserving machine learning.
## Bias, Risks, and Limitations
This Tabby checkpoint inherits biases from the GPT-2 architecture and the UCI Diabetes dataset used for fine-tuning.
Considerations include those common to all generative models, such as:
- Bias in synthetic data feature distributions, particularly those that may reflect real-world disparities in the dataset.
- Potential hallucinations that do not perfectly match real-world distributions.
## Citation
If you use Tabby, please cite:
```bibtex
@article{cromp2025tabby,
title={Tabby: Tabular Data Synthesis with Language Models},
author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala},
journal={arXiv preprint arXiv:2503.02152},
year={2025},
url={https://arxiv.org/abs/2503.02152}
}
```
## Model Card Contact
For questions or collaborations, please reach out to [Sonia Cromp](https://socromp.github.io) at [[email protected]].