|
--- |
|
library_name: transformers |
|
tags: |
|
- MOE |
|
- GPT-2 |
|
- tabular |
|
- generative |
|
- causalLM |
|
pipeline_tag: tabular-regression |
|
--- |
|
|
|
# Tabby Model Card |
|
|
|
Tabby is a post-training architecture modification for Transformer-based large language models, |
|
enabling their use for **tabular dataset synthesis**. This specific demo checkpoint is based on [DistilGPT-2](https://huggingface.co/distilbert/distilgpt2) |
|
and fine-tuned on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37), |
|
using our novel Plain training method, |
|
as an example of Tabby’s tabular synthesis capabilities. |
|
Tabby enhances transformer-based LLMs by incorporating **Mixture of Experts (MoE) layers**, |
|
allowing for better modeling of structured data. |
|
|
|
🐱 **Check out our [blog](https://sprocketlab.github.io/posts/2025/02/tabby/) or [paper](https://arxiv.org/abs/2503.02152) for more details and our [GitHub repo](https://github.com/soCromp/tabby) for code to use the model!** |
|
|
|
|
|
- **Developed by:** University of Wisconsin-Madison |
|
- **Shared by:** Sonia Cromp et al. |
|
- **Model type:** MoE-enhanced GPT-2-based causal language model for tabular data |
|
- **License:** MIT |
|
- **Finetuned from model:** [`distilgpt2`](https://huggingface.co/distilbert/distilgpt2) |
|
|
|
## Uses |
|
|
|
### How to Use |
|
[This demo notebook](https://github.com/soCromp/tabby/blob/main/demo.ipynb) loads the model checkpoint provided here and uses it to perform synthesis. |
|
To get started, follow the [environment setup instructions](https://github.com/soCromp/tabby/tree/main) in the GitHub readme. |
|
|
|
### Direct Use |
|
|
|
This Tabby checkpoint can be used for: |
|
- High-fidelity synthesis of diabetes patients based on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37). |
|
- Data augmentation for training machine learning models on the UCI Diabetes dataset. |
|
- Comparison with other tabular synthesis approaches. |
|
|
|
### Downstream Use |
|
|
|
- Further fine-tuning on other structured datasets (e.g., financial records, medical records, or survey data). |
|
- Generating synthetic tabular data for privacy-preserving machine learning. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This Tabby checkpoint inherits biases from the GPT-2 architecture and the UCI Diabetes dataset used for fine-tuning. |
|
Considerations include those common to all generative models, such as: |
|
- Bias in synthetic data feature distributions, particularly those that may reflect real-world disparities in the dataset. |
|
- Potential hallucinations that do not perfectly match real-world distributions. |
|
|
|
## Citation |
|
|
|
If you use Tabby, please cite: |
|
|
|
```bibtex |
|
@article{cromp2025tabby, |
|
title={Tabby: Tabular Data Synthesis with Language Models}, |
|
author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala}, |
|
journal={arXiv preprint arXiv:2503.02152}, |
|
year={2025}, |
|
url={https://arxiv.org/abs/2503.02152} |
|
} |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
For questions or collaborations, please reach out to [Sonia Cromp](https://socromp.github.io) at [[email protected]]. |