---
library_name: transformers
tags:
- MOE
- GPT-2
- tabular
- generative
- causalLM
pipeline_tag: tabular-regression
---

# Tabby Model Card

Tabby is a post-training architecture modification for Transformer-based large language models, 
enabling their use for **tabular dataset synthesis**. This specific demo checkpoint is based on [DistilGPT-2](https://huggingface.co/distilbert/distilgpt2) 
and fine-tuned on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37), 
using our novel Plain training method, 
as an example of Tabby’s tabular synthesis capabilities. 
Tabby enhances transformer-based LLMs by incorporating **Mixture of Experts (MoE) layers**, 
allowing for better modeling of structured data. 

🐱 **Check out our [blog](https://sprocketlab.github.io/posts/2025/02/tabby/) or [paper](https://arxiv.org/abs/2503.02152) for more details and our [GitHub repo](https://github.com/soCromp/tabby) for code to use the model!**


- **Developed by:** University of Wisconsin-Madison
- **Shared by:** Sonia Cromp et al.
- **Model type:** MoE-enhanced GPT-2-based causal language model for tabular data
- **License:** MIT
- **Finetuned from model:** [`distilgpt2`](https://huggingface.co/distilbert/distilgpt2)

## Uses

### How to Use
[This demo notebook](https://github.com/soCromp/tabby/blob/main/demo.ipynb) loads the model checkpoint provided here and uses it to perform synthesis. 
To get started, follow the [environment setup instructions](https://github.com/soCromp/tabby/tree/main) in the GitHub readme.

### Direct Use

This Tabby checkpoint can be used for:
- High-fidelity synthesis of diabetes patients based on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37).
- Data augmentation for training machine learning models on the UCI Diabetes dataset.
- Comparison with other tabular synthesis approaches.

### Downstream Use

- Further fine-tuning on other structured datasets (e.g., financial records, medical records, or survey data).
- Generating synthetic tabular data for privacy-preserving machine learning.

## Bias, Risks, and Limitations

This Tabby checkpoint inherits biases from the GPT-2 architecture and the UCI Diabetes dataset used for fine-tuning. 
Considerations include those common to all generative models, such as:
- Bias in synthetic data feature distributions, particularly those that may reflect real-world disparities in the dataset.
- Potential hallucinations that do not perfectly match real-world distributions.

## Citation

If you use Tabby, please cite:

```bibtex
@article{cromp2025tabby,
  title={Tabby: Tabular Data Synthesis with Language Models},
  author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala},
  journal={arXiv preprint arXiv:2503.02152},
  year={2025},
  url={https://arxiv.org/abs/2503.02152}
}
```

## Model Card Contact

For questions or collaborations, please reach out to [Sonia Cromp](https://socromp.github.io) at [cromp@wisc.edu].