File size: 3,142 Bytes
c53f651
 
d02a73e
 
 
 
 
 
 
c53f651
 
d02a73e
c53f651
d02a73e
 
 
 
 
 
 
ca6204b
1dfca3a
c53f651
 
d02a73e
 
 
 
 
c53f651
 
a80f823
 
 
 
 
c53f651
 
d02a73e
 
 
 
c53f651
d02a73e
c53f651
d02a73e
 
c53f651
 
 
d02a73e
 
 
 
c53f651
d02a73e
c53f651
d02a73e
c53f651
d02a73e
 
 
 
65a3f78
d02a73e
65a3f78
d02a73e
 
c53f651
 
 
d02a73e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
library_name: transformers
tags:
- MOE
- GPT-2
- tabular
- generative
- causalLM
pipeline_tag: tabular-regression
---

# Tabby Model Card

Tabby is a post-training architecture modification for Transformer-based large language models, 
enabling their use for **tabular dataset synthesis**. This specific demo checkpoint is based on [DistilGPT-2](https://huggingface.co/distilbert/distilgpt2) 
and fine-tuned on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37), 
using our novel Plain training method, 
as an example of Tabby’s tabular synthesis capabilities. 
Tabby enhances transformer-based LLMs by incorporating **Mixture of Experts (MoE) layers**, 
allowing for better modeling of structured data. 

🐱 **Check out our [blog](https://sprocketlab.github.io/posts/2025/02/tabby/) or [paper](https://arxiv.org/abs/2503.02152) for more details and our [GitHub repo](https://github.com/soCromp/tabby) for code to use the model!**


- **Developed by:** University of Wisconsin-Madison
- **Shared by:** Sonia Cromp et al.
- **Model type:** MoE-enhanced GPT-2-based causal language model for tabular data
- **License:** MIT
- **Finetuned from model:** [`distilgpt2`](https://huggingface.co/distilbert/distilgpt2)

## Uses

### How to Use
[This demo notebook](https://github.com/soCromp/tabby/blob/main/demo.ipynb) loads the model checkpoint provided here and uses it to perform synthesis. 
To get started, follow the [environment setup instructions](https://github.com/soCromp/tabby/tree/main) in the GitHub readme.

### Direct Use

This Tabby checkpoint can be used for:
- High-fidelity synthesis of diabetes patients based on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37).
- Data augmentation for training machine learning models on the UCI Diabetes dataset.
- Comparison with other tabular synthesis approaches.

### Downstream Use

- Further fine-tuning on other structured datasets (e.g., financial records, medical records, or survey data).
- Generating synthetic tabular data for privacy-preserving machine learning.

## Bias, Risks, and Limitations

This Tabby checkpoint inherits biases from the GPT-2 architecture and the UCI Diabetes dataset used for fine-tuning. 
Considerations include those common to all generative models, such as:
- Bias in synthetic data feature distributions, particularly those that may reflect real-world disparities in the dataset.
- Potential hallucinations that do not perfectly match real-world distributions.

## Citation

If you use Tabby, please cite:

```bibtex
@article{cromp2025tabby,
  title={Tabby: Tabular Data Synthesis with Language Models},
  author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala},
  journal={arXiv preprint arXiv:2503.02152},
  year={2025},
  url={https://arxiv.org/abs/2503.02152}
}
```

## Model Card Contact

For questions or collaborations, please reach out to [Sonia Cromp](https://socromp.github.io) at [[email protected]].