sonicc
/

tabby-distilgpt2-diabetes

Tabular Regression

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

tabby-distilgpt2-diabetes / README.md

sonicc's picture

Update README.md

a80f823 verified 4 days ago

|

history blame contribute delete

3.14 kB

	---
	library_name: transformers
	tags:
	- MOE
	- GPT-2
	- tabular
	- generative
	- causalLM
	pipeline_tag: tabular-regression
	---

	# Tabby Model Card

	Tabby is a post-training architecture modification for Transformer-based large language models,
	enabling their use for tabular dataset synthesis. This specific demo checkpoint is based on [DistilGPT-2](https://huggingface.co/distilbert/distilgpt2)
	and fine-tuned on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37),
	using our novel Plain training method,
	as an example of Tabby’s tabular synthesis capabilities.
	Tabby enhances transformer-based LLMs by incorporating Mixture of Experts (MoE) layers,
	allowing for better modeling of structured data.

	🐱 Check out our [blog](https://sprocketlab.github.io/posts/2025/02/tabby/) or [paper](https://arxiv.org/abs/2503.02152) for more details and our [GitHub repo](https://github.com/soCromp/tabby) for code to use the model!


	- Developed by: University of Wisconsin-Madison
	- Shared by: Sonia Cromp et al.
	- Model type: MoE-enhanced GPT-2-based causal language model for tabular data
	- License: MIT
	- Finetuned from model: [`distilgpt2`](https://huggingface.co/distilbert/distilgpt2)

	## Uses

	### How to Use
	[This demo notebook](https://github.com/soCromp/tabby/blob/main/demo.ipynb) loads the model checkpoint provided here and uses it to perform synthesis.
	To get started, follow the [environment setup instructions](https://github.com/soCromp/tabby/tree/main) in the GitHub readme.

	### Direct Use

	This Tabby checkpoint can be used for:
	- High-fidelity synthesis of diabetes patients based on the [UCI Diabetes dataset](https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=diabetes&id=37).
	- Data augmentation for training machine learning models on the UCI Diabetes dataset.
	- Comparison with other tabular synthesis approaches.

	### Downstream Use

	- Further fine-tuning on other structured datasets (e.g., financial records, medical records, or survey data).
	- Generating synthetic tabular data for privacy-preserving machine learning.

	## Bias, Risks, and Limitations

	This Tabby checkpoint inherits biases from the GPT-2 architecture and the UCI Diabetes dataset used for fine-tuning.
	Considerations include those common to all generative models, such as:
	- Bias in synthetic data feature distributions, particularly those that may reflect real-world disparities in the dataset.
	- Potential hallucinations that do not perfectly match real-world distributions.

	## Citation

	If you use Tabby, please cite:

	```bibtex
	@article{cromp2025tabby,
	title={Tabby: Tabular Data Synthesis with Language Models},
	author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala},
	journal={arXiv preprint arXiv:2503.02152},
	year={2025},
	url={https://arxiv.org/abs/2503.02152}
	}
	```

	## Model Card Contact

	For questions or collaborations, please reach out to [Sonia Cromp](https://socromp.github.io) at [[email protected]].