Tabby: Tabular Data Synthesis with Language Models
Abstract
While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.
Community
Tabby is an architecture modification to pre-trained LLMs, enabling their use for tabular dataset synthesis. Our evaluations indicate that Tabby reaches SOTA synthesis quality, and even reaches parity with real, non-synthetic data in 3/6 datasets. Please enjoy! ๐
๐ง Blog: https://sprocketlab.github.io/posts/2025/02/tabby/
๐ Paper: https://arxiv.org/abs/2503.02152
๐ค HuggingFace checkpoint: https://huggingface.co/sonicc/tabby-distilgpt2-diabetes
๐พ GitHub (with demo notebook): https://github.com/soCromp/tabby
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation (2025)
- TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data (2025)
- Synthetic Tabular Data Detection In the Wild (2025)
- Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes (2025)
- LLM Embeddings for Deep Learning on Tabular Data (2025)
- TabGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization (2025)
- Structural Deep Encoding for Table Question Answering (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper