vitaliykinakh commited on
Commit
012c18c
·
verified ·
1 Parent(s): 2de08f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - demo-org/diabetes
5
+ - scikit-learn/adult-census-income
6
+ - leostelon/california-housing
7
+ - vitaliykinakh/heloc
8
+ - vitaliykinakh/sick
9
+ - vitaliykinakh/travel
10
+ metrics:
11
+ - accuracy
12
+ ---
13
+
14
+ This repository contains the official models from the paper "[Tabular Data Generation using Binary Diffusion](https://arxiv.org/abs/2409.13882)",
15
+ accepted to [3rd Table Representation Learning Workshop @ NeurIPS 2024](https://table-representation-learning.github.io/).
16
+
17
+ # Abstract
18
+
19
+ Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive.
20
+ Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed
21
+ data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we
22
+ introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary
23
+ representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary
24
+ data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary
25
+ cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter
26
+ tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets,
27
+ demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes
28
+ datasets while being significantly smaller in size.
29
+
30
+ # Results
31
+
32
+ The table below presents the **Binary Diffusion** results across various datasets and models. Performance metrics are shown as **mean ± standard deviation**.
33
+
34
+ | **Dataset** | **LR (Binary Diffusion)** | **DT (Binary Diffusion)** | **RF (Binary Diffusion)** | **Params** |
35
+ |-------------------------|---------------------------|---------------------------|---------------------------|------------|
36
+ | **Travel** | **83.79 ± 0.08** | **88.90 ± 0.57** | **89.95 ± 0.44** | **1.1M** |
37
+ | **Sick** | 96.14 ± 0.63 | **97.07 ± 0.24** | 96.59 ± 0.55 | **1.4M** |
38
+ | **HELOC** | 71.76 ± 0.30 | 70.25 ± 0.43 | 70.47 ± 0.32 | **2.6M** |
39
+ | **Adult Income** | **85.45 ± 0.11** | **85.27 ± 0.11** | **85.74 ± 0.11** | **1.4M** |
40
+ | **Diabetes** | **57.75 ± 0.04** | **57.13 ± 0.15** | 57.52 ± 0.12 | **1.8M** |
41
+ | **California Housing** | *0.55 ± 0.00* | 0.45 ± 0.00 | 0.39 ± 0.00 | **1.5M** |
42
+
43
+ ---
44
+
45
+ # Citation
46
+ ```
47
+ @article{kinakh2024tabular,
48
+ title={Tabular Data Generation using Binary Diffusion},
49
+ author={Kinakh, Vitaliy and Voloshynovskiy, Slava},
50
+ journal={arXiv preprint arXiv:2409.13882},
51
+ year={2024}
52
+ }
53
+ ```