CTAB-GAN: Effective Table Data Synthesizing

Figure 1: Synthetic Tabular Data Generation via CTAB-GAN

1. Introduction

2. Motivation

Figure 2: Challenges of modeling industrial dataset using existing GAN-based table generators: (a) mixed type, (b) longtail distribution, and © skewed data
  • Tabular data comprises of mixed variables that consist of both a continuous and a discrete component. Similarly, missing values embedded in continuous variables may also be regarded as a categorical component of a mixed variable.
  • Continuous variables exhibit heavy long-tailed distributions which are difficult to model and reproduce authentically.
  • Continuous variables contain multiple modes of skewed frequencies which further exacerbate modelling.

3. Contribution

  • Novel conditional adversarial network which introduces a classifier providing additional supervision to improve its utility for ML applications.
  • Efficient modelling of continuous, categorical, and mixed variables via novel data encoding and conditional vector.
  • Light-weight data pre-processing to mitigate the impact of long tail distribution of continuous variables using a simple log transform.
  • Providing an effective data synthesizer for the relevant stake-holders.

4. Results

Figure 3: Result of modeling industrial dataset using CTAB-GAN: (a) mixed type, (b) long tail distribution, and © skewed data
  • Mixed variables- Figure 3.(a) shown above compares the real and CTAB-GAN generated data for the variable “Mortgage” in the Loan dataset. CTAB-GAN encodes this variable as a mixed type. We can see that CTAB-GAN generates clear 0 values unlike existing state-of-the-art techniques.
  • Long tail continuous variables- Figure 3.(b) compares the cumulative frequency graph for the “Amount” variable in the Credit dataset. This variable is a typical long tail distribution. One can see that CTAB-GAN perfectly recovers the real distribution. Due to log-transform data pre-processing, CTAB-GAN learns this structure significantly better than the state-of-the-art methods.
  • Skewed multi-mode continuous variables- Figure 3.© compares the frequency distribution for the continuous variable “Hours-per-week” from the Adult dataset. Except the dominant peak at 40, there are many side peaks which make synthesizing this column extremely difficult. However, we see that CTAB-GAN is more capable than existing methods to recover the skewed multi-modal distribution due to it’s novel construction of the conditional vector designed to make the generation process more robust to such distributions.

5. Conclusion

--

--

--

I am a researcher at Generatrix- An AI-based privacy preserving data synthesizing platform. I have an avid passion for new and emerging technologies in AI & ML

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Churn Prediction Using Machine Learning

Neural Networks at wysker

Content Based Image Retrieval without Metadata*

Improving Deep neural networks by regularization

How to Calculate the SVD from Scratch with Python

Pre-Processing in Natural Language Machine Learning

What is the Best Facial Recognition Software to Use in 2021?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aditya Kunar

Aditya Kunar

I am a researcher at Generatrix- An AI-based privacy preserving data synthesizing platform. I have an avid passion for new and emerging technologies in AI & ML

More from Medium

Hard Drive Failure Detection using S.M.A.R.T attributes on Backblaze dataset.

Introduction to Multilabel Classification

Adaboost classifier for face detection using viola jones algorithm

K-Nearest Neighbors (KNN) Algorithm: