Review of GANs for tabular data

Original article can be found here (source): Deep Learning on Medium

We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action. For final results and source code, you can go to the Github repository.

Photo by Nate Grant on Unsplash

What is GAN

“GAN composes of two deep networks: the generator and the discriminator” [1]. Both of them simultaneously trained. Generally, the model structure and training process presented this way:

GAN training pipeline. By Jonathan Hui — What is Generative Adversarial Networks GAN? [1]

The task for the generator is to generate samples, which won’t be distinguished from real samples by the discriminator. I won’t give much detail here, but if you would like to dive into them, you can read the medium post and the original paper by Ian J. Goodfellow.

Recent architectures such as StyleGAN 2 can produce outstanding photo-realistic images.

Hand-picked examples of human faces generated by StyleGAN 2, Source arXiv:1912.04958v2 [7]

Problems

While face generation seems to be not a problem anymore, there are plenty of issues we need to resolve:

  • Training speed. For training StyleGAN 2 you need 1 week and DGX-1 (8x NVIDIA Tesla V100).
  • Image quality in specific domains. The state-of-the-art network still fails on other tasks.
Hand-picked examples of cars and cats generated by StyleGAN 2, Source arXiv:1912.04958v2 [7]

Tabular GANs

Even cats and dogs generation seem heavy tasks for GANs because of not trivial data distribution and high object type variety. Besides such domains, the image background becomes important, which GANs usually fail to generate.
Therefore, I’ve been wondering what GANs can achieve in tabular data. Unfortunately, there aren’t many articles. The next two articles appear to be the most promising.

First, they raise several problems, why generating tabular data has own challenges:

  • the various data types (int, decimals, categories, time, text)
  • different shapes of distribution ( multi-modal, long tail, Non-Gaussian…)
  • sparse one-hot-encoded vectors and highly imbalanced categorical columns.

Task formalizing

Let us say table T contains n_c continuous variables and n_d discrete(categorical) variables, and each row is C vector. These variables have an unknown joint distribution P. Each row is independently sampled from P. The object is to train a generative model M. M should generate new a synthetic table T_synth with the distribution similar to P. A machine learning model learned on T_synth should achieve a similar accuracy on a real test table T_test, as would a model trained on T.

Preprocessing numerical variables. “Neural networks can effectively generate values with a distribution centered over (−1, 1) using tanh [3]. However, they show that nets fail to generate suitable data with multi-modal data. Thus they cluster a numerical variable by using and training a Gaussian Mixture Model (GMM) with m (m=5) components for each of C.

Normalizing using GMM using mean and standard deviation. Source arXiv:1811.11264v1 [3]

Finally, GMM is used to normalize C to get V. Besides, they compute the probability of C coming from each of the m Gaussian distribution as a vector U.

Preprocessing categorical variables. Due to usually low cardinality, they found the probability distribution can be generated directly using softmax. But it necessary to convert categorical variables to one-hot-encoding representation with noise to binary variables

After prepossessing, they convert T with n_c + n_d columns to V, U, D vectors. This vector is the output of the generator and the input for the discriminator in GAN. “GAN does not have access to GMM parameters” [3].

Generator

They generate a numerical variable in 2 steps. First, generate the value scalar V, then generate the cluster vector U eventually applying tanh. Categorical features generated as a probability distribution over all possible labels with softmax. To generate the desired row LSTM with attention mechanism is used. Input for LSTM in each step is random variable z, weighted context vector with previous hidden and embedding vector.

Discriminator

Multi-Layer Perceptron (MLP) with LeakyReLU and BatchNorm is used. The first layer used concatenated vectors (V, U, D) among with mini-batch diversity with feature vector from LSTM. The loss function is the KL divergence term of input variables with the sum ordinal log loss function.

Example of using TGAN to generate a simple census table. The generator generates T features one be one. The discriminator concatenates all features together. Then it uses Multi-Layer Perceptron (MLP) with LeakyReLU to distinguish real and fake data. Source arXiv:1811.11264v1 [3]

Results

Accuracy of machine learning models trained on the real and synthetic training set. (BN — Bayesian networks, Gaussian Copula). Source arXiv:1811.11264v1 [3]

They evaluate the model on two datasets KDD99 and covertype. For some reason, they used weak models without boosting (xgboost, etc). Anyway, TGAN performs reasonably well and robust, outperforming bayesian networks. The average performance gap between real data and synthetic data is 5.7%.

The key improvements over previous TGAN are applying the mode-specific normalization to overcome the non-Gaussian and multimodal distribution. Then a conditional generator and training-by-sampling to deal with the imbalanced discrete columns.

Task formalizing

The initial data remains the same as it was in TGAN. However, they solve different problems.

  • Likelihood of fitness. Do columns in T_syn follow the same joint distribution as T_train
  • Machine learning efficacy. When training model to predict one column using other columns as features, can such model learned from T_syn achieve similar performance on T_test, as a model learned on T_train

Preprocessing

Preprocessing for discrete columns keeps the same.

For continuous variables, a variational Gaussian mixture model (VGM) is used. It first estimates the number of modes m and then fits a Gaussian mixture. After we normalize initial vector C almost the same as it was in TGAN, but the value is normalized within each mode. Mode is represented as one-hot vector betta ([0, 0, .., 1, 0]). Alpha is the normalized value of C.

An example of mode-specific normalization. Source arXiv:1907.00503v2 [4]

As a result, we get our initial row represented as the concatenation of one-hot’ ed discrete columns with representation discussed above of continues variables:

Preprocessed row. Source arXiv:1907.00503v2 [4]

Training

“The final solution consists of three key elements, namely: the conditional vector, the generator loss, and the training-by-sampling method” [4].

CTGAN model. The conditional generator can generate synthetic rows conditioned on one of the discrete columns. With training-by-sampling, the cond and training data are sampled according to the log-frequency of each category, thus CTGAN can evenly explore all possible discrete values. Source arXiv:1907.00503v2 [4]

Conditional vector

Represents concatenated one-hot vectors of all discrete columns but with the specification of only one category, which was selected. “For instance, for two discrete columns, D1 = {1, 2, 3} and D2 = {1, 2}, the condition (D2 = 1) is expressed by the mask vectors m1 = [0, 0, 0] and m2 = [1, 0]; so cond = [0, 0, 0, 1, 0]” [4].

Generator loss

“During training, the conditional generator is free to produce any set of one-hot discrete vectors” [4]. But they enforce the conditional generator to produce d_i (generated discrete one-hot column)= m_i (mask vector) is to penalize its loss by adding the cross-entropy between them, averaged over all the instances of the batch.

Training-by-sampling

“Specifically, the goal is to resample efficiently in a way that all the categories from discrete attributes are sampled evenly during the training process, as a result, to get real data distribution during the test” [4].

In another word, the output produced by the conditional generator must be assessed by the critic, which estimates the distance between the learned conditional distribution P_G(row|cond) and the conditional distribution on real data P(row|cond). “The sampling of real training data and the construction of cond vector should comply to help critics estimate the distance” [4]. Properly sample the cond vector and training data can help the model evenly explore all possible values in discrete columns.

The model structure is given below, as opposite to TGAN, there is no LSTM layer. Trained with WGAN loss with gradient penalty.

Generator. Source arXiv:1907.00503v2 [4]
Discriminator. Source arXiv:1907.00503v2 [4]

Also, they propose a model based on Variational autoencoder (VAE), but it out of the scope of this article.

Results

Proposed network CTGAN and TVAE outperform other methods. As they say, TVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes. The generator in GANs does not have access to real data during the entire training process, unlike TVAE.

Benchmark results over three sets of experiments, namely Gaussian mixture simulated data (GM Sim.), Bayesian network simulated data (BN Sim.), and real data. They report the average of each metric. For real datasets (f1, etc). Source arXiv:1907.00503v2 [4]

Besides, they published source code on GitHub, which with slight modification will be used further in the article.

Applying CTGAN to generating data for increasing train (semi-supervised)

This is a kind of vanilla dream for me to be examined. After brief familiarization with recent developments in GAN, I’ve been thinking about how to apply it to something that I solve on the work daily. So here is my idea.

Task formalization

Let say we have T_train and T_test (train and test set respectively). We need to train the model on T_train and make predictions on T_test. However, we will increase the train by generating new data by GAN, somehow similar to T_test, without using ground truth labels of it.

Experiment design

Let say we have T_train and T_test (train and test set respectively). The size of T_train is smaller and might have different data distribution. First of all, we train CTGAN on T_train with ground truth labels (step 1), then generate additional data T_synth (step 2). Secondly, we train boosting in an adversarial way on concatenated T_train and T_synth (target set to 0) with T_test (target set to 1) (steps 3 & 4). The goal is to apply newly trained adversarial boosting to obtain rows more like T_test. Note — original ground truth labels aren’t used for adversarial training. As a result, we take top rows from T_train and T_synth sorted by correspondence to T_test (steps 5 & 6). Finally, rain new boosting on them and check results on T_test.

Experiment design and workflow

Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline but without CTGAN (in step 3 we won’t use T_sync).

Code

Experiment code and results released as Github repo here. Pipeline and data preparation was based on Benchmarking Categorical Encoders’ article and its repo. We will follow almost the same pipeline, but for speed, only Single validation and Catboost encoder was chosen. Due to the lack of GPU memory, some of the datasets were skipped.

Datasets

All datasets came from different domains. They have a different number of observations, several categorical and numerical features. The aim of all datasets is a binary classification. Preprocessing of datasets was simple: removed all time-based columns from datasets. The remaining columns were either categorical or numerical. In addition, while training results were sampled T_train — 5%, 10%, 25%, 50%, 75%