Original article can be found here (source): Deep Learning on Medium

We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action. For final results and source code, you can go to the Github repository.

# What is GAN

“GAN composes of two deep networks: the **generator **and the **discriminator**” [1]**. **Both of them simultaneously trained. Generally, the model structure and training process presented this way:

The task for the **generator** is to generate samples, which won’t be distinguished from real samples by the** discriminator**. I won’t give much detail here, but if you would like to dive into them, you can read the medium post and the original paper by Ian J. Goodfellow.

Recent architectures such as StyleGAN 2 can produce outstanding photo-realistic images.

## Problems

While face generation seems to be not a problem anymore, there are plenty of issues we need to resolve:

**Training speed**. For training StyleGAN 2 you need 1 week and DGX-1 (8x NVIDIA Tesla V100).**Image quality in specific domains**. The state-of-the-art network still fails on other tasks.

# Tabular GANs

Even cats and dogs generation seem heavy tasks for GANs because of not trivial data distribution and high object type variety. Besides such domains, the image background becomes important, which GANs usually fail to generate.

Therefore, I’ve been wondering what GANs can achieve in tabular data. Unfortunately, there aren’t many articles. The next two articles appear to be the most promising.

First, they raise several problems, why generating tabular data has own challenges:

- the various data types (int, decimals, categories, time, text)
- different shapes of distribution ( multi-modal, long tail, Non-Gaussian…)
- sparse one-hot-encoded vectors and highly imbalanced categorical columns.

**Task formalizing**

Let us say table **T **contains **n_c **continuous variables and** n_d **discrete(categorical) variables, and each row is **C** vector. These variables have an unknown joint distribution **P. **Each row is independently sampled from **P**. The object is to train a generative model **M. M** should generate new a synthetic table **T_synth **with the distribution similar to **P. **A machine learning model learned on **T_synth** should achieve a similar accuracy on a real test table **T_test**, as would a model trained on **T.**

**Preprocessing numerical variables. “**Neural networks can effectively generate values with a distribution centered over (−1, 1) using *tanh*”* *[3]. However, they show that nets fail to generate suitable data with multi-modal data. Thus they cluster a numerical variable by using and training a Gaussian Mixture Model (**GMM**) with **m **(m=5) components for each of **C**.

Finally, GMM is used to normalize **C **to get **V. **Besides, they compute the probability of **C** coming from each of the** m **Gaussian distribution as a vector** U.**

**Preprocessing categorical variables. **Due to usually low cardinality, they found the probability distribution can be generated directly using softmax. But it necessary to convert categorical variables to one-hot-encoding representation with noise to binary variables

After prepossessing, they convert **T** with **n_c + n_d** columns to **V, U, D **vectors. This vector is the output of the generator and the input for the discriminator in GAN. “GAN does not have access to GMM parameters” [3].

**Generator**

They generate a numerical variable in 2 steps. First, generate the value scalar **V**, then generate the cluster vector **U **eventually applying** tanh**. Categorical features generated as a probability distribution over all possible labels with **softmax.** To generate the desired row LSTM with attention mechanism is used. Input for LSTM in each step is random variable ** z, weighted context vector **with

**previous hidden**and

**embedding vector.**

**Discriminator**

Multi-Layer Perceptron (MLP) with LeakyReLU and BatchNorm is used. The first layer used concatenated vectors **(V, U, D) **among** **with mini-batch diversity with feature vector from LSTM. The loss function is the KL divergence term of input variables with the sum ordinal log loss function.

**Results**

They evaluate the model on two datasets **KDD99 **and **covertype. **For some reason, they used weak models without boosting (xgboost, etc). Anyway, TGAN performs reasonably well and robust, outperforming bayesian networks. The average performance gap between real data and synthetic data is 5.7%.

The key improvements over previous TGAN are applying the mode-specific normalization to overcome the non-Gaussian and multimodal distribution. Then a conditional generator and training-by-sampling to deal with the imbalanced discrete columns.

**Task formalizing**

The initial data remains the same as it was in TGAN. However, they solve different problems.

**Likelihood of fitness**. Do columns in**T_syn**follow the same joint distribution as**T_train****Machine learning efficacy.**When training model to predict one column using other columns as features, can such model learned from**T_syn**achieve similar performance on**T_test**, as a model learned on**T_train**

**Preprocessing**

Preprocessing for **discrete **columns keeps the same.

For **continuous **variables, a variational Gaussian mixture model (**VGM**) is used. It first estimates the number of modes **m** and then fits a Gaussian mixture. After we normalize initial vector **C **almost the same as it was in TGAN, but the value is normalized within each mode. Mode is represented as one-hot vector betta ([0, 0, .., 1, 0]). Alpha is the normalized value of **C**.

As a result, we get our initial row represented as the concatenation of one-hot’ ed discrete columns with representation discussed above of continues variables:

**Training**

“The final solution consists of three key elements, namely: the conditional vector, the generator loss, and the training-by-sampling method” [4].

**Conditional vector**

Represents concatenated one-hot vectors of all discrete columns but with the specification of only one category, which was selected. “For instance, for two discrete columns, D1 = {1, 2, 3} and D2 = {1, 2}, the condition (D2 = 1) is expressed by the mask vectors m1 = [0, 0, 0] and m2 = [1, 0]; so cond = [0, 0, 0, 1, 0]” [4].

**Generator loss**

“During training, the conditional generator is free to produce any set of one-hot discrete vectors” [4]. But they enforce the conditional generator to produce **d_i (**generated discrete one-hot column)**= m_i** (mask vector) is to penalize its loss by adding the cross-entropy between them, averaged over all the instances of the batch.

**Training-by-sampling**

“Specifically, the goal is to resample efficiently in a way that all the categories from discrete attributes are sampled evenly during the training process, as a result, to get real data distribution during the test” [4].

In another word, the output produced by the conditional generator must be assessed by the critic, which estimates the distance between the learned conditional distribution **P_G(row|cond)** and the conditional distribution on real data **P(row|cond)**. “The sampling of real training data and the construction of **cond** vector should comply to help critics estimate the distance” [4]. Properly sample the **cond** vector and training data can help the model evenly explore all possible values in discrete columns.

The model structure is given below, as opposite to TGAN, there is no LSTM layer. Trained with WGAN loss with gradient penalty.

Also, they propose a model based on Variational autoencoder (VAE), but it out of the scope of this article.

**Results**

Proposed network CTGAN and TVAE outperform other methods. As they say, TVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes. The generator in GANs does not have access to real data during the entire training process, unlike TVAE.

Besides, they published source code on *GitHub*, which with slight modification will be used further in the article.

# Applying CTGAN to generating data for increasing train (semi-supervised)

This is a kind of vanilla dream for me to be examined. After brief familiarization with recent developments in GAN, I’ve been thinking about how to apply it to something that I solve on the work daily. So here is my idea.

**Task formalization**

Let say we have **T_train** and **T_test** (train and test set respectively). We need to train the model on **T_train **and** **make predictions on** T_test. **However, we will increase the train by generating new data by GAN, somehow similar to **T_test**, without using ground truth labels of it.

**Experiment design**

Let say we have **T_train** and **T_test** (train and test set respectively). The size of **T_train** is smaller and might have different data distribution. First of all, we train CTGAN on **T_train** with ground truth labels (*step 1*), then generate additional data **T_synth **(*step 2*)**. **Secondly, we train boosting in an adversarial way on concatenated **T_train **and** T_synth** (target set to 0) with** T_test** (target set to 1) (*steps 3 & 4*)**. **The goal is to apply newly trained adversarial boosting to obtain rows more like **T_test. **Note — original ground truth labels aren’t used for adversarial training. As a result, we take top rows from **T_train** and **T_synth** sorted by correspondence to **T_test **(steps *5 & 6*).** **Finally**, **rain new boosting on them and check results on **T_test.**

Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline but without CTGAN (in step 3 we won’t use **T_sync**).

**Code**

Experiment code and results released as Github repo here. Pipeline and data preparation was based on Benchmarking Categorical Encoders’ article and its repo. We will follow almost the same pipeline, but for speed, only Single validation and Catboost encoder was chosen. Due to the lack of GPU memory, some of the datasets were skipped.

**Datasets**

All datasets came from different domains. They have a different number of observations, several categorical and numerical features. The aim of all datasets is a binary classification. Preprocessing of datasets was simple: removed all time-based columns from datasets. The remaining columns were either categorical or numerical. In addition, while training results were sampled **T_train — 5%, 10%, 25%, 50%, 75%**