Tabular GANs for uneven distribution

GANs are well known for success in the realistic image generation. However, they can be applied in tabular data generation as well. We will review and examine some recent papers about tabular GANs in action. We will generate data to make train distribution bring closer to the test. Then compare model performance trained on the initial train dataset, with trained on the train with GAN generated data, also we train the model by sampling train by adversarial training. We show that using GAN might be an option in case of uneven data distribution between train and test data.

The task for the generator is to generate samples, which won't be distinguished from real samples by the discriminator. We won't give much detail here. You can read the medium post and the original paper [2]. Recent architectures such as StyleGAN 2 can produce outstanding photo-realistic images, examples in Figure 2.  [6] Problems While face generation seems to be not a problem anymore, there are plenty of issues we need to resolve:

Tabular GANs
Even cats and dogs generation seem heavy tasks for GANs because of not trivial data distribution and high object type variety. Besides such domains, the image background becomes important, which GANs usually fail to generate. Therefore, we've been wondering what GANs can achieve in tabular data. Unfortunately, there aren't many articles. The next two articles appear to be the most promising.
TGAN: Synthesizing Tabular Data using Generative Adversarial Networks [13] They raise several problems, why generating tabular data has its own challenges: the various data types (int, decimals, categories, time, text) different shapes of distribution ( multi-modal, long tail, Non-Gaussian. . . ) sparse one-hot-encoded vectors and highly imbalanced categorical columns. Preprocessing numerical variables Neural networks can effectively generate values with a distribution centered over (-1, 1) using tanh. However, they show that nets fail to generate suitable data with multi-modal data. Thus they cluster a numerical variable by using and training a Gaussian Mixture Model (GMM) [9] with m (m=5) components for each of C. Finally, GMM is used to normalize C to get V. Besides, they compute the probability of C coming from each of the m Gaussian distribution as a vector U.
Preprocessing categorical variables Due to usually low cardinality, they found the probability distribution can be generated directly using softmax. But it necessary to convert categorical variables to one-hot-encoding representation with noise to binary variables After prepossessing, they convert T with n c + n d columns to V, U, D vectors. This vector is the output of the generator and the input for the discriminator in GAN. GAN does not have access to GMM parameters.
Generator They generate a numerical variable in 2 steps. First, generate the value scalar V, then generate the cluster vector U eventually applying tanh.
Categorical features are generated as a probability distribution over all possible labels with softmax. To generate the desired row LSTM [3] with attention mechanism is used. Input for LSTM in each step is random variable z, weighted context vector with previous hidden and embedding vector.
Discriminator Multi-Layer Perceptron (MLP) with LeakyReLU [10] and Batch-Norm [5] is used. The first layer used concatenated vectors (V, U, D) among with mini-batch diversity with feature vector from LSTM. The loss function is the KL divergence term of input variables with the sum ordinal log loss function.
The detailed model structure is shown in Figure 4  Then it uses Multi-Layer Perceptron (MLP) with LeakyReLU to distinguish real and fake data [13] Results They evaluate the model on two datasets KDD99 and covertype. For some reason, they used weak models without boosting (xgboost, etc). Anyway, TGAN performs reasonably well and robust, outperforming bayesian networks. The average performance gap between real data and synthetic data is 5.7% [13].
Modeling Tabular Data using Conditional GAN (CTGAN) [12] The key improvements over previous TGAN are applying the mode-specific normalization to overcome the non-Gaussian and multimodal distribution. Then a conditional generator and training-by-sampling to deal with the imbalanced discrete columns.
Task formalizing The initial data remains the same as it was in TGAN. However, they solve different problems.
-Likelihood of fitness. Do columns in T syn follow the same joint distribution as T train -Machine learning efficacy. When training the model to predict one column using other columns as features, can such model learned from T syn achieve similar performance on T test, as a model learned on T train Preprocessing Preprocessing for discrete columns keeps the same. For continuous variables, a variational Gaussian mixture model (VGM) is used. It first estimates the number of modes m and then fits a Gaussian mixture. After we normalize initial vector C almost the same as it was in TGAN, but the value is normalized within each mode. The mode is represented as one-hot vector betta ([0, 0, .., 1, 0]). Alpha is the normalized value of C. Example is shown in Figure 5. . An example of mode-specific normalization [12].
As a result, we get our initial row represented as the concatenation of one-hot' ed discrete columns with representation discussed above of continues variables:

Training
The final solution consists of three key elements, namely: the conditional vector, the generator loss, and the training-by-sampling method as is shown in Figure 6 [12].

Generator loss
During training, the conditional generator is free to produce any set of one-hot discrete vectors.
But they enforce the conditional generator to produce d i (generated discrete one-hot column)= m i (mask vector) is to penalize its loss by adding the crossentropy between them, averaged over all the instances of the batch.

Training-by-sampling
Specifically, the goal is to resample efficiently in a way that all the categories from discrete attributes are sampled evenly during the training process, as a result, to get real data distribution during the test.
In other words, the output produced by the conditional generator must be assessed by the critic, which estimates the distance between the learned conditional distribution P G(row|cond) and the conditional distribution on real data P(row|cond).
The sampling of real training data and the construction of cond vector should comply to help critics estimate the distance.
Properly sample the cond vector and training data can help the model evenly explore all possible values in discrete columns. The model structure is given below, as opposite to TGAN, there is no LSTM layer. Trained with WGAN loss with gradient penalty.
Also, they propose a model based on Variational autoencoder (VAE), but it out of the scope of this article.

Results
The proposed network CTGAN and TVAE outperform other methods. As they say, TVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes. The generator in GANs does not have access to real data during the entire training process, unlike TVAE. Detailed results are shown in Figure 7. They report the average of each metric. For real datasets (f1, etc). [12] Besides, they published source code on GitHub [11], which with slight modification will be used further in the article.

Task formalization
Let say we have T train and T test (train and test set respectively). We need to train the model on T train and make predictions on T test. However, we will increase the train by generating new data by GAN, somehow similar to T test, without using ground truth labels of it.

Experiment design
The experiment design is shown in Figure 8. Let say we have T train and T test (train and test set respectively). The size of T train is smaller and might have different data distribution. First of all, we train CT-GAN on T train with ground truth labels (step 1), then generate additional data T synth (step 2). Secondly, we train boosting in an adversarial way on concatenated T train and T synth (target set to 0) with T test (target set to 1) (steps 3 & 4). The goal is to apply newly trained adversarial boosting to obtain rows more like T test. Note -original ground truth labels aren't used for adversarial training. As a result, we take top rows from T train and T synth sorted by correspondence to T test (steps 5 & 6). Finally, train new boosting on them and check results on T test.
Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline but without CTGAN (in step 3 we won't use T sync).
Code Experiment code and results are released as GitHub repo [1]. Pipeline and data preparation was based on Benchmarking Categorical Encoders' article and its repositories. We will follow almost the same pipeline, but for speed, only Single validation and Catboost encoder was chosen.
Datasets All datasets came from different domains. They have a different number of observations, several categorical and numerical features. The aim of all datasets is a binary classification. Prepossessing of datasets was simple: removed all time-based columns from datasets. The remaining columns were either categorical or numerical. In addition, while training results were sampled T train -5%, 10%, 25%, 50%, 75%. Some dataset characteristics such as the number of points, features, and categorical features are shown in Table 1. From the first sight of view and in terms of metric and stability (std), GAN shows the worse results. However, sampling the initial train and then applying adversarial training we could obtain the best metric results and stability (sample original). To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged among datasets.

Results
To determine the best validation strategy, we compared the top score of each dataset for each type of validation. As you can see from We can see that GAN outperformed other sampling types in 2 datasets. Whereas sampling from original outperformed other methods in 3 of 7 datasets. Of course, there isn't much difference. but these types of sampling might be an option. Detailed results in Table 3.  Table 3. Different sampling results, higher is better for a mean (ROC AUC), lower is better for std (100%-maximum per dataset ROC AUC).  Table 4. Same target % is equal 1 then the target rate for train and test are different no more than 5%. Higher is better.
Let's define same target % is equal 1 then the target rate for train and test is different no more than 5%. So then we have almost the same target rate in train and test None and sample original are better. However, gan is starting performing noticeably better than target distribution changes as you can see in Table 4.

Acknowledgments
The author would like to thank Open Data Science community [7] for many valuable discussions and educational help in the growing field of machine and deep learning. Also, special big thanks to PAO Sberbank [8] for allowing solving such tasks and providing computational resources.