Don’t Stop at Ensembles — Unconventional Deep Learning Techniques for Tabular Data

Original article was published on Deep Learning on Medium

Don’t Stop at Ensembles — Unconventional Deep Learning Techniques for Tabular Data

Photo by chuttersnap on Unsplash

In recent years, Deep Learning has made huge strides in the fields of Computer Vision and Natural Language Processing. And as a result, deep learning techniques have often been confined to image data or sequential (text) data. What about tabular data? The traditional method of information storage and retrieval in many organizations is arguably the most important when it comes to business use cases. But data in our tables/dataframes seem to be content with the use of simple Multi-layer feedforward networks in the Deep Learning arena. Although one can argue that Recurrent Neural Networks (RNNs) are often used on tabular time series data, the applications of these methodologies on data without a time series component in it is very limited. In this blog post, we’ll look at the application of some deep learning techniques, usually used on image or text data, on non-time series tabular data in the decreasing level of conventionality.

Autoencoders for Dimensionality Reduction

Conventionally, autoencoders have been used for non-linear dimensionality reduction. Say we have a dataset where the number of features is way more than what we’d prefer it to be, we can use autoencoders to bring the feature set to the desired feature size through complex non-linear functions that we needn’t have to worry about! This is a more effective technique compared to linear dimensionality reduction methods like PCA (Principal Component Analysis) or other conventional non-linear techniques like LLE (Locally Linear Embeddings).

Autoencoder structure (Source)

Autoencoders are trained on the training feature set without any labels, i.e., they try to predict as output whatever the input was. This would be a simple task if the hidden layers were wide enough to capture all of our input data. Thereby, the requirement, for a neural network to be an autoencoder, is to have at least one layer, the bottleneck layer, of lower dimension compared to the input or the output dimension. This is usually the embedding layer or the reduced feature set we want to use. The metric when training the network can be the usual mean squared error or mean absolute error. If the original data is x and the reconstructed output generated by the autoencoder is x_hat, we try to minimize

L(x, x_hat) = |x — x_hat|²

After training the encoder and decoder parts of the network together, we use only the encoder part of the model as our dimensionality reduction/feature extraction algorithm. The sample code in Keras is as follows:

from keras.layers import Dense, Dropout
from keras.models import Sequential, Model
from keras import metrics, Input
METRICS = [
metrics.RootMeanSquaredError(name='rms'),
metrics.MeanAbsoluteError(name='mae')
]
ENCODING_DIM = 16 #Desired Dimension
BATCH_SIZE = 64
EPOCHS = 100
def make_and_train_autoencoder(X_train, metrics=METRICS):

len_input_output = X_train.shape[-1]
input_ = Input(shape=(len_input_output,))
encoded = Dense(units=ENCODING_DIM*2, activation="relu")(input_)
bottleneck = Dense(units=ENCODING_DIM,
activation="relu")(encoded)
decoded = Dense(units=ENCODING_DIM*2,
activation="relu")(bottleneck)
output = Dense(units=len_input_output,
activation="linear")(decoded)
#Training is performed on the entire autoencoder
autoencoder = Model(inputs=input_, outputs=output)
autoencoder.compile(optimizer='adam', loss='mean_squared_error',
metrics=[metrics])
autoencoder.fit(X_train, X_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS)
#Use only the encoder part for dimensionality reduction
encoder = Model(inputs=input_, outputs=bottleneck)
return autoencoder, encoder

Denoising Autoencoders for Noise Reduction

Denoising autoencoder example on MNIST (Source)

The inspiration for Denoising Autoencoders comes from the field of computer vision. As you can see above, they can be used to remove noise from the input data. Denoising Autoencoders (DAEs) can be used similarly on tabular data as most of the data collection processes inherently have some noise. This technique has proven to be key in many Kaggle competitions’ winning solutions (Porto Seguro’s Safe Driver Prediction). Unlike autoencoders, DAEs do not need to satisfy the requirement of having the encoding and decoding parts to the network. In other words, there is no bottleneck, i.e., it is simply a neural network trained to remove noise. Then the question is, how do we train the network?

Working of a denoising autoencoder (Source)

As explained in the image above, we first corrupt our input data x. The corrupted data 𝑥̃ is usually obtained by adding Gaussian noise or by setting some of the input feature values to zero. This is our way of trying to mimic the noise in our datasets. We then pass 𝑥̃ through the DAE we have designed to get the reconstructed output x_hat with the same dimensions as the input x. The loss function is similar to that of a usual autoencoder. A DAE tries to minimize the difference between the output x_hat and the original data x, thereby giving it the ability to eliminate the influence of noise and extract features from the corrupted data. The sample code is as follows:

#Change mean and scale of the noise according to your data
noise = np.random.normal(loc=0, scale=0.5, size=X_train.shape)
X_train_noisy = X_train + noise
len_input_output = X_train.shape[-1]def make_dae(metrics=METRICS):
dae = Sequential([
Dense(units=len_input_output*2,
activation="relu", input_shape=(len_input_output,)),
Dropout(0.5), #Add dropout layers if required
Dense(units=len_input_output*2, activation="relu"),
Dense(units=len_input_output*2, activation="relu"),
Dense(units=len_input_output, activation="linear"),
])
dae.compile(
optimizer='adam',
loss='mean_squared_error',
metrics=[metrics]
)
return dae
dae = make_dae()
history = dae.fit(
X_train_noisy,
X_train,
batch_size = BATCH_SIZE,
epochs = EPOCHS
)

We usually train the denoising autoencoder only on the training set. Once the model is trained, we can pass the original data x and the test set, say x` through the DAE to get a denoised version of the dataset.

Synthetic Data Generation Using Language Models

Many of us might have come across the famous character-level language model for generating Shakespeare like text. What if we trained a language model on our training data (in CSV/txt format)? The primary question is, why would we want to do that? The simple answer — Data Imbalance. Most real-world datasets have a huge difference in the number of training examples for each class/label. Take for example fraud detection. Only about ~0.05% of all transactions are fraudulent. Thereby, we might want to generate more training examples of the minority class to tackle the imbalance problem.

By training a character-level language model on our minority class data (only feature values), we can generate records similar to the minority class just like we generate text by training the model on a set of poems. I won’t dive deep into the specifics of training and sampling from a character-level language model but I encourage you to go through Andrej Karpathy’s blog (or Week 1 of the Sequence Models course on Coursera).

Thankfully, there are libraries like gretel-synthetics that do the above-described job for us! Gretel’s library also has other options like enabling differential privacy in the synthetic datasets produced. Their blog posts are a great way to understand and learn about their library. The following sample code can be used to generate synthetic dataframes:

!pip install gretel-synthetics --upgradefrom pathlib import Path
from gretel_synthetics.batch import DataFrameBatch
import pandas as pd
source_df = pd.read_csv("diabetic_data.csv") #File Pathconfig_template = {
"max_lines": 20000, # maximum lines of training data. Set to ``0`` to train on entire dataframe
"max_line_len": 2048, # the max line length for input training data
"epochs": 15, # Gretel recommends 15-50 epochs with GPU for best performance
"vocab_size": 20000, # tokenizer model vocabulary size
"gen_lines": 100, # the number of generated text lines
"dp": True, # train with differential privacy enabled (privacy assurances, but reduced accuracy)
"field_delimiter": ",", # Must be specified
"overwrite": True,
"checkpoint_dir": str(Path.cwd() / "checkpoints")
}
batcher = DataFrameBatch(df=source_df, config=config_template)
batcher.create_training_data()
batcher.train_all_batches()
status = batcher.generate_all_batch_lines()
synthetic_df = batcher.batches_to_df()
synthetic_df.head(10)

All of the deep learning techniques discussed above fall in the category of self-supervised or unsupervised learning. These are often used as preprocessing steps before training the actual model for our classification or regression tasks. The effectiveness of these steps will depend on the data you have in hand and parameters like how long you want to train the language model or how much compression you want in your feature set.