Source: Deep Learning on Medium
Deep learning has proved to be groundbreaking in a lot of domains like Computer Vision, Natural Language Processing, Signal Processing, etc. However, when it comes to the more structured, tabular data (consisting of categorical or numerical variables), traditional machine learning approaches (such as Random Forests) are believed to perform better. As expected, Neural nets have caught up and in many instances shown to be performing equally well or even better at times.
The easiest way to perform deep learning with tabular data is through the fast-ai library and it gives really good results, but it might be a little too abstracted for someone who’s trying to understand what is really going on behind the scenes. Hence, in this article, I’ve covered how to build a simple deep learning model to deal with tabular data in Pytorch on a multiclass classification problem.
A little background on Pytorch
Pytorch is a popular open-source machine library. It is as simple to use and learn as Python. A few other advantages of using PyTorch are its multi-GPU support and custom data loaders. If you’re unfamiliar with the basics or need a revision, here’s a good place to start:
If you wanna follow along with the code, here’s my Jupyter notebook:
I’ve used the Shelter Animal Outcomes Kaggle competition data:
It’s a tabular dataset consisting of about 26k rows and 10 columns in the training set. All columns except
DateTime are categorical.
Given certain features about a shelter animal (like age, sex, color, breed), predict its outcome.
There are 5 possible outcomes:
Return_to_owner, Euthanasia, Adoption, Transfer, Died. We are expected to find the probability of an animal’s outcome belonging to each of the 5 categories.
Although this step depends largely on the particular data and problem, there are two necessary steps that need to be followed:
Getting rid of
Nan (not a number) indicates a missing value in the dataset. The model doesn’t accept
Nan values, hence they must be either deleted or replaced.
For numerical columns, a popular way of dealing with these values is to impute them with 0, mean, median, mode or some other function of the remaining data. Missing values might sometimes indicate an underlying feature in your dataset, so people often create a new binary column corresponding to the column with missing values to record whether the data was missing or not.
For categorical columns,
Nan values can be considered as their own category!
Label encoding all categorical columns:
Since our model can only take numerical inputs, we convert all our categorical elements to numbers. This means instead of using strings to represent categories, we use numbers. The number chosen to represent any category in the columns doesn’t really matter because we’re later going to use categorical embeddings to further encode these categories. Here’s a simple example of label encoding:
I’ve used the
LabelEncoder class from the scikit-learn library to encode the categorical columns. You could define a custom class to do this and keep track of the category labels because you’d need them to encode test data too.
Label encoding the target:
We also need to label encode the target if it has string entries. Also, make sure you maintain a dictionary mapping the encodings to original values because you’ll need it to figure out the final output of your model.
Data Processing particular to the Shelter Outcome problem:
Along with the above-mentioned steps, I did a little more processing for the example problem.
- Removed the
AnimalIDcolumn because it’s unique and won’t help in training.
- Removed the
OutcomeSubtypecolumn because it’s a part of the target but we’re not asked to predict it.
DateTimecolumn because exact Timestamp of when the record was entered didn’t seem like an important feature. In fact, I first tried to split it out into separate month and year columns but later realized that removing the column altogether gave me a better result!
Namecolumn because it had too many
Nanvalues (more than 10k missing). Also, it did not seem like a very important feature in determining an animal’s outcome.
Note: In my notebook, I stacked the train and test columns and then did the preprocessing to avoid having to do label encoding based on the train set labels on the test set (because it would involve maintaining a dictionary of encoded labels to actual values). It was okay to do the stacking and processing here because there are no numerical columns (hence no imputing done) and the number of categories per column was fixed. In practice, we must never do this because it may leak some data from the test/validation sets to the training data and lead to an inaccurate evaluation of the model. For example, if you had missing values in a numerical column like
age and decided to impute it with the average value, the average value should be calculated only on the train set (not stacked train-test-valid set) and this value should be used to impute missing values in validation and test sets too.
Categorical embeddings are very similar to word embeddings which are commonly used in NLP. The basic idea is to have a fixed-length vector representation of each category in the column. How this is different from a one-hot encoding is that instead of having a sparse matrix, using embeddings, we get a dense matrix for each category with similar categories having values close to each other in the embedding space. Hence, this process not only saves up memory (as the one-hot encoding for columns having too many categories can really blow up the input matrix, also it is a very sparse matrix) but also reveals intrinsic properties of the categorical variables.
For example, if we had a column of colors and we find embeddings for it, we can expect
pink to be closer in the embedding space than
Categorical embedding layers are equivalent to extra layers on top of each one-hot encoded input:
For our shelter outcome problem, we have only categorical columns but I’ll be considering columns with less than 3 values as continuous. To decide the length of each column’s embedding vector I’ve taken a simple function from the fast-ai library:
Pytorch Dataset and DataLoader
We extend the
Dataset (abstract) class provided by Pytorch for easier access to our dataset while training and for effectively using the
DataLoader module to manage batches. This involves overwriting the
__getitem__ methods as per our particular dataset.
Since we only need to embed categorical columns, we split our input into two parts: numerical and categorical.
We then choose our batch size and feed it along with the dataset to the DataLoader. Deep learning is generally done in batches.
DataLoader helps us in effectively managing these batches and shuffling the data before training.
To do a sanity check, you can iterate through the created DataLoaders to look at each batch:
Our data is split into continuous and categorical parts. We first convert the categorical parts into embedding vectors based on the previously determined sizes and concatenate them with the continuous parts to feed to the rest of the network. This picture demonstrates the model I’ve used:
Now we train the model on the training set. I’ve used Adam optimizer to optimize the cross entropy loss. The training is pretty straightforward: iterate through each batch, do a forward pass, compute gradients, do a gradient descent and repeat this process for as many epochs as needed. You can look at my notebook to understand the code.
Since we’re interested in finding the probabilities for each class for our test inputs, we apply a Softmax function over our model output. I also made a Kaggle submission to see how well this model performs:
We’ve done very less feature engineering and data exploration and used a very basic deep learning architecture, yet our model has done better than about 50% of the solutions. This shows that this approach of modeling tabular data using neural networks is pretty powerful!