Using Deep Learning for Structured Data with Entity Embeddings

Showing that deep learning can handle structured data and how to do it.

The idea of embeddings comes from learning them on words in NLP (word2vec), image taken from Aylien

In this blog we will touch on two recurring questions in machine learning: The first question revolves around how deep learning performs well on images and text, but how can we use it on our tabular data? Second is a question you must always ask yourself when building a machine learning model: How am I going to deal with categorical variables in this data set? Surprisingly, we can answer both questions with the same answer: entity embeddings.

Deep Learning has outperformed other Machine Learning methods on many fronts recently: image recognition, audio classification and natural language processing are just some of the many examples. These research areas all use what is known as ‘unstructured data’, which is data without a predefined structure. Generally speaking this data can also be organized as a sequence (of pixels, user behavior, text). Deep learning has become the standard when dealing with unstructured data. Recently the question has arisen of whether deep learning can also perform the best on structured data. Structured data is data that is organized in a tabular format where the columns represent different features and the rows represent different data samples. This is similar to how data is represented in an Excel sheet. Currently, the golden standard for structured data sets are Gradient Boosted Tree models (Chen & Guestrin, 2016). They consistently perform the best on Kaggle competitions, as well as in academic literature. Recently Deep Learning has shown that it can match the performance of these boosted tree models on structured data. Entity embeddings play an important role in this.

Structured vs. unstructured data

Entity Embeddings

Entity embeddings have been shown to work successfully when fitting neural networks on structured data. For example, the winning solution in a Kaggle competition on predicting the distance of taxi rides used entity embeddings to deal with the categorical metadata of each ride (de Brébisson et al., 2015). Similarly, the third place solution on the task of prediction store sales for Rossmann drug stores used a much less complicated approach than the number one and two’s solutions. The team was able to achieve this success by using a simple feed-forward neural network with entity embeddings for the categorical variables. This included variables with over a 1000 categories like the store id (Guo & Berkahn, 2016).

If this is your first time reading about embeddings I suggest you first read this post. In short, embeddings refer the representation of categories by vectors. Let’s show how this works on a short sentence:

‘Deep learning is deep’

We can represent each word with a vector, so the word ‘deep’ becomes something like [0.20, 0.82, 0.45, 0.67]. In practice, one would replace the words by integers like 1 2 3 1, and use a look-up table to find the vector linked to each integer. This practice is very common in Natural Language Processing and has also been used in data that consists of a behavioral sequence, like the journey of an online user. Entity embeddings refer to using this principle on categorical variables, where each category of a categorical variable gets represented by a vector. Let’s quickly review the two common methods for handling categorical variables in machine learning.

  • One-hot encoding: Creates binary sub-features like word_deep, word_learning, word_is. These are 1 for the category belonging to that data point and 0 for the others. So, for the word ‘deep’ the feature word_deep will be 1 and word_learning, word_is etc. will be 0.
  • Label encoding: Assigning integers like we did in the example before, so deep becomes 1, learning becomes 2 etc. This method is suitable for tree-based methods, but not for linear models because it implies an order in the assigned values.

Entity embeddings basically take the label encoding approach to the next level, by not just assigning an integer to a category but a whole vector. This vector can be of any size and has to be specified by the researcher. You might be wondering what the advantages of these entity embeddings are.

  1. Entity embeddings solve the disadvantages of one-hot encoding. One-hot encoding variables with many categories results in very sparse vectors, which are computationally inefficient and make it harder to reach optimization. Label encoding also solves this problem, but can only be used by tree-based models.
  2. Embeddings provide information about the distance between different categories. The beauty of using embeddings is that the vectors assigned to each category are also trained during the training of the neural network. Therefore, at the end of the training process we end up with a vector that represents each category. These trained embeddings can then be visualized to provide insights into each category. In the Rossmann sales prediction task, the visualized embeddings of German states showed similar clusters to the states’ geographical locations. Even though none of this geographical information was available to the model.
  3. The trained embeddings can be saved and used in non-deep learning models. For example, one could train the embeddings for categorical features each month and save the embeddings. These embeddings can then be used to train a Random Forest or a Gradient Boosted Trees model by loading the learned embeddings for the categorical features.

Choosing the embedding size

The embedding size refers to the length of the vector representing each category and can be set for each categorical feature. Similar to the tuning process of hyperparameters in a neural network, there are no hard rules for choosing the embedding size. In the taxi distance prediction task the researchers used an embedding size of 10 for each feature. These features had very different dimensionalities ranging from 7 (day of the week) to 57106 (client id). Choosing the same embedding size for each category is an easy and transparent approach, but probably not the optimal one.

For the Rossmann store sales prediction task the researchers chose a value between 1 and M (the amount of categories) -1 with a maximum embedding size of 10. For example, day of the week (7 values) gets an embedding size of 6, while store id (1115 values) gets an embedding size of 10. However, the authors have no clear rules of choosing the size between 1 and M-1.

Jeremy Howard rebuilt the solution for the Rossmann competition and came up with the following solution for choosing embedding sizes:

# c is the amount of categories per feature
embedding_size = (c+1) // 2
if embedding_size > 50: 
embedding_size = 50

Visualizing Embeddings

An advantage of embeddings is that the learned embeddings can be visualized to show which categories are similar to each other. The most popular method for this is t-SNE, which is a technique for dimensionality reduction that works particularly well for visualizing data sets with high-dimensionality. Let’s finish this post with two quick examples of visualized embeddings. Below are the visualized embeddings for home depot products and the category they belong to. Similar products like oven, refrigerator and microwave are very close to each other. The same goes for products like charger, battery and drill.

Learned embeddings of home depot products.

Another example are the learned state embeddings of German states in the Rossmann sales prediction task mentioned earlier in this post. The proximity between the states in the embeddings is similar to their geographical location.

Example of the learned state embeddings for Germany

I hope that I was able to enthusiast you about embeddings and deep learning. If you liked this posts be sure to recommend it so others can see it. You can also follow this profile to keep up with my process in deep learning. See you there!

Be sure to checkout the rest of my deep learning series:

  1. Setting up AWS & Image Recognition
  2. Convolutional Neural Networks
  3. More on CNNs & Handling Overfitting
  4. Why You Need to Start Using Embedding Layers

References

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). ACM.

De Brébisson, A., Simon, É., Auvolat, A., Vincent, P., & Bengio, Y. (2015). Artificial neural networks applied to taxi destination prediction. arXiv preprint arXiv:1508.00021.

Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.

Source: Deep Learning on Medium