An Introduction to Using Categorical Embeddings

Source: Deep Learning on Medium


In this blog we will use categorical embeddings in our model for the Store Item Demand Forecasting Challenge on Kaggle. This blog will be written at the introductory level and no knowledge of categorical embeddings is expected. However, you should have a basic understand of Pandas, Keras, and ANNs.

Outline of blog:

  1. Introducing the Challenge
  2. Explaining Embeddings
  3. Setting Up Our Data
  4. Building Our Model
  5. Analyzing Our Results
  6. Moving Forwards

Introducing the Challenge:

We are given the sales data over a 5 year period (Jan 1, 2013 — Dec 31, 2017) for 50 different items at 10 different stores. The objective is to predict the sales for the next 90 days for every item at every store.

We won’t be doing much EDA in this blog but for just a brief snapshot of what type of data we are dealing with I created a plot of the mean sales per item per day.

I encourage you to explore the data more on your own, especially looking at trends across stores. Here is how to recreate my plot:

X = pd.read_csv(“train.csv”, parse_dates=True, index_col = 0)
X.groupby('date').mean().resample("w").mean().plot()

Embeddings:

Why do we need them?

Traditionally, the best way with dealing with categorical data has been one hot encoding a method where the categorical variable is broken into as many features as the unique number of categories for that feature and for every row, a 1 is assigned for the feature representing that row’s category and rest of the features are marked 0.

There are a lot of issues with this method. For categories with lots of unique features we get a lot of sparse data. Also each vector is equidistant from every other vector which causes us to lose the value of relationships between variables.

Embeddings are a solution to dealing with categorical variables while avoiding a lot of the pitfalls of one hot encoding.

How do they work?

Formally, an embedding is a mapping of a categorical variable into an n-dimensional vector.

This provides us with 2 advantages. First, we limit the number of columns we need per category. Second, embeddings by nature intrinsically group similar variables together.

Let’s go through an example!

Suppose we want to use the day of the week as a feature in our neural net. We create a a 7×4 (I’ll explain why 4 later) matrix mapping a day of the week to each row and initialize the tensor. We then replace a specific day of the week with its corresponding vector.

This matrix now allows us to discover non-linear relationships among variables. As opposed to one hot encoding, where a day of the week can only ever be a single value, embeddings transform the day of the week into a 4-dimensional concept. After training our model, we may discover that this table contains semantic meaning. For example, Saturday and Sunday could be more closely related than say Saturday and Wednesday.

Setting Up Our Data:

We are now ready to apply this concept to our model. The first step in this process is determining which variables are continuous and which are categorical.

It makes sense that both store and item should be categorical. Despite them both being numbers, their ordering has no semantic relationship (meaning store 1 is no more likely to be closer related with store 2 than it is store 50).

The only other column we have is the date, which with some very basic feature engineering we can split into year, month, day, and dow (day of week). It might make sense to treat these features as continuous since there is a relationship between them . However, often times we get better results treating continuous variables as categorical. It take’s a bit of trial and error to determine what particular variables should be, but for the this tutorial we we only use year as continuous.

Why?

Using month as an example, there might be non-linear trends between months such as summer months being more likely to have increased sales. Using an embedding matrix allows us to capture potentially very complex relationships.

So let’s see how to do this in code!

First we need to create the appropriate columns in our dateframe.

X.drop('sales', 1, inplace=True)
X['y'] = train_data.index.year-train_data.index.year.min()
X['m'] = train_data.index.month
X['d'] = train_data.index.day
X['dow'] = train_data.index.dayofweek

Now we create a list for the categorical variables and for the continuous variables.

cat_vars = list(X.columns)
cat_vars.remove('y')
cont_vars = ['y']

Next, we create our validation set. We will use 10% of our training data.

from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(train_data, y_train, test_size=.1, random_state=0, shuffle = True)

We now have to copy our date frame into a numpy list so that our input will work in Keras.

X_train = []
X_val = []
X_train.append(x_train[cont_vars].astype('float32').values)
X_val.append(x_val[cont_vars].astype('float32').values)
for cat in cat_vars:
X_train.append(x_train[cat].values)
X_val.append(x_val[cat].values)

We have to repeat the above process for our test data, but I will leave that as an exercise.

Lastly, we need to determine that size of our embedding. There is no steadfast rule on how to do this but a good heuristic given by Jermey Howard of Fast.Ai is to take half the number of unique values then add one.

cat_sizes = {}
cat_embsizes = {}
for cat in cat_vars:
cat_sizes[cat] = train_data[cat].nunique()
cat_embsizes[cat] = min(50, cat_sizes[cat]//2+1)

Building Our Model:

Now that we’ve got the data setup completed we can begin the fun part of building our model.

Let’s start off with some imports.

from keras.layers import Dense, Dropout, Embedding, Input, Reshape, Concatenate
from keras.models import Model

As an aside, this competition uses a SMAPE evaluation metric which Keras does not have built so we will create that here. Don’t worry too much about this part now.

import keras.backend as K
def custom_smape(x, x_):
return K.mean(2*K.abs(x-x_)/(K.abs(x)+K.abs(x_)))

We will use the Keras Functional API to build our model. At a high level, our model architecture will have 6 Input Layers — Five of those layers feed into an embedding layer — The model then merges in a concatenation layer, followed by 2 dense layers.

We create a list to hold our input layers (as we need this as a dimension for our Model object) and a list for our concatenation layers.

ins = []
concat = []

Now we iterate over our categorical variables and create an input layer → embedding layer → reshape layer.

for cat in cat_vars:
x = Input((1,), name=cat)
ins.append(x)
x = Embedding(cat_sizes[cat]+1, cat_embsizes[cat], input_length=1)(x)
x = Reshape((cat_embsizes[cat],))(x)
concat.append(x)

I would pause here and make sure you understand what we just wrote or read the documentation if you don’t.

Now that we’ve dealt with our categorical layers we still have to create an Input Layer for the year.

y = Input((len(cont_vars),), name='cont_vars')
ins.append(y)
concat.append(y)

Lastly, we concatenate our 5 Reshape Layers and our 1 Input Layer. We finish our model by connecting the concatenating layer to a 100 unit Dense layer which is connected to our output — a single unit Dense Layer.

y = Concatenate()(concat)
y = Dense(100, activation= 'relu')(y)
y = Dense(1)(y)
model = Model(ins, y)
model.compile('adam', custom_smape)

Our final model architecture:

Results

model.fit(X_train, y_train, 64, 3, validation_data=[X_val, y_val])
Epoch 1: loss: 0.1484 - val_loss: 0.1264
Epoch 2: loss: 0.1270 - val_loss: 0.1269
Epoch 3: loss: 0.1266 - val_loss: 0.1263

We see that our model is fitted after just 2 epochs. Now let’s run it on our test set (of course we should refit our model including our validation set).

test_preds = model.predict(X_test)
sample_data = pd.read_csv("sample_submission.csv", index_col=0)
sample_data['sales'] = test_preds
sample_data.to_csv('preds.csv')

Running the model on the test set gives us a score of 14.44 which would place us #307 out of 462 teams. This is not great, but for something this simple and very little feature engineering we are off to a good start.

Moving Forwards:

In order to really understand a concept it is best to apply it on your own. So here is a few recommendations to reinforce what you just learned:

  1. Add some additional features to this dataset. Perhaps holidays or the weather has a big impact on sales.
  2. Watch lesson 4 of Jermey Howard’s Deep Learning course.
  3. Try out this method to another competition on Kaggle.