Source: Deep Learning on Medium
In this blog I am going to take you through the steps involved in creating a embedding for categorical variables using a deep learning network on top of keras. The concept was originally introduced by Jeremy Howard in his fastai course. Please see the link for more details.
Across most of the data sources that we work with we will come across mainly two types of variables:
- Continuous variables: These are usually integer or decimal numbers and have infinite number of possible values e.g. Computer memory units i.e 1GB, 2GB etc..
- Categorical variables: These are discrete variables which is used to split the data based on certain characteristics. e.g. Types of computer memory i.e. RAM memory, internal hard disk, external hard disk etc.
When we build a ML model more often then not it is required for us to transform the categorical variable before we can use it in the algorithm. The transformation applied has a big impact on the performance of the model, especially if the data has a large number of categorical features with high cardinality. Example’s of some of the usual transformation’s applied include:
One-Hot encoding: Here we convert each category value into a new column and assign and assign a
0(True/False) value to the column.
Binary encoding: This creates fewer features than one-hot, while preserving some uniqueness of values in the the column. It can work well with higher dimensional ordinal data.
These usual transformation’s however do not capture the relationship between the categorical variables. Please see the below link for more information on different type of encoding methods.
To demonstrate the application of deep embedding’s let’s take an example of the bicycle sharing data from Kaggle.
As we can see there are a number of columns in the data set. In order to demonstrate this concept we shall use just the date_dt, cnt and mnth columns from the data.
Traditional one-hot encoding would result in 12 columns, one of each month. However in this type of embedding equal importance is given to each day of the week and there is no relationship between each of the months.
We can see a seasonal pattern’s of each of the months in the below graph. As we can see months 4 to 9 are the peak months. Months 0, 1, 10,11 are months of low demand for bike hire.
Additionally when we plot the daily usage for each month, represented by a different colour, we cam see some weekly patterns within each month.
Ideally we would expect such relationships to be captured by use of embeddings. In the next section we will examine the generation of these embeddings using a deep network built on top of keras.
The code is as shown below. We will build a perceptron network with dense layer network and a ‘relu’ activation function.
The input for the network i.e. ‘x’ variable in the month number. This is a numeric representation of each of the months in the year and ranges from 0 to 11. Hence the input_dim is set to 12.
The output for the network i.e. ‘y’ is a scaled column of ‘cnt’. However ‘y’ can be increased to include other continuous variables. Here as we are using a single continuous variable we will set the last number of the output dense layer to 1. We will train the model for 50 iterations or epochs.
embedding_size = 3
model = models.Sequential()
model.add(Embedding(input_dim = 12, output_dim = embedding_size, input_length = 1, name="embedding"))
model.compile(loss = "mse", optimizer = "adam", metrics=["accuracy"])
model.fit(x = data_small_df['mnth'].as_matrix(), y=data_small_df['cnt_Scaled'].as_matrix() , epochs = 50, batch_size = 4)
Embedding Layer: Here we specify the embedding size for our categorical variable. I have used 3 in this case, if we were to increase this it will capture more details on the relationship between the categorical variables. Jeremy Howard suggests the following solution for choosing embedding sizes:
# m is the no of categories per feature
embedding_size = max(50, m+1/ 2)
We are using an “adam” optimiser with a mean-square error loss function. Adam is preferred to sgd (stochastic gradient descent) as it is much faster optimiser due to its adaptive learning rate. You can find more details on the different types of optimisers here.
The final resulting embedding for each of the months are as follows. Here ‘0’ is for January and ‘11’ for December.
When we visualise this using a 3D plot, we can see a clear relationship between the months. The months with similar ‘cnt’ are grouped closer together e.g. months 4 to 9 are very similar to each other.
In conclusion we have seen that by using Cat2Vec (categorical variable to vectors) we can represent high cardinality categorical variable using low dimension embedding while preserving the relationship between each of the categories.
In the next few blogs we will explore on how we can use these embeddings to build supervised and unsupervised machine learning models with better performance.
If you have enjoyed this post please do clap.