Helping Machines Understand Complex Questions

Original article can be found here (source): Deep Learning on Medium

To have more computational power I used Google Colab to work on my project, so I added the data to my google drive. I loaded the data, selected only what was relevant for my project and split the data right away to avoid any leakage.

Also, the original targets on this dataset are not binary, so I did a quick transformation to facilitate our classification.

Now the fun part starts! First, we create a little helper function to clean our text data from punctation, numbers, extra spaces, etc.

We apply the function to all our text data using the method apply and lambda.

We will load a pickled version of the Common Crawl GloVe model with 840B tokens, 2.2M vocabulary, cased, and 300d vectors. Loading the pickled version is much faster than loading the entire 2.03 GB glove model. Again, I saved that file on my Google Drive, so that’s where we are getting it from, but you can find all GloVe embeddings here. We define and embedding matrix that our model will use as part of an embedding layer further on.

Now we need to tokenize and pad our text data into the right shape for our 300 dimension GloVe embeddings. I choose to treat my text columns as separate features instead of concatenating all text data.

Next, we also need to prepare our categorical column for our neural network. I used LabelEncoder to do that.

Finally, we will build our neural network model with Keras. This function below was heavily inspired and adapted from this Kaggle Notebook and uses three separate inputs — one for the categorical feature, and two text features, the questions title and the questions body.

Let’s walk through what is going on there: we pass our categorical feature through a Dense layer with a sigmoid activation. Then, use an embedding layer that will encode our text data with GloVe embeddings, and concatenate them. Then we use a Long Short Term Memory network (LSTM) layer, which is commonly used for natural language processing models, with a bidirectional layer wrapper. We will follow the LSTM layer with concatenating a Global Max Pooling and a Global Average Pooling, add two hidden layers with relu and sigmoid activations. Finally, we add the text and the categorical layer outputs and pass one final Dense layer with our 8 targets and a sigmoid activation that will work with our multi-label problem target.

We compile our model and pass the function with the GloVe embedding matrix that we have constructed previously. We then fit it to our data and evaluate our results. First I use the evaluate method and get the model’s accuracy for train and test data.

Train Accuracy: 0.8713242853592384
Test Accuracy: 0.839124177631579

However, because we have a strong imbalance in our data, we want to look at other metrics that will be more informative than accuracy. Keras does not support a classification report method but we can easily counter that by transforming the predictions from Keras into a binary result and then passing them through Scikit-learn classification report method.

What we can see from our results is that even though we were able to achieve an overall accuracy of close to 84% with our Neural Network model using GloVe embeddings, we have varying precision, recall and f1 scores depending on the target.

Our data suffers from some class imbalance, where the more complex targets are in fact also the ones with fewer samples. These targets had the worst performance by far, not only because they are the complex and thus difficult for the machines to interpret, but also because of the small number of samples. More data, especially covering the targets that were underrepresented, can help improve our model further.

In any case, working on this project was a great way to play with Natural Language Processing tools and with Keras, and I hope it can be useful to you as well. Happy coding!