Original article was published by Darius Fuller on Deep Learning on Medium
The project and task I will be referring to during this post are not something that needs to be completed through the use of a neural network. There are multiple, well-established options available to anyone looking to perform text classification. For example, Naive Bayes classifiers and Support Vector Machines (SVM) are commonly used to classify texts. This blog post by Kamran Kowsari goes over text classification and commonly used algorithms in greater detail if you’re curious.
Knowing my options, I decided to go with what I though was the most novel. Having just done a deep dive into how deep learning and artificial intelligence (AI) work with respect to NLP, I was eager to give it a go. My use of deep learning employs an artificial neural network, or ANN, to look at input text(s) and learn from underlying “hidden” features within to produce a desired output. Basically, the ANN will engineer it’s own features depending on the input it receives, allowing for (in theory) greater understanding. Similarly, I would recommend looking over Jason Brownlee’s blog post for greater understanding on the topic.
The data set I used for my project is from Crowdflower and can be found on data.world. There was not much information that I could find on exactly where and when the tweets come from, so I am only inferring from what the tweets themselves say. I can say for sure that it consists of over 9,000 tweets taken from those attending one or more tech-related events at the South by Southwest (SXSW) festival in Austin, TX sometime around 2013.
For brevity, I will include all the imports and a brief explanation on their purpose in one convenient GitHub gist:
This is what the data set looks like after being converted into a DataFrame in a Jupyter Notebook:
Each row includes:
- A sentiment label**
- A brand or product the sentiment is directed at**
- The raw tweet text
**determined by human evaluators
Out of the three features available, the “emotion_in_tweet_is_directed_at” column contains all but one of the missing values in entire data set (roughly 2/3 missing). In addition, this information would not be useful in training a model to learn from the text, as itself is an interpretation drawn from the text by a human; for these reasons I removed this column and the one single row without a “tweet_text” value from the DataFrame.
At this point during the completion of the project, I used a custom function to clean the all of the tweets. This was not in preparation for the neural network, it was in an effort to explore the data by creating a couple visualizations: a frequency distribution and a word cloud.
A frequency distribution is a plot displaying how many occurrences of each token there are in a given corpus. It comes from Natural Language Toolkit (NLTK) and is relatively easy to implement after a text has been tokenized. Here’s how I did it:
Just by looking at this FreqDist plot, it is easy to see where my inferences about the data set come from. There is a strong showing from the tech-related tokens such as “google”, “apple”, and “ipad”, indicating some tie between these tweets and technology. It seems possible to even recreate a sentence describing the event, I thought potentially that there was a “pop-up” to promote the “launch” of a “new” “ipad”.
As a partner to this plot, I coded in a word cloud, which represents the same idea stretched over a custom image. The necessary package is from WordCloud. Here’s how I made mine:
The word cloud appears to confirm the results of the FreqDist plot, with some minor differences. For example, there are bigrams included such as “apple store” , “social network”, or “called circle”. This helped provide more clues as to what was inspiring some these tweets: the launching of a social network called circles (confirmed by a Google search).
Now with a better understanding of the type of tweets that are in the data set, let’s get into how I prepared the data set for the neural network.
Finding the target
First I changed the text labels into a numerical representation that the network can understand easily:
- “Negative emotion” → 0
- “No emotion toward brand or product” → 1
- “I can’t tell” → 1
- “ Positive emotion” → 2
I chose to do it this way as I felt it best represented a negative, neutral, positive structure for the target variable given the text labels that came with the data. Following this, I needed to convert the labels into a one-hot encoded representation, using the to_categorical() function from Keras. The last thing I did before cleaning the actual text was a train-test split. This was in order to have data to evaluate my network’s performance with after it has been trained.
Cleaning the text data is, in my opinion, made a lot easier through the use of regular expressions (regex). Although there is a bit of a learning curve when attempting to do non-basic tasks, using regex, one can modify text documents on a character-by-character basis. I recommend playing around with it on regexr.com first, just to see how it works in a hand-on manner.
In my project I ended up using a custom function that would take in a multiple regex patterns, text, and replacement strings, returning a cleaned version of the input text according to the input patterns. Here is the core function doing the cleaning and some of it’s work:
“Best thing I’ve heard in a long while actually! "I gave iPad 2 money to #Japan relief." #sxsw @mention @mention @mention”
“Best thing I’ve heard in a long while actually! I gave iPad 2 money to #Japan relief. #sxsw”
Now that some of the nonsense has been removed, the next step I took was to tokenize the text. Keras makes this really easy via the use of the Tokenizer() class found in the “text” module. After this it is necessary that the tweets, now represented as a list of tokens, be converted into a padded sequence. A padded sequence is an ordered numerical representation of a sentence (or text) padded with zeroes to be a desired length. Here’s how I did it (trust me, its important):
The target class distribution for this data set was highly imbalanced, which is a problem for any type of machine learning. Essentially the less examples of a given class a model has to learn from, the more likely it is to not predict that class. The distribution was:
- Negative sentiment: 6.26%
- Neutral sentiment: 60.97%
- Positive sentiment: 32.75%
In addition, my predicament required the usage of a different package than I was used to when addressing the class imbalance. Due to the data being sequenced, using the SMOTE (Synthetic Minority Over-sampling Technique) class from imbalanced-learn (imblearn) was not possible.
The synthetic sequences generated using SMOTE would not be reverse translatable into coherent English, since they were randomly generated based upon each minority class’ attributes respectively. Thus, I decided to use imblearn’s RandomOverSampler() class to randomly copy tweets in the minority classes, ensuring the “readability” of the inputs remain intact for the learning process.
Now that I have a data set with an even distribution among classes, I can begin to put together the neural network that will eventually attempt to learn how to classify tweets by their sentiment!
When building an ANN with Keras, the first step is to instantiate the model. This is done like with any other package by calling the class and storing in into a variable. From here one only needs to use the .add() method to stack on as many layers as desired before finalizing the build using the .compile() method. Here is how I did it:
I’ll do my best to explain a bit on what is going on above. In line 8, I begin by adding the embedding layer that will serve as the “space” that each input sequence will live in before moving into the next layer. Choosing the values for the two parameters will depend on the task and input data, but sticking with memory-friendly numbers (64, 128, etc.) for the embedding size in my experience helps.
Lines 10–13 detail the addition of a Long-Short Term Memory (LSTM) layer, that in theory would help the model to analyze each sequence as a whole rather than part-by-part, thus increasing it’s understanding. I needed to apply a GlobalAveragePooling1D() to transform the data appropriately (more on this concept) for use in the next layers. The last line applies dropout regularization, which promotes generalization of the model by restricting the features passed on to the next layer by 30%.
Lines 16 and 17 show the addition of a densely connected layer (Dense()) and another application of dropout regularization. Line 20 is the final dense layer, which is the output layer; the activation function and neurons are dependent on the number of classes one is attempting to predict (in my case ‘softmax’ and 3 neurons).
As mentioned before, in order to finalize the ANN’s architecture, one needs to apply the .compile() method (lines 23–25). There are three main parameters:
- Optimizer: String name or class of optimization algorithm (default: “rmsprop”)
- Loss: Method that network uses to determine distance in space
- Metrics: Metric that network uses to judge it’s performance each cycle
Line 28 is just a demonstration of how to use the .summary() method to receive a confirmation of the architecture:
In my experience, visually confirming the neural network architecture prior to training is a great way to potentially catch any missteps and strategize on what parameters to tweak when tuning. Regardless, I now have a compiled ANN that can start training on data!
Training an ANN with Keras is very similar to how one would do so with a package like Sci-Kit Learn (sklearn): using the method .fit(). However this is where the similarities cease, as Keras’ method has a different set of parameters specific to training ANNs. Here were the ones I made use of:
- batch_size: The amount of samples used to train neurons with per epoch
- epochs: Total number of iterations over the batches of training data
- callbacks: Place to input callback class, which will perform a specific task with relation to the training of the data.
- validation_split: A percentage of the training data to be set aside each epoch as validation data. This data will be used to evaluate the model’s performance and guide adjustments to weight for each neuron after each pass over the entire set.
- verbose: Accepted values are: 0, 1, or 2. These values determine the level of detail in the display during ANN training. A value of 0 will produce nothing, 1 will display a progress bar for each epoch, and 2 displays a line for each epoch including chosen evaluation metrics and how long the training took.
The .fit() method will produce what the Keras documentation calls a history object. This object has the attribute .history, which will be instrumental in evaluating the model’s performance. The documentation describes it as:
“…a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).”
Using the information stored in the this attribute, I will be able to create a graph that is commonly used to evaluate how the well training of an ANN is going.
Fitting the data:
Display during training with verbose=2:
With the training finished, the model is now ready to make some predictions! A key point is that in order to make predictions, the test data, or any new data for that matter, must be in the same format as the data used for training. Luckily this was not a concern for me, since my test data comes from the same data set, but I would keep a copy of the training data around for reference if necessary.
Getting the model to create predictions is mandatory if we are to see how well it can classify tweets based upon their sentiment. Once again, similarly to sklearn, this can be done by putting the desired data into the model’s .predict() method.
Once the predictions have been generated, it is fairly simple to produce a classification report and confusion matrix using functions found in sklearn’s metrics module. In this post, I will not go into great detail about the “why”, but I wrote in fair detail about both topics in a previous post if you’d like more information.
Side Note: I needed to use the .argmax() method on my predictions prior to creating the following plots. This was in order to change my vector of probabilities for each of the three classes respectively into a vector of the most probable class as the prediction; without this step, I would be unable to generate these plots.
A classification report serves as a quick snapshot of commonly used metrics for classification tasks.
Creating the classification report:
A confusion matrix is a plot that helps illustrate how well a model predicts each of the classes respectively. Generally, it is used to analyze the relationship between the predicted labels and true labels. In order to display properly, I needed to make use of a custom function. I will include code for both.
Using the function to generate a confusion matrix:
Just like that I now have access all the information necessary to begin tuning my model for better performance. Tuning generally refers to tweaking parameters and/or the architecture during the compilation stage of an ANN. The “Optimization in Neural Networks” section in Matthew Stewart’s blog post does a great job explaining how this process works.
Alternatively, I can just leave it be and conclude with the results I have since I have completed the task of creating an ANN that can classify tweets by their sentiment with relative accuracy.
Trying It Out
As a personal preference, I like to functionize processes or otherwise unwieldly blocks of code so that I can consistently execute them with minimal effort. In this next section I will show how I generalized the process above into a callable function as well as the results of the ANN created in this post (I did not discuss all of the functionalities I added during this post).
Through the use of this function, I was able to train and evaluate models efficiently, leaving time and space left over for strategizing on my next tuning adjustment.
Although not the end-all be-all of classification metrics, I felt getting 64% accuracy on my first go classifying text using deep learning isn’t half bad!
Holding It Down
Honestly, I do believe that I was able to squeeze out most of the predictive capabilities from the input data. This, however, is not to say that I believe the performance could not be further improved. I think with more time preprocessing the data and/or playing with the architecture, my model can detect sentiment with higher accuracy.
The entire project is viewable on GitHub if you would like to see how actually did everything discussed in this post.