Original article was published on Deep Learning on Medium
We also recognized the value of the classification learner, being able to predict a review’s class through an independent similarity comparison with each class. The classification approach would also provide insight into the probability of each data point belonging to each class. This information would be helpful in understanding where we may need to adjust our model to account for cases of uncertainty (multiple labels with similar probability outputs). As a result, we decided to train a separate classification version over another copy of a pre-trained BERT-Base model, visualized below:
We discovered that for the classification approach, the independence assumption is not necessarily upheld. The model is more likely to misclassify a 5-star review as a 4-star rather than a 1-star because the similarity of features present between a 4-star and 5-star review is naturally greater than that of a 1-star and 5-star pair. The essence of trying to capture a review’s sentiment in making a proper classification inherently models the nature of star rating relationships.
Regression — Classification Combination
Running into the issue of uncertainty with regression allowed us to explore the avenue of creating a regression-classification combined approach. Instituting a tunable threshold of uncertainty as mentioned above, we devised a scheme that leveraged the trained classification model for these uncertain cases. The procedure for this first ensemble is as follows: in the event of an uncertain regression prediction, the same datapoint would be fed into the classification model for prediction. We tried both utilizing the classification output as the predicted label and as a rounding guider (round regression value to label nearest to classification output).
We found that first version (classification prediction) of this ensemble performed better than the second version (rounding guider), and additionally outperformed the best regression rounding rule by about 1%. This small improvement implied that the use of a classification model for uncertain reviews was helpful, but also identified the limitation of a regression base for this task. While the notion of capturing spatial relations between labels seemed ideal, the regression model effectively tries to imitate the average metric of determining star ratings from reviews seen during training, causing the prediction values to stray more towards the central region between labels, inducing higher uncertainty on average per prediction.
On the other hand, we found that a pure classification method outperformed the best ensemble model from above by 2%. We speculate that this result can be explained by how learned word associations are related to the ratings. Unlike a regression approach that attempts to model a singular average metric for all classes, a classification approach works on detecting key characteristics that are paramount to defining a particular class. This model featurizes a review and outputs the probability of the featurized review belonging to a class, creating less uncertainty in the prediction. In the case of Yelp reviews, people share different views in relating a sentiment to a star rating. For example, one person can write a strong and positive review, but one small blemish may cause the user to report 4 stars. While the regression model may use the minor negative sentiment to decrease the output score from a 5 to 4.8, a classification can better comprehend the significance of the subtle negativity through a similarity comparison and output a higher probability to belonging in the 4-star category. Classification takes advantage of the discrete nature of the labels, avoiding the need to develop a fully comprehensive metric across all classes. Especially with the skewed nature of Yelp reviews, this difference is key in reducing the effect such variance and noise have on model performance.
With classification models performing better, we sought to explore the performance of an ensemble of different trained classification models. Having leveraged several pre-trained models mentioned above and fine-tuning these models with a classification approach to the rating prediction task, we selected the top 3 performing models (determined through validation accuracy), and created a weighted ensemble according to the accuracies. With each model individually enduring high variance, the ensemble aims to reduce the variance in prediction along with the generalizability error involved. We tried two different prediction approaches: weighted average and maximum weighted probability sum.
Both variants showed slight increases in model performance, approximately 0.7% improvement, attributed mainly to the decrease in variance of predictions. We discovered that the weighted probability sum gives a better prediction estimate, most likely because this method factors in all class probabilities prior to its final classification instead of just the maximum ones.
Other Deep Neural Network Designs
While transformers shows the greatest promise in theory and application, we also explored the avenue of Recurrent Neural Networks (RNN), specifically Long Short-Term Memory (LSTM) and Gated Recurrence Units (GRU), to compare its effect on performance for this task. RNNs are well-known in their applications to language processing for their ability to use the information from all prior hidden states in the computation of the current layer’s output. This recurrent structure is adept in language processing because capturing the semantics of language require connections to be understood between different parts of a sequence of words.
LSTMs are great at processing hidden state information, learning to forget unhelpful features while remembering important ones. We thought that this architecture would be helpful in encoding a proper sentiment for the review, while filtering out unnecessary features that would otherwise contribute to noise outputs. The diagram of an LSTM is shown below:
GRUs have a similar architecture to LSTMs but differ in the memory cost and the replacement of the “forget” and “input” gate of an LSTM with an “update” gate. The diagram of a GRU is depicted below:
Both of these RNNs were implemented in Keras, and made use of a preprocessing pipeline to obtain cleaner reviews for input. This pipeline consisted of removing punctuation, tokenizing text, removing stop words, and lemmatizing the remaining tokens (replace vocabulary with unconjugated forms). This cleaned tokenization was encoded through a static, 300-dimensional GloVe embedding model before being passed into the RNNs.
These models performed worse than the transformer bases, as expected, yet set a nice baseline for the comparison of RNN performance to transformers for large text classification.
Traditional Machine Learning Models (Not Deep Learning)
Although the focus of this project was to construct a deep learning model for classifying Yelp reviews, we wanted to try out several, more traditional machine learning models to utilize as a baseline performance tool. We understand that smaller model capacity of these algorithms, which are not as suitable for larger comprehension tasks. Regardless, we chose to test our constructions of a Random Forest, a Linear Support Vector Classifier (SVC), Naive Bayes classifier, and Logistic Regression, all hypertuned through a grid-search methodology. As expected, these models performed significantly worse compared the deep learning ones, but it was certainly helpful in validating our approaches towards neural networks for this task.
Deep Learning Model Performance
Here’s the result from different models (validated on 20% data unless mentioned otherwise in the chart):
The table displays a direct comparison between the validation accuracy and MAE of various deep learning models. While these models may represent different training procedures (slightly different preprocessing steps or validation splits), the results indicate the relative close performance of models, with a few notable comments. Firstly, one can see that all transformer models (except ALBERT) performed better than the RNN models, attributable to the long-range dependencies and attention mechanism that transformers maintain. ALBERT is a lite version of BERT, so the smaller model capacity can serve as a reasonable explanation for performance reduction. Furthermore, the original BERT models performed than its variants, which may be a consequence of the lack of data needed to adequately fine-tune the larger variants in addition to the nature of the task. As mentioned above, the classification models outperformed their regression counterparts. Lastly, increasing the maximum sequence length on BERT helped increase the model performance, even though it took much longer to train. While the average review totals to a little over 90 tokens post-data processing, a non-negligible percentage of reviews contain more than 300 tokens in total, with the maximum soaring over 800 tokens. Since transformers require a predetermined sequence length, increasing this value allowed more tokens to be processed by the model, providing the classifier with a better semantic encoding of the review’s sentiment.
Traditional ML Model Performance
These models were trained and tested through k-fold cross validation (k=4). The results prove to be mostly worse than those of deep learning models, as expected.
We wanted to visualize the results in order to better understand model flaws and adjust our approaches to make our model more robust to these errors. We create confusion matrices for several of the models, detailing the count of predicted-true pairwise labels. These matrices were constructed on 50K reviews to illustrate the general trends in prediction patterns.
We also wanted to visualize the percentage of predictions to each star rating from our models, and compare it to the true label distribution. As an example, display the pie charts for BERT Length 256 model, shown in the pie chart below: