Source: Deep Learning on Medium
1. The main idea
The main idea behind this approach is that negative and positive words usually are surrounded by similar words. This means that if we would have movie reviews dataset, word ‘boring’ would be surrounded by the same words as word ‘tedious’, and usually such words would have somewhere close to the words such as ‘didn’t’ (like), which would also make word didn’t be similar to them. On the other hand, it would be unlikely to have happened, that word ‘tedious’ had more similar surrounding to word ‘exciting’, than to word ‘boring’. With such assumption, words could form clusters (based on similarity of their surrounding) of negative words that have similar surroundings, positive words that have similar surroundings, and some neutral words that end up between them (such as ‘movie’). It might seem not quite convincing at the beginning, and I might not be perfect explainer, but it actually turns out to be true.
The perfect tool for such problem (of having words that are similar to their surrounding) is the one and only word2vec! If you haven’t heard of it before, here is a wonderful article about word2vec algorithm by Chris McCornick:
And perfect tutorial by Pierre Megret, which I used in this article to train my own word embeddings:
2. The Data
The first, the only, and the most important step in every Data Science/Machine Learning project is data preparation. Without good data quality, it is always possible to end up with a biased model, that is either not performing well according to some metric we choose(e.g. F-score on test set) or, which is harder to diagnose at the beginning, has been taught biased relations, that actually doesn’t reflect it’s availability to e.g. distinguish positive and negative emotions, but just allowed it to perform well on given data set.
The cell below presents one of basic text preparation steps that I’ve chosen to use, but I didn’t include all of them, as everything is included in my repository, and I don’t want to make the article less readable. Frankly speaking, I didn’t spend a lot of time on it, and there is still plenty of space to do your own preparations, especially if you would try to implement it for languages like English, that have wonderful libraries for text normalization. For Polish language it could be really important to use tools like Morfologik, to stem the words to their basic structure, as we have a lot of different word suffixes that change the word for the model, but actually mean exactly the same thing (e.g. ‘beznadziejny’ and ‘beznadziejna’ both mean hopeless but the first one refers to a man, and the other to a woman).
All the steps that I’ve chosen to include:
- dropping rows with missing (NaN) values,
- dropping duplicated rows,
- removing rows with rate equal to 0, as it contained some error, probably from the data gathering phase,
- replacing polish letters with use of unidecode package,
- replacing all non-alphanumeric signs, punctuation signs, and duplicated white spaces with a single white space
- retaining all rows with sentences with a length of at least 2 words.
Another idea could be to implement spell checker, in order to prevent training too many embeddings of words, that actually mean exactly the same thing. Here is a wonderful article about spell checker that uses Word2Vec and Levenstein distance, to detect semantically most similar words:
After cleaning the words, there were several other steps taken to prepare the data for word2vec model, all of which are included in my github repo. Main steps included most frequent bigrams of words detection and replacement with gensim’s Phrases module. All these steps and most of the hyperparameters in Word2Vec model I used were based on the wonderful Word2Vec tutorial from kaggle that I linked before.
3. Word2Vec model
In this exercise, I used gensim’s implementation of word2vec algorithm with CBOW architecture. I trained 300 dimensional embeddings with lookup window equal to 4, negative sampling was set to 20 words, sub-sampling to 1e-5, and learning rate decayed from 0.03 to 0.0007.
w2v_model = Word2Vec(min_count=3,
4. K-Means clustering
K-means clustering is a basic technique for data clustering, and it seemed most suitable for a given problem, as it takes as an input number of necessary clusters, and outputs coordinates of calculated clusters centroids (central points of discovered clusters). It is an iterative algorithm, in which in first step n random data points are chosen as coordinates of clusters centroids (where n is the number of seeked clusters), and next in every step all points are assigned to their closest centroid, based on euclidean distance. Next, new coordinates of every centroid are calculated, as mean of coordinates of all data points assigned to each centroid, and iterations are repeated till reaching minimal value of squared sum of distances between points assigned to centroids, and their centroid coordinates (which just simply means that coordinates of clusters stop to change), or number of iterations reach given limit.
In the given problem I used sklearn’s implementation of K-means algorithm with 50 repeated starting points, to presumably prevent the algorithm from choosing wrong starting centroid coordinates, that would lead the algorithm to converge to not optimal clusters, and 1000 iterations of reassigning points to clusters.
After running it on estimated word vectors, I got 2 centroids, with coordinates that can be retrieved with method:
Next, to check which cluster is relatively positive, and which negative, with use of gensim’s most_similar method I checked what word vectors are most similar in terms of cosine similarity to coordinates of first cluster:
word_vectors.similar_by_vector(model.cluster_centers_, topn=10, restrict_vocab=None)
As you can see (if you know Polish, which I encourage you to learn if you want to have some superpowers to show off with) 10 closest words to cluster no. 0 in terms of cosine distance are the ones with positive sentiment. Some words classified to cluster 0 are even contextually positive, e.g. collocation ‘miod_malina’, which consists of words that literally mean ‘honey’ and ‘raspberry’, means that something is amazing and perfect, and it got sentiment score (inverse of distance from cluster it was assigned to, see the code in repository for details) of +1.363374.
The negative cluster is harder to describe, as not all most similar words that end up closest to it’s centroid are directly negative, but when you check if words like ‘hopeless’, ‘poor’ or ‘broken’ are assigned to it, you get quite good results, as all of them end up where they should have.
temp[temp.words.isin(['beznadziejna', 'slaba', 'zepsuty'])]
It might seem tricky, that I use cosine distance to determine the sentiment of each cluster, and then euclidean distance to assign each word to a cluster, but there is no motivation behind it, I just used available methods from both libraries, and it worked.
5. Assigning clusters
Next step, partially mentioned in the previous chapter, was to assign each word sentiment score — negative or positive value (-1 or 1) based on the cluster to which they belong. To weigh this score I multiplied it by how close they were to their cluster (to weigh how potentially positive/negative they are). As the score that K-means algorithm outputs is distance from both clusters, to properly weigh them I multiplied them by the inverse of closeness score (divided sentiment score by closeness score).
With these steps being complete, there was full dictionary created (in form of pandas DataFrame), where each word had it’s own weighted sentiment score. To assess how accurate these weighted sentiment coefficients were, I randomly sampled dataframe with obtained coefficients. As you can see, for most of you probably with help of google translate, words in the table below mostly end up in the correct cluster, though I must admit that many words didn’t look so promising. Probably, the best option to correct it would be to normalize data properly or to create 3rd, neutral cluster for words that shouldn’t have any sentiment at all assigned to them, but in order to not make this project too big, I didn’t improve them, and it still worked pretty well, as you will see later.
6. Tfidf weighting and sentiment prediction
Next step was to calculate tfidf score of each word in each sentence with sklearn’s TfidfVectorizer. This step was conducted to consider how unique every word was for every sentence, and increase positive/negative signal associated with words that are highly specific for given sentence in comparison to whole corpus.
Finally, all words in every sentence were on one hand replaced with their tfidf scores, and on the other with their corresponding weighted sentiment scores.
Gists above and below present functions for replacing words in sentences with their associated tfidf/sentiment scores, to obtain 2 vectors for each sentence
The dot product of such 2 sentence vectors indicated whether overall sentiment was positive or negative (if the dot product was positive, the sentiment was positive, and in opposite case negative).
7. Model scores
Chosen metric for evaluating model’s performance was precision, recall, and F-score, mainly because classes in dataset were highly imbalanced, but in fact, the dataset was so highly imbalanced, that I should have probably come up with a metric that would punish this imbalance even more. It turned out, that model achieved 99% recall, which shows that it was really good at discriminating negative sentiment observations. One could argue that it’s quite obvious that it should have, as it had very few negative observations, and they probably differed the most from others, and it’s partially true, but if you consider that the model achieved also 80% precision, it might show, that it also learned quite a lot, and didn’t just split the data in half, with negative observations ending up in the correct cluster.
╔════════════════ Confusion Matrix ══════════════╗
║ ║ 0 ║ 1 ║
║ 0 ║ 9523 ║ 306 ║
║ 1 ║ 127125 ║ 508277 ║
║ Scores ║ ║
║ accuracy ║ 0.802503 ║
║ precision ║ 0.999398 ║
║ recall ║ 0.799930 ║
║ F1 ║ 0.888608 ║
To sum up, unsupervised approach achieved quite good results (in my opinion), as without the use of any pretrained models, and actually no previous information what is positive or negative in a given text, it achieved quite high metrics, significantly higher than predicted at random. One could argue, that it might be only for an analyzed dataset, as it might contain easily distinguishable words, but I used this approach also for different datasets, and with the same set of steps it still had quite good results (around 0.75 F1-score). Frankly speaking, I’m quite interested in hearing from you how it worked for your datasets!
8. Further discussion
This article was written mainly to present an idea about unsupervised language processing, not to create the best possible solution based on it, so there is plenty of space to improve it. Improvements that come into my mind, other than ones I already mentioned before, include:
- K-Means clustering based on cosine, not euclidean distances
- Include third, neutral cluster, or assign some words that end up somewhere between positive and negative clusters sentiment score equal to zero
- Hyperparameter tuning of Word2Vec algorithm, based on e.g. F1-score achieved on dataset (though it would require splitting the dataset into train and test datasets, as the training would become supervised)
- Not considering bi-grams of words
Here we arrive at the end of this short article — I really hope you enjoyed it and look forward to hearing from you about any improvements that you came up with. I also hope that it was somehow informative to you, and thank you for reading it!
All the best, and may the high F1-score be with you!