Unsupervised outlier detection in text corpus using Deep Learning

Source: Deep Learning on Medium


Auto-encoder based approach to finding the most unique movie plot in Wikipedia movie database.

Photo by Noah Black on Unsplash

What do you understand by uniqueness? ‘Uniqueness’ means something that is different from other things. In Machine Learning terms, it is nothing but an ‘Outlier’. Now, the question comes, how can we detect those without any prior knowledge? i.e., by unsupervised manner? In this paper, we will discuss about an unsupervised deep learning based technique of outlier detection for text data.


Data exploration & problem formulation

We will use Kaggle dataset for “Wikipedia Movie Plots”. It contains the title, director, genre and other information along with the plot. Our work will be limited to ‘plot’ only.

The goal of the problem would be analysing ‘plot’ of the movies and finding most unique movies or you can say ‘outliers’ in Machine Learning terms.

We will use Python and libraries like pandas, sci-kit learn, Gensim, matplotlib for our work.

Let’s first explore the dataset and see how does it look like.

import pandas as pd
wikipedia_movie_df = pd.read_csv('../data/wiki_movie_plots_deduped.csv', 
index_col=False, header=0)
wikipedia_movie_df.head()
Figure 1

As we will be working with ‘Plot’ only, will pick ‘Plot’ and will keep ‘title’ also as an identifier

title_plot_df = wikipedia_movie_df[['Title','Plot']]
title_plot_df.head()
Figure 2

There are a total of 34886 records.

Before going anywhere, let’s see how does a full plot of a movie look like

title_plot_df.iloc[0,1]
Figure 3

We need to clean the text before doing any further data exploration and seeding it into any algorithm.

Text Cleaning Steps

Cleaning of texts involves 7 steps as following:

  1. Removal of tags (like <html>, <p>)
  2. Removal of punctuation (like ‘,’,’.’,’!’ etc)
  3. Removal of multiple whitespaces in between texts
  4. Removal of numerics
  5. Removal of stop words (like ‘at’, ‘to’, ‘the’, ‘and’ etc)
  6. Removal of very short words
  7. Stemming i.e., converting words to its root form (like for ‘playing’, ‘played’ there will be only one token after processing: ‘play’ which is root word)

Let’s create a function which does all above

Now, apply the function to clean above mentioned movie plot and see the result

clean_text(title_plot_df.iloc[0,1])
Figure 4

Though the ‘cleaned’ text did not remain grammatically correct anymore, but still it holds the context which is essential for further processing.

We need to do this for all of the movie plots

title_plot_df['Plot'] = title_plot_df['Plot'].map(lambda x: clean_text(x))

For data exploration, let’s find most frequent 50 words/tokens present in the plots.

We need to write a generic function for graphical representation

We can now use the above function for plotting the data

barplot(words=common_words, words_counts=common_words_counts, title='Most Frequent Words used in movie plots')
Figure 5

From the above bar chart, we can easily see that the ‘kill’ is the most frequent word in all movies.

After this, let’s do the reverse, i.e., finding out least frequent 50 words/tokens in all movies and plot the graph

barplot(words=uncommon_words, words_counts=uncommon_word_counts, title='Least Frequent Words used in movie plots')
Figure 6

We can see a lot of words/tokens are there with frequency 1.

From the intuitive analysis, we can easily conclude that our target result, i.e., ‘The most unique movies’ definitely will not contain most frequent words and may contain the least frequent word. This is a very high-level view. In practice, we need to actually see the contextual meaning of the movie plot and find out the rarest ones.


Solution Approach & processing steps

We are already done with the cleaning step and now we have all movie plots as single space-separated tokens. For any text mining problem, vector space modelling is an essential part. It converts the text into numerical features and later these can be seeded into any Machine Learning algorithm. For our use case also, will do the same thing. After that, we will use an auto-encoder Neural Network to self-train and use it as a model. Will explain each of these steps one by one and see the result

Step 1 — Conversion of texts to Vector Space Model

There are many ‘Vector Space Models’ for text analysis like ‘Tf-Idf’, ‘CBOW’, ‘Word2Vec’, ‘Doc2Vec’ etc.

Tf-Idf’ is a simple frequency-inverse frequency-based model and very useful in case of small and domain-specific texts. It is very good in understanding texts which contain domain heavy terms and are not very well written or grammatically correct. But it fails to understand the contextual meaning especially for very well written & grammatically correct texts. The same mood can be expressed using different words.

Doc2Vec’ & ‘Word2Vec’ are the clear winner over here. ‘Doc2Vec’ randomly samples words from texts and trains one Neural Network model internally which gives numerical vector representation of the text. In our case, as movie plots are well-written texts, will use ‘Doc2Vec’ as our ‘Vector Space Model’.

Let’s write a transformer which converts text corpus into document vectors.

Now, apply to all of the movie plots and see the result

doc2vec_tr = Doc2VecTransformer(vector_size=300)
doc2vec_tr.fit(title_plot_df)
doc2vec_vectors = doc2vec_tr.transform(title_plot_df)
Figure 7
print(doc2vec_vectors)
Figure 8

We have kept our ‘vector_size’ as 300. Generally, it is preferable to keep in between 100 & 300. So, all movie plots now got decomposed into a vector of size of 300 and overall ‘doc2vec_vectors’ dimension becomes 34886×300. We can say, each movie plot text have 300 numerical features.

Step 2 — Training an ‘Auto-Encoder’ neural network

As our process is completely unsupervised and we don’t have labelled data (as outlier/non-outlier), we will use 5-layer deep ‘Auto-encoder’ neural network to train our model. It is a special type of neural network which copies input data to output data. This process is known as ‘reconstruction’. ‘Hidden layers’ of the network does the feature extraction & decoding work. At the end of the entire process definitely, some loss gets generated and the data point which is dissimilar from others incurs more loss.

Its layer structure will look like

300 -> 600 -> 150 -> 600 -> 300

i.e., Layer 1(Input Layer)–300 features, Layer 2–600 features, Layer 3–150 features, Layer 4–600 features, Layer 5(Output Layer)-300 features. Layer 1, as usual, will have all ‘Doc2Vec’ generated features from ‘Step 1’. Layer 2, 3 & 4 are hidden layers doing the actual data messaging (expansion & shrinking) and information extraction part. Actually, Layer 2 is known as ‘encoding layer’ and Layer 4 is known as ‘decoding layer’. Ultimately, the output will come out from Layer 5. ‘Autoencoders’ are trained with the same data as input & output both. So, Layer 5 output is nothing but a reconstructed version of the input with some loss. Normal data points will go through smoothly between layers, with minimal loss, but data loss for the ‘outliers’ will be more as those don’t follow the hidden data pattern.

This is the graphical representation of our 5-layer ‘Auto-encoder’

Figure 9

Now, let’s see practically how this will work on ‘Doc2Vec’ data

print(predicted_vectors)
Figure 10

‘predicted_vectors’ are output vectors.

Actually, this ‘Auto-Encoder’ is nothing but a deep non-linear regressor. We can see its accuracy also.

auto_encoder.score(predicted_vectors, doc2vec_vectors)
Figure 11

‘Score’ is excellent at 0.95 i.e, 95% of the variance is maintained by the ‘Auto-Encoder’ as per ‘Regression Accuracy’ terminology. This also means that 95% information of the input variables is successfully reconstructed.

We can also see the pattern of loss function of the entire network

pd.DataFrame(auto_encoder.loss_curve_).plot()
Figure 12

Here X-axis denotes no of iterations. We can see at 12th iteration, loss reached at its minimum value.

Step 3 — Similarity measure with output and actual work

Once we get the output vectors, we can now measure the loss. ‘Loss’ is ideally given by ‘outlier factor’ expression

Figure 13

where

[OF]ᶦ = outlier factor for iᵗʰ data instance

d = total number of features (in our case, d=300)

xᵏ = kᵗʰ feature value of the input (1 ≤ k ≤ d)

oᵏ = kᵗʰ feature value of output (reconstructed input actually) (refer Figure 9)

Outliers always will have a tendency of not following the pattern and will give higher reconstruction error. So,

higher value of [OF] denotes a possible ‘outlier’.

As we are dealing with numeric vectors of textual data, squared distance or Euclidean distance like above may not give the correct result. Hence, we will use ‘cosine similarity’ as a measure. Now [OF]ᶦ becomes a simple cosine measure of input & output vector of iᵗʰ data point.

But one difference here:

lower value of [OF]ᶦ indicates possible outlier instead of higher value.

So the steps of doing this will be:

  1. Calculate ‘cosine similarity’ between input vector & output vector of iᵗʰ data point.
  2. Sort all cosine similarity values in increasing order
  3. Choose top ‘k’ (‘k’ is user given input) values and possible outliers will be the corresponding data points

Let’s put this logic in two functions

We can call the above functions in order to get top ‘k’ outliers

print('Top 5 unique movies')
sorted_cosine_similarities = get_computed_similarities(vectors=doc2vec_vectors, predicted_vectors=predicted_vectors)
display_top_n(sorted_cosine_similarities=sorted_cosine_similarities)
Figure 14

So, we got the top 5 unique movies.


Validating the results

For unsupervised ‘outlier detection’ problems in Machine Learning, validating the output is really challenging as because we don’t have labelled data as a benchmark. And also as we are using ‘Doc2Vec’, checking contextual validity is difficult. So we will take a simpler approach and validate our results with the observations from ‘data exploration stage’.

We will see whether there is an intersection between ‘all processed words of most unique movie’ and ‘very frequent words across all movies’

most_unique_movie_index, cosine_sim_val = sorted_cosine_similarities[0]
most_unique_movie_plot = title_plot_df.iloc[most_unique_movie_index, 1]
most_unique_movie_words_counter = Counter(preprocess_string(most_unique_movie_plot))
intersected_common_word_counter = common_word_counter & most_unique_movie_words_counter
intersected_common_words = [word[0] for word in intersected_common_word_counter.items()]
intersected_common_word_counts = [word[1] for word in intersected_common_word_counter.items()]
intersected_common_word_counter
Figure 15

So, there is no intersection.

What it means: ‘Most unique movie plot does not contain very frequent words which are very common for other movie plots’.

This is quite relevant for a ‘unique’ movie which is different from others, right? So, our solution is working perfectly!

Now, let’s see the opposite scenario — ‘Whether most unique movie contains least frequent words of the corpus or not’.

uncommon_word_count_items = [word for word in islice(all_word_counts, 300000)]
all_movies_uncommon_word_counter = Counter(dict(uncommon_word_count_items))
common_word_counter = Counter(aggregate_counter.most_common(1000))
intersected_uncommon_word_counter = all_movies_uncommon_word_counter & most_unique_movie_words_counter
intersected_uncommon_words = [word[0] for word in intersected_uncommon_word_counter.items()]
intersected_uncommon_word_counts = [word[1] for word in intersected_uncommon_word_counter.items()]
barplot(words=intersected_uncommon_words, words_counts=intersected_uncommon_word_counts,
title='Few Common words between all words of most unique movie & least frequent words in all movies')
Figure 16

From the plot, it can be said — ‘Most unique movie plot does contain many very least frequent words which are very uncommon for other movie plots’

thus making it very unique.

We validated our result from both sides. The same test can be carried out for other movies also coming within top ‘k’ (There may be a little variation due to increasing cosine similarity values. Also, variation may occur in overall result due to varying nature of ‘gensim — Doc2Vec’. In each run it may produce little different vectors which may influence the output).

So, we learned how to use ‘Deep Learning’ to find outliers from text data.

‘Jupyter notebook’ project for this can be found in the Github. Looking forward to feedback or questions.