Original article was published by Lawrence Alaso Krukrubo on Artificial Intelligence on Medium
Hard-coding the most popular text-embedding algorithm…
Term Frequency-Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document, in a collection or corpus.
Simply put, TF-IDF shows the relative importance of a word or words to a document, given a collection of documents.
Note that before we can do text-classification, the text must be translated into some form of numerical representation, a process known as text-embedding. The resulting numerical representation which is usually in the form of vectors can then be used as input to a wide range of classification models.
TF-IDF is the most popular approach to embed texts into numerical vectors for modeling, information retrieval and text-mining.
Over 83% of text-based recommender systems in digital libraries use TF-IDF… (link)
So today, we shall look at some basic text-classification processes including text-normalization and feature-extraction which culminates in TF-IDF vectorization.
Then, we shall write very simple python functions to perform TF-IDF.
Here’s the excerpt…
1. Text Preprocessing:
Text preprocessing is the next step after loading the document.
Text normalization is the process of transforming a piece of text into a canonical (official) form.
Preprocessing includes a variety of activities, often informally and collectively referred to as text-normalization. These include:-
- Parts of speech tagging
- Phrase chunking
- Remove Punctuations
- Spell check
- Remove Stopwords
- Expand Contractions
Stemming usually refers to a crude heuristic process that chops off the ends of words. Lemmatization usually refers to the use of vocabulary and morphological analysis of words, to remove inflexions only and to return the base or dictionary form of a word, which is known as the lemma.
For this exercise, we shall only cover the following preprocessing steps
- Remove Punctuations
- Change to Lowercase
- Remove Stopwords
First, let’s import some much-needed libraries:
Texts include a lot of punctuation, which we need to remove if we want to work only with the actual words. We’d also remove numbers from the text.
from string import punctuation# First remove digits
doc1Txt = ''.join(c for c in doc1Txt if not c.isdigit())# Next we remove Punctuations
doc1Txt = ''.join(c for c in doc1Txt if c not in punctuation)
Change to Lowercase:
doc1Txt = ''.join(c.lower() for c in doc1Txt)
A large number of the words in the text are common words like “the” or “and”. These “stopwords” add little in the way of semantic meaning to the text, and won’t help us determine the subject matter — so run the cell below to remove them.
Tokenization involves splitting the text into individual words and counting the number of times each word occurs. This step is also a crucial pre-requisite for the feature-extraction phase.
# Download 'punkt' from nltk
# Tokenize the text into individual words
moon_words = nltk.tokenize.word_tokenize(doc1Txt)
Let’s get the frequency distribution count using the FreqDist function
fdist = FreqDist(moon_words)
Until now, we’ve simply counted the number of occurrences of each word. This doesn’t take into account the fact that sometimes multiple words may be based on the same common base or stem, and maybe semantically equivalent. For example, “fishes”, “fished”, “fishing”, and “fisher” are all derived from the stem “fish”.
So let’s stem the words.
# Get the word stem from PorterStemmer library
ps = PorterStemmer()
doc1Txt = [ps.stem(word) for word in moon_words]
And that’s it for stemming and for the basic text preprocessing phase.
Let’s now make the frequency distribution (fdist), a DataFrame and plot it as a Pareto chart with higher frequency items first, using matplotlib.
From the Pareto chart, words like ‘new’, ‘go’, ‘space’ have the highest frequency in the Moon.txt document. No surprises, the entire speech is about moon exploration.
2. Feature Extraction:
Feature-extraction for text data has two main steps
- Define Vocabulary
- Vectorize Documents
Step 1 has to do with identifying words based on their frequency and record them as a vocabulary, using a distribution, just as we did in tokenization and frequency distribution above.
For step 2, we shall vectorize the text using the TF-IDF algorithm.
As stated above, TF-IDF shows the relative importance of a word or words to a document, given a collection of documents. Therefore, we need to download a few more documents.
This also implies normalizing each downloaded document just as we did with the Moon.txt above.
In programming, the moment you need to repeat code, then it’s time to write a function…
So let’s write a function that performs the preprocessing steps above for a list of documents, and for uniformity’s sakes, let’s apply it to all documents.
Okay, let’s write another simple function to read in the 4 documents we’d use for TF-IDF
Now let’s call these two functions on our documents list
Let’s see one of the imported and normalized docs. It’s the ‘Inaugural.txt’ file containing excerpts from President JFK’s inaugural speech…
TF-IDF is still a part of text feature-extraction, concerned with vectorizing text documents. But it deserves a space to itself since it’s the topic of this article.
In the previous example, we used basic term frequency to determine each word’s “importance” based on how often it appears in one document.
When dealing with a large corpus of multiple documents, Term Frequency-Inverse Document Frequency (or TF-IDF), is used to score how often a word or term appears in one document compared to its more general frequency across the entire collection of documents.
Using this technique, a high degree of relevance is assumed for words that appear frequently in a particular document, but relatively infrequently across a wide range of other documents.
Coding the TF-IDF Algorithm
We shall write four simple functions to compute TF-IDF for a collection of documents.
Before we write our functions, let’s create text blobs for our documents. Text blobs make it easier to work with a collection of text documents.
# Create a collection of documents as textblobsfrom textblob import TextBlob as tbdoc1 = tb(doc1Txt)
doc2 = tb(doc2Txt)
doc3 = tb(doc3Txt)
doc4 = tb(doc4Txt)docs = [doc1, doc2, doc3, doc4]
- The Term Frequency Function:
This simple function above calculates the term-frequency for every word in a document. It simply returns the total count of each specific word divided by the total number of words in the document.
If the document contains no words, it returns zero.
2. The Contains Function:
This concise function simply checks if a given word is available in all documents. For example, if the word ‘chosen’ is available in all 4 of our documents, the method returns 4.
3. The Inverse-Document-Frequency Function:
This function returns the inverse-document-frequency score for each word. It calls the
_contains() function we defined earlier, which checks the number of documents having the specific word.
The inverse-document-frequency function returns the inverse score by taking the log() of the number of documents divided by the number of documents that contain the specific word.
math.log(num_documents / num_documents_containing_word).
For example, if we have 4 documents and we’re checking for the inverse-document-frequency of the word ‘chosen’ across all 4 documents, see what the method returns:
*If all 4 documents have the word:-
*If only 3 documents have it:-
math.log(4/3) = 0.29
*If only 2 documents have it:-
math.log(4/2) = 0.69
*If only 1 document has it:-
math.log(4/1) = 1.39
Clearly, we can see that the inverse-document-frequency function penalizes (apportions little scores to) words that appear in more documents and gives bigger scores to words that appear in fewer documents.
4. The TFIDF Function:
Finally, we have the TFIDF function, whose main job is to call the other functions properly and print out the top 5 words unique to each document.
The TFIDF function computes for each word in each document, the product of the term-frequency function and the inverse-document frequency functions.
_tf(word, doc) * _idf(word, docs)
This helps to classify text documents by apportioning higher weights to words that occur frequently in one document and less frequently or zero in other documents. It helps to distinguish one text document from another. In the grand scheme of things, TF-IDF also helps to show the similarity index amongst similar text documents.
In programming, the moment you start writing a bunch of similar functions, then it’s time to write a class…
Therefore, let’s put all these functions in a TFIDF class
So here we have our own TF-IDF class of algorithms for text-classification!
See image below showing the output from calling our TF-IDF class functions. We can see the top 5 words unique to each document from the entire lot.
Feel free to explore how to call the TFIDF class with instance objects and if you wish to see my notebook on Github, here’s the Repo link
Lawrence is a Data Specialist at Tech Layer, passionate about fair and explainable AI and Data Science. I believe that sharing knowledge and experiences is the best way to learn. I hold both the Data Science Professional and Advanced Data Science Professional certifications from IBM and the IBM Data Science Explainability badge. I have conducted several projects using ML and DL libraries, I love to code up my functions as much as possible. Finally, I never stop learning and experimenting and yes, I have written several highly recommended articles.
Feel free to find me on:-