Source: Deep Learning on Medium
Introduction to Term Frequency — Inverse Document Frequency(TF-IDF) in Natural Language Processing (NLP)
Introduction to Term Frequency — Inverse Document Frequency
Most of the time we encounter a text or document where a word repeats itself. Sometimes it has a meaningful significance and sometimes it becomes meaningless. Let’s take the example of “the”. 10% of the time it is used without purpose, and in machine learning these kinds of words are found to have no importance.
Similarly, data is extracted from various sources, it could be both qualitative and quantitative. Quantitative analysis is processed by machine learning algorithms but what happens when we want to analyze qualitative data, or simply we can say text analysis.You must have heard about Bag of words. In a concise way,
“Bag of words gives the total count of terms that appear in a document.”
In addition to this, Bag of words has given similar importance to all words, we cannot authorize which word is more important and relevant to text data analysis. In order to overcome this condition, the concept of TF-IDF has been introduced.
The TF-IDF was originated for the documented analysis and retrieval of information from text. Consider the most common words “like”, ”this”, “if-then”, these words might get a low score even if they appear in the document many times, being considered as non-relevant.
However, suppose the word “book” appears many times in a document but does not appear as much in another document, then this gets high scores and is considered as most relevant. This is how TF-IDF works.
TF-IDF is an algorithm that investigates every keyword in a document and fishes out the most essential keywords from the document. It is employed by many enterprises, for instance, Google uses the TF*IDF score in its search engine, let’s discuss TF-IDF in detail.
We expect that the higher a word appears in a document, the more relevant it is. This might be correct but what happens when we have documents of variable sizes, how are we supposed to determine which word is more relevant to all documents.
We also know that bigger size documents have a higher number of appearances in comparison to low sized documents. So one way is to normalize the occurrence of a word, it gets divided with the size of the document. Here, the process of term frequency works. TF measures the occurrence of the word in any document.
Term frequency is defined as a number of times instance or keyword appears in a single document divided by the total number of words in that document.
TF is the function of “t” and “d” where “t” indicates the term or word in a document “d”. It has often happened that a term appears in a single document and does not appear in other documents but also another term appears in all documents. As we know the length of the document varies in each case, so term frequency varies with the occurrence of term respectively.
Inverse Document Frequency
So far, we have seen different examples for text document analysis, about term frequency where TF is responsible for getting the number of times term “t” appears in a document.
We are able to count the instance of terms in the document, but how do we know if this is really worth or not for our analysis and if our approach is significant for the desired result or not?
Let’s make it more clear with an example, we have two sentences, in general, they are not interlinked “this is a Dog” and “this is a Pen”. Now, “this”, “is”, and “a” are the terms that appear many times but have less importance as compared to “Dog” and “Pen”.
The issue which arises here is that while it is known to us that “Dog” and “Pen” have more importance, but how does a machine learn that? Also, the size of the document has its own effect, to overcome which, IDF plays a major role. IDF signifies how important the term is to be in the collection of documents.
Inverse Document Frequency calculates the weight of a rare term of the text in a collection of documents. The formula of IDF is given by :
In summarization of this, we can say that term frequency and inverse document frequency (TF-IDF) collectively find out the count of every term and the weight of the rare terms. It can be observed that rare words or terms contain more relevance in documents so we need to sort the list of words rarely used.
In TF-IDF, each term has its own respective TF and IDF score, combined together as TF*IDF weight term or score. The TF-IDF is applied to figure out the corpus of simple terms, where the corpus is defined as how essential the term is in the assembly of the documents. At last, we get the larger document frequency, lower is IDF and vice versa.
Code in Python for TF-IDF
Let us take an example to understand TF-IDF in a simpler way. We have two sentences:
Sentence1: Oxygen is taken by humans.
Sentence2: CO2 is taken by Plants.
Now, a table is presented here that gives the values of in terms of the TF-IDF score,
Following are the observations from the above table,
- The score value of TF-IDF is “0” for common words, i.e. having no significant use.
- TF-IDF score value of “Oxygen”, “CO”2, “humans”, and “plants” is “0.13” each, this implies that words are more significant and show high relevance.
TF*IDF increases with a number of occurrences in the document and the weight of rare instances in the assembly of documents.
Let’s move towards the code part:
In Step 1, data is fed as “statement1 and statement2”, which contains string values for the use of TF-IDF calculation.
Next, to check the number of words present in the given string, we split the statement with the help of “split function”.
In the data, multiple duplicate words are present so to combine all words into one variable, we use “union function”.
Here, default values are assigned to every word as 0.
The for loop initiated, the value of variable “worddict1” is increased by 1 if the word is found to be present in our dataset and remain 0 if it is not.
In this step, the pandas’ library is imported as pd first and data is represented as a data frame, 1 shows the statement contains that word and 0 shows the statement has no word related to that.
After getting the presence of each word in our statement data, we need to calculate the TF value of each word. A simple function is created and the TF values of each word recorded.
As we get the TF values of each word, the IDF value of the word dictionary is also required to complete the process of the TF*IDF score value. So in the above step, we calculate the IDF value of our word dictionary.
Now we have a complete score of the TF-IDF. We can simply get TF*IDF value to get the final output and see which word is significant and which is not.
The result is as below;
- The score of the word “CO2”, “Oxygen”, “humans”, and “plants” found to be 0.13 each.
- The words “by”, “is”, and “taken” contain the score 0 which shows that “CO2”, “Oxygen”, “humans”, and “plants” have more significance in comparison to “by”, “is”, “taken”.
Hence I will conclude this by stating that TF-IDF is a very powerful algorithm in natural language processing and retrieval of any relevant information is the main task of this algorithm. Due to its information retrieval property, it is widely approved by tech giants. For more blogs in Analytics and new technologies do read Analytics Steps.