A simple guide to One-hot Encoding, tf and tf-idf Representation

Source: Deep Learning on Medium

A simple guide to One-hot Encoding, tf and tf-idf Representation

You get to see one-hot encoding all over the Machine Learning Tutorials, right. But what is it?

One-hot Representation

The one hot representation, as the name suggests starts with zero vector and sets at 1. Machine Learning, one-hot is a group of bits among which the legal combinations of values are only those with a single (1) bit and all the others low (0).

Consider the following two sentences:

“ Time Flies like an arrow
Fruit flies like a banana. “

What we get from these sentences is :

— — — — — — — — — — — — — — — — — — — –

{time, fruit, flies, like, an , a , arrow, banana}

Each word is taken without repetition.

The one-hot representation for the phrase “like a banana” will be 3 in 8 matrix.

Why 3 ? ( like a banana : is three columns to the 8 columns )

The binary encoding for “ like a banana “ would be:

tf ( Term Frequency )

The term frequency of a phrase, sentence or document is simply the sum of the one-hot representation of its constituent words.

“ Time flies like an arrow
Fruit flies like a banana “

Tf- idf ( Term Frequency Inverse Document Frequency )

It is a numerical statistic that is intented to reflect how important a word is to a document in a collection or corpus.

Consider a collection of patent documents. You would expect most of them
to contain words like claim, system, method, procedure and so on, often repeated multiple times. The TF representation weights words proportionally to their frequency. However, common words such as “claim” do no add anything to our understanding of a specific patent.

Conversely if a rare word ( such as “ Tetrafluoroethylene “ ) occurs less frequently but is quite likely to be indicative of the nature of the patent document, we would want to give it a larger weight in our representation. The IDF is a heuistic to do exactly that.

The IDF representation (w) of a token is defined with:

Let’s do example:

— — — — — — — — — — — — — — —

Document 1: “ this is a sample”

Document 2 : “this is another sample “

tf(“this”,d1) = 1/5 = 0.5
tf(“this”,d2) = 1/7 = 0.14

An, idf is constant per corpus, and accounts for the ratio of documents that include the word “this”. In this case, we have a corpus of two documents and all of them include the word “this”.

idf(“this”, D) = log(2/2) = 0

So, if-idf is zero for the word “this”, which implies that the word is not very informative as it appears in all documents

tf-idf(“this”, d1, D ) = 0.2 * 0 = 0
tf-idf(“this”, d2, D ) = 0.14 * 0 = 0

The word “example” is more interesting it occures three times but only in second document/

tf(“example”, d1) = 0/5 = 0
tf(“example”, d2) = 3/7 = 0.429

idf(“example”, D) = log(2/1) = 0.301

Finally,
tf-idf(“example”,d1, D) = tf(“example”, d1) * idf(“example”,D) = 0 * 0.301 = 0

tf-idf(“example”,d2,D) = tf(“example”, d2) * idf(“example”, D)

= 0.429 * 0.301 = 0.129