Word2Vec for Google Quest Kaggle challenge data

Source: Deep Learning on Medium

Word2Vec for Google Quest Kaggle challenge data

In the previous article, we extracted features of the documents using TF-IDF. TF-IDF has its own disadvantages like missing out on the semantic relationship between the words. Word embedding is one concept which tries to capture context between the words, semantic similarity, etc.,

Word2Vec is one such word embedding technique using a shallow neural network developed by Tomas Mikalov in 2013. Word2Vec construct an embedding layer which represents a word in the document using the Skip Gram or Combined Bag of Words(CBOW) method. Tomas said skip-gram works better with a small amount of data, whereas CBOW is faster and has better representations for more frequent words.

Data Preprocessing

  • Removing special characters(.,@,$,#, etc.,)
  • removing stop words (but, and, yet, the, etc.,)
  • converting all to lower case letters
#Preprocessing - removing unwanted characters, tokenization, stop-word removal
def clean_data(txts):
x = re.sub("[^a-zA-Z0-9]", " ",txts)
x = x.lower().split()
stops = set(stopwords.words("english"))
words = [w for w in x if not w in stops]

return( " ".join(words))

Feature Extraction

First, create a corpus of words from the given documents (Train and test data)\

corpus = []
for i in range(len(train_words)):

‘train_words’ — list of all sentences and word_tokenize is a function to split all the words from a sentence.

Once, we create a corpus that has all unique words in the dataset using a genism library to extract the Word2Vec features.

from gensim.models import Word2Vec#Creating word embedding for the words. Embedding dimension = 50 and window 2 tells, how many words should be looked at once
model = Word2Vec(corpus, size=50, window=2)

Now you have a unique representation for each word present in the dataset.

Create the input features(X) for the train and test data as below

#Creating the input data 
#Initializing the X matrix with zeros
X_train= np.zeros((len(train),50))

for i in range(len(train)):
#Create a list of word embeddings of the words in each sentence
emb = [model.wv[w] for w in train[i]]
#Take the mean of the word embeddings of the words in a sentence
#because length of the sentences varies and the dimension of the #features will increase with the increase in the number of
#words in the sentence
X_train[i] = np.mean(emb, axis=0)

Once input features train the model and test as explained in previous article