Source: Deep Learning on Medium
Automated Essay Scoring — Kaggle Competition End to End Project Implementation-Part 2
Kindly go through Part 1, Part 2 and Part 3 for complete understanding and project execution with given Github link.
- Training LSTM Model.ipynb for training and saving the model.
Importing the Data
- Gensim, NLTK, Django libraries have been added.
- Constants have been added in terms of DATASET_DIR, GLOVE_DIR, SAVE_DIR paths.
- Loading data using pandas library from training_set_rel3.tsv.
- Removing unnecessary columns like domain_score and raters_domain.
- Defining Minimum and Maximum score which we will be using at the time of predicting the actual score.
Preprocessing the Data
We will preprocess all essays and convert them to feature vectors so that they can be fed into the RNN.
These are all helper functions used to clean the essays.
- There are 4 functions defined:
- getAvgFeatureVecs: This function accepts 3 parameters: essays, model, num_features. It internall calls makeFeatureVec function to convert essays into FeatureVector.
- makeFeatureVec: This function accepts 3 parameters: words, model, num_features. Using Word2Vec index2word function and np.divide it gives ultimately average feature vectors for the passed model.
- essay_to_sentense: This function accepts 2 parameters: essay_v, remove_stopwords. it internally calls essay_to_wordlist and converts essays to sentences.
- essay_to_wordlist: This function accepts 2 parameters: essay_v, remove_stopwords. It removes the stopwords and returns words.
- Whenever you are working with NLP Machine Learning and Deep Learning tasks the above mentioned steps are almost neccessary because machine understands numbers or we can say that computation is very easy when we use numbers here we refer to vectors.
- We are trying to convert essay or corpus to first sentenses and then to words which can also be called are tokens and then convert them to vectors.
I would strongly suggest to go through few NLP terminology and concepts like Tokenizer, Stemming, Limitization, Stopwords, different methods to convert words to Vectors like BOW, TF-IDF, n-gram. These are NLP data pre-processing techniques before feeding to any machine learning algorithms or deep learning algorithms.
Defining the model
Here we define a 2-Layer LSTM Model.
Note that instead of using sigmoid activation in the output layer we will use Relu since we are not normalising training labels.
- /models folder contains 6 different models which you should try and check accuracy.
- As part of training, you just need to replace above code with those model files code
Now we train the model on the dataset.
We will use 5-Fold Cross Validation and measure the Quadratic Weighted Kappa for each fold. We will then calculate Average Kappa for all the folds.
- We are first training the essays using Word2Vec model which is available in gensim library. Later on, we are saving into word2vecmodel.bin file which we will be using at the time of predicting the score.
print("Training Word2Vec Model...")
model = Word2Vec(sentences, workers=num_workers, size=num_features, min_count = min_word_count, window = context, sample = downsampling)
- Now we are using the functions which we have previously defined to convert essay to vector representation.
- We are also passing this vectors into LSTM model and saving the model in final_lstm.h5 file.
- As part of result, we are calculating cohen kapp score with quadratic weights for 5 times using KFold cross validation and then we are taking the average. As you can see the result will be
print("Average Kappa score after a 5-fold cross validation: ",np.around(np.array(results).mean(),decimals=4))
Average Kappa score after a 5-fold cross validation: 0.9615
The above research papers clearly explains the importance of using cohen’s kappa score as well as different Models they have tried in their research and which model has given the best result.
IMPORTANT: In practical project implementation also this is very important that you must try with different models to get maximum accuracy and then that model will be saved and used as a part of production.
Next, we will be going through Web Application code where we will see how saved model is actually predicting the score in Part 3.
If you really like this article series, kindly clap, follow me and enjoy the extreme power of artificial intelligence just like below minions.