Automated Essay Scoring — Kaggle Competition —  End to End Project Implementation — Part 1

Source: Deep Learning on Medium

Automated Essay Scoring — Kaggle Competition — End to End Project Implementation — Part 1

Kindly go through Part 1, Part 2 and Part 3 for complete understanding and project execution.

Let’s first understand the meaning of automated essay scoring. In our education system, different students write essays as part of their examination and teacher’s grade them based on their essay writing skills. The question here is can this be automated and at what extent?

Definition: As per Wikipedia, it has been explained pretty well as follows:

Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades — for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.

Competition: In 2012, the Hewlett Foundation sponsored a competition on Kaggle called Automated Student Assessment Prize. You can go through this link ASAP for further details.

Dataset: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine’s capabilities.

The training data is provided in three formats: a tab-separated value (TSV) file, a Microsoft Excel 2010 spreadsheet, and a Microsoft Excel 2003 spreadsheet. The current release of the training data contains essay sets 1–6. Sets 7–8 will be released on February 10, 2012. Each of these files contains 28 columns:

  • essay_id: A unique identifier for each individual student essay
  • essay_set: 1–8, an id for each set of essays
  • essay: The ascii text of a student’s response
  • rater1_domain1: Rater 1’s domain 1 score; all essays have this
  • rater2_domain1: Rater 2’s domain 1 score; all essays have this
  • rater3_domain1: Rater 3’s domain 1 score; only some essays in set 8 have this.
  • domain1_score: Resolved score between the raters; all essays have this
  • rater1_domain2: Rater 1’s domain 2 score; only essays in set 2 have this
  • rater2_domain2: Rater 2’s domain 2 score; only essays in set 2 have this
  • domain2_score: Resolved score between the raters; only essays in set 2 have this rater1_trait1 score — rater3_trait6 score: trait scores for sets 7–8

As mentioned in Kaggle competition, we have used training_set_rel3.tsv file in our coding.

Figure 1 training_set_rel3.tsv

As you can see in the essay column, @CAPS1, @CAPS2, @NUM1 has been mentioned. So for this, the competition sponsors have made an effort to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as “@PERSON1.”

The entities identified by NER are: “PERSON”, “ORGANIZATION”, “LOCATION”, “DATE”, “TIME”, “MONEY”, and “PERCENT”.

I have given here basic understanding of Kaggle Competition what actually required but for detailed study of required and dataset, kindly go through the link ASAP.

EXTREMELY IMPORTANT: Note that this competition happened 8 years go and in the Leader board who got the first prize had score of 0.81407. We are going to break this record and achieve the score of 0.96. Surely we can say that at that time deep learning had some limitations which we don’t have now.

Kindly go through the Github link for all the coding and you can run like a web application where user can choose essays from 1 to 8 and then students can write context with respect to that essay. By clicking on button Grade Me, students will get their score on the spot for their written essay.

Below mentioned points give basic understanding of the Project:

  • This project is developed on Django Framework. I have worked and delivered many Web Application projects on ASP.NET, MVC Web API, Angular JS and latest on the Flask. For executing this web application on server, you may not require complete understanding of Django Framework. As per Github guidelines, you will be able to host on server easily.
  • As part of Data Science Life Cycle, we have already cleared steps of data ingestion and few steps of data pre-processing because training data has already been provided along with little bit of pre-processing as mentioned above as part of Anonymization.
  • We are also using Deep Learning LSTM algorithm which has capability of learning Sequence of information. We are looking here for score so we can say that Many-to-One LSTM algorithm need to apply.
  • Here, we have calculated score using statistical method called Cohen kappa score. There are other statistical methods are also available which can be used for further exploration and to understand about Cohen Kappa’s few details from Wikipedia as mentioned below (Example).

1. Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of is:

Where po is the relative observed agreement among raters (identical to accuracy), and pe is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category.

2. If the raters are in complete agreement then k=1. If there is no agreement among the raters other than what would be expected by chance (as given by pe), k=0. It is possible for the statistic to be negative, which implies that there is no effective agreement between the two raters or the agreement is worse than random.

  • As per part of NLP, Essays will be further processed under sentence and then into tokens using NLTK library along with usage of regular expressions. We will be converting into Vectors to feed LSTM using Word2Vec algorithm and also using pretrained model of GloVe.
  • We will be creating Neural Network architecture using LSTM and feeding our feature Vectors to this Neural Network. We can prepared n number of different Neural Network architecture and check which Neural Network gives best output.
  • We will also go through 2 Research Papers related to Automated Essay Scoring and get little bit of insights of it. The link of research papers are already mention in Github link at last.
  • Here, we will be training the model and saving this model into h5 format which is of Keras. Please note that Scikit learn model you will be saving in pickle file format pkl. Tensorflow model you can save in protobuf file format (.pb).
  • I will also explain few code snippets of essay load web page, when we are clicking on Grade Me button, we will be loading these saved models and how are we calculating the score.
  • I will be going through mainly 2 files:
  1. Training LSTM Model.ipynb for training and saving the model.
  2. /mysite/grader/views.py for getting the context from web page and scoring the essay from saved model.