Written Comm Analyzer — Scoring Readability

While writing some content, it also always a prerequisite to know your audience. The intensity and depth of our writing should be in accordance to the language maturity of the reader or the context. Imagine a company releasing its annual report that has social media lingos or someone trying to teach a primary school kids about teamwork with excerpts from Harvard Business Review! This doesn’t work quite well right? Irrespective of the content that needs to be delivered, focus should also be put on the maturity and level of language understanding that the reader has. It will be futile to feed information to your readers if they can’t comprehend your content, or on the other extreme, if your content lacks the depth for a mature reader. In both the cases, it will be a complete disaster.

Readability has been an area of focus in linguistics for ages. Institutes and scholars have tried to to come up with a standardized solution to ‘measure’ readability. These solutions often take into account objective measurable textual properties such as average number of syllables per word, words per sentence, ratio of difficult to easy words, etc.

To develop a module to measure readability, we tried two approaches:

  1. Legacy Readability Formulas
  2. Machine Learning based solution

The first approach uses formulas that has been used for decades in a way to standardise readability levels. The second approach uses features extracted from the text and feeding them to a regression algorithm that gives out a score. The regression model was trained on data that has human-evaluated readability scores for 1000+ passages. We shall take a look at these approaches one by one.

Approach 1 — Legacy Readability Formulas

There are several formulas formulated by linguists and language institutes to standardise the measure of readability. Below are a list of formulas that we used. More information regarding the exact formula can be found here.

  1. Flesch Reading Ease formula
  2. Flesch-Kincaid Grade Level
  3. Dale-Chall Formula
  4. SMOG Index
  5. Gunning Fog Formula
  6. Automated Readability Index
  7. The Coleman-Liau Index
  8. Linsear Write Formula

Approach 2 — Machine Learning based (Support Vector Regression)

The legacy readability formulas work, but we wanted to explore machine learning for this. For machine learning to work, we required two things:

  1. A tool to extract features from the text. The legacy formulas depend on just a few basic text features, but we wanted to explore more features and check how they correlate with the readability score.
  2. A human evaluated dataset that contains text passages and readability grade. We need this dataset to train our machine learning model.

To our good fortune, we found TAALES — Tool for the automatic analysis of lexical sophistication. This tool takes a passage and extracts 300+ features. These features are extracted by an inbuilt model that has been trained on multiple text corpuses.

We also found two datasets — Common Core and Weebit that contained essays and passages categorized into grade levels. We used these two datasets to create training examples for SVR (Support Vector Regression) model.

A typical feature set from TAALES

We studied the correlation between the 300+ features and the grade level. We decided to take into account top 15 most correlated features out of 300+ in order to avoid overfitting. With the 15 features, we trained a SVR model using sci-kit. We can use this model to categorize any English passage into grade levels ranging from grade 1–2 to college graduate.

Source: Deep Learning on Medium