Source: Deep Learning on Medium
Let’s get started to see how we can achieve this using the neural coreference library from hugging face.
All the code and the jupyter notebook is available at –
First install the necessary libraries in the jupyter notebook :
!pip install neuralcoref
!pip install spacy==2.1.0
!python3 -m spacy download en
Initialize neural coreference library:
# Load your usual SpaCy model (one of SpaCy English models)
nlp = spacy.load('en')
from nltk.tokenize import sent_tokenize# Add neural coref to SpaCy's pipe
Initialize with some sample text and resolve coreferences with the library:
text = "Scientists know many things about the Sun. They know how old it is. The Sun is more than 4½ billion years old. It is also a star that is the centre of our solar system. They also know the Sun’s size."text = str(text)doc = nlp(text)clusters = doc._.coref_clustersprint("clusters ",clusters)
resolved_coref = doc._.coref_resolved
print ("Resolved by NeuralCoref: \n" )
The print output from clusters will be:
[Scientists: [Scientists, They, They], the Sun: [the Sun, The Sun, Sun, It]]
For each entity you will see the coreferences of that entity and also any pronouns associated with it.
When you let the library resolve coreferences (replace pronouns with their nouns) the coreference resolved output is :
Scientists know many things about the Sun. Scientists know how old it is. the Sun the Sun is more than 4½ billion years old. the Sun is also a star that is the centre of our solar system. Scientists also know the Sun’s size.
Because of resolving errors as well as perhaps some indexing errors in the library logic the ouput is some times odd. Eg: “the sun the sun” in second line.
We will attempt to write our custom resolving function as well as generate english pronoun grammar questions in the process.
First we choose only a subset of pronouns to be replaced. And write an auxiliary function that is helpful for us to get a sentence given an index of a word.
Then we write our custom function to resolve only those coreferences from the cluster list that are pronouns and are not the same as the original entities.
With our custom coreference function above, the output for the initial text with coreference resolved is —