Formatting SpaCY custom training data the easier way

Original article was published on Deep Learning on Medium

Formatting SpaCY custom training data the easier way

One of my company projects required building Custom Named Entity Recognition model. After much research, I concluded that using SpaCY ner model would be the right fit for my project. Thereafter, I started collecting custom data required to create a test data set and storing it in a csv file. The csv contained data related to the medical field such as notes from medical conferences, seminars etc. After tokenization of sentences into words, I further assigned entities to each of these words. Since the data set comprised of medical terms and concepts, a predefined NER model was incapable of assigning entities apart from the ones supported by SpaCY i.e.

GPE (Geo Political Entity)

GEO (Geographical Entity)

ORG (Organization)

PER (Person)

CARDINAL (Numbers, integer values etc.)

DATE and so on.

For example, the entity associated with the word “Influenza” was GPE, which is absurd. Similarly, “Genetics” was assigned “PERSON” entity. Hence, without a custom training data set, the model would be full of flaws.

Data Preprocessing

To create a fixed training data in SpaCY, certain formatting of the data needs to be done, for it to be processed by the SpaCY NER model. As mentioned on, any training data set has to be of the following format:

Figure 1. SpaCY’s train set format

In simple words, the train data is nothing but a list of tuples containing four attributes:

  1. Sentence/ keyword
  2. Start of the word for which custom entity is defined
  3. End of the same word for which the custom entity is defined
  4. The custom Label

Note: If in a sentence, multiple words and their corresponding labels are to be added, a list is created having tuples of the format [(start_of_word1, end_of_word1, LABEL_for_word1), (start_of_word2, end_of_word2, LABEL_for_word2)….].

Since the extracted data is in a .csv file, converting it into the above mentioned format was a challenge. After looking for answers on various websites on Google, I could not manage to get data in this format. Solutions for converting custom training data into SpaCY’s training data format were huge and tedious, yielding no correct output. Then, I tried the simpled traditional method in Python to convert my data set into SpaCY’s required training data format.

A snapshot of my .csv file and assigned labels by SpaCY predefined ner model is as shown below:

Figure 2. Initial keywords and their labels

For the above data, the correct association of keywords would be as follows:

Figure 3. Required labels

Hence, to convert this into SpaCY’s required training data set format, use the simple 9 line code as shown below:

Figure 4. CODE

Try the above method and leave your feedback in the comments!