Natural Language Processing Made Simpler with 4 Basic Regular Expression Operators!

Original article was published by Bharath K on Artificial Intelligence on Medium


Let us analyze how to use this module now in more detail with the following text sample and how exactly the re module can be used to perform the various operations required for appropriate processing and parsing of the text data. I just made up a random text sample with some random irregular sentences. You can use the same sentence as me or make up your own random sentence and follow along.

The text sample is as shown below:

sentence = "Machine Learning is fun. Deep learning is awesome. Artificial Intelligence: is it the future?"

The functions that we will be using for the purpose of data pre-processing are the following four basic regular expressions operations —

  1. re.findall()
  2. re.split()
  3. re.sub()
  4. re.search()

Using the four above functions almost any natural language task and data pre-processing of text data can be done. So, without further ado, let us start analyzing each of these functions and how they can be utilized.

The above method returns a list of all matches. If no match is found then an empty list is returned.

Let us try to find out all the words that begin with a capital letter. The code block below can be used for the following process —

capital = re.findall("[A-Z]\w+", sentence)
print(capital)

This should give us the following output [‘Machine’, ‘Learning’, ‘Deep’, ‘Artificial’, ‘Intelligence’].

If you want to find out how many full stops or periods are there in the text data you can use either of the two commands —

1. len(re.findall("[.]", sentence))
2. len(re.findall("\.", sentence))

Both of the above commands should give the result as 2 since we have a total of two periods. The backlash ‘\’ command is used a breaker to find only period and not perform another regex operation.

This function can be used to split the text accordingly and whenever there is a match a list of data is returned. Otherwise an empty list is returned.

Let us perform a split operation to get a bunch of sentences that are separated by periods. The following command below will be this operation.

re.split("\.", sentence)

This operation will return the following list of sentences.

['Machine Learning is fun',
' Deep learning is awesome',
' Artificial Intelligence: is it the future?']

If you want to split with both periods and question marks then follow the below command.

re.split("[.?]", sentence)

The following function performs a substitution operation when a match is found. If no match is found then the pattern is left unchanged.

If you want to substitute all the periods and question marks with explanations, then you can make use of the below command —

re.sub("[.?]", '!', sentence)

The first position in the function takes the items you want to replace. The second position is where you specify what to replace the selections with. The final and third position is where the sentence or the text data on which the replacement operation is to be performed.

After performing the above operation the below sentence is what you should receive.

'Machine Learning is fun! Deep learning is awesome! Artificial Intelligence: is it the future!'

The function finds the first match of a particular word or punctuation or selected item and returns the operation accordingly. If no match is found, then a none type value is returned.

If I want to find the position of the starting and ending characters of the word “fun.” in the text, then I can run the below command.

x = re.search("fun.", sentence)
print(x.start())
print(x.end())

The above code block will return an output of 20 and 24. This result tells us that the position of ‘f’ is 20 and position of ‘.’ is 24. There are a lot more operations you can try out with this function which I would highly recommend.

With this, we have reached the end of the major operations for regular operations. Keep experimenting with this module to learn more about the more intricate details related to this topic.