Natural Language Processing is Just Find & Replace with Way More Options

Source: Deep Learning on Medium

If you have used Microsoft Word, you have used “Find”.

Find is a simple search function within a Word document that matches an exact sequence of characters.

When you first hear about Natural Language Processing, it sounds like an esoteric process full of mysticism and complicated math. While partially true — the complicated math part — there is nothing mystical or overly complex happening.

While it’s fashionable to freak out about robots becoming sentient and driving the human race into extinction, that’s not really how it works.

The Five Layers of Document Search

1. Find

When you perform search a Word document, you ask the software, “Hey, Word! Find these characters in this order.” Word is happy to oblige because this is a simple character-match problem. If the document contains the characters you enter in the sequence you enter them.

Which is a fancy way of saying it finds the word you search if the word exists in the document.

2. Find and Replace

This goes one step beyond “Find”. You now ask the software, “Hey, Word! Find these characters in this order. Once you find the first match, replace it with these characters in this order.”

Which is a fancy way of saying, “Find ‘runner’ and replace the first instance with ‘running’.”

Again, very straight-forward.

3. Replace All

We’re now asking, “Hey, Word! Find these characters in this order. Now that you found all matching sequences, replace them with this sequence of characters.”

Still, nothing crazy. “Find all matches to ‘runner’ and replace it everywhere within the document with ‘running’.”

4. Regex

This is where things start to get interesting. With Regex, you are not restricted to a single sequence of characters. A series of “operators” let you add dozens of parameters to your search. Operators are parts of punctuation you can use to refine a search. If you’ve ever used quotation marks to refine a Google search, you’ve used operators.

Practically speaking, this looks like…

Search: “natural language processing” “n00b”

This finds all web pages with both the phrase “natural language processing” and the word “n00b”.

Search: “natural language processing” -n00b

Conversely, this finds all web pages with both the phrase “natural language processing” and WITHOUT the word “n00b”.

With Regex, we can perform similar searches using punctuation to further refine our search. This is not an article about Regex, but here is a simple example.

Let’s say we want to find, “Hello, world!” within a document.

Hello[,]+\s+\w+.

match: Hello, world!

And here’s why that works.

Hello matches the characters Hello literally (case sensitive)Match a single character present in the list below [,]++ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy), matches the character , literally (case sensitive)\s+ matches any whitespace character (equal to [\r\n\t\f\v ])+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)\w+ matches any word character (equal to [a-zA-Z0-9_])+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy). matches any character (except for line terminators)Global pattern flagsg modifier: global. All matches (don't return after first match)m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Of course, we could just as easily do a traditional Find to search for this phrase. However, this simple example demonstrates the fundamental difference between Find and Regex.

There’s An Extension for That

Before we jump into the fifth layer of search, we have to take a step back to create a mental model of what we know so far.

You have an iPhone or Android. Your iPhone or Android can do a lot of things out-of-the-box, but you don’t unlock its true power until you start downloading apps.

Apps take the technology in your phone and extend it.

If you can annotate pictures with text in Apple’s stock tools, you can build an app that uses the same technology to add stickers to photos. Or transform the person in the picture into a baby. Your iPhone can do almost anything, it just needs an app to unlock its potential.

Similarly, you can do a lot with JavaScript. But it has certain restrictions that make it difficult to scale. That’s where frameworks like Ember.js and React Native come along. The boilerplate “convention over customization” framework of Ember.js extends what you can do with Javascript.

This isn’t a new idea in tech. Find and Replace/Replace All extends a simple Find search. Regex extends Find and Replace/Replace All.

And now that we have that out of the way, NLP libraries extend Regex.

Regular Expressions vs. Word Vectors

Thus far, we have established how basic search functions work within a document. We then looked at how Find and Replace further extends basic search. Then we looked at Regex and how its operators further extend Find and Replace/Replace All.

Now let’s talk about how NLP extends Regex.

Rather than a pure computational “character order/match” sequence, NLP takes an entirely different approach. It says…

Let’s convert all of the words and spaces in your documents into “tokens” (individual words, spaces, punctuation, etc). Let’s then place those tokens in a table and translate the shape of each token into a word vector. After that, we’ll run a set of extended Regex pattern matches comparing the shape of the words in our table with the shape of words in a pre-trained vocabulary model. Finally, we’ll find all similar word vectors, and, based on the training data we provide, predict which words match our search patterns.

Again, we can see how NLP extends Regex. Regex adds operators to let us further refine a search. NLP adds word vectors to give us an infinite combination of search dimensions to further refine our search. This means we can can search far beyond simple character and punctuation sequence matches.

With this approach, you typically don’t perform manual searches. You train a convolutional neural network to recognize similar and dissimilar patterns in word vectors using the patterns of a particular library. Over time (hundreds of milliseconds), the neural network works through every combination of search variables (“dimensions”) learning what is and is not a correct match.

It might take you 5–10 seconds to perform one search within a document. Imagine a world where you can execute a single search hundreds of thousands of times per millisecond.

Think about how quickly you would begin to recognize patterns within a given text if you could cycle through every possible answer that fast.

Finally, picture how fast you could begin to infer facts from the data within the text if you could perform that many computations at once. And how quickly you would improve at getting correct answers if you could cycle through decades of trial and error in an instant.

Congratulations, you now understand Natural Language Processing and Machine Learning!

In the end, it’s just Find & Replace with way more options.

Postscript: My apologies in advance for any concepts I misrepresented or poorly explained above. I'm learning as I go and talking through things as I learn. If I got any of this wrong, please offer corrections in the comments!