Deep Learning Is Blowing up OCR, and Your Field Could be Next

Source: Deep Learning on Medium

Deep Learning Is Blowing up OCR, and Your Field Could be Next

OCR driven by Deep Learning can read text off tiny elements in an image. Photo credit: Gado Images. Output shown is from Google Vision API.

Imagine a computer that can read your handwriting (even if it’s as bad as mine). Or one that can read a tiny street sign in a grainy picture you snapped on your phone. Or better yet, one that can do this and immediately translate the results into 100 different languages.

In the last few years, all these things suddenly became possible. This is the power of modern, Deep Learning driven Optical Character Recognition (OCR).

OCR is the process of using machine vision, letter recognition and other techniques to automatically extract text from an image. The image can be a scan of a printed page, a photograph, or anything else with textual data that’s not already in a computer-readable format.

Back in the day, training an OCR system was fairly straightforward. You took a bunch of letters and figured out what made them look the way they do. Maybe you detected the edges of each letter, determined how the angles fit together, and then coded this into an OCR program. If the input was dirty, perhaps you threw in a thresholding function to clear things up.

Early open source OCR systems like Tesseract, which grew out of efforts at Xerox, used this painstaking, hands on approach. They worked great — if what you were processing was professionally scanned, typewritten pages, ideally converted to grayscale with the levels and lighting properly balanced.

Unsurprisingly, not many actual data sources are that clean. And so until very recently, OCR didn’t work very well, except in some limited cases. It was bolted onto copier and scanner software, used to good effect in platforms like Evernote, and otherwise wrung for as much value as possible in the use cases where it actually worked. But for anything beyond clean text, accuracy dropped off, and fast.

A Better Way

Even early on, there were hints at a better way. While doing my degree in Cognitive Science (a mashup of neuroscience and computer science) at Johns Hopkins, I remember reading a paper — probably written in the 1980s — about using neural networks for OCR. The conclusion was essentially “Hey guys, neat, this might actually work.” It sat on a shelf for three decades.

To the 1980s’ credit, at the time neural networks were more a thought experiment than an actual usable tool. They existed mainly in theoretical academic papers, as computers at the time could only implement the most basic networks. Even in the ’00s at Hopkins, we still focused on perceptrons, backpropogation, and the theories behind NNs, but rarely coded up an actual one, much less put it to some productive use.

The Deep Learning Revolution

Right around 2013, though, all that changed. No one is really quite sure why, but around the early 10s there was an explosion of interest in Deep Learning, and a corresponding explosion in the availability of cheap computing power to run ever more complex neural networks. It’s been called the Deep Learning Revolution. Gone were the days of simple perceptrons you could introspect and (somewhat) understand. In were the days of complex networks with pre-processing of inputs and 100 hidden layers that ran on cloud-based super-computing clusters of hordes of GPUs.

As the Deep Learning Revolution swept the tech world, almost overnight OCR became good. Really good. Today’s neural network and machine learning driven OCR systems, like the Google Vision API, can read text in a grainy photograph, even if it’s tiny, in a weird font, upside down, or partially obscured. It can read typewritten notes off the back of a 60 year old photograph. It can even read handwriting scrawled in cursive with a fountain pen (yes, there are use cases for this capability).

Modern OCR reads a handwritten, cursive note on an archival photo box. Photo credit: Gado Images. Output shown is from Google Vision API.

How do modern Deep Learning driven Neural Networks accomplish this? It could be by learning letter forms and variations on those form. It could be through probabilistic analyses of which letters are likely to occur where, taking into account the context of the scene. It could be gnome magic.

The truth is that, in many cases, we really don’t know. Classic computer vision techniques for OCR were introspectable — you could understand the edge detection algorithm, perhaps export an intermediary image or two to debug, and generally have a good sense for how everything fit together. Neural networks, on the other hand, are so complex that they’re essentially a black box. You put photos of text in one end (with some processing) and get machine readable text out the other.

To paraphrase Linkin Park, in the end it doesn’t really matter. Most actual users don’t care how these systems work. To the user, the only thing that matters is how the system performs (and, of course, how much it costs). Deep Learning based OCR excels on both accounts.

Deep Learning In Practice

So how are these systems actually used? At my company Gado Images, we’ve used them to process tens of thousands of scanned historical images, pulling out text from original captions and making whole catalogs of photographs instantly searchable. Banks use them to automatically pull text out of a snapshot of a check from your mobile phone, allowing for the magic of remote deposit. In a more mission-critical setting, they’re used by self driving cars to detect and read road signs, and avoid swerving into trees. And of course, people still use them to parse good old-fashioned documents!

OCR may seem boring at first. The tech has been around, in some form, for over 100 years. Reading text off a page doesn’t have the same visceral interest (or creepiness factor) as recognizing faces, guiding robots, or many of the other feats computer vision can perform. But good OCR is an incredibly powerful capability, underlying many technologies we take for granted. Do you want your bank’s mobile app depositing your $5,000 consulting check into some random person’s account, or your Tesla reading “Slow: Bridge Out Ahead” as “Low Bridge Overhead”?

Reading street signs for self driving cars is a mission-critical application of Deep Learning driven OCR. Photo credit: Gado Images.

Beyond that, OCR is a microcosm of the rapidly evolving world of AI. For most of its 100+ year history, OCR worked in essentially the same way. Techniques got better, computers got faster, but nothing fundamental changed. And nothing really improved, either. Then, in a few short years, Deep Learning came along and blew everything up. A century of incremental improvements were beaten out by a few years of machine learning.

What Does it Mean For Me?

If you work in any AI field — or coding, medicine or the law, for that matter — Deep Learning has either already arrived or will be there shortly. Don’t be surprised if problems which have plagued your field for decades, especially ones involving repeatable patterns, are suddenly tractable. And if you’re in a job that relies on solving those problems the old-fashioned way, don’t be surprised if a neural network suddenly becomes your strongest competition.

In the end, Deep Learning isn’t something to be feared. OCR and the resulting searchable text makes lots of jobs easier, from organizing photo archives to digitizing a handwritten sales pitch. But neural networks can’t preserve a physical photo or write that compelling pitch in the first place. And likewise, neural networks can help diagnose cancer, but they can’t interact with and treat a patient.

Deep Learning is a tool. It’s powerful, and there’s some functions it can replace outright, like transcribing text from a document. But it’s even more powerful when it’s combined with — and leveraged by — smart humans. So even if you’re not a programmer, learn about Deep Learning. Delve into the tech, figure out how it’s being used in your field, and maybe even try it out for yourself. You might be surprised by how powerful a tool it is.