Original article was published on Artificial Intelligence on Medium
What is the Difference Between CNN and RNN?
Convolutional Neural Networks and Recurrent Neural Networks
In machine learning, each type of artificial neural network is tailored to certain tasks. This article will introduce two types of neural networks: convolutional neural networks (CNN) and recurrent neural networks (RNN). Using popular Youtube videos and visual aids, we will explain the difference between CNN and RNN and how they are used in computer vision and natural language processing.
What is the Difference Between CNN and RNN?
The main difference between CNN and RNN is the ability to process temporal information or data that comes in sequences, such as a sentence for example. Moreover, convolutional neural networks and recurrent neural networks are used for completely different purposes, and there are differences in the structures of the neural networks themselves to fit those different use cases.
CNNs employ filters within convolutional layers to transform data. Whereas, RNNs reuse activation functions from other data points in the sequence to generate the next output in a series.
While it is a frequently asked question, once you look at the structure of both neural networks and understand what they are used for, the difference between CNN and RNN will become clear.
To begin, let’s take a look at CNNs and how they are used to interpret images.
What is a Convolutional Neural Network?
Convolutional neural networks are one of the most common types of neural networks used in computer vision to recognize objects and patterns in images. One of their defining traits is the use of filters within convolutional layers.
CNNs have unique layers called convolutional layers which separate them from RNNs and other neural networks.
Within a convolutional layer, the input is transformed before being passed to the next layer. A CNN transforms the data by using filters.
What are Filters in Convolutional Neural Networks?
A filter in a CNN is simply a matrix of randomized number values like in the diagram below.
The number of rows and columns in the filter can vary and is dependent on the use case and data being processed. Within a convolutional layer, there are a number of filters that move through an image. This process is referred to as convolving. The filter convolves the pixels of the image, changing their values before passing the data on to the next layer in the CNN.
How do Filters Work?
To understand how filters transform data, let’s take a look at how we can train a CNN to recognize handwritten digits. Below is an enlarged version of a 28 x 28 pixel image of the number seven from the MNIST dataset.
Below is the same image converted into its pixel values.
As the filter convolves through the image, the matrix of values in the filter line up with the pixel values of the image and the dot product of those values is taken.
The filter moves or ‘convolves’ through each 3 x 3 matrix of pixels until all the pixels have been covered. The dot product of each calculation is then used as input for the next layer.
Initially, the values in the filter are randomized. As a result, the first passes or convolutions act as a training phase and the initial output isn’t very useful. After each iteration, the CNN adjusts these values automatically using a loss function. As the training progresses, the CNN continuously adjusts the filters. By adjusting these filters, it is able to distinguish edges, curves, textures, and more patterns and features of the image.
While this is an amazing feat, in order to implement loss functions, a CNN needs to be given examples of correct output in the form of labeled training data.
When transfer learning can’t be applied, many convolutional neural networks require exorbitant amounts of labeled data.
Still having trouble wrapping your head around CNNs? Below is an excellent, but lengthy, video lecture from Jeremy Howard at fast.ai. The video illustrates how CNNs work in detail:
Where CNNs Fall Short
CNNs are great at interpreting visual data and data that does not come in a sequence. However, they are not so great at interpreting temporal information such as videos (which are essentially a sequence of individual images) and blocks of text.
Entity extraction in text is a great example of how data in different parts of a sequence can affect each other. With entities, the words that come before and after the entity in the sentence have a direct effect on how they are classified. In order to deal with temporal or sequential data, like sentences, we have to use algorithms that are designed to learn from past data and ‘future data’ in the sequence. Luckily, recurrent neural networks do just that.
What is a Recurrent Neural Network?
Recurrent neural networks are networks that are designed to interpret temporal or sequential information. RNNs use other data points in a sequence to make better predictions. They do this by taking in input and reusing the activations of previous nodes or later nodes in the sequence to influence the output. As mentioned previously, this is important in tasks like entity extraction. Take, for example, the following text:
President Roosevelt was one of the most influential presidents in American history. However, Roosevelt Street in Manhattan was not named after him.
In the first sentence, Roosevelt should be labeled as a person entity. Whereas, in the second sentence it should be labeled as a street name or a location. Knowing these distinctions is not possible without taking into account the words before them, “president,” and after them, “street”.
RNNs for Autocorrect
To dive a little deeper into how RNNs work, let’s look at how they could be used for autocorrect. At a basic level, autocorrect systems take the word you’ve typed as input. Using that input, the system makes a prediction as to whether the spelling is correct or incorrect. If the word doesn’t match any words in the database, or doesn’t fit in the context of the sentence, the system then predicts what the correct word might be. Let’s visualize how this process could work with an RNN:
The RNN would take in two sources of input. The first input is the letter you’ve typed. The second input would be the activation functions corresponding to the previous letters you typed. Let’s say you wanted to type “network,” but typed “networc” by mistake. The system takes in the activation functions of the previous letters “networ”, and the current letter you’ve inputted “c”. It then spits out “k” as the correct output for the last letter.
This is just one simplified example of how RNN’s could work for spelling correction. Today, data scientists use RNNs to do much more incredible things. From generating text and captions for images to creating music and predicting stock market fluctuations, RNNs have endless potential use cases.