NLP with CNNs

Original article was published by on AI Magazine

NLP with CNNs

A step by step explanation, with a Keras implementation of the architecture.

Convolutional neural networks (CNNs) are the most widely used deep learning architectures in image processing and image recognition. Given their supremacy in the field of vision, it’s only natural that implementations on different fields of machine learning would be tried. In this article, I will try to explain the important terminology regarding CNNs from a natural language processing perspective, a short Keras implementation with code explanations will also be provided.

The concept of sliding or convolving a pre-determined window of data is the central idea behind why CNNs are named the way they are. An illustration of this concept is as below.

Image by author

The first thing to notice here is the method by which each word(token) is represented as 3-dimensional word vectors. A weight matrix of 3×3 is then slid horizontally across the sentence by one step(also known as stride) capturing three words at a time. This weight matrix is called a filter; each filter is also composed of an activation function, similar to those used in feed-forward neural networks. Due to some mathematical properties, the activation function ReLU (rectified linear unit) is mostly used in CNNs and deep neural nets. Going back to image classification, the general intuition behind these filters is that, each filter can detect different features of an image, the deeper the filter, the more likely it will capture more complex details, as an example, the very first filters in your Convnet will detect simple features such as edges and lines, but the features at the very back might be able to detect certain animal types. All this is done without hardcoding any of the filters. Backpropagation will ensure that the weights of these filters are learned from the data.
The next important step is to calculate the output(convolved feature). For the example, below we will consider a 5*5 image and a 3*3 filter (when dealing with CNNs you will mostly work with square matrices) the output layer is calculated by summing over the element-wise multiplication as each filter slides over the window of data one stride at a time each pixel is multiplied by its corresponding weight in the filter. The example below illustrates how the first cell in the output layer is calculated; the red numbers in the image represent the weights in the filter.

Image by author

The calculation is as follows: (1∗2)+(1∗1)+(1∗0)+(0∗1)+(2∗0)+(0∗1)+(0∗0)+(2∗1)+(0∗4)=5

The python code with the activation function would be:

z_0 = max(sum(x*w), 0 )

In the case of a 2D filter The size of the output layer can be calculated using the following formula:

(N-F)/S +1

N = size of image , F = size of filter S = stride(1 in our case)

When applied to text you will be using a filter that slides by 3 strides horizontally across the window in 1-Dimension:

Image by author


The last two examples resulted in an output size that is smaller than that of the input’s. It also isn’t too hard to imagine cases in which the filter doesn’t exactly fit the matrix with a given number of slides. To counter these complications, padding can be used in two ways:

  1. Pad the outer edges with zero vectors (zero-padding)
  2. ignore the part of the matrix that does not fit the filter (valid padding)
Image by author


Pooling is the equivalent of dimension reduction in CNNs. The central idea is that we have to divide the output layers into subsections and calculate a value that best represents the output. The reason why this is so effective is that it helps the algorithm learn higher-order representations of the data while reducing the number of parameters. Types of pooling:

  1. Sum pooling
  2. Max pooling
  3. Average pooling

Here is an example of max pooling:

Image by author

Fully connected layer

The fully connected layer at the end receives the input from the previous pooling and convolutional layers, it then performs a classification task. In our example, we will be classifying a 300 token window of words into 1-Positive sentiment. 0-negative sentiment. The last neuron in the fully connected layer will take the weighted average of 250 neurons as a sigmoid function(returns a value between (0,1))

Keras implementation

In this section, we will try to keep the code as general as possible for use cases in NLP. To keep things simple, we will not be going into the details of data pre-processing, but the general procedure is to tokenize and vectorize the data. In our example, a word2vec embedding was used, with each token being represented as a 300-Dimension word vector. Our data was also padded so that each sentence contained 400 tokens, long sentences were cut off after 400 tokens, and shorter sentences were zero-padded. The resulting dimension for each sentence is 300*400. We then divide the data into x_train and x_test; we will not be using a validation data set for this project. Now that we have our data ready, we can define some hyperparameters.

##hyper parameters
batch_size = 32
embedding_dims = 300 #Length of the token vectors
filters = 250 #number of filters in your Convnet
kernel_size = 3 # a window size of 3 tokens
hidden_dims = 250 #number of neurons at the normal feedforward NN
epochs = 2

Now we can start building the model using the Keras library.

model = Sequential()
model.add(Conv1D(filters,kernel_size,padding = 'valid' , activation = 'relu',strides = 1 , input_shape = (maxlen,embedding_dims)))

Here we that the padding is valid, which means that we will not maintain the input size, the resultant convolved matrix will be of size 100*1. Max pooling layer that takes the maximum value in a window of two.

#GlobalMaxPooling1D(n) default = 2.

We then add the fully connected layer with a dropout rate of 0.2(we use this to counter over-fitting). Lastly, the output neuron will fire based on the sigmoid activation function. Keras will classify anything below 0.5 as 0, and anything above 0.5 as 1


The final step is to compile and fit the model.

model.compile(loss = 'binary_crossentropy',optimizer = 'adam', metrics = ['accuracy']),y_train,batch_size = batch_size,epochs = epochs , validation_data = (x_test,y_test))

Now you can sit back and watch as your model trains. We were able to achieve a 90% accuracy using 60% of Stanford’s training data. you can find more details in the 7th chapter of the book: Natural Language Processing in Action.


  1. CNNs can be used for different classification tasks in NLP.
  2. A convolution is a window that slides over a larger input data with an emphasis on a subset of the input matrix.
  3. Getting your data in the right dimensions is extremely important for any learning algorithm.