Deep Learning experiments on News Classification Dataset with Keras, Tensorflow and Azure ML

Source: Deep Learning on Medium

Headline: The Audacity Of “Nope” Or, Why A Trump Presidency Is No Surprise.

This project has been done as course project for Neural Networks by Adil Yatkin (text processing), Sadiq Eyvazov (multimodal learning) and Valeh Farzaliyev (image processing)


Classification is quite a challenging field in text mining as it requires prepossessing steps to convert unstructured data to structured information. With the increase in the number of news it has got difficult for users to access news of his interest which makes it a necessity to categories news so that it could be easily accessed. Categorization refers to grouping that allows easier navigation among articles. Internet news needs to be divided into categories. This will help users to access the news of their interest in real time without wasting any time. In this project we will classify the news according to headlines, short description and headline images.


There are plenty of ready datasets available on the internet nowadays. We specially chose “News Category Dataset” uploaded by Rishabh Misra in Kaggle. The dataset can be found here.

This dataset contains around 200,000 news headlines from the year 2012 to 2018 obtained from HuffPost. Each news headline has corresponding category. There are 41 different categories.

Distribution of news headlines per category.

In general, most of text classification examples on the internet use data with two categories such as spam email filtering or sentiment analysis (IMDB movie reviews, positive or negative). Thus, the dataset we are using is suitable to do multi-class text classification.

Here is sample news item:

There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",
Melissa Jeltsen",
She left her husband. He killed their children. Just another day in America.",

Project Overview

The aim of this project is to classify news into categories based on their headline, short description and cover image using different neural network models. So, for the text classification, we used 4 different models where in 3 of them we used pretrained word embeddings: ELMo, Bert and NN-LM. Similarly, for image based classification we trained one simple AlexNet like CNN model and used one pretrained VGG16 features.

Later, we concatenated text and image features to train multimodal networks and compared results.

Due to huge amount of data and insufficient computation power in personal laptops, we had difficulties while training large models. For example, most common problem was to allocate big memory to store weight matrices which gave memory error as the number of parameters was at the order of billion. Therefore, we had to shrink (10K news, 5 classes) train and test size on local environment. Yet, we managed to run mentioned models on Azure Machine Learning Service with full dataset. Azure gives one Tesla K80 (12GB) under Azure ML service for free for 1 month. However, ELMo and Bert models took extremely long time even on Azure.

Text Processing

Simple Word Embedding

The simplest idea is to directly train textual input. However, length of each news headline may vary. So, it is better to use recurrent layer instead of only dense layers. Besides, we cannot give words as input to neural networks. But, most wide-spread solution is to map each word with a number using text preprocessors.

In Keras, there is an Embedding layer which associates a vector for each input given vocabulary length. Then, we add 1 LSTM and 3 Dense layers.

That is how our model looks like:

Then, we train 5 epochs with batch size 128. We have to tranform train labels to vectors using to_categorical() method in order to use categorical cross-entropy loss.

history =[X_train], batch_size=128, y=to_categorical(y_train), verbose=1, validation_split=0.25, 
shuffle=False, epochs=5)

Simple model gave us 98% training, 77% validation and 78% test accuracy. As it is seen, the model is overfitted

When we run same model with 200k news and 41 classes on Azure ML, we achieved 99% train, 52% validation and 40% test accuracy. Moreover, simple word embedding is not very useful. To address overfitting problem we used regularizers, dropout and hypertuning. We’ll talk about it later.


NN-LM is text embedding based on feed-forward Neural-Net Language Models with pre-built OOV. It maps text to 128-dimensional embedding vectors. It is a token based text embedding trained on English Google News 200B corpus.

NN-LM is available in tensorflow-hub. The code snippet below shows how to use it.

Keras allows us to use custom embedding methods using Lambda layers. But there is a dimension mismatch between output of Lambda layers and the rest of network. Reshape layer in keras solves this problem.

Overall network is same as previous simple model except embedding part. That is 1 LSTM and 3 Dense layers following NN-LM embedding.

LSTM with NN-LM embedding

Here we trained 10 epochs using batch size 64.

history =[X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25, 
shuffle=False, epochs=10)

This model gave us 78% train, 76% validation and 76% test accuracy. We already have reasonable model without overfitting. We see the effect of using pre-trained embeddings.

LSTM with NN-LM model history

Accuracies dropped when we trained on full dataset: 53% train, 51% validation and 52% test. Note that, here we have 41 classes. So this result is perfectly OK for now.


ELMo is a word representation technique proposed by AllenNLP in November 2018. Unlike traditional word embedding methods, ELMo is dynamic, meaning that ELMo embeddings change depending on the context even when the word is the same.

ELMo representations are:

  • Contextual: The representation for each word depends on the entire context in which it is used.
  • Deep: The word representations combine all layers of a deep pre-trained neural network.
  • Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.


Usually, ELMo is used within AllenNLP framework which uses PyTorch instead of Keras. There exists small, medium and large (original) pretrained ELMo models in AllenNLP. We tested medium and large models.

AllenNLP is easy to use. First, it is required to parse input data and transform it into torch tensors.

Then we used the model explained in tutorials, As a result, train accuracy is 88%, validation accuracy is 82% and test accuracy is 83%.

This model performs well than previous two models. However, we were planning to train multimodal network using text and image features. But, as we used Keras (Tensorflow backend) for the image part, we realized that it would be problem. That’s why we remodeled using Keras. Similar to NN-LM, there is ELMo embedding in tensorflow-hub.

How to use ELMo with tensorflow-hub
LSTM with ELMo embedding

Unsurprisingly, we get almost same result: 88% train, 82% validation, 81% test accuracy.

It took around 8 hours to train this model on Azure ML service (one K80 12GB gpu). Results are: 68% train, 60% validation and 59% test accuracy for 200k news and 41 classes. So far, this model has the highest accuracy.

Overfitting on Simple Word Embedding (our base model)

Although ELMo gives the best result, we thought that we can achieve closer result with the base model if overfitting is resolved. To do so, we used 4 different techniques: Dropout, Hypertuning and Regularization.

To save in time, each model is trained with 1000 samples. For this set of samples, simple model gives 98% train, 49% validation and 52% test accuracy. There is nearly 50% overfitting

First, we added dropout layer to the base model and trained with different dropout rates.

Here is accuracy and loss dependence on dropout rate

Accuracy-Dropout and Loss-Dropout graphs

The best dropout rate is 0.2. Now, the difference between train and validation accuracy is 33% and test accuracy is 57%. There is a one third improvement.

Next, in fine-tuning stage, we found best learning rate to be 0.001 using the same analogy. Later, the optimizer is changed to rms_prop. In general, it is a good choice for recurrent neural networks. This change slightly dropped overfitting level to 31%.

At last, we tried L1 and L2 regularizations. The model didn’t learn at all with L1 (26% accuracy). However, L2 increased overfitting level by 8% while increasing test accuracy by ~10% after rms_prop.

Finally, combining these results, we trained the base model again on 10K news with found parameters. Remember that there was 21% difference before. Now, after 4 epochs, model achieved 82% training and 75% validation accuracy. There is only 6% difference with ELMo.

Simple Word Embedding results after dropout, hypertuning and L2 regularization.

Image Processing

Initially, images were not given in the dataset. But, we found a way to grab images from given links. But, it is obvious that it would take a lot to visit 200K links, parse HTML content and download image in meta tag. This could be done in parallel using multiple threads.

After downloading image, here we add image property to original object and later save it. I wish everything was that easy. Other than it took so long and used almost all of computational resources in laptop, problems within datasets became apparent. Such that some links were broken, some of them were invalid in dataset, and some news contained cover images in gif format which is not suitable to train with other images. At the end, 56264 files downloaded out of 200853. Although specified to be 600×600 while fetching, there were images with different sizes and formats. After cleaning up unwanted files and resizing them, 49K remained. However, due to memory errors we couldn’t save final json which contained relations between image names and news items. So, we had to re-run above code do everything again. But, not to waste time again, we fetched images for 10K sample dataset.


AlexNet is the simplest common CNN model. It contains 2 CNN layers, 1 MaxPool, 2 Dense layers and Batch Normalization in between CNN layers.


600×600 is still big for training CNN images. We had resize them to 64×64. Before feeding input to the model, we normalized them using mean/std.

X_train_norm = (X_train - np.mean(X_train, axis=0)) / np.std(X_train, axis=0)


history =, batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.10, 
shuffle=False, epochs=5)

This model gives 53% train, 35% validation and 43% test accuracy. This results are unstable and differ a lot.


VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another.

VGG16 model visualization

Instead of training VGG16 on our dataset, we simply used pretrained weights in Keras.

include_top = False allows us to get features before dense layers, so output shape is 7x7x512. We use these features to train on our dataset by adding 2 Dense layers and softmax activation

history =, batch_size=64, y=to_categorical(y_train), validation_split=0.10, epochs=5)

Results: 57% train, 43% validation and 53% test accuracy.

There are several other pretrained models in Keras which are also trained on ‘imagenet’ such as InceptionResNet and ResNet. We also tried to get features from them and train same model as above. However, output size of those networks are so big and impossible to use with 10K samples. Sometimes we used 1000 or less samples, but it was illogical to make comparison and stopped using them.

How to decide news classes from these images?

Finally, from results of AlexNet and VGG16, we see that images are not helpful alone to classify news. That is due to content of images. When we look at images, we see that there is not enough feature which can identify respective category. For example, model can learn sport news by majority of green pixels in the image, or political news by face of Donald Trump and other public faces. However, there images in other categories which contains only text or logos. Moreover, if we add context related features we can expect to increase accuracy scores.

Multimodal Deep Learning

As we have seen pretrained embeddings and pretrained CNN models behaved better than our naïve approaches. Therefore, we trained NN-LM + VGG16 and ELMo + VGG16 multimodal architectures.

history =[train_text, train_image], batch_size=64, y=to_categorical(y_image), validation_split=0.10, epochs=4)

The result is 93% train, 57% validation and 58% test accuracy. Similarly, we get 94% train, 61% validation and 59% test accuracy with ELMo + VGG16. Clearly, adding images improved training accuracy although increasing overfitting. Thus, we cannot say that training both features together helped.


In this project we looked for advantages/disadvantages of multimodal learning. Although we got bad results, we know that this mostly originates from inconsistent data source. In addition, lack of computational resources (even AzureML) we couldn’t use more advanced models such as Bert (still running 😂) for text processing, ResNet for image processing and etc. However, we see that image features worsen the better performing text models.

We thank Ardi Tampuu for his fruitful discussion regarding project topic.

Check github repo for full code: