Real or Not? NLP with Disaster Tweets (Classification using Google BERT)

Original article was published by gautam iruvanti on Deep Learning on Medium

Real or Not? NLP with Disaster Tweets (Classification using Google BERT)

Table of Contents

  1. Project Overview
  2. Data Description
  3. Exploratory Data Analysis
  4. Feature Engineering
  5. Data Preprocessing
  6. Building the BERT model
  7. Results

Project Overview

Twitter has become an important communication channel in times of emergency. The ubiquitous of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring tweets(i.e. disaster relief organizations and news agencies) but, it’s not always clear whether a person’s words are actually announcing a disaster.

Take an example of a tweet ‘On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE’. The author of the tweet explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid but it’s less clear to a machine.

Kaggle¹ hosted this challenge on their platform and data set was created by the company figure-eight² originally shared on their ‘Data For Everyone’ website (

Data Description

What files do I need?

You’ll need train.csv, test.csv and sample_submission.csv.

What should I expect the data format to be?

Each sample in the train and test set has the following information:

  • The text of a tweet
  • A keyword from that tweet (although this may be blank!)
  • The location the tweet was sent from (may also be blank)

What am I predicting?

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.


  • train.csv — the training set
Training Data
Testing Data
  • sample_submission.csv — a sample submission file in the correct format
Submission File


  • id — a unique identifier for each tweet
  • text — the text of the tweet
  • location — the location the tweet was sent from (may be blank)
  • keyword — a particular keyword from the tweet (may be blank)
  • target — in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Exploratory Data Analysis

  1. Null Values
Null Values in train dataset
Null Values in test dataset
Distribution of missing values in the keyword and location feature in the train and test dataset
  1. Target Distribution : A few targets have a very high probability of being a real disaster tweet (class = 1) in the training distribution . If the test dataset is also drawn from the train distribution then we can use this information to improve our predictions .
Probability of target being equal to 1 for keywords

Feature Engineering

  1. What to do with the location and keyword columns?

Locations are not automatically generated, they are user inputs. That’s why the location is very dirty and there are too many unique values in it. It shouldn’t be used as a feature.

Fortunately, there are signals in keywords because some of those words can only be used in one context. Keywords have very different tweet counts and target means. keyword can be used as a feature by itself or as a word added to the text. Every single keyword in the training set exists in the test set. If training and test set are from the same sample, it is also possible to use target encoding on keyword.

Data Preprocessing

  1. Clean the text feature

The text feature in the train and test sets are noisy . One way to clean the feature is by removing the :

  • remove URLs
  • remove emojis
  • remove html content
  • remove punctuation’s

Building the BERT Model

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google .BERT makes use of Transformers, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces prediction for the task.

BERT is a pre-trained Transformer Encoder stack. It is trained on Wikipedia and the BookCorpus dataset.

BERT introduced contextual word embeddings (one word can have a different meaning based on the words around it). The Transformer uses attention mechanisms to understand the context in which the word is being used. That context is then encoded into a vector representation. In practice, it does a better job with long-term dependencies.

Eg of what BERT does :

Working of Google BERT

How BERT is trained :

Mask Language Model (MLM):

Masked Language modeling

Next Sentence Prediction (NSP):

Next Sentence Prediction

Benefits of BERT :

  1. Captures both semantics and context of the email
  2. Gives good results on small datasets as its pre trained on the wikipedia corpus

Building a BERT Classifier:

Model Summary

Additions made to the standard Bert model to create a classifier

  1. Lambda Layer
  2. Dense Layer with a tanh activation to keep the embedding values in the range :[-1,+1]
  3. Dense Layer with a softmax activation function , the output of this layer is the probabilities of the input belonging to a certain class .

Training :

The Hyper parameters:

1)No of epochs = 15

2)Batch_size = 16

3)Adams Optimizer , learning rate = 1e — 5

validation loss across epochs

The validation loss drops consistently after the 15th epoch , so we train our model for 15 epochs .


training accuracy: 0.95

testing accuracy: 0.81

Confusion Matrix :

0 : Represents a fake tweet

1 : Represents a real tweet

Confusion matrix

Accuracy Score : 0.81

Report :

Precision recall f1-score and support for the predictions

For the sample submission , we get a public score of 0.83

Link to the code