Generate News Headlines

Original article can be found here (source): Deep Learning on Medium

TABLE OF CONTENTS:

  1. BUSINESS PROBLEM
  2. BUSINESS CONSTRAINTS
  3. USE OF MACHINE LEARNING
  4. SOURCE OF DATA
  5. DATA DESCRIPTION
  6. EXPLORATORY DATA ANALYSIS (EDA)
  7. DATA PREPROCESSING
  8. SPLITTING DATA
  9. MODELLING
  10. COMPARING THE MODELS
  11. FUTURE WORK
  12. REFERENCES
  13. GITHUB REPOSITORY
  14. LINKEDIN PROFILE

1. BUSINESS PROBLEM

Build a machine learning model which can automatically generate the headline of a news.

2. BUSINESS CONSTRAINTS

  1. No strict latency constraints – Given a news the model can take a 2–3 seconds.
  2. Interpretability of model is not important – Knowing Why the model has generated this headline for a particular news is not important to user.

3. USE OF MACHINE LEARNING

We will use seq2seq network with Bahdanau attention to generate the headline.

4. SOURCE OF DATA

I have scrapped the data from https://inshorts.com/en/read

Code for scrapping the data:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
# code for scraping the first page
d={'headlines':[],'news':[]}
r = requests.get("https://inshorts.com/en/read")
soup = BeautifulSoup(r.content, 'html.parser')
min_news_id = soup.findAll("script",{"type":"text/javascript"})[2].text
min_news_id = min_news_id[25:35]
soup=soup.findAll("div",{"class":"news-card z-depth-1"})
for data in soup:
d['headlines'].append(data.find(itemprop="headline").getText())
d['news'].append(data.find(itemprop="articleBody").getText())
# code for scraping more pages
for i in tqdm(range(2100)):
# It uses JavaScript to load more data from
# https://inshorts.com/en/ajax/more_news using POST requests
# with parameter 'news_offset' which informs server what page
# it has to send to client.
# we can make POST requests with this parameter to get new
# data in JSON format
try:
params = {'news_offset': min_news_id}
req = requests.post("https://inshorts.com/en/ajax/more_news",data=params)
#In JSON you have HTML in json_data['html'] and
#json_data['min_news_id'] for next page
.
json_data = req.json()
min_news_id = json_data['min_news_id']
soup = BeautifulSoup(json_data['html'], 'html.parser')
soup=soup.findAll("div",{"class":"news-card z-depth-1"})
for data in soup:
d['headlines'].append(data.find(itemprop="headline").getText())
d['news'].append(data.find(itemprop="articleBody").getText())
except:
pass
# storing the data into .csv file
df = pd.DataFrame(d)
df.to_csv("inshorts_news.csv", index=False)

5. DATA DESCRIPTION

fig: 1

There are 2 column one for headlines and other one for corresponding news and there are nearly 188322 datapoints .

6. EXPLORATORY DATA ANALYSIS (EDA)

Headlines

fig: 2 – Distribution plot for length of headlines

We can see that maximum headlines have length between 5–15 words.

fig: 3

Min size of headlines is 3 avg is around 9 Max length of headlines is 18 so we take padding size of 20 for decoder.

fig: 4 – Box plot for headline length

We can see that 75% of headlines have length below 12 words.

News

fig: 5 – Density plot of news length

We can observe that most of the news have length around 60 words.

fig: 6

Min length of news is 38 avg is 58 and max is 67 so we will take padding size of 70 for encoder.

fig: 7 Box plot of news length

We can see that 75% of news have length below 60 words.

7. DATA PREPROCESSING

We have to remove stop words from both news and headlines. You can argue why to remove stopwords from headlines. In conclusion part you can see that the model tried with stopwords did not performed better then other with stopwords removed form headlines.

We have to add the “ssttaarrtt” in beginning and “eenndd” at the end of headlines so that decoder will know when to start decoding and when to stop decoding the sentence. Following code can be used for this:

df[‘headlines’] = df[‘headlines’].apply(lambda x : ‘ssttaarrtt ‘+ x + ‘ eenndd’)

8. SPLITTING DATA

from sklearn.model_selection import train_test_split
X_1, X_test, y_1, y_test = train_test_split(np.array(df['news']),np.array(df['headlines']), test_size=0.005)
# split the train data set into cross validation train and cross validation test
X_tr, X_cv, y_tr, y_cv = train_test_split(X_1, y_1, test_size=0.15)

Tokenizing news and padding

t = Tokenizer()
t.fit_on_texts(X_tr)
vocab_size = len(t.word_index) + 1 # for index zero we have to add +1
# integer encode the documents
encoded_docs = t.texts_to_sequences(X_tr)
encoded_docs_test = t.texts_to_sequences(X_test)
encoded_docs_cv = t.texts_to_sequences(X_cv)
max_length = 70
padded_docs_train = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
padded_docs_test = pad_sequences(encoded_docs_test, maxlen=max_length, padding='post')
padded_docs_cv = pad_sequences(encoded_docs_cv, maxlen=max_length, padding='post')

Padding will help us batch the data and by tokenizing we can replace the words with their ranks. In above example we have tokenize the news and similarly we can tokenize the headlines.

9. MODELLING

I have used “glove42b300dtxt.zip” for embedding layer which will represent the word with 300 vectors.There are nearly 104945 words in news vocabulary and 44499 words in the headlines vocabulary.

I have tried the following architectures :

  1. Encoder(1-Lstm) + attention + Decoder(1-Lstm)
  2. Encoder(1-Lstm) + attention + Decoder(1-Lstm) with stop words in headlines.
  3. Encoder(3-Lstm) + attention + Decoder(1-Lstm)
  4. Encoder(Bidir(Lstm) + attention + Decoder(1-Lstm)
  5. Encoder(Bidir(Lstm) + lstm + Bidir(Lstm)) + attention + Decoder(1-Lstm)

Above all the models 4th model performs best so we will discuss in detail about this model.

The encoder part will contain 1 layer of Bidirectional lstm and previous layer is embedding layer with 300 vectors for every word. The encoder output will go to attention layer which outputs a context vector and encoder hidden output of last time step will go to first time step of decoder. The decoder is also a 1-lstm layer which produces some output after every time step and that output is concatenated with the context vector of attention layer and the word with highest probability will be selected and given to the next time step of the decoder.

Given below is the architecture of the bidirectional model

fig: 8 – Architecture of bidirectional lstm

As we can see in fig 8 that the input from the previus layer is going into 2 LSTM one is forward and another is backward and then the outputs from these LSTM will be concatenated and passed to the attention and decoder.

Give below is the architecture of the model which tells us about the shape of input and output.

fig: 9 – Architecture of model

In fig 9 we can see input layer contains 70 words in a sentence where every word represented by 300 glove vector. This output will go to both the lstm layer simultaneously. Bidirectional lstm returns 1 concatenated output from both lstm and 4 hidden output, here we are using 300 lstm unit so for all 70 time step we concatenate both lstm output which give 600 dimension and pass it to next layer. we also concatenate state_h, state_c of both the lstm layer which result in 600 dim vector and then send this vector to the first time step of the decoder. You can see the below code for reference

encoder = Bidirectional(LSTM(units= 300, input_shape=(70,300),return_state=True,return_sequences=True, dropout=0.5,recurrent_dropout=0.5))encoder_out,f_h_out,f_cell_out,b_h_out,b_cell_out =encoder(x1)state_h = Concatenate()([f_h_out, b_h_out])state_c = Concatenate()([f_cell_out, b_cell_out])

Here f_h_out, f_cell_out corresponds to forward layer and b_h_out,b_cell_out corresponds to backward layer.

The encoder output will go to the attention layer here we are using bahdanau attention. The attention layer takes the encoder output and decoder output and generate the context vector.You can refer the following code for this

attention_layer = AttentionLayer(name='attention_layer')
attention_out, attention_states = attention_layer([encoder_out, d_lstm_out])

In decoder also we use embedding layer but a single lstm of 600 units with forward direction will be used because encoder hidden output is also 600 dimension. The decoder output will be concatenated with context vector and then highest probable word will be selected. You can check the code below

concat = Concatenate(axis=-1, name='concat_layer')([d_lstm_out, attention_out])#dense layer
decoder_dense = TimeDistributed(Dense(y_vocab_size, activation='softmax'))
decoder_outputs = decoder_dense(concat)

Now let’s go examine how these parameters have been calculated

fig: 10- Parameters of the model
  1. Input layers will have 0 parameters because there is nothing to train.

2. Embedding layer for encoder uses glove vector and vocabulary size of 104945 so 300* 104945=31483500 parameters.

3. Bidirectional layer contains 2 lstm of 300 units and 300 vector input so according to lstm formula 4*((units*units) + (inp_dim*units) + bias) no of parameters for single lstm 4*(300*300 +300*300 +300)= 721200 and for 2 lstm are 721200 * 2 = 1442400 parameters.

4. Embedding layer for decoder uses glove vector and vocabulary size of 44499 so 300* 44499=13349700 parameters.

4. Concatenate does not require any parameter

5. Decoder have 1- lstm with 600 units so no of parameters are 4*(600*600 +300*600 +600)= 2162400

6. For attention layer we are using Bahdanau attention please look at the equation

fig: 11 Badhanau attention equation

Here we can see that there are 3 weight matrix let’s look at their dimension Wencoder(600*600), Hencoder(600*1), Wdecoder(600*600), Hdecoder(600*1), Wcombined(600*1)

Hidden states parameters are already calculated with lstm layer so we will take only weights matrix here for calculating attention layer parameters. So no of parameters will be (600*600 + 600*600 +600)=720600.

After multiplying all hidden matrix with weights matrix we gets a shape of (600×1) for context vector.

7. As you can see in fig: 9 after concatenating decoder output and context vector we get a vector of 1200 dimension now this vector will be given as input to the Timedistributed layer and we get an output of 44499 dimension which is equal to vocabulary of decoder. The no of parameters used here will be (1200+1(bias))*44499=53443299 parameters.

Now we can calculate total no of trainable parameters among all the parameters.

Summing all the parameters will give us total parameters 31483500 + 1442400 + 13349700 + 2162400 + 720600 +53443299 = 102,601,899.

As we are using embedding layer with glove vectors so these are non trainable parameters so nearly 31483500 + 13349700 = 44,833,200 parameters are non trainable parameters.

Subtracting non trainable parameters from all parameters we get trainable parameters 102,601,899 – 44,833,200 = 57,768,699 .

So there are nearly 57,768,699 trainable parameters and you can see the exact figure in fig:10 .

Inference

This is the encoder model :

encoder_model = Model(inputs=encoder_input,outputs=[encoder_out, state_h, state_c])

This is the decoder model:

decoder_input_h = Input(shape=(600,))
decoder_input_c = Input(shape=(600,))
decoder_hidden_state = Input(shape=(70,600))
dec_emb2= decoder_embedding_layer(decoder_input)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_input_h, decoder_input_c])
attn_out_inf, attn_states_inf = attention_layer([decoder_hidden_state, decoder_outputs2])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
decoder_outputs2 = decoder_dense(decoder_inf_concat)decoder_model = Model([decoder_input] + [decoder_hidden_state,decoder_input_h, decoder_input_c],[decoder_outputs2] + [state_h2, state_c2])

This is the code for decoding the model :

def decode_sequence(input_seq):
# Encode the input as state vectors.
e_out, e_h, e_c = encoder_model.predict(input_seq)

# Generate empty target sequence of length 1.
seq = np.zeros((1,1))

# Populate the first word of target sequence with the start word.
seq[0, 0] = target_headlines_word_index['ssttaarrtt']
stop_condition = False
decoded_sentence = ''

while not stop_condition:

output, h, c = decoder_model.predict([seq] + [e_out, e_h, e_c])

token_index = np.argmax(output[0, -1, :])
try:
token = headlines_word_index[token_index]

if(token!='eenndd'):
decoded_sentence += ' '+token
# Exit condition: either hit max length or find stop word.
if (token == 'eenndd' or len(decoded_sentence.split()) >= (20-1)):
stop_condition = True
except:
pass
# Update the target sequence (of length 1).
seq = np.zeros((1,1))
seq[0, 0] = token_index

# Update internal states
e_h, e_c = h, c
return decoded_sentence

Here we can see that seq[0,0] is initialize with ‘ssttaarrtt’ and act as a decoder input for the fist time step like here decoder_model.predict([seq] + [e_out, e_h, e_c]) this will return 3 values (output, h, c) where output contain 44499 values and using (np.argmax(output[0, -1, :])) i can choose the index value of most probable word. If word corresponds to index value is “eenndd” we will end decoding or else we will update the value of ‘seq’ as (seq[0, 0] = token_index) and also the next hidden output as (e_h, e_c = h, c) now these updated values will be given to decoder_model.predict([seq] + [e_out, e_h, e_c]) this decoder_model.predict() function internally calls below function.

decoder_model = Model([decoder_input] + [decoder_hidden_state,decoder_input_h, decoder_input_c],[decoder_outputs2] + [state_h2, state_c2])

Now after decoding we will get the result like this :

fig: 12 output of generated headlines

We have used BLEU score as a metric here and the result for this model are :

BLEU-1: 0.416519
BLEU-2: 0.277199

10. COMPARING THE MODELS

11. FUTURE WORK

  • We have trained the network on 1.5 lakhs datapoints, increasing the no of datapoints can help improve BLEU score.
  • Better hyperparameter tuning and better architecture.
  • Using Beam search decoding
  • Use pointer generator network
  • Use pre-trained BERT model

12.REFERENCES

13. GITHUB REPOSITORY

https://github.com/ankuyadav17/Inshorts_news_headline_generation

14. LINKEDIN PROFILE

https://www.linkedin.com/in/ankit-yadav-809773100/