Preprocessing Sequence Data in Keras(Padding & Masking Techniques)

Original article was published on Deep Learning on Medium

Preprocessing Sequence Data in Keras(Padding & Masking Techniques)

When working on Functional API or Sequential API for Sequence modeling in keras, It can be challenging to understand how to prepare sequence data for input layer to a RNN,LSTM model. In case of variable length sequence prediction problems, this requires that data to be transformed such that each sequence has the same length in 3D format.In this blog we are going to see Padding and Masking techniques in tensorflow keras to reshape input sequence in 3D format.

Padding

Padding is method used to ensure all sequence in list have same length.The first step need to follow is preprocessing sequence input data, it is very common that mostly individual samples will have different lengths. Let us consider below example

Here data in 2D list where individual sample have different length of 6 ,5, 3 respectively . Usually deep learning model require input with standard size tensor( batch, sequence_len ,features)for RNN , LSTM Model input layer. For Eg. if input tensor with shape format (batch , 6, features) then input samples must have minimum 6 samples in 1 sequence. If it is shorter than 6 samples then it will be padded with the placeholders values , viseversa if the input sample greater than 6 , then it’s required to truncate larger samples.

Keras provides an API to easily truncate and pad sequences to a common length:

tf.keras.preprocessing.sequence.pad_sequences(sample,maxlen, truncating)

For Eg . Input tesor(batch , 6, features) Here arguments, maxlen=6 refers that input sample length, it shouldn’t be greater/ less than 6, padding=’pre’ refer that whether to use padded value prior to initial sequence or padding=’post’ refers to add padded values at end of sequences.Same format apply for ‘truncating arguments as well. Refer below sample of code.

Masking

This technique help the model to identify and ignore all padded value during processing of data. Now that all samples have a uniform length, the model must be informed that some part of the data is actually padding and should be ignored. This mechanism is masking.

Before applying masking on input samples, it is must to convert input numpy array into tensor datatype using (tf.convert_to_tensor method).Then need to create Masklayer with argument mask_value=0.0. It will append input sample with 0.0 value wherever it’s needed.

There are two ways to introduce input masks in Keras models:

1. Add a keras.layers.Masking layer

2. Configure a keras.layers.Embedding layer with mask_zero=True.

The code below shows how to use masking technique in above methods.

  1. Add a keras.layers.Masking layer method

2. Embedding method

We can see from the printed result, the mask is a 2D boolean tensor with shape (batch_size, sequence_length), where each individual ‘False’ entry indicates that the corresponding timesteps should be ignored during processing.

Conclusion: With this Padding & Masking technique,sequence input data can be preprocessed effectively and it can be ready to use for RNN,LSTM model .