TensorFlow 2 — BERT: Movie Review Sentiment Analysis

Original article was published on Deep Learning on Medium

TensorFlow 2 -BERT: Movie Review Sentiment Analysis

BERT stands for Bidirectional Encoder Representations from Transformers. A pre-trained. A pre-trained BERT model can be fine-tuned to create state-of-the-art models for a wide range of NLP tasks such as question answering, sentiment analysis and named entity recognition. BERT BASE has 110M parameters (L=12, H=768, A=12) and BERT LARGE has 340M parameters (L=24, H=1024, A=16)(L stands for the number of layers, H for the hidden size and A for the number of self-attention heads) (Devlin et al., 2019).

The architecture of BERT’s model is a multi-layer bidirectional Transformer encoder (see Figure 1). The authors of BERT paper pre-train the model with 3.3 billion words in the two NLP tasks: Task #1: Masked LM and Task #2: Next Sentence Prediction (NSP).

Figure 1: BERT Architecture — BERT representations are jointly
conditioned on both left and right context in all layers (Devlin et al., 2019)

BERT model has an interesting input (see Figure 2) representation. Its input is the sum of the token embeddings, the segment embeddings and the position embeddings (Devlin et al., 2019)

Figure 2: BERT model input: token, segment and position embeddings

Dataset

IMDB Dataset from Kaggle has 50K movie reviews for natural language processing. The dataset in CSV format has two columns: review and sentiment. For polarity, a review is either positive or negative. Hence, we have got a binary classification problem in the supervised learning setting.

Data preprocessing

The important part of data prepossessing is how to construct specific BERT input embeddings. The functions in the following code block are used for 1) transforming a review to the three embeddings, and 2) formatting inputs that can be consumed by the model in training and testing. We set the maximum sequence length to 500.

# Functions for constructing BERT Embeddings: input_ids, input_masks, input_segments and Inputs
MAX_SEQ_LEN=500 # max sequence length
def get_masks(tokens):
"""Masks: 1 for real tokens and 0 for paddings"""
return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))

def get_segments(tokens):
"""Segments: 0 for the first sequence, 1 for the second"""
segments = []
current_segment_id = 0
for token in tokens:
segments.append(current_segment_id)
if token == "[SEP]":
current_segment_id = 1
return segments + [0] * (MAX_SEQ_LEN - len(tokens))
def get_ids(tokens, tokenizer):
"""Token ids from Tokenizer vocab"""
token_ids = tokenizer.convert_tokens_to_ids(tokens,)
input_ids = token_ids + [0] * (MAX_SEQ_LEN - len(token_ids))
return input_ids
def create_single_input(sentence, tokenizer, max_len):
"""Create an input from a sentence"""
stokens = tokenizer.tokenize(sentence)
stokens = stokens[:max_len]
stokens = ["[CLS]"] + stokens + ["[SEP]"]

ids = get_ids(stokens, tokenizer)
masks = get_masks(stokens)
segments = get_segments(stokens)
return ids, masks, segments

def convert_sentences_to_features(sentences, tokenizer):
"""Convert sentences to features: input_ids, input_masks and input_segments"""
input_ids, input_masks, input_segments = [], [], []

for sentence in tqdm(sentences,position=0, leave=True):
ids,masks,segments=create_single_input(sentence,tokenizer,MAX_SEQ_LEN-2)
assert len(ids) == MAX_SEQ_LEN
assert len(masks) == MAX_SEQ_LEN
assert len(segments) == MAX_SEQ_LEN
input_ids.append(ids)
input_masks.append(masks)
input_segments.append(segments)
return [np.asarray(input_ids, dtype=np.int32),
np.asarray(input_masks, dtype=np.int32),
np.asarray(input_segments, dtype=np.int32)]
def create_tonkenizer(bert_layer):
"""Instantiate Tokenizer with vocab"""
vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case=bert_layer.resolved_object.do_lower_case.numpy()
tokenizer=bert.bert_tokenization.FullTokenizer(vocab_file,do_lower_case)
return tokenizer

Modelling

To build a state-of-the-art NLP model for solving the sentiment analysis problem, we select BERT BASE as the pre-trained model. We add one fully connected layer which has 768 ReLU activation units and dropout = 0.1. We also add an output layer which has two softmax functions, i.e., the same approach as the google-research TensorFlow 1- BERT Tutorial rather than using the sigmoid function. Of course, the loss function should be the categorical_crossentropy rather than the binary_crossentropy in this case. Table 1 shows that the model is indeed a big one that has110M trainable parameters!

def nlp_model(callable_object):
# Load saved BERT base model
bert_layer = hub.KerasLayer(handle=callable_object, trainable=True)

# BERT layer three inputs: ids, masks and segments
input_ids = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_ids")
input_masks = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_masks")
input_segments = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="segment_ids")

inputs = [input_ids, input_masks, input_segments] # BERT inputs
pooled_output, sequence_output = bert_layer(inputs) # BERT outputs

# Add a hidden layer
x = Dense(units=768, activation='relu')(pooled_output)
x = Dropout(0.1)(x)

# Add output layer
outputs = Dense(2, activation="softmax")(x)
# Construct a new model
model = Model(inputs=inputs, outputs=outputs)
return model
model = nlp_model("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1")
model.summary()
Table 1: NLP model

Model training

We have found the ideal learning rate is 2e-5 for the Adam optimizer. Although the model training takes time, its performance often overweights the computational resources demanded by training the big model. The model training time for one epoch is approx. 47 minutes in Colab Pro with 1 GPU. Wow! Just after one epoch, the nlp_model has already achieved 94% accuracy.

# Train the model
BATCH_SIZE = 8
EPOCHS = 1
# Add Adam optimizer and categorical_crossentropy loss to the model
opt = Adam(learning_rate=2e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
# Fit data to the model
history = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
epochs=EPOCHS,
batch_size=BATCH_SIZE,
verbose = 1)

Results

Please note that due to the computational resource constraint, I have not conducted 10-fold cross-validation. Therefore, the 94% accuracy may be different from the average accuracy over 10-fold cross-validation.

The notebook is accessible at this link.

Reference

Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019. BERT: Pre-Training Of Deep Bidirectional Transformers For Language Understanding. [online] Arxiv.org. Available at: <https://arxiv.org/pdf/1810.04805.pdf> [Accessed 19 May 2020].