Original article was published by Vaibhavb on Deep Learning on Medium
We have done the necessary EDA for our data now let’s build the model to tackle this problem. For this NLP problem, we are going to use LSTM and BERT models.
What is LSTM?
Let me put it in simple words, LSTM is Long short-term memory networks a type of RNN capable of learning order dependence in sequence prediction problems. For more details, you can check this blog.
What is BERT?
BERT is a Bidirectional Encoder Representation of Transformer. It was published by Google in 2018 the transformers are built on an attention mechanism that puts weights on the neurons of the encoder-decoder model for the sequence-to-sequence model. It provides the power to the transformer to capture contextual information in the NLP segment. I know that these all terms can make you feel nervous about BERT but for detailed explanation and understanding, you can see this blog.
Preparing data for LSTM models:
For preprocessing, I am cleaning the dataset as follows:
1. I will try to make my dataset similar to embedding as possible.
Getting your vocabulary close to the pre-trained embeddings means, that you should aim for your preprocessing to result in tokens that are mostly covered by word vectors.
Using the Glove vector and modifying dataset:
1. check_coverage is where the text goes through the given vocabulary and tries to find word vectors in your embedding matrix.
2. build_vocab builds an ordered dictionary of words and their frequency in your text corpus.
3. loadGloveModel for loading glove model
In the function I am checking if the word in
dataset present in glove vector
:Param vocab: dictionary of words and their
:return: dictionary of words which is not
present in glove vector.
'''def build_vocab(sentences, verbose = True):
:param sentences: list of list of words
:return: dictionary of words and their count
In this function I am building the glove vector
:param gloveFile: file directory
:return: glove vector
Checking how much data is found in glove vector:
— Found embeddings for 89.63% of all text
Words that are not present in the embedding file.
We have to modify this so that we have more coverage of data.
Seems like ‘ and other punctuation directly on or in a word is an issue. We could simply delete punctuation to fix that words, but there are better methods. Lets explore the embeddings, in particular symbols a bit.
Making a white list to identify symbols that are not present in the glove vector and present in the dataset.
from string import ascii_letters, digitslatin_similar = "’'‘ÆÐƎƏƐƔĲŊŒẞÞǷȜæðǝəɛɣĳŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊĲĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịĳĵķƙĸĺļłľŀŉńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ"white_list = ascii_letters + digits + latin_similar + ' ' + "'"
Symbols in glove file excluding white_list
Note: So lets have closer look on what we just did. We printed all symbols that we have an embedding vector for. Intrestingly its not only a vast amount of punctuation but also emojis and other symbols. Especially when doing sentiment analysis emojis and other symbols carrying sentiments should not be deleted! What we can delete are symbols we have no embeddings for.
So let’s check the characters in our texts and find those to delete:
Symbols in jigsaw dataset excluding white_list
Taking glove symbols and jigsaw dataset symbols and comparing which are common and which are not:
Now deleting the symbols which are not common in between glove and dataset.
After doing this checking the coverage:
— Found embeddings for 99.58% of all text
Preparing data for BERT models:
tokens = tokenizer.tokenize(comment_text)
encoding = tokenizer.encode_plus(tokens, max_length = 128, pad_to_max_length = True)
input_ids = encoding["input_ids"]
input_mask = encoding["attention_mask"]
We know this competition uses modified metrics so we have to modify our model output for the same
identity_columns = ['asian', 'atheist', 'bisexual', 'black',
'buddhist', 'christian', 'female',
'heterosexual', 'hindu', 'jewish',
'latino', 'male', 'other_disability',
'physical_disability', 'wow', 'sad', 'likes',
'transgender', 'white', 'sexual_explicit',
weights = np.ones((len(train),))/4
weights += (train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int)/4
# Background Positive, Subgroup Negative
weights +=(((train['target'].values>=0.5).astype(bool).astype(np.int) +(train[identity_columns].fillna(0).values<0.5).sum(axis=1).astype(bool).astype(np.int) )>1 ).astype(bool).astype(np.int)/4
# Background Negative, Subgroup Positive
weights += (((train['target'].values<0.5).astype(bool).astype(np.int) +(train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int)/4loss_weight = 1.0 / weights.mean()y_train = np.vstack([(train['target'].values>=0.5).astype(np.int),weights]).Ty_aux_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']].values
Now as everything is set we can start building the architecture of the LSTM and BERT model.
LSTM small architecture:
LSTM large architecture:
I build three BERT models, the overall model architecture is similar just the BERT model is different.
1. TFBertModel (bert-base-cased)
2. TFBertForSequenceClassification (bert-base-cased)
3. TFBertForSequenceClassification (bert-base-uncased)
bert = 'bert-base-cased' or 'bert-base-uncased'config = BertConfig.from_pretrained(bert)bert model = TFBertModel.from_pretrained(bert, config=config)
###################### OR ######################
bert model = TFBertForSequenceClassification.from_pretrained(bert, config=config)
Now after training each model, I store their results then blending them to achieve a better result.
bert_final = 4*bert_model['prediction'] +\
+ 3*bert_seq_cased['prediction']lstm_final = (2*lstm_large['prediction'] +\
1.5*lstm_small['prediction'])/3.5final_pred = (bert_final + lstm_final*4)/14submission = pd.read_csv("project_dataset/sample_submission.csv")submission['prediction'] = final_pred