Jigsaw Unintended Bias in Toxicity Classification — Kaggle Competition

Original article was published by Vaibhavb on Deep Learning on Medium


We have done the necessary EDA for our data now let’s build the model to tackle this problem. For this NLP problem, we are going to use LSTM and BERT models.

What is LSTM?
Let me put it in simple words, LSTM is Long short-term memory networks a type of RNN capable of learning order dependence in sequence prediction problems. For more details, you can check this blog.

What is BERT?
BERT is a Bidirectional Encoder Representation of Transformer. It was published by Google in 2018 the transformers are built on an attention mechanism that puts weights on the neurons of the encoder-decoder model for the sequence-to-sequence model. It provides the power to the transformer to capture contextual information in the NLP segment. I know that these all terms can make you feel nervous about BERT but for detailed explanation and understanding, you can see this blog.

Preparing data for LSTM models:

Preprocessing
Important remarks

For preprocessing, I am cleaning the dataset as follows:
1. I will try to make my dataset similar to embedding as possible.

Getting your vocabulary close to the pre-trained embeddings means, that you should aim for your preprocessing to result in tokens that are mostly covered by word vectors.

Using the Glove vector and modifying dataset:

Making functions:
1. check_coverage is where the text goes through the given vocabulary and tries to find word vectors in your embedding matrix.
2. build_vocab builds an ordered dictionary of words and their frequency in your text corpus.
3. loadGloveModel for loading glove model

def check_coverage(vocab,embeddings_index):
'''
In the function I am checking if the word in
dataset present in glove vector
:Param vocab: dictionary of words and their
count
:return: dictionary of words which is not
present in glove vector.
'''
def build_vocab(sentences, verbose = True):
'''
:param sentences: list of list of words
:return: dictionary of words and their count
'''
def loadGloveModel(gloveFile):
'''
In this function I am building the glove vector
:param gloveFile: file directory
:return: glove vector
'''

Checking how much data is found in glove vector:
— Found embeddings for 89.63% of all text

Top 10 elements

Words that are not present in the embedding file.

We have to modify this so that we have more coverage of data.

observsations:
Seems like ‘ and other punctuation directly on or in a word is an issue. We could simply delete punctuation to fix that words, but there are better methods. Lets explore the embeddings, in particular symbols a bit.

Making a white list to identify symbols that are not present in the glove vector and present in the dataset.

from string import ascii_letters, digitslatin_similar = "’'‘ÆÐƎƏƐƔIJŊŒẞÞǷȜæðǝəɛɣijŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊIJĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ"white_list = ascii_letters + digits + latin_similar + ' ' + "'"

Symbols in glove file excluding white_list

Glove symbols

Note: So lets have closer look on what we just did. We printed all symbols that we have an embedding vector for. Intrestingly its not only a vast amount of punctuation but also emojis and other symbols. Especially when doing sentiment analysis emojis and other symbols carrying sentiments should not be deleted! What we can delete are symbols we have no embeddings for.

So let’s check the characters in our texts and find those to delete:

Symbols in jigsaw dataset excluding white_list

Jigsaw symbols

Taking glove symbols and jigsaw dataset symbols and comparing which are common and which are not:

Symbols which are not common
Common symbols

Now deleting the symbols which are not common in between glove and dataset.

After doing this checking the coverage:
— Found embeddings for 99.58% of all text

Preparing data for BERT models:

Bert is a transformer so we required a transformer library for that and Huggingface is the best library out there you can also use the TensorFlow hub module.

Preprocessing data:

tokens = tokenizer.tokenize(comment_text)
encoding = tokenizer.encode_plus(tokens, max_length = 128, pad_to_max_length = True)
input_ids = encoding["input_ids"]
input_mask = encoding["attention_mask"]

Output:

We know this competition uses modified metrics so we have to modify our model output for the same

identity_columns = ['asian', 'atheist', 'bisexual', 'black',
'buddhist', 'christian', 'female',
'heterosexual', 'hindu', 'jewish',
'homosexual_gay_or_lesbian', 'muslim',
'intellectual_or_learning_disability',
'latino', 'male', 'other_disability',
'other_gender', 'other_race_or_ethnicity',
'other_religion', 'other_sexual_orientation',
'physical_disability', 'wow', 'sad', 'likes',
'psychiatric_or_mental_illness', 'funny',
'transgender', 'white', 'sexual_explicit',
'disagree', 'identity_annotator_count',
'toxicity_annotator_count']
# Overall
weights = np.ones((len(train),))/4
# Subgroup
weights += (train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int)/4
# Background Positive, Subgroup Negative
weights +=(((train['target'].values>=0.5).astype(bool).astype(np.int) +(train[identity_columns].fillna(0).values<0.5).sum(axis=1).astype(bool).astype(np.int) )>1 ).astype(bool).astype(np.int)/4
# Background Negative, Subgroup Positive
weights += (((train['target'].values<0.5).astype(bool).astype(np.int) +(train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int)/4
loss_weight = 1.0 / weights.mean()y_train = np.vstack([(train['target'].values>=0.5).astype(np.int),weights]).Ty_aux_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']].values

Now as everything is set we can start building the architecture of the LSTM and BERT model.

LSTM small architecture:

LSTM large architecture:

BERT Model:

I build three BERT models, the overall model architecture is similar just the BERT model is different.

1. TFBertModel (bert-base-cased)
2. TFBertForSequenceClassification (bert-base-cased)
3. TFBertForSequenceClassification (bert-base-uncased)

bert = 'bert-base-cased' or 'bert-base-uncased'config = BertConfig.from_pretrained(bert)bert model = TFBertModel.from_pretrained(bert, config=config)
###################### OR ######################
bert model = TFBertForSequenceClassification.from_pretrained(bert, config=config)

Now after training each model, I store their results then blending them to achieve a better result.

bert_final = 4*bert_model['prediction'] +\
3*bert_seq_uncased['prediction']\
+ 3*bert_seq_cased['prediction']
lstm_final = (2*lstm_large['prediction'] +\
1.5*lstm_small['prediction'])/3.5
final_pred = (bert_final + lstm_final*4)/14submission = pd.read_csv("project_dataset/sample_submission.csv")submission['prediction'] = final_pred
print("Shape:", submission.shape)
submission.head()

After submitting the file I got this result: