Fake Job Classification with BERT

Original article can be found here (source): Artificial Intelligence on Medium

Recently, The University of the Aegean published the Employment Scam Aegean Dataset. The data contains about 18K real-life job advertisements. The aim is to provide a clear picture of the Employment Scam problem to the research community. In this post, we will use BERT to classify fake job descriptions in the Employment Scam Aegean Dataset.

Before we get started, let’s briefly review the BERT method.

BERT stands for Bidirectional Encoder Representations from Transformers. The paper describing the BERT algorithm was published by Google and can be found here. BERT works by randomly masking word tokens and representing each masked word with a vector based on its context. The two applications of BERT are “pre-training” and “fine-tuning”.

PRE-TRAINING BERT

For the pre-training BERT algorithm, researchers trained two unsupervised learning tasks. The first task is described as Masked LM. This works by randomly masking 15% of a document and predicting those masked tokens. The second task is Next-Sentence Prediction (NSP). This is motivated by tasks such as Question Answering and Natural Language Inference. These tasks require models to accurately capture relationships between sentences. In order to tackle this, they pre-train for a binarized prediction task that can be trivially generated from any corpus in a single language. The example they give in the paper is as follows: if you have sentence A and B, 50% of the time A is labelled as “isNext” and the other 50% of the time it is a sentence that is randomly selected from the corpus and is labelled as “notNext”. Pre-training towards this tasks proves to be beneficial for Question Answering and Natural Language Inference tasks.

FINE-TUNING BERT

Fine Tuning BERT works by encoding concatenated text pairs with self attention. Self-attention is the process of learning correlations between current words and previous words. An early application of this is in the Long Short-Term Memory (LSTM) paper (Dong2016) where researchers used self-attention to do machine reading. The nice thing about BERT is through encoding concatenated texts with self attention bidirectional cross attention between pairs of sentences is captured.

Source

In this article, we will apply BERT to predict whether or not a job posting is fraudulent. This post is inspired by BERT to the Rescue which uses BERT for sentiment classification of the IMDB data set. The code from BERT to the Rescue can be found here.

Since we are interested in single sentence classification, the relevant architecture is:

Source

In the figure above, the input for the BERT algorithm is a sequence of words and the outputs are the encoded word representations (vectors). For single sentence classification we use the vector representation of each word as the input to a classification model.

Now let’s get started!

  1. IMPORT PACKAGES
import pandas as pd 
import numpy as np
import torch.nn as nn
from pytorch_pretrained_bert import BertTokenizer, BertModel
import torch
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import classification_report

2. DATA EXPLORATION

First let’s read the data into a data frame and print the first five rows. We can also set the max number of display columns to ‘None’:

pd.set_option('display.max_columns', None)
df = pd.read_csv("fake_job_postings.csv")
print(df.head())

For simplicity, let’s look at the ‘description’ and ‘fraudulent’ columns:

df = df[['description', 'fraudulent']]
print(df.head())

The target for our classification model is in the column ‘fraudulent’. To get an idea of the distribution in and kinds of values for ‘fraudulent’ we can use ‘Counter’ from the collections module:

from collections import Counter
print(Counter(df['fraudulent'].values))

The ‘0’ value corresponds to a normal job posting and the ‘1’ value corresponds to a fraudulent posting. tWe see that the data is slightly imbalanced, meaning there are more normal job posting (17K) than fraudulent postings (866).

Before proceeding, let’s drop ‘NaN’ values:

df.dropna(inplace = True)

Next we want to balance our data set such that we have an equal number of ‘fraudulent’ and ‘not fraudulent’ types. We also should randomly shuffle the targets:

df_fraudulent= df[df['fraudulent'] == 1] 
df_normal = df[df['fraudulent'] == 0]
df_normal = df_normal.sample(n=len(df_fraudulent))
df = df_normal.append(df_fraudulent)
df = df.sample(frac=1, random_state = 24).reset_index(drop=True)

Again, verifying that we get the desired result:

print(Counter(df['fraudulent'].values))

Next, we want to format the data such that it can be used as input into our BERT model. We split our data into training and testing sets:

train_data = df.head(866)
test_data = df.tail(866)

We generate a list of dictionaries with ‘description’ and ‘fraudulent’ keys:

train_data = [{'description': description, 'fraudulent': fraudulent } for description in list(train_data['description']) for fraudulent in list(train_data['fraudulent'])]test_data = [{'description': description, 'fraudulent': fraudulent } for description in list(test_data['description']) for fraudulent in list(test_data['fraudulent'])]

Generate a list of tuples from the list of dictionaries :

train_texts, train_labels = list(zip(*map(lambda d: (d['description'], d['fraudulent']), train_data)))
test_texts, test_labels = list(zip(*map(lambda d: (d['description'], d['fraudulent']), test_data)))

Generate tokens and token ids:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
train_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], train_texts))
test_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], test_texts))
train_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, train_tokens))
test_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, test_tokens))
train_tokens_ids = pad_sequences(train_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")
test_tokens_ids = pad_sequences(test_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")

Notice we truncate the input strings to 512 characters because that is the maximum number of tokens BERT can handle.

Finally, generate a boolean array based on the value of ‘fraudulent’ for our testing and training sets:

train_y = np.array(train_labels) == 1
test_y = np.array(test_labels) == 1

4. MODEL BUILDING

We create our BERT classifier which contains an ‘initialization’ method and a ‘forward’ method that returns token probabilities:

class BertBinaryClassifier(nn.Module):
def __init__(self, dropout=0.1):
super(BertBinaryClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.dropout = nn.Dropout(dropout)
self.linear = nn.Linear(768, 1)
self.sigmoid = nn.Sigmoid()

def forward(self, tokens, masks=None):
_, pooled_output = self.bert(tokens, attention_mask=masks, output_all_encoded_layers=False)
dropout_output = self.dropout(pooled_output)
linear_output = self.linear(dropout_output)
proba = self.sigmoid(linear_output)
return proba

Next, we generate training and testing masks:

train_masks = [[float(i > 0) for i in ii] for ii in train_tokens_ids]
test_masks = [[float(i > 0) for i in ii] for ii in test_tokens_ids]
train_masks_tensor = torch.tensor(train_masks)
test_masks_tensor = torch.tensor(test_masks)

Generate token tensors for training and testing:

train_tokens_tensor = torch.tensor(train_tokens_ids)
train_y_tensor = torch.tensor(train_y.reshape(-1, 1)).float()
test_tokens_tensor = torch.tensor(test_tokens_ids)
test_y_tensor = torch.tensor(test_y.reshape(-1, 1)).float()

and finally, prepare our data loaders:

BATCH_SIZE = 1
train_dataset = torch.utils.data.TensorDataset(train_tokens_tensor, train_masks_tensor, train_y_tensor)
train_sampler = torch.utils.data.RandomSampler(train_dataset)
train_dataloader = torch.utils.data.DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)
test_dataset = torch.utils.data.TensorDataset(test_tokens_tensor, test_masks_tensor, test_y_tensor)
test_sampler = torch.utils.data.SequentialSampler(test_dataset)
test_dataloader = torch.utils.data.DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)

5. FINE TUNING

We use the Adam optimizer to minimize the Binary Cross Entropy loss and we train with a batch size of 1 for 1 EPOCHS:

BATCH_SIZE = 1
EPOCHS = 1
bert_clf = BertBinaryClassifier()
optimizer = torch.optim.Adam(bert_clf.parameters(), lr=3e-6)
for epoch_num in range(EPOCHS):
bert_clf.train()
train_loss = 0
for step_num, batch_data in enumerate(train_dataloader):
token_ids, masks, labels = tuple(t for t in batch_data)
probas = bert_clf(token_ids, masks)
loss_func = nn.BCELoss()
batch_loss = loss_func(probas, labels)
train_loss += batch_loss.item()
bert_clf.zero_grad()
batch_loss.backward()
optimizer.step()
print('Epoch: ', epoch_num + 1)
print("\r" + "{0}/{1} loss: {2} ".format(step_num, len(train_data) / BATCH_SIZE, train_loss / (step_num + 1)))

And we evaluate our model:

bert_clf.eval()
bert_predicted = []
all_logits = []
with torch.no_grad():
for step_num, batch_data in enumerate(test_dataloader):
token_ids, masks, labels = tuple(t for t in batch_data)logits = bert_clf(token_ids, masks)
loss_func = nn.BCELoss()
loss = loss_func(logits, labels)
numpy_logits = logits.cpu().detach().numpy()

bert_predicted += list(numpy_logits[:, 0] > 0.5)
all_logits += list(numpy_logits[:, 0])

print(classification_report(test_y, bert_predicted))

This model does a decent job at predicting real postings. The performance for predicting fraudulent posts isn’t as good but can be improved by increasing the number of epochs and further feature engineering. I encourage you to play around with hyper-parameter tuning and the training data to see if you can improve classification performance.

To summarize, we built a BERT classifier to predict whether or not job postings were real or fraudulent. If you are interested in other applications of BERT, you can read Fake News Classification with BERT and Russian Troll Tweets: Classification with BERT. If you are interested in a thorough walkthrough of the BERT method, I encourage you to read BERT to the Rescue. The code from this post is available on GitHub. Thank you for reading!