End to End Machine Learning: From Data Collection to Deployment

Source: Deep Learning on Medium

End to End Machine Learning: From Data Collection to Deployment 🚀

Originally published here. For more posts of this kind, visit my blog

This started out as a challenge. With a friend of mine, we wanted to see if it was possible to build something from scratch and push it to production. In 3 weeks. This is our story.

In this post, we’ll go through the necessary steps to build and deploy a machine learning application. This starts from data collection to deployment and the journey, as you’ll see it, is exciting and fun 😀.

Before we begin, let’s have a look at the app we’ll be building:

As you see, this web app allows a user to evaluate random brands by writing reviews. While writing, the user will see the sentiment score of his input updating in real-time along with a proposed rating from 1 to 5.

The user can then change the rating in case the suggested one does not reflect his views, and submit.

You can think of this as a crowd sourcing app of brand reviews with a sentiment analysis model that suggests ratings that the user can tweak and adapt afterwards.

To build this application we’ll follow these steps:

  • Collecting and scraping 🧹 customer reviews data using Selenium and Scrapy
  • Training a deep learning sentiment classifier 🤖 on this data using PyTorch
  • Building an interactive web app using Dash 📲
  • Setting a REST API and a Postgres database 💻
  • Dockerizing the app using Docker Compose 🐳
  • Deploying to AWS 🚀

All the code is available in our github repository and organized in independant directories, so you can check it, run it and improve it.

Let’s get started! 💻

1 — Scraping the data from Trustpilot with Selenium and Scrapy 🧹

⚠️ Disclaimer: The scripts below are meant for educational purposes only: scrape responsibly.

In order to train a sentiment classifier, we need data. We can sure download open source datasets for sentiment analysis tasks such as Amazon Polarity or IMDB movie reviews but for the purpose of this tutorial, we’ll build our own dataset. We’ll scrape customer reviews from Trustpilot.

Trustpilot.com is a consumer review website founded in Denmark in 2007. It hosts reviews of businesses worldwide and nearly 1 million new reviews are posted each month.

Trustpilot is an interesting source because each customer review is associated with a number of stars.

By leveraging this data, we are able to map each review to a sentiment class.

In fact, we defined reviews with:

  • 1 and 2 stars as bad reviews
  • 3 stars as average reviews ⚠️
  • 4 and 5 stars as good reviews

In order to scrape customer reviews from trustpilot, we first have to understand the structure of the website.

Trustpilot is organized by categories of businesses.

Each category is divided into sub-categories.

Each sub-category is divided into companies.

And then each company has its own set of reviews, usually spread over many pages.

As you see, this is a top down tree structure. In order to scrape the reviews out of it, we’ll proceed in two steps.

  • Step 1️⃣: use Selenium to fetch each company page url
  • Step 2️⃣: use Scrapy to extract reviews from each company page

Scrape company urls with Selenium: step 1

All the Selenium code is available and runnable from this notebook 📓

We first use Selenium because the content of the website that renders the urls of each company is dynamic which means that it cannot be directly accessed from the page source. It’s rather rendered on the front end of the website through Ajax calls.

Selenium does a good job extracting this type of data: it simulates a browser that interprets javascript rendered content. When launched, it clicks on each category, narrows down to each sub-category and goes through all the companies one by one and extracts their urls. When it’s done, the script saves these urls to a csv file.

Let’s see how this is done:

We’ll first import Selenium dependencies along with other utility packages.

We start by fetching the sub-category URLs nested inside each category.

If you open up your browser and inspect the source code, you’ll find out 22 category blocks (on the right) located in div objects that have a class attribute equal to category-object

Each category has its own set of sub-categories. Those are located in div objects that have class attributes equal to child-category. We are interested in finding the urls of these subcategories.

Let’s first loop over categories and for each one of them collect the URLs of the sub-categories. This can be achieved using Beautifulsoup and requests.

Now comes the selenium part: we’ll need to loop over the companies of each sub-category and fetch their URLs.

Remember, companies are presented inside each sub-category like this:

We first define a function to fetch company urls of a given subcategory:

and another function to check if a next page button exists:

Now we initialize Selenium with a headless Chromedriver. This prevents Selenium from opening up a Chrome window thus accelerating the scraping.

PS: You’ll have to donwload Chromedriver from this link and choose the one that matches your operatig system. It’s basically a binary of a Chrome browser that Selenium uses to start.

The timeout variable is the time (in seconds) Selenium waits for a page to completely load.

Now we launch the scraping. This approximatively takes 50 minutes with good internet connexion.

Once the scraping is over, we save the company urls to a csv file.

And here’s what the data looks like:

Pretty neat right? Now we’ll have to go through the reviews listed in each one of those urls.

Scrape customer reviews with Scrapy: step 2

All the scrapy code can be found in this folder 📁

Ok, now we’re ready to use Scrapy.

First, you need to install it either using:

  • conda: conda install -c conda-forge scrapy

or

Then, you’ll need to start a project:

cd src/scraping/scrapy 
scrapy startproject trustpilot

This command creates the structure of a Scrapy project. Here’s what it looks like:

Using Scrapy for the first time can be overwhelming, so to learn more about it, you can visit the official tutorials.

To build our scraper, we’ll have to create a spider inside the spiders folder. We’ll call it scraper.py and change some parameters in settings.py. We won’t change the other files.

What the scraper will do is the following:

  • It starts from a company url
  • It goes through each customer review and yields a dictionary of data containing the following items
  • comment: the text review
  • rating: the number of stars (1 to 5)
  • url_website: the company url on trustpilot
  • company_name: the company name being reviewed
  • company_website: the website of the company being reviewed
  • company_logo: the url of logo of the company being reviewed
  • It moves to the next page if any

Here’s the full script.

To fully understand it, you should inspect the source code. It’s really easy to get.

In any case, if you have a question don’t hesitate to post it in the comment section ⬇

Before launching the scraper, you have to change a couple of things in the settings.py:

Here are the changes we made:

This indicates to the scraper to ignore robots.txt, to use 32 concurrent requests and to export the data into a : format under the filename: comments_trustpilot_en.csv.

Now time to launch the scraper:

cd src/scraping/scrapy 
scrapy crawl trustpilot

We’ll let it run for a little bit of time.

Note that we can interrupt it at any moment since it saves the data on the fly on this output folder is src/scraping/scrapy.

2 — Training a sentiment classifier using PyTorch 🤖

The code and the model we’ll be using here are inspired from this github repo so go check it for additional information. If you want to stick to this project’s repo you can look at this link.

Now that the data is collected, we’re ready to train a sentiment classifier to predict the labels we defined earlier.

There are a wide range of possible models to use. The one we’ll be training is a character based convolutional neural network. It’s based on this paper and it proved to be really good on text classification tasks such as binary classification of Amazon Reviews datasets.

The question you’d be asking up-front though is the following: how would you use CNNs for text classification? Aren’t these architectures specifically designed for image data?

Well, the truth is, CNN are way more versatile and their application can extend the scope of image classification. In fact, they are also able to capture sequential information that is inherent to text data. The only trick here is to efficiently represent the input text.

To see how this is done, imagine the following tweet:

Assuming an alphabet of size 70 containing the english letters and the special characters and an arbitrary maximum length of 140, one possible representation of this sentence is a (70, 140) matrix where each column is a one hot vector indicating the position of a given character in the alphabet and 140 being the maximum length of tweets. This process is called quantization.

Note that if a sentence is too long, the representation truncates up to the first 140 characters. On the other hand, if the sentence is too short 0 column vectors are padded until the (70, 140) shape is reached.

So what to do now with this representation?

Feed it to a CNN for classification, obviously 😁

But there’s a small trick though. Convolutions are usually performed using 2D-shaped kernels, because these structures capture the 2D spatial information lying in the pixels. Text is however not suited to this type of convolutions because letters follow each other sequentially, in one dimension only, to form a meaning. To capture this 1-dimensional dependency, we’ll use 1D convolutions.

So how does a 1-D convolution work?

Unlike 2D-convolutions that make a 2D kernel slide horizontally and vertically over the pixels, 1D-convolutions use 1D kernels that slide horizontally only over the columns (i.e. the characters) to capture the dependency between characters and their compositions. You could think for example about a 1D kernel of size 3 as a character 3-gram detector that fires when it detects a composition of three successive letters that is relevant to the prediction.

The diagram below shows the architecture we’ll be using:

It has 6 convolutional layers:

and 2 fully connected layers:

On the raw data, i.e. the matrix representation of a sentence, convolutions with a kernel of size 7 are applied. Then the output of this layer is fed to a second convolution layer with a kernel of size 7 as well, etc, until the last conv layer that has a kernel of size 3.

After the last convolution layer, the output is flattened and passed through two successive fully connected layers that act as a classifier.

To learn more about character level CNN and how they work, you can watch this video

Character CNN are interesting for various reasons since they have nice properties 💡

  • They are quite powerful in text classification (see paper’s benchmark) even though they don’t have any notion of semantics
  • You don’t need to apply any text preprocessing (tokenization, lemmatization, stemming …) while using them
  • They handle misspelled words and OOV (out-of-vocabulary) tokens
  • They are faster to train compared to recurrent neural networks
  • They are lightweight since they don’t require storing a large word embedding matrix. Hence, you can deploy them in production easily

That’s all about the theory now !

How to train the model using PyTorch 🔥

In order to train a character level cnn, you’ll find all the files you need under the src/training/ folder.

Here’s the structure of the code inside this folder:

train.py: used for training a model
predict.py: used for the testing and inference

src: a folder that contains:

  • model.py: the actual CNN model (model initialization and forward method
  • dataloader.py: the script responsible of passing the data to the training after processing
  • utils.py: a set of utility functions for text preprocessing (url/hashtag/user_mention removal)

To train our classifier, run the following commands:

When it’s done, you can find the trained models in src/training/models directory.

Model performance 🎯

On training set

On the training set we report the following metrics for the best model (epoch 5):

Here’s the corresponding tensorboard training logs:

On validation set

On the validation set we report the following metrics for the best model (epoch 5):

Here’s the corresponding validation tensorboard logs: