Object Detection using a Convolutional Neural Network in a real-time Selenium Headless Browser



This is one of our exciting R&D projects that I’ve been fortunate to work on. The idea came to us when we were looking at all the cool things that the OpenAI team was doing. After seeing the Atari games play themselves, GO champions getting crush, we’re thinking okay, but what about something actually useful?

The way the OpenAI team built the Atari games in a nutshell was that they gave all the controls over to a computer that could self drive and reward them.

OpenAI Gym

For us, we just wanted to start at the surface, something achievable in a hackathon. Our goal was to be able to create an interface where June.ai could see the way a human would. And later integrate the interactions using a mouse and keyboard.

The problem

We’re using a ton of natural language processing algorithms. But those algorithms are all about text. Humans as visual creatures, find importance in the way buttons are shown, the font and sizes of titles, and the placement of images.

We wanted to see if we could build a self-driving browser.

We started small, by picking a few items like Headlines, Subheaders, and Buttons. And we trained a Faster R-CNN model. We just wanted to see if we connect the two technologies together, a selenium headless browser, stream the video into a CNN model that could detect the objects that we trained.

We already know that Convolutional Neural Network drives huge breakthroughs in the field of computer vision. Unlike traditional methods, deep CNNs work by consecutively training relatively small pieces of information and integrating them deeper in network, thus are able to process vast amount of variations in images.

Driven by the success of region proposal methods and region-based CNN, object detection algorithm is widely applied on surveillance, vehicle detection, manufacturing products detection, etc. The method used to conduct a object detection model is slightly different from the traditional CNN followed by a fully connected layer. Since the number and class of objects is not fixed, the length of the output layer varies. Thus we need to select different regions with each image and use a CNN model to predict if the object exists within that region.

Faster R-CNN (proposed by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun), is composed of two modules: a deep fully convolutional network that proposes regions, and a Fast-CNN detector that uses the proposed region. Instead of using the selective search as the region proposal method, another network is built to predict the region. Then the RoI pooling layer accepts the convolutional features and the predicted bounding boxes.

Here is a brief introduction of how we train our own object detection model.

We first collect 1500+ real world emails and transfer them from HTML format into RGB images which are encoded as JPEG. In order to label the objects in email, I need a table of bounding boxes with coordinates that define the class region of each object.

We choose LabelImg, a graphical image annotation tool that we can manually label all the email images with objects like headline, button, image, discount, etc. The coordinates of bounding box of each object would then be saved in a XML file.

After hand-labeling the images, we convert the XML files into TFRecords file which serves as the input data for training. With Tensorflow Object Detection Model, we are able to train our own object detection model with customized labels.

Training a object detection model with CNN could be time consuming and computational expensive. Thus we choose to train it on Google Cloud ML Engine. We setup the config with standard GPU with 5 workers.

trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
Google Cloud ML Engine Jobs Console

Here you can see we already achieved low loss after 20,000 steps.

After our model has been trained, we export it to a Tensorflow graph proto with multiple checkpoint. We choose a candidate checkpoint and export with the command:

> python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path gs://elly-text-recognition/training/ssd_mobilenet_v1_coco.config \
--trained_checkpoint_prefix gs://elly-text-recognition/model_10_12/model.ckpt-17224 \
--output_directory model_10_12.pb

Next, we just need to applied the graph output model (model_10_12.pb) on a video. This is done with OpenCV, a open source computer vision library that processes images and videos.

And we got this:

Great! It works, we followed a simple tutorial and basic object detection works, and the technologies are able to communicate with each other.

Now, getting back to the problem. One particular example: A lot of promotional emails have a lot of images that say 30% off in big huge letters, but never say them in the text (or it’s very difficult to filter this out because of the noise). So we trained it to detect anything that would have % but in promotional images.

And…. magic!

The object detection model does a pretty impressive job on email images. We tried to use the technique on production level application.

Unfortunately, the latency was just not worth it.

First you’d have to render the entire email as a web page, save that giant image somewhere. Then run the image through the model and label the image, parse out the text, then save that text to the database.

We did have success in the ability to click on the unsubscribe buttons for you, but then again it’s much easier to reply to the unsubscribe email. But for the small percentage of emails that do not have the unsubscribe headers, we do click on the unsubscribe buttons if the unsubscribe headers are not there, because in that case latency isn’t really an issue.

The eventual goal of this project would be to map out different websites that do NOT have API’s yet available and be able to perform automated actions for you. And we think that this is very doable — focusing in on the most repetitive, high volume tasks that our users perform and start there.

The next thing we’ll also look at is how we can grab the code of interesting elements that are in your browsers. This would allow us to extract text and start mapping out how certain code is rendered on webpages…

Source: Deep Learning on Medium