Source: Deep Learning on Medium
By Michael McDermott (Mike McDermott) Senior Data Scientist, Tommy Levi (Tommy Levi, https://twitter.com/tslevi) Director of Data Science, Jordan Dawe (Jordan Dawe, https://twitter.com/freedryk) Senior Data Science Developer, Yosem Sweet (Yosem Sweet, https://twitter.com/yosemsweet) Senior Director of Technology, Tavis Rudd (https://twitter.com/tavisrudd) Principal Engineer, Macarena Poo (Macarena Poo) UI Developer
At Unbounce, we have an R&D team who focus on researching new features to help our customers. The central question that drives us is: how does one make a high converting landing page? There are many factors that likely affect landing page conversion rate, such as the page copy, the graphic design, or the quality of the leads being driven to the page. We have been using machine learning to examine how the layout, typography, and colour pallette of a landing page influences its conversion rate. In the interests in giving back to the community we wanted to open source an optimized PyTorch Tree-LSTM implementation, share some of our machine learning work at Unbounce, and describe the work involved in doing machine learning in an industry setting.
Our PyTorch Tree-LSTM repository can be found here: https://github.com/unbounce/pytorch-tree-lstm
Landing pages commonly have a standard set of elements that make up a landing page — a headline summarizing the page’s offer, a “call to action” detailing actions the page designer wants visitor to take, among others. (A more complete description of landing page elements can be found here.) In the field of marketing there is quite a bit of standard lore of “best practices” around how design elements should be used on a page. These include rules like “keep your brand consistent”, “have a prominent offer in the headline”, “clearly display the button/form for user actions”, and “the logo should go at the top left”. We have the unique opportunity to investigate these questions from a data driven perspective. For example: Should you have a prominent offer in the headline? Does it matter at all?
With a dataset of these elements along with the conversion rate of the landing page, we’d be able to analyze landing page design and generate insights for our customers as to how they could improve their pages. Unfortunately, landing pages typically do not have these elements labeled in a machine-readable way. To remedy this problem we want to build a semantic design element classifier, a machine learning system that, given the HTML and CSS of a landing page, would determine which parts of the HTML Document Object Model (DOM) represented which type of design element.
Our R&D team believes in rapid iteration and minimum viable products, so we decided to see if we could train a machine learning model to identify landing page headlines for us. Headlines are easy for humans to identify, so our success or failure in building such a classifier would tell us a lot about how hard it would be to classify other page elements. In this post we wanted to share with the community how we developed this system.
The Data Set
Unbounce hosts roughly 500,000 landing pages on its page servers at any given time. For each of these pages we have two types of data we could potentially use to predict headlines:
- Page screenshots of the rendered page
- HTML DOM, CSS, and image files that are used by the web browser to render the landing page.
In principle, either of these should give us sufficient information to identify a page element (since a trained human could). As a team we discussed using both options. We decided to start with the HTML and CSS data for the following reasons:
- DOM data generally requires less computational resources to work with than image data.
- A model that worked on the image data would have to segment the image into elements before classifying the elements, while the DOM data is already segmented into HTML elements.
- The text displayed by an HTML element likely is useful for classifying the element type and extracting text from the DOM is much simpler than from an image.
- It’s easy to convert a DOM into an image with a web browser, but it’s hard to do the reverse.
To handle this we turned to the Puppeteer module in Node.js, which provides an interface to control a headless instance of the Chromium web browser. Using an experimental API in the Chrome Developer Protocol we captured a DOM snapshot, the computed CSS properties for each DOM node, and a screenshot of the page rendered with a browser viewport 1280 pixels wide for 10,000 Unbounce landing pages.
Labelling the Data Set
Amazon Mechanical Turk allows you to send human labelers to a webpage to perform their tasks, so we built a Typescript app that could display the HTML DOM tree and screenshot for a web page and allow users to click page components to identify the page headline. With this in place, we ran a week-long contest for Unbouncers; the three top labelers would get $25 gift cards, and the team that labeled the most as a group would get a cake. This worked surprisingly well and at the end of the contest we had over 5,000 pages with labeled headlines. More information about our labeling contest can be found here.
We found that 4,714 pages were labelled more than once. Of these, 41% were labelled consistently. Even more interesting, 2072 pages were labelled multiple times by the same user and 18% of these labels disagreed. Before we can use this data set for any machine learning, we need to resolve these inconsistencies, as feeding a model with the same example labelled differently will make it very hard to impossible for the model training to converge.
With our data gathered we next had to decide on a model to train, and to select an appropriate model we needed to think a bit about the task we want to accomplish. Our task is a supervised classification problem — we want to label page components as headlines, and we have a training set with labeled headlines available. The simplest approach to this task would be to run a classifier on the CSS properties and attributes of every HTML element in the DOM tree. This kind of element-by-element classification achieves test accuracies of 60–90% (Lim et al. 2012) — reasonable performance but we would prefer the classifier to be a bit more accurate, so we started to consider how to incorporate information about the DOM structure into the model.
The semantic meaning of DOM elements is not solely based on the properties of a single HTML element: an element’s design properties are also contained in its relations to elements nested inside and surrounding that element. For example, headlines are often grouped with an image in the DOM structure, or broken up across multiple sibling elements which combine to form the headline. We considered a few models we considered that can learn from the structure of the DOM tree:
- Recursive Neural Network (Socher et al. 2011; not to be confused with Recurrent Neural Networks) models are capable of processing a tree structure, but are formulated to try to extract structure from images or sentences. The model is run on adjacent elements in the image or sentence, and the tree structure is inferred from the network output. This is not quite what we want; we have a tree structure and we want to label nodes, while Recursive Nets have adjacency data and try to infer a tree structure.
- Child-Sum Tree-LSTM models extend LSTM models to predict parent node values using input from variable numbers of child nodes, allowing the model to process any tree structure. As such they are capable of encoding relationships over whole branches of the tree, as well as summarizing the tree data in the output of the root node. This is eventually the model we decided to try.
- Seq2seq Models with attention work by transforming a sequence of features into a sequence of outputs, but instead of processing each node in the sequence in a set order, they use an attention mechanism to examine all the input sequence nodes at once. This is an extremely powerful approach, but as these models are typically applied to sequences it’s not a perfect fit for our data. Most importantly, however, at the time we were doing this work our team was not familiar with these types of models. This would be an extremely interesting approach to try, even if we were unable to figure out a method to encode the tree structure for the model.
Having decided to use a Tree-LSTM, we found Riddhiman Dasgupta’s treelstm PyTorch implementation and decided to use this code as a starting point to construct a model.
Our data set contains a large number of raw and potential features about each node. Each element or node of the tree contains:
- The CSS properties of the node (e.g. color and font properties, bounding boxes)
- The DOM properties of the node (e.g. parent-child/sibling relationships, node type)
- Whether or not each node is a headline
These features can be separated into two basic categories, numeric and categorical. We have chosen an encoding scheme where the numeric features are normalized (to some maximal value) and the categorical features are one-hot encoded.
For our initial attempt, we wanted to choose a minimal set of features that were both simple to encode and would (hopefully) be sufficient to predict headlines. In particular, we did not include properties such as the full text, image encodings/urls, relative node positioning, and computer vision features (such as SIFT or HOG).
The feature set we used:
Numeric: font size, font weight, color, background color, bounding box coordinates
Categorical: font family, element type
Making the labels consistent
We had hoped when we started this project, that with a semantic element as simple as headline that human experts would tend to agree on what constitutes a headline. As we’ve seen above, this turns out to not be the case. We saw two primary types of disagreement in headline labelling:
- People disagreeing on what is a headline
- People agree on what the headline is but disagree on how to label it
Let’s talk about each of these. In cases where people disagree on which page element to label a headline, it represents true ambiguity in our data set. That is, we have pages either with multiple potential headlines, or no true one. In this case, these pages are closer to “noise” in that if we included them initially, we’d be asking the machine to learn the same example data but make different predictions. This would likely lead to a slow down or lack of convergence in training, and it would be hard to interpret the results. We initially solve this problem by simply removing these pages from the data set. These pages were less than 10% of the overall set. Note that we remove them from both the training and test sets.
The second case is different. In this instance, people are fundamentally agreeing on what constitutes a headline for a landing page, but they are disagreeing on fine details. For example, some people select the text box as the headline, while others select the parent node that also contains the style and bounding information. In these cases, humans are both pointing to the same thing when they are asked what a headline is, we’re just losing a bit in translation to the raw DOM elements. We want to include these pages in our data set for machine learning, but we need to resolve the label ambiguity.
To accomplish this we used a deterministic algorithm. It takes the human labels and tries to map different labeling schemes to the same thing. We relabel the data set using two schemes:
- If an element is headline, then we relabel all of its children to also be labelled headline
- If all of the children of an element are labelled headline, then we relabel their parent element to also be labelled headline
This relabelling scheme is able to resolve all ambiguity for the examples in our data set.
Having completed these preparation steps we were finally ready to train a model to classify the headline nodes. We use Amazon SageMaker to train models, allowing us to bundle code into Docker containers and push training to cloud systems very easily. We configured the model with an Adam optimizer set to default parameters and a binary cross entropy loss function and set the Tree-LSTM hidden layer to have 40 units. We wrote some interface code to load the landing page data and began training the model.
After a few training runs we had some early indications that the model was learning how to identify headlines: our loss was steadily decreasing, our precision and recall were surprisingly high for not having tuned hyperparameters yet, and our predictions were highly peaked near the true headline locations. However, we were running into a common machine learning problem: things were taking pretty dang long. Each epoch took about 10 minutes to train and a full training run took about three days, which didn’t support fast iteration cycles. Thus, we starting looking for optimizations we could perform to speed up the computations.
The simplest way we found to optimize was by reducing the data set size. Headlines are almost always near the top of a landing page, so we removed all the nodes from every landing page that were more than 900 pixels from the top of the page. Landing pages typically contain several vertical pages of content, so this change alone sped up the training by a factor of 2; now epochs were taking 5 minutes.
We were uncertain if optimizing the Tree-LSTM code would yield sufficient speed improvements to justify the time investment and increase in code complexity. To estimate how much speed increase code optimization would provide us, we decided to implement a linear LSTM model from the PyTorch library and apply it to the Tree-LSTM data. The PyTorch LSTM is implemented in C++ and so should represent an upper bound on the achievable speed increase. We ran the LSTM model on sequences composed of nodes on paths between each leaf and the root node of the DOM tree. This resulted in a epoch taking about 40 seconds, but unfortunately the model wasn’t able to classify headlines well, which is an indication that the tree structure is indeed important to our problem.
We decided that the speed increase we observed in the vanilla LSTM was large enough to justify optimising our Tree-LSTM code. Our initial code did a simple recursive walk of the DOM trees and evaluated each node in serial. We rewrote the model evaluation to evaluate trees in steps during which all nodes for which their child dependencies were satisfied were evaluated in a single parallel PyTorch operation. With this optimized code training epochs now took around 10 seconds. We’ve released this optimized PyTorch model under an open-source MIT license here. (After implementing these optimizations we discovered a TensorFlow implementation of the same strategy here — had we found this earlier we would probably have just transitioned to using TensorFlow).
One optimization that didn’t work was running training on GPU. We actually found running on GPU was slower than running on CPU, though we are not sure why. Our guess is that the model is simply not large enough to benefit from the GPU parallelism. Our Tree-LSTM only has two layers of neurons, we only have 125 features per node, and our whole dataset is collectively small enough to fit into GPU RAM. We expect that as we increase the size of our model and feature set, GPU acceleration will become more important.
With these optimizations complete we were able to complete training runs in as little as one hour, and we began hyperparameter tuning, which included trying different optimizer algorithms, changing the hidden layer size, adding extra layers, and altering the learning rate. After a day or two of tuning we were able to hit precision and recall values of 90%.
Below we have taken a particular page and plotted the headline classification probabilities as training progresses. By the end of training the headlines have a very high probability associated with them and the non- headlines a very low probability, so the model is doing a good job of learning this step function. Note that even though we end up in a good state as far as learning is concerned, the probabilities are not approaching the step function in anything like a monotonic fashion. This may indicate a very complicated loss surface with many local minima.
Conclusions and Learnings
Thanks for coming along on our journey as we start trying to understand how machine learning can be used to label and extract semantic web content. While we’ve put a lot of effort into this, we realize it’s only a small, initial step. We’ve shown that machine learning can detect headlines in a varied set of landing pages, and this gives us hope that we can eventually extract most or all of the semantic structure of a page. Along the way we’ve learned some things as well:
- The concept of a headline is itself tricky, even for human experts. This factors into our absolute upper bound, as well as needs strategies to mitigate when building out a data set. This will likely be even trickier for certain other elements.
- The text itself and the absolute positioning aren’t necessary for properly predicting headlines.
- Doing machine learning on tree structured data is computationally intensive, but has significant potential.
- Tree-LSTMs are capable of learning highly discontinuous functions, over many elements (we have pages of ~400 elements).
- Proper batching and data structures can increase performance of tree-LSTMs by ~30x.
- CPUs are competitive with GPUs for these models on our data set. Further investigation is required to understand why this is the case, and if there are further potential optimizations.