Source: Deep Learning on Medium

# Introduction

Model reinforcement learning algorithms have achieved astonishing results in many real-world games, such as Alpha Go and OpenAI Five. In this article, we discuss a closer-to-life application of reinforcement learning, known as web navigation problems, in which an agent learns to navigate the web following some instructions. Specifically, we discuss an algorithm, named QWeb proposed by Gur et al. in ICLR 2019, that leverage DQN to solve web navigation problems.

Notice that albeit there is no mathematical reasoning going on here, I still put many mathematical expressions to illuminate the underlying process. So for better readability, you may want to refer to my personal blog.

# Preliminaries

## Simple Introduction to Web Navigation Problems

We first introduce two terminologies in web navigation problems:

**DOM**is a tree representation of the web page, whose elements are represented as a list of named attributes.**Instruction**is a list of fields represented by key-value pairs.

The agent’s objective is to navigate the web page(i.e., modify the DOM tree) such that it meets an instruction. The following figure demonstrates several MiniWoB web tasks:

## Problem Setup

We first formulate the MDP for our problem, *M=<S, 𝜌, A, P, R, G, 𝛾>*, where 𝜌 denotes the initial state distribution, the transition function *P* is deterministic, a goal *G* is specified by the instruction, and 𝛾 is the discount factor. The rest notations are defined as follows:

**State space,***S,*consists an instruction and a DOM tree.**Action space,***A,*contains two composite actions*Click(e)*and*Type(e, y)*, where*e*is a leaf DOM element and*y*is a value of a field in an instruction. We further introduce three atomic actions to express these composite actions:*a^D*picks a DOM element*e*,*a^C*specifies either a click or type action, and*a^T*generates a type sequence. Now we can represent*Click(e)*by*a^D=e*,*a^C=’Click’*and*Type(e, y)*by*a^D=e, a^C=’Type’, a^T=y*.**Reward***R*is a function of the final state of an episode and the final goal state. It’s*+1*if these states are the same and*-1* if they are not. No intermediate reward is given.

# QWeb

QWeb solves the above problem using deep Q network(DQN) to generate Q values for each state and for each atomic action. The training process is almost the same as traditional DQN with the help of reward augmentation and some curriculum learning approaches, which we will discuss later. But for now let’s first focus on the architecture of QWeb, which is essentially the most fruitful part of this algorithm.

## Architecture

We discuss each part in the above graph in details

**Encoding user instructions**: As we’ve seen in the preliminaries, a user instruction consists of a list of fields, i.e.,key-value pairs *<K, V>*. To produce a representation, we first encode words in keys, giving us *f(i, j)* for the *j*-th word in the key of *i*-th pair. Then we represent a key as the average of these embeddings, i.e. *f_K(i)={1\over N}\sum_{j}f_{K}(i, j)*, where *N* is the number of words in the key. We also follow the same process to encode values, having embeddings *f_V(i, j)* and *f_V(i)*. An instruction field is then produced by further encoding the concatenation of the key and value embedding through a fully-connected layer, i.e., *f(i)=FC([f_K(i),f_V(i)])*, where *[.]* denotes vector concatenation. The whole instruction can now be represented as *f=Stack(f(i))*, a matrix whose rows are instruction fields.

**Encoding DOM-Instruction Intersection**: We first encode DOM element *j* by averaging the word embeddings over each sequence and each attribute, which gives us *D(j)*. Then for each instruction filed *f(i)* and each element embedding *D(j)*, we encode them through an encoder *E(f(i), D(j))*. The conditional embedding of *D(j)* can then be expressed as the weighted average of these embedding, i.e. *E_{cond}(j)=\sum_{i}p_{i}E(f(i), D(j))*, where probabilities *p_i=softmax(u*f(i))* with* u* being a trainable vector. One could take *E_C* as a self-attention module, where* Q=u, K=f, *and* V=E(f, D)*. (please refer to this post, if you want to know more about self-attention)

**Encoding DOM Trees**: We concatenate the conditional DOM element embeddings *E_{cond}(j)* with DOM element embeddings* D(j)* to generate a single vector for each DOM element, *E_{conc}(j)=[E_{cond}(j), D(j)]*. Then we run a bidirectional LSTM(biLSTM) network on top of the list of DOM elements to encode the DOM tree. Each output vector of the biLSTM is then transformed through a fully-connected layer with tanh to generate DOM element representations, i.e., *E_{tree}(j)=tanh(FC(biLSTM(E_{conc}(j))))*.

**Generating Q values**: We compute the pairwise similarities between each field and each DOM element to generate a context matrix

*M=fE_{tree}^T*, where rows and columns of

*M* show the posterior values for each field and each DOM element in the current state, respectively — notice that

*M[i][j]*is the dot product of

*f(i)*and

*E_{tree}(j)*). We now use rows of

*M* as the

*Q* values for each instruction field, i.e,

*Q(s, a^T=i)=M[i]*. We compute the

*Q* values for each DOM element by transforming

*M* through a fully-connected layer and summing over the rows, i.e.

*Q(s, a^D)=sum(FC(M^T), 1)*, where

*M^T* is the transpose of

*M*. Finally,

*Q* values for click and type actions on a DOM element are generated by transforming the rows of

*M* into

*2* dimensional vectors using another fully-connected layer, i.e.,

*Q(s,a^C)=FC(M^T)*.

**Incorporating Shallow Encodings**: In scenarios where the reward is sparse and input vocabulary is large, such as in ﬂight-booking environments with hundreds of airports, it is difﬁcult to learn a good semantic similarity using only word embeddings. Gur et al. propose augmenting the deep *Q* network with shallow instruction and DOM tree encodings to alleviate this problem. For shallow encodings, we first define an embedding matrix of instruction field and DOM elements as word-based similarities, e.g., Jaccard similarity, binary indicators such as subset or superset. Let’s take a concrete example to see what this embedding matrix looks like. Assuming the corpus of the instruction field consists of *[“loc”, “date”, “name”]*, and that of DOM elements is comprised of *[“name”, “loc”]*, the corresponding shallow encoding matrix looks like

A shallow input vector for DOM elements or instruction fields is generated by summing over columns or rows of the shallow encoding matrix, respectively. Take the above example; if we have an instruction field *[“loc” “name”]*, then the resulting input vector is *[0, 1] +[1, 0]=[1,1]*. These shallow input vectors are then transformed using a fully-connected layer with tanh and scaled via a trainable variable to generate the corresponding shallow *Q* values. The final values are a combination of deep and shallow *Q* values

## Reward Augmentation

So much for the architecture, now let’s shift our attention to the reward function. The environment reward function defined in the preliminaries is extremely sparse — the agent only gets a reward at the end of the episode, and worse still, the success states are a small fraction of the total state space, which makes the success reward even sparser. The authors, therefore, introduce a potential reward for remedy. Specifically, they define a potential function *Potential(s, g)* that counts the number of matching DOM elements between the current state *s* and the goal state *g*. The potential reward function is then computed as the scaled difference between two potentials for the next state and the current state:

# Curriculum Learning

The authors also propose two curriculum learning strategies to speed up the learning process:

**Warm-Start**: To speed up the learning process, we warm-start an episode by choosing initial states somewhere near the goal state. These initial states can be obtained by randomly selecting a subset of DOM elements and having an oracle agent performs the correct action. The distance between the warm-started initial states and the goal state increases as the training progresses — which can easily be achieved by shrinking the selected subset.**Goal Simulation**: We also relabel states near the initial states as subgoals as done by HER, but here these states are generated by an oracle agent. As before, the distance between the subgoals and initial states increases as the training proceeds.

## Algorithm

The algorithm now should be clear: