Reinforcement Learning — Part 03

Original article was published on Artificial Intelligence on Medium

Reinforcement Learning — Part 03

Optimistic Initial Values, UCB, and Gradient Bandit

Reinforcement Learning — Part 01
Reinforcement Learning — Part 02

In my last article, Part 02, we went through the simple example of multi-armed bandit machine. We followed this by going through a coding contraption I called Insanity. We’re able to use it to reproduce some results from chapter 2 of S&B (Sutton and Barto book). In this article we’ll continue on chapter two and put Insanity to the test, we’ll see if it is easy to start from existing experiment and reproduce another one that is somewhat different. As we go through this exercise, we’ll touch on few more ideas mentioned in this chapter.

As a quick reminder, main components we needed to implement were:

  • ActionValueInitializer: this is the components responsible for settings initial action values.
  • ActionValuesProvider: this is the component responsible for continuous evaluation of action values.
  • ActionSelector: action selectors use the values produced by action value provider to figure out what action the agent should take next.

In addition to these, we also have Agent, Environment, and Experiment. These don’t need to change for experiments included in this article, we only need to provide an initializer-provider-selector mix. Let’s now discuss three more variations for bandit machine learning.

Optimistic Initial Values

In our multi-armed bandit code, we needed to give actions some initial value before we proceed with our estimation process. Natural choice was to give all actions a value of zero (we used a ZeroInitializer for this — defined in

The effect of choosing zero for all action values was that the first time the agent needs an action, it will choose a random one. Second time, agent can still choose the same action if it made a greedy choice. Wouldn’t it be better if we cycle through all actions and try each one first before we go greedy? This is the idea behind optimistic initial value. It promotes more exploration in the beginning until we have some estimates for action values then we can benefit from our greedy choices.

Effect of optimal initial values

The way this work is as follows. We start by assuming all actions are optimum. We do this by assigning all actions an initial value larger than the mean reward we expect to receive after pulling any arm. As a result of this, the first estimate of any action will always be less than this optimum initial value. And, even if we’re greedy, we’ll not visit this action again and rather select another unvisited action. This approach encourages the agent to visit all actions multiple times, resulting in early improvement in estimated action values.


Implementing optimistic initial values is almost trivial. We simply need to either replace the initializer used earlier with one that is optimistic or modify the existing one. I choose to create a new OptimisticInitializer as, even though they are similar, each one represents a separate concept.

Here is how the two initialization methods compare:

Looking at the plot we notice the following:

  • Optimistic initialization performance is not good at the beginning as it explores a lot early.
  • As time goes on, it will perform better as it learned action values earlier and is selecting optimal value more often.
  • S&B poses the question: if we are averaging over 2000 runs, which should remove random fluctuations, what is it about the spike we see in early steps? the answer is that with optimistic initialization, first few steps are not really random. We rather cycle through all actions many times randomly picking our large optimistic value. First round, all will have equal % optimality (1/k, where k is number of actions — note the 10% in the figure above). The next action will have a spike as on average optimal action will have the largest value. This will repeat, with decaying values until the effect of optimal initial values fades away.

Lesson to learn here, is that algorithms are sensitive to the choice of initial action values. We say that the method is biased by its initial estimate. Note also that this bias towards initial values may not be helpful if action values are not stationary and change over time, we need to continue exploring to tackle this.

Upper-Confidence-Bound (UCB)

So far, we’ve been selecting actions randomly. Problem with this randomness is that it doesn’t discriminate between actions. An action that we have a good estimate for its value can be selected while we explore. There is no new knowledge we gain by selecting this action. A better approach will be to select actions that we know little about. Optimistic initial values approach achieved this to some extent in the early learning steps, but we can do better.

UCB follows what is called optimism in the face of uncertainty. This means that if we don’t have enough confidence on a value of an action, assume it is optimum and select it. This way we can continuously improve our estimates of the value of these actions.

How does UCB works

The idea of UCB is simple — even though its selection rule may appear to be intimidating. What we’ve been doing so far is to estimate the value Q(a) of actions and we either select actions with max Q (greedy action) or we randomly explore. UCB doesn’t use Q(a) to figure out what action to select. Instead, it adds some quantity U(a) the depends on how good is our estimate of action a is, or in other words, how confident we are in the value of action a.

We want this quantity U(a) to have the following attributes:

  • It must have a small value if the action was selected frequently enough and have a good estimate. Mathematically, U(a) should be inversely proportional to number of times action was selected. The more we select an action, the smaller U becomes, and our Q will be good enough estimate for its value.
  • It must have large value if we choose actions other than a. This can be represented mathematically by making U directly proportional to age of the agent — or the number of time steps in the episode. The more we choose actions other than a, the larger the number of time steps, hence the larger our uncertainty in the value of a.

This is a qualitative description of U(a), and in fact, there are many ways to mathematically represent U. S&B uses the following form:

As we can see:

  • We no longer select actions that maximize Q(a), we rather select actions to maximize Q(a) + U(a).
  • U(a) will increase the values of actions we want to explore; it is essentially an exploration term.
  • As mentioned above, U(a) is inversely proportional to N(a) — the number of time action a was selected. The more we select an action, the less contribution from U term for this action.
  • Actions never selected will result in value of U(a) equal to infinity. In other words, these are considered optimal and should be selected.
  • U(a) directly proportional with ln(t). The more we choose other actions over a, the less confident we are in a. U will increase as ln(t) will increase while Nt(a) will not. The Choice of ln(t), as opposed to t, makes sure that rate of increase gets smaller over time. We already expect actions with high confidence to be selected less and less as we progress in the learning process.
  • The constant c determines how much we explore as it is the weight of the exploration term U(a).
  • We will always perform greedy selection for actions based on equation above. This is because exploration is already incorporated into U(a).

Implementing UCB

As you might’ve already guessed, we need a new action values provider that knows how to calculate U in addition to Q. We also need a new selector that is always greedy. Code for these is relatively straight forward. You are highly encouraged to check it out on GitHub. Here is how the performance curve for UCB looks like:

Note: S&B doesn’t average rewards received from the environment. In other words, if you studied the code, you’ll see that Experiment class passes the reward it receives from the environment to a RunningAverage utility, the value from this utility is returned to our analysis component. Hence the smooth curves you see in these articles. To get the exact result in S&B (where the average reward curve is noisy and have this characteristic peak), I turned off averaging. You can see a lot confusion online due to this averaging trick. Hopefully this clears things out.

Gradient Bandit

Gradient bandit follows a completely different approach. Rather than estimating action values, agent will learn some arbitrary numerical values H(a) for each action a. These values don’t represent value of actions, but rather a relative preference for choosing an action versus another. We then use soft-max to obtain a probability mass function out of the values of H(a). The math of this approach (which is derived in details in S&B) can be summarized as:

Things you need to know about these equations:

  • Rt is the reward received at time t.
  • R bar is the estimated reward Q. It acts as a baseline with which the received reward R is compared. If R is larger than the baseline, π will increase for this action otherwise it will decrease. Other actions go in the other directions to keep sum of probabilities equal to 1 as expected.
  • The choice R bar is completely arbitrary and we could’ve chosen any value. See proof in S&B for details on this.
  • We run our experiment twice, once with baseline, and another without baseline (R bar is fixed at zero value).
  • Values of H(a) are initialized to zeros for all actions a.
  • First two equations represent the update rules for calculating H(a) at each time step.
  • π(a) represent the probability of selecting action a. We start with all actions have the same probability (1/k, where k is the number of actions).
  • Last equation is our soft-max and this is how we calculate π(a) using the values of H(a).
  • Another thing to note is that in this experiment we use 4 rather than 1 for the actual action value (Q*(a)).

An important point to notice here is how we select actions. We are not selecting randomly. We are not using argmax as we used to do. We are going to sample our actions from the probability mass function π(a).

Implementing Gradient Bandit

No surprise here as well. We’ll use our trustworthy ZeroInitializer, we’ll use a new GrandientActionValuesProvider, and finally we’ll use a new GradientSelector. Again, code is really simple, go ahead and take a look and familiarize yourself with it. Here is how the performance look like as generated by the code:

As you can see, we got the same results on in S&B with very small code change.

This wraps up chapter 2 of Sutton and Barto. Next our adventure will take us into the realm of Markov Decision Process, another exciting subject. Stay tuned….