Multi-Armed Bandit — Solution Methods

Original article was published on Artificial Intelligence on Medium

Optimistic Initial Value

Can we promote exploration without using Epsilon-Greedy?

It turns out that we can do this using the concept of Optimistic Initial Value.

Without Optimistic Initial Value

Let us understand it. Initially we do not know the true reward associated with each action. But we will make estimates as we progress.

At t = 0, estimated_rewards = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Let us assume that we select to pull the first lever and get a reward of 0.2. So average reward for this arm will be (0 + 0.2)/2 = 0.1 .

At t = 1, estimated_rewards = [0.1, 0, 0, 0, 0, 0, 0, 0, 0]

Now because we are following Greedy Action Selection ( not Epsilon), so as we progress in this game we will just keep selecting the first lever. Let the pull of the first lever arm give a reward of 0.18 (remember that the rewards given at runtime are sampled from true rewards). So average reward for this arm will be (0 + 0.2 + 0.18)/3 = 0.126 .

At t =2, estimated_rewards = [0.126, 0, 0, 0, 0, 0, 0, 0, 0]

So we will keep selecting the first arm. Since we are not exploring so there are fair chances that we are missing the optimal return. Can we do better?

With Optimistic Initial Value

Now suppose we initialize the estimated rewards with a value of +5 each. So initially we have :

At t = 0, estimated_rewards = [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

Let us again assume that we select to pull the first arm and get a reward of 0.2. So average reward for this arm will be (5 + 0.2)/2 = 2.6.

At t = 1, estimated_rewards = [2.6, 5, 5, 5, 5, 5, 5, 5, 5, 5]

Did you notice that? The estimated average reward from the selected arm is lower than its initial estimates. So our agent will be a bit disappointed and this will motivate the agent to explore other arms. Now the greedy policy has 9 choices ( because the maximum value 5 is associated with 9 other actions). Assuming that ties are broken randomly, a random action from among the remaining 9 will be chosen. Let us suppose the 3rd arm is selected this time. So average reward for this arm will be (5 + 0.2)/2 = 2.6.

At t = 2, estimated_reward = [2.6, 5, 2.6, 5, 5, 5, 5, 5, 5, 5]

Notice that while calculating average we do not divide by the time step. Rather the denominator is the number of times that particular action has been selected in the past.

It is easy to observe that as we continue this way, every arm will be explored at least once. So we observed how Optimistic Initial Value can make a big difference. We can also combine the Optimistic Initial Value technique with epsilon Greedy.

Let’s see some comparison charts:-

In the adjacent graph, the blue trajectory corresponds to greedy action-selection with estimated rewards initialized to 5. The red trajectory corresponds to epsilon-greedy with epsilon = 0.1 and reward estimates initialized to 0.

Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time¹.

In the adjacent graph, we have epsilon-greedy action-selection methods with reward estimates initialized to different values of 0.0, 5.0, -5.0, 10.0, and an identical value of epsilon = 0.01.

You can take the blue trajectory as a basis to compare. This is because it corresponds to our popular epsilon-greedy action selection with epsilon = 0.01 and reward estimates initialized to 0.

Run this notebook to explore the Optimistic Initial Value Method.