Cross-Entropy Method Performance Analysis

Original article was published on Artificial Intelligence on Medium

Cross-Entropy Method Performance Analysis

Implementation of the Cross-Entropy Training Loop

In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present the improved variant of the method that keeps “elite” episodes for several iterations of the training process. Finally, we will show the limitations of the Cross-Entropy method to motivate other approaches.

Overview of the Training Loop

Next, we will present in detail the code that makes up the training loop that we presented in the previous post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.

Main variables

The code begins by defining the main parameters of the method.

GAMMA = 0.9

Helper classes

We will require a series of helper classes:

from collections import namedtupleEpisode = namedtuple(‘Episode’, field_names=[‘reward’, ‘steps’])EpisodeStep = namedtuple(‘EpisodeStep’, 
field_names=[‘observation’, ‘action’])

Here we will define two helper classes that are named tuples from the collections package in the standard library:

  • EpisodeStep: This will be used to represent one single step that our Agent made in the episode, and it stores the state observed from the Environment and what action the Agent completed. Remember that we will use episode steps from “elite” episodes as training data.
  • Episode: This is a single episode stored with a total discounted Reward and a collection of EpisodeStep.

Initialization of variables

At this point, a set of variables that we will use in the training loop are initialized. We will present each of them as they are required in the loop:

iter_no = 0reward_mean = 0full_batch = []
batch = []
episode_steps = []
episode_reward = 0.0
state = env.reset()

The training loop

We learned in the previous post that the training loop of our Agent that implements the Cross-Entropy algorithm repeats 4 main steps until we become satisfied with the result:

1 — Play N number of episodes

2 — Calculate the Return for every episode and decide on a return boundary

3 — Throw away all episodes with a return below the boundary.

4 — Train the neural network using episode steps from the “elite” episodes

We have decided that the Agent must be trained until a certain Reward threshold is reached. Specifically, we have decided a threshold of 80% indicated in the variable REWARD_GOAL:

while reward_mean < REWARD_GOAL:

STEP 1 — Play N number of episodes

The next piece of code is the one that generates the batches with episodes:

action = select_action(state)
next_state, reward, episode_is_done, _ = env.step(action)
episode_steps.append(EpisodeStep(observation=state,action=action))episode_reward += rewardif episode_is_done: # Episode finished
next_state = env.reset()
episode_steps = []
episode_reward = 0.0
<STEP 2> <STEP 3> <STEP 4>state = next_state

The main variables we will use are:

  • batchaccumulates the list ofEpisode instances (BATCH_SIZE=100).
  • episode_steps accumulates the list of steps in the current episode.
  • episode_reward maintain a reward counter for the current episode (in our case we only have Reward at the end of the episode, but the algorithm is described for a more general situation where we can have Rewards not only at the last step).

The list of episode steps is extended with an (observation, action) pair. It is important to note that we save the observed state that was used to choose the action (but not the observation next_state returned by the Environment as a result of the action):


The reward is added to the current episode’s total reward:

episode_reward += reward

When the current episode is over (hole or goal state) we need to append the finalized episode to the batch, saving the total reward and steps we have taken. Then, we reset our environment to start over and we reset variables episode_steps and episode_reward to start to track next episode:

batch.append(Episode(reward=episode_reward, steps=episode_steps))next_obs = env.reset()
episode_steps = []
episode_reward = 0.0

STEP 2 — Calculate the Return for every episode and decide on a return boundary

The next piece of code implements step 2:

if len(batch) == BATCH_SIZE:
reward_mean = float(np.mean(list(map(lambda s:
s.reward, batch))))
elite_candidates= batch
returnG = list(map(lambda s: s.reward * (GAMMA **
len(s.steps)), elite_candidates))
reward_bound = np.percentile(returnG, PERCENTILE)

The training loop executes this step when a number of plays equal toBATCH_SIZE have been run:

if len(batch) == BATCH_SIZE:

First, the code calculates the Return for all the episodes:

elite_candidates= batch
returnG = list(map(lambda s: s.reward * (GAMMA **
len(s.steps)), elite_candidates))

In this step, from the given batch of episodes and percentile value, we calculate a boundary reward, which will be used to filter “elite” episodes to train the Agents neural networks:

reward_bound = np.percentile(returnG, PERCENTILE)

To obtain the boundary reward, we will use NumPy’s percentile function, which, from the list of values and the desired percentile, calculates the percentile’s value. In this code, we will use the top 30% of episodes (indicated by the variable PERCENTILE) to create the “elite” episodes.

During this step we compute the reward_mean that is used to decide when to finish the training loop:

reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))

STEP 3 — Throw away all episodes with a return below the boundary

Next, we will filter off our episodes with the following code:

train_obs = []
train_act = []
elite_batch = []
for example, discounted_reward in zip(elite_candidates, returnG):
if discounted_reward > reward_bound:
train_obs.extend(map(lambda step: step.observation,
train_act.extend(map(lambda step: step.action,

For every episode in the batch:

for example, discounted_reward in zip(elite_candidates, returnG):

we will check that the episode has a higher total reward than our boundary:

if discounted_reward > reward_bound:

and if it has, we will populate the list of observed states and actions that we will train on, and keep track of the elite episodes:

train_obs.extend(map(lambda step: step.observation,example.steps))
train_act.extend(map(lambda step: step.action, example.steps))

Then we will update this tree variable with the “elite” episodes, the list of states and actions with which we will train our neural network:


STEP 4 — Train the neural network using episode steps from the “elite” episodes

Every time our loop accumulates enough episodes (BATCH_SIZE), we compute the “elite” episodes and at the same iteration the loop trains the neural network of the Agent with this code:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)
action_scores_t = net(state_t)
loss_t = objective(action_scores_t, acts_t)

iter_no += 1
batch = []

This code train the neural network using episode steps from the “elite” episodes, using the state s as the input and issued actions a as the label (desired output). Let’s go to comment it in more detail al the code lines:

First, we transform the variables to tensors:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)

We zero gradients of our neural network


and pass the observed state to the neural network, obtaining its action scores:

action_scores_t = net(state_t)

These scores are passed to the objective function, which will calculate cross-entropy between the neural network output and the actions that the agent took

loss_t = objective(action_scores_t, acts_t)

Remember that we only consider “elite” actions. The idea of this is to reinforce our neural network to carry out those “elite” actions that have led to good rewards.

Finally, we need to calculate gradients on the loss using thebackward method and adjust the parameters of our neural network using the step method of the optimizer:


Monitor the progress of the Agent

In order to monitor the progress of the Agent’s learning performance, we included this print in the training loop:

print(“%d: loss=%.3f, reward_mean=%.3f” % 
(iter_no, loss_t.item(), reward_mean))

With it we show the iteration number, the loss and the mean reward of the batch (in the next section we also write the same values to TensorBoard to get a nice chart):

0: loss=1.384, reward_mean=0.020 
1: loss=1.353, reward_mean=0.040
2: loss=1.332, reward_mean=0.010
3: loss=1.362, reward_mean=0.020
4: loss=1.337, reward_mean=0.020
5: loss=1.378, reward_mean=0.020
. . .639: loss=0.471, reward_mean=0.730
640: loss=0.511, reward_mean=0.730
641: loss=0.472, reward_mean=0.760
642: loss=0.481, reward_mean=0.650
643: loss=0.472, reward_mean=0.750
644: loss=0.492, reward_mean=0.720
645: loss=0.480, reward_mean=0.660
646: loss=0.479, reward_mean=0.740
647: loss=0.474, reward_mean=0.660
648: loss=0.517, reward_mean=0.830

We can check that the last value of the reward_mean variable is the one that allowed to finish the training loop.

Improving the Agent with a better neural network

In a previous post, we already introduced TensorBoard, a tool that helps in the process of data visualization. Instead, the “print” used in the previous section, we could use these two sentences to plot the behavior of these two variables:

writer.add_scalar(“loss”, loss_t.item(), iter_no)
writer.add_scalar(“reward_mean”, reward_mean, iter_no)

In this case, the output is:

More complex Neural Network

One question that arises is if we could improve the Agent’s neural network. For instance, what happens if we consider a hidden layer with more neurons, let say 128 neurons:

net= nn.Sequential(
nn.Linear(obs_size, HIDDEN_SIZE),
nn.Linear(HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)

The result can be shown here (or executing the GitHub code):

We can see that this network learns faster than the previous one.

ReLU activation function

What happens if we change the activation function? e.g. a ReLU instead a Sigmoid?

Below you can see what happens: the network converges much earlier, with only 200 iterations it has already been completed.

Improving the Cross-Entropy algorithm

So far we have shown how to improve the neural network architecture. But we can also improve the algorithm itself: we can keep “elite” episodes for a longer time. The previous version of the algorithm samples episodes from the Environment, train on the best ones and threw them away. However, when the number of successful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them. We need to change only one line in the code:

elite_candidates= full_batch + batch#elite_candidates= batch

The result seen through TensorBoard is:

We can see that the number of iterations required is reduced again.

Limitations of the Cross-Entropy method

So far we have seen that with the proposed improvements, with very few iterations of the training loop we can find a good neural network. But this is because we are talking about a very simple “non-slippery” Environment. But what if we have a “slippery” environment?

slippedy_env = gym.make(‘FrozenLake-v0’, is_slippery=True)class OneHotWrapper(gym.ObservationWrapper):
def __init__(self, env):
super(OneHotWrapper, self).__init__(env)
self.observation_space = gym.spaces.Box(0.0, 1.0,
(env.observation_space.n, ), dtype=np.float32)

def observation(self, observation):
r = np.copy(self.observation_space.low)
r[observation] = 1.0
return r
env = OneHotWrapper(slippedy_env)

Again TensorBoard is a big help. In the following figure, we see the behavior of the algorithm during the first iterations. It is not able to take off the value of the Reward:

But if we wait for 5,000 more iterations, we see that it can improve, but from there it stagnates and is no longer able to surpass a threshold:

And although we have waited more than two hours, it fails to improve and not surpass the threshold of 60%:


With an example as simple as Frozen-Lake we see that the Cross-Entropy method cannot find the solution (of training a neural network). Later in the series, you will become familiar with other methods that address these limitations. See you in the next post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.

Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject.