Original article can be found here (source): Deep Learning on Medium
2. Creating a Reward Matrix
A reward is an integral part of the environmental feedback. When an agent interacts with the environment, the changes in the state are reflected through its actions by the reward signal .
The goal, in general, is to solve a given task achieving the maximum reward possible. That is why many algorithms have a very small negative reward for each action the agent takes, to animate it to solve the task as fast as possible.
Hence, we can assign a NumPy matrix denoting the rewards to be -1 by default. Only the viable paths will be set to 0, while the maximum reward-giving path leading directly to the goal will be set to 200. The reward matrix helps the bot take the best action that will eventually lead it to its goal.
3. Framing the Q matrix
Next, we’ll add a similar Q matrix to the brain of our agent, representing the memory of what the agent has learned through its experience. The rows of the Q matrix represent the current state of the agent, and the columns represent the possible actions leading to the next state (the links between the nodes).
The Gamma parameter shown in the code snippet below also has some importance in the overall working of the agent. If Gamma is set closer to zero, the agent will be more inclined towards immediate rewards, whereas if set closer to one, the agent will consider future rewards with greater weight, and while willing to delay the reward.
4. Training the model for certain epochs
In each training session, the agent explores the environment (with rewards represented by the R matrix defined earlier), and receives the reward (if any on its way) until it finally terminates by reaching the goal state. The purpose of training is to enhance the knowledge base of our agent, represented by the Q matrix. Generally speaking, more training = a more optimized Q matrix.