A summary of the concepts discussed in the second lecture for a reinforcement learning course from the School of AI called Move 37. The focus of this lecture was on one variation of the Bellman Equation. See my previous post here for the introductory lecture notes summary.

**Review of Concepts**

**State**: a numeric representation of what the agent is observing at a particular point of time in the environment

**Action**: the input the agent provides to the environment, calculated by applying a policy to a current state.

**Reward**: a feedback signal from the environment reflecting how well the agent is performing the goals of the game

*Goal of Reinforcement Learning*

*Goal of Reinforcement Learning*

*Given the current state we are in, choose the optimal action which will maximise the long-term expected reward provided by the environment.*

*Dynamic Programming*

*Dynamic Programming*

Reinforcement learning is based on Dynamic Programming.

Dynamic Programming is a

class of algorithms, which seek tosimplify complex problemsby breaking them up intosub-problemsandsolvingthe sub-problemsrecursively(by a function that calls itself)

**What question does the Bellman Equation answer?**

- Given the state I’m in, assuming I take the best possible action now and at each subsequent step, what long-term reward can I expect?
- What is the VALUE of the STATE?
- Helps us evaluate the expected reward relative to the advantage or disadvantage of each state

*Bellman Equation (for deterministic environments)*

*Bellman Equation (for deterministic environments)*

*The **value of a given state** is equal to the **max action** (action which maximises the value) of the** reward of the optimal action in the given state** and add **discount factor** (diminishes the reward over time) multiplied by the** next state’s Value** from the Bellman Equation.*

How to find the best action (*maxₐ*)?

- Brute force
- Deep neural nets

*Importance of the discount factor (*γ*):*

- It is important to tune this hyperparameter to get optimum results
- Successful values range between 0.9 and 0.99
- A lower value encourages short-term thinking
- A higher value emphasises long-term rewards

Side Note: The equation is recursive so it will start from the max reward and calculate backwards.

Since the lecture video for the course was private, I found a good alternative video explanation of the application of the Bellman Equation highlighted above.

Source: Deep Learning on Medium