Source: Deep Learning on Medium

Reinforcement Learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. A learning agent can take actions that affect the state of the environment and have goals relating to the state of the environment. One of the challenges that arise in Reinforcement Learning, and not in other kinds of learning, is trade-off between exploration and exploitation. Of all the forms of Machine Learning, Reinforcement Learning is the closest to the kind of learning that humans and other animals do.

#### Elements of Reinforcement Learning

Other than the ** agent** and the

**, one can identify four main subelements of RL**

*environment**Policy —**is a mapping from perceived states of the environment to actions to be taken when in those states. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behaviour. It maybe stochastic, specifying probabilities for each action.**Rewards —**On each time step, the environment sends to the reinforcement learning agent a single number called reward. The agent’s sole objective is to maximize the total reward it receives in the long run. The reward signal thus defines what are the good and the bad signals for the agent. It maybe a stochastic function of the state and action.**Value Function —**specifies, roughly, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of the environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states. For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards, or the reverse could be true.**Model of the environment —**this mimics the behavior of the environment, that allows inferences to be made about how the environment will behave. For example, given a state and an action, the model might predict the resultant next state and next reward. Methods for solving reinforcement learning problems that use models are called model-based methods, as opposed to simpler model-free methods, trial and error learners.*

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values which we are most concerned when making and evaluating decisions.

#### An Example: Tic-Tac-Toe

Reinforcement Learning Approach to solve Tic-Tac-Toe:

- Set up table of numbers, one for each possible state of the game.
- Each number will be our latest estimate of our probability of winning from that state.
- This estimate is the state’s
and the whole table is the learned value function.*value* - Assuming we always play Xs, then for all states with 3 Xs in a row the probability of winning is 1.0
- And for all states with 3 Os in a row the probability of winning is 0.0
- We set the initial values of all other states to 0.5
*(representing we have a 50% chance of winning.)*

We then play many games against the opponent. To select our moves:

- We examine the states that would result from each of our possible mvoes and look up their current values in the table.
- Most of the time we move greedily, selecting the move that leads to the state with the greatest value.
*(highest probability of winning)* - Occasionally, we select randomly from among the other moves instead.
*(Exploration)*

While playing, we change the values of the states in which we find ourselves:

- After each greedy move, from A to B, we update the value of A to be more closer to the value of B.
- This is achieved using the following formula

where,** V(S_t)** — value of the older state, state before the greedy move

*(A)*

**— value of the new state, state after the greedy move**

*V(S_t+1)**(B)*

**— learning rate**

*alpha*This update rule is an example of ** Temporal-Difference Learning** method, so called because its changes are based on a difference,

`V(S_t+1) — V(S_t)`

, between estimates at two successive times.*Thanks for reading!*