Source: Deep Learning on Medium
Markov Decision Process(MDPs) for reinforcement learning
we already learn about Agent, environment, action, state, reward, episode. Now we will put all the above terms into a formal framework called Markov Decision Process(MDPs).
the Markov property:
it is a central component of MDPs.
other conditional distribution:
the state right now is something we measure right now where no need to use raw data. any input from the agent sensor can be used to form state.
one more piece in the algorithm is the policy represented by pi. the optimal policy defines in term of the value function. Policy used by the agent to navigate the environment.
total reward after the state of action can be depended upon the previous state. it does not count current reward.
the goal is to maximize the total reward.
it is something which applies to total reward.
value function is determined by the policy and has state s as the parameter. it only depends on future reward. value of all terminal is zero(0).
we can also do some things for value function:
optimal value function and optimal policy function.
since they are independent, we will talk about together.
optimal policy is the best policy. policy for which no greater value. optimal policy are not unique but optimal value function are.
optimal actional value similarly define by
the relation between state and action:
Bellman optimality function:
choose the state which gives largest Vs value but if we have Q(s,a) no need to look ahead, simply choose argmax. Q(s, a) effectively choose one step ahead of search result.