VPN is a deep reinforcement learning architecture that mixes ideas from both model free and model based methods. Generally model based methods learn environment dynamics so as to predict real observations, however, VPN attempts to learn a dynamics model that has a latent state representation optimized to predict values and rewards.

*Network Heads*

*Network Heads*

The encoding module is only applied to the real observation given by the environment and produces a latent state s. The value, outcome, and transition modules are then recursively applied to expand the tree.

*Rollout Backup*

*Rollout Backup*

The values are backed up at rollout time as the uniform average along the max path. This is described in the set of equations and visualized above.

*training*

*training*

Updates are applied in an asynchronous fashion. At training time only the path actually taken at rollout time is expanded for updating. This means the tree expanded at training time does not necessarily correspond to the max action tree.

Source: Deep Learning on Medium