Grid: A grid world environment based on openAI-gym

Original article was published on Artificial Intelligence on Medium

Problem statement


The Figure uses a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A′. From state B, all actions yield a reward of +5 and take the agent to B′.


Firstly, this problem is a perfect example of what we call a Finite MDP or Markov Decision Process. Now, a task can be classified as MDP when the it strictly follows the Markov’s property shown below

p(s′,r|s,a)=Pr{St+1=s′,Rt+1 =r|St=s,At=a} …….(1)

here p is the probability of an action a followed under a policy to reach from the current state s to successive state s’. On reaching s’ the agent gets the reward r. please note that here by agent we mean a computer program.

Grid with terminal states

In the figure, the grid is shown with light grey region that indicates the terminal states. A terminal state is same as the goal state where the agent is suppose end the game.


To solve the problem we are suppose to deduce a policy that directs the agent towards the terminal states, the one at the upper left corner and the other at the lower right corner of the grid. By policy we mean π which is nothing but a rule that guides the agent in taking the most suitable action depending on the state in which the agent is. In mathematical terms, π is nothing but mappings from a state S to an action A as shown in the figure below.

Mapping from state to actions


I will try my level best to keep it as simple as possible. The implementation goes as follows:

  1. Importing the packages

2. Create the grid environment

3. Implementing the step function to calculate the reward to be returned for particular action by the agent.

4. we construct the grid by loading the file and fetch the observations from the grid environment on the basis of which the agent will perform the actions and get reward.

5. Fetching agent’s states: This is the most important step in which we assign start state and target state to the agent based on which it will initiate the task.


In conclusion, we have successfully implemented the grid world problem. I hope the code will not be scary, though its long, but believe me its easy once you try it.

I know that it’s not to the point and I should’ve explained the steps in details but as stated previously that this tutorial assumes the reader to have relevant background with RL and most importantly openAI-gym package.

In case you require the detailed implementation, the link to my repository is given below. Feel free to ask your queries in comments section below but I may be slow to reply.


thanks !!!