How I trained my first AI

Original article was published on Artificial Intelligence on Medium

How I trained my first AI

We have always been fascinated by science fiction dreams and imaginations. Intelligent machines, robots, artificial intelligence, neural networks have been the result of these fascinations, and with the integration of Unity and Machine learning, had me inspired to get my foot in the machine learning door.

This article is about how I trained a machine learning model in Unity using ML Agents.


The goal was to develop an AI that learns to clean a room, from detecting trash to dumping it into a trashcan, all by itself using Reinforcement Learning.

Reinforcement Learning

The Agents/Bots will learn using Reinforcement Learning, therefore their actions are either rewarded or penalized. The future generation Bots will act upon these sets and create better and better behaviors.

The typical framing of a Reinforcement Learning scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

Designing the Environment

A typical Learning Environment contains three components:

  • Agent : handle generating observations, performing the actions it receives and assigning a reward (positive / negative) when appropriate.
  • Brains : which encapsulates the logic for making decisions for the Agent. In essence, the Brain determines which actions the Agent should take at each instance.
  • Academy : which orchestrates the observation and decision making process.

So what are the points that one must consider before designing an environment.

  1. What does the agent observe?
  2. What action can it take?
  3. How can the agent be rewarded?

What does our agent observe? One of the most important parts when creating a learning environment is to start as simple as possible. Therefore, our agent uses only 8 observations. It’ll observe its distance and direction to the trashcan, the direction where the bot is facing, and whether the trash is collected or not.

But, you might ask me how on earth the bot will collect these observations? For that I used 5 raycasts which covers about 120 degree field of vision. Each object in the environment is tagged with their names, so as a ray hits something the bot gets to know what is in front of him. Having randomness in an environment is always good for the training of the agent. So, I made the trash and trashcan spawn randomly in their respective spawn area.

What action it can take? — Again, we want to start as simple as possible. The agent decides whether to move forward, rotate left or right, or do nothing. I made the action of collecting trash automatic (whenever the bot walks over it), just for the sake of simplicity.

How can the agent be rewarded? — We want our agent to collect trash and dump it in the trash can. So we reward him every time he collects the trash, and dumps it. In order to encourage the agent, he is rewarded with a tiny negative reward every step. You really don’t want to over-design your rewards, because it could easily lead to reward-exploitation by the agent.


First iterations were strange but expected, our agent was very clueless, didn’t know what to do, where to go and ended up standing still or finding himself stuck in some weird situations.

So, what was happening here? Our agent did know which actions were right, however, it lacked the knowledge of those which were wrong since there weren’t any negative rewards or penalties.

In order to fix this, I added a tiny negative reward for every step it takes or if it hits a wall so that if it find himself getting stuck somewhere he tend to change its action and try to complete the task with minimum number of steps.

Another challenge arises

Training an agent can take up 2–3 hrs and since our environment wasn’t good enough at first, I found myself modifying it multiple times and running through the whole training process all over again and again. I tried tweaking the parameters of our Neural Network but that wasn’t helping much.

Thankfully, this is not the real world and one brain is not limited to one agent. We can use multiple agents, using the same brain — meaning all learning progress is shared.

You can train multiple agents simultaneously either by creating multiple instances of the environment or by cranking the number the agents in a single environment.


They worked almost perfectly, well as perfect as I want them to be. Their behavior is somewhat unpredictable and it’s really enjoyable watching them try new things for example, I noticed them using two main strategies in particular,

  • First being the way that I wanted them to follow, where they first grab all the trash and then dumps them at last.
  • The second being the way they ended up following, which is by exploiting the reward system where they dumps the trash as soon as they pick one.

The way the reward system was programmed in the early stages of the training did allow them to gain that juicy (+1) dumping trash reward multiple times which was a silly mistake on my behave.

After all the modification, tweaking and hours of training I ended up with two different brains using two different strategies. Overall, this was a successful first attempt and an overwhelming enjoyable experience for me.

Ending Notes

If you have come this far, thanks a lot! Really curious in what you guys have to say. Feel free to leave your suggestions and feedback.

Reach me at —