Original article can be found here (source): Artificial Intelligence on Medium
Should I stay or should I go now?
If I go there will be trouble
And if I stay it will be double
So ya gotta let me know
Should I stay or should I go? — The Clash
Don’t we all feel like that sometimes? Unhappy about your current job but the new job pays lower. Unsure whether to invest in the stocks market or in wealth management platform? Even Spider-Man had to make to a choice between saving Mary Jane or saving a cable car of people.
A dilemma is a struggle that occurs within the mind. It is about having to make a choice between two or more alternatives, in which the outcomes are equally favourable or undesirable. Making the right choice may result in positive outcomes and making the wrong move would cost you.
In reinforcement learning, the machine learning agent also faces a dilemma, choosing between exploration and exploitation. During the training process, the agents have to:
- pick something familiar to maximise the chance of getting rewarded
- pick something new which may (or may not) result in making better decisions in future
Striking a balance between exploring and exploiting
Finding the balance between exploration (of uncharted territory) and exploitation (of current knowledge) is key to training a successful reinforcement learning agent. The unknown needs to be discovered to expand existing knowledge. The known needs to be exploited, to generate rewards.
This means that sometimes you purposely have to decide not to choose the action that you think will be the most rewarding to gain new information. Even though sometimes it means ended up making some bad decisions in the process of exploration. But at the same time, you want to maximise your reward, by exploiting what you know worked best.
So how do we strike a balance between sufficiently exploring the unknowns and exploiting the optimal action?
- sufficient initial exploration such that the best options may be identified
- exploit the optimal option to maximise the total reward
- continuing to set aside a small probability to experiment with sub-optimal and unexplored options, in case they provide better returns in the future
- if those experiment options turn out to be doing well, the algorithm must update and begin selecting this option
In reinforcement learning, we can decide how much time an agent will spend time exploring. This is done by adjusting the epsilon-greedy parameter, which ranges from 0 to 1.
If we set 0.1 epsilon-greedy, the algorithm will explore 10% of the time and exploit the best options 90% of the time. In most cases, the epsilon-greedy parameter value is set typically between 5 to 10%.
Evaluate different epsilon greedy with a tic-tac-toe agent
I’ve built a tic-tac-toe game where agents can learn the game by playing against each other. First, let me introduce you to our agents, they are agent X, and agent O. Agent X always goes first, that means agent X has the advantage.
Experiment #1. To figure out the most suitable epsilon-greedy value for each agent for this game, I will test different epsilon-greedy value. I will initialise agent X to explore 1% (eps 0.01) of the time, both agents will play against each other for 10,000 games, and I will record the number of times agent X wins. Then I will increase to exploration and repeat the test until agent X explores 100% of the time (eps 1.0).
The results of agent X (exploration from 1% to 100%). vs Agent O (eps 0.05). The blue line represents the number of games agent X is winning on different exploration rate.
This shows that the higher the exploration rate, agent X winning rate drops. It peaked at winning 9268 games when agent X explores 5% of the time. Agent O also begin to win more games as agent X explores more than 50% of the time.
Experiment #2. Lets see how agent O fairs if we initialise agent X with an optimal 5% epsilon greedy. The blue line represents the number of games agent O is winning.
Well, agent O has no chance of winning with any exploration rate; it lost most of the games before it can learn the game.
Experiment #3. Let’s tweak epsilon greedy of agent X to 100%, this means that agent X will be playing random actions all the time. The blue line represents the number of games agent O is winning against a random agent X.
Agent O begin losing more after 30% exploration rate.
Try the demo
Explore the online demo and challenge our reinforcement agent in a game of tic-tac-toe. You can tune the parameters to train different agents.
Learn about how the tic-tac-toe agent learns:
If you like the online demo, you might like this too: