My journey of trying Deep Reinforcement Learning to optimize train scheduling: pt. 1

Source: Deep Learning on Medium

Many things that I tried

At the very beginning I didn’t even start from reading similar problem solving papers, because I had a much larger problem, variable amount of agents!! My position was to create every algorithm from scratch so I didn’t use RLlib even though it offers great tools both for multi and single agent reinforcement learning settings. One possible way was to make a separate network for every agent, and all is good but there would be no communication between agents, and when they need to come up with a way to share a resource it won’t be possible. Then I thought to use a technique where the Critic is centralized and gets all agents’ observations and give everyone a state-value and Actors adjust their policies in that direction, here all agents have separate networks so the problem of variable agent amount is out the way. Yet there is a problem of decentralized execution where agents have no idea about each other’s decisions during execution. So I was stuck thinking about ways to make a channel of communication between agents, and came up with a great idea, which later I found out is not mine, and some people already wrote an amazing paper about it and named it Multi-agent Bidirectionally-Coordinated Nets ( BiCNet). The idea behind it is the following, every agent in the environment is represented as a bi-directional RNN sequence and passes it’s hidden state to other agents, and as it’s a bi-directional RNN, every agent knows about every other agent through hidden states. And the best part about it, is that you don’t need to design the communication channel, they will learn the best way of communication “language” in the training process.
Now as we have the algorithm architecture it’s time to choose the algorithm itself. In the sea of algorithms like DQN, DDQN, A3C etc. I chose the very amazing PPO from the family of Actor-Critic type algorithms. The loss I calculated for each agent individually and summed it up at the end for training. The mean, instead of the sum, works as well, so I think later I need to work with it a little more, because the sum will get bigger as the number of agents increase, and I don’t want to mess with my network that way. The choice of PPO was based on the fact that it’s from Actor-Critic family of algorithms which work the best, and also it’s a lot more stable than A3C, so here we are. If you would like to learn more about PPO, I would recommend checking out this amazing video.
I set up the training process in a way that complexity increases over time/iterations, so the training would start from 1 agent in a small map, and the map would get bigger and the agent amount increments after several map size increments. In the future I want to try a way of increasing complexity based on how good is the algorithm performing rather than just time.
Now it’s the time to talk about the network architecture. At first I tried to use the TreeObservation and I worked with a different architecture but later I switched to GlobalObservation and I will talk about the network I used there. So the observation has the size of hxwx22xnumber_of_agents, so I used CNN layers on top of RNN layers. As I said the map size varies so I used Spatial Pyramid Pooling to later give it to my RNN component. I did some tricks to give the CNN all agents’ observations as batches and later to extract the real batches of every agent’s observation for RNN. I used Pytorch because it’s awesome – easy to work with, easy to build and test new ideas and also the code looks just amazing. All the code I am talking about could be found in here.

Results

I can’t brag about my results YET!! So the deal is that my code is not very optimized for my computational budget, and I am hitting memory error before getting any valuable results. What I can tell is that while training, there is a very noticeable movement towards learning, as I can see agents, that stop to let other agent pass, and many little details like that. I am far from done for sure, and will spend as much of my free time on this as I can, because I have a lot of ideas on how to make this work.

Ideas for future

So I want to optimize the code first, to not hit memory error at least. Also I have some bugs and would like to get rid of those as well (I want to try to keep training for several iterations on a same map before changing it, but the bug won’t let me). The major change will be to incorporate planning inside my algorithm to decrease the time of training, and also it seems intuitive to have planning in such a project, so that’s a must have in future. Maybe later I can try Hierarchical Reinforcement Learning as I have couple of ideas for it as well.

My journey is not over yet, and I am pretty confident that I can solve this problem using modern RL techniques. And why I am so concentrated on RL is because one algorithm can be applied to many other similar problems, after solving train scheduling next can be planes, trucks, ships, you name it. And also by using learning algorithms we ensure that over time it will get better with little to no changes. Also learning process may give us insights on solving the problem in a way never been thought before.
That’s it for part 1, see you in part 2.