Last week we presented our implementation of Federated Averaging in TensorFlow. We provided an easy to use interface, wrapping all of the federated logic into a custom TensorFlow optimizer. This allows actual computation to happen using TensorFlow’s distributed infrastructure which is highly optimized.
Even if this is a great demo within a local cluster, we aim to enable this kind of technology for everyone, so that users in different parts of the world can train an algorithm together.
With TensorFlow we would need to know the IP’s of every single worker, however, with python sockets we can define a central worker and have the rest of the nodes point their connections at it.
We will show how to set up several Raspberry Pi’s to talk to each other and train together a neural network to recognize the Fashion Mnist data set. Let’s get into it!
We are going to build a custom TensorFlow hook, which will take care of all the communication among the nodes. This allows us to define a TensorFlow graph as usual, and add the hook to the training session. Federated Learning made easy!
We won’t spend time on the graph definition because it is fairly standard. We will instead focus our effort in the FederatedHook.
This hook will use TensorFlow placeholders to inject data in the graph during training. So we define placeholders for every variable that is going to be averaged, and set up the ops to assign the values fed to these placeholders to the actual local variables.
With these in place we only need to use python sockets to send numpy arrays between the workers.
As we mentioned earlier we need a chief worker to which the rest of the nodes can point their sockets. The public IP of this chief must be known to every other node in the network. We decide to place this chief worker in AWS. The rest of the nodes in our network will be Raspberries.
Sockets are defined at a certain private IP. But our workers will only be able to see the public facing ports of the chief host. Therefore the machine acting as a chief will need to set up his network so that all the data arriving to one of it’s public facing ports will be redirected to the corresponding private one. This is called port forwarding.
To better understand this point let’s see it with an example, we will work with an AWS instance. The chief worker opens a socket in the private IP xxx.xxx.x.xx in the port 3389 (CHIEF_PRIVATE_IP in the code). The other workers though can not open a socket pointing to this private IP because they only have access to public facing ports. So we will tell them to connect to the chief’s public IP (CHIEF_PUBLIC_IP in the code) on port 7777.
So that the connection requests reach our chief worker’s socket in port 3389 we will first need to open public port 7777 in our network, and accept TCP traffic through it This is done in the security group settings. Both inbound and outbound settings should look like this:
Make sure to choose random ports, and disable port forwarding once you don’t need it anymore!
After that, we will need to forward the 7777 port of the public IP to the 3389 port of the private one. This is done with the following shell commands:
sudo iptables -t nat -A PREROUTING -p tcp — dport 7777 -j DNAT — to-destination xxx.xxx.x.xx
sudo iptables -A FORWARD -p tcp — dport 7777 -j ACCEPT
If instead of using an AWS you would like to run the chief in your computer, you will need to do this port forwarding in your router. This is usually quite easily done accessing your router configuration.
You may also try to run the code in LAN using, for example, a couple Raspberry Pi’s. In this case you would need to set both IP’s in the code to be the same.
If you are using only Raspberry Pi’s you are not likely to have any problem with the ports, but if you are using your computer to run the chief worker you will need to open the port who is serving it. As an example using the IP xxx.xxx.x.xx:3389, just run the following command:
sudo iptables -A INPUT -p tcp — dport 3389 -j ACCEPT
Try this also in Raspberry if it can’t connect properly. We also highly recommend to close your ports once you don’t need them open anymore.
sudo iptables -I INPUT -p tcp — dport 3389 -j REJECT
We made some performance tests with the Raspberry Pi’s and the results are great!
As we can see, training in a traditional distributed fashion results in a poor performance in this set-up where the bandwidth is limited (we are using Wi-Fi). However Federated Averaging outperforms even single node training thanks to the 100x reduction in the bandwidth requirements, which results in a ~4x reduction in training time vs distributed SGD.
Federated Averaging outperforms single node training because a single Raspberry Pi lacks the processing power to efficiently process the whole data-set. By distributing it over 6 Raspberries we reduce the workload for each one of them and end up with a faster training. Accuracy results with Federated Averaging are in this simple example comparable to those of distributed SGD.
Our implementation using sockets is a little bit slower than both training with MPI or gRPC (distributed TensorFlow). However it is the only one that allows us to train without needing to know the IP’s of every worker and it is still fast!
Let us know what you think and stay tuned for more cool Federated stuff.
Source: Deep Learning on Medium