Federated Learning : Machine Learning That Respects Data Privacy

Source: Deep Learning on Medium

Federated Learning : Machine Learning That Respects Data Privacy

Introducing Federated Learning

Hot on the heels of Facebook admitting sharing user data with Cambridge Analytica among others, a whistleblower recently alleged that Google has collected intimate medical data of 50 million people for what is apparently internally known as ‘The Nightingale Project’ .

With data replacing oil as the world’s most valuable resource, every one from startups to tech giants are racing to get as much of it by any means possible and build their own models for every use case imaginable. While the quest for data rages, the rest of us are putting up black tapes over our webcams and glaring at Buzzfeed quizzes waiting for Alexa to turn into Skynet as Black Mirror warned.

But it isn’t just the unsuspecting social media users who are vary of giving apps access to their data. Every company seems to have their own team of engineers building them object detection and speech recognition models even though there are multiple accurate APIs available out there. Trust seems to be nonexistent when it comes to data. Enterprises are just as afraid to use ML APIs as the average Facebook user is of taking another quiz. Given how consistently machine learning has been trending in recent times, we seem to be heading to the end of privacy like the great London manure crisis and federated learning may well be to the data privacy crisis what the car was to the manure crisis.

What is federated learning?

Federate learning is a style of machine learning where instead of building a centralised corpus of data and training one model on it, we train multiple decentralised models with multiple decentralised datasets and combine the weights to update the models with the insights from all the training data.

Instead of centralising training data in a data center and setting up hundreds of GPUs to process it, federated learning leverages the advanced computational power and storage of new age cell phones. Federated learning allows mobile phones to build a shared model while keeping all the training data safe and secure on the devices. Federated learning is simply machine learning without uploading all the data to the servers to form a centralised corpus of data. Mobile phones download the existing model, use sensitive user data to train the model locally and upload the learned weights using encrypted communication which are averaged and shared as an update across all models on all the other phones to improve the performance.

Google introduced the federated learning approach first in a 2016 paper published by Google AI researchers: Communication-Efficient Learning of Deep Networks from Decentralized Data. The prediction model behind the Google keyboard was the subject of the paper. The process can be summarised in a few simple steps:

  • Your phone downloads the model when you install the Google keyboard.
  • The model finetunes itself on the text you type in all day.
  • It shares what it has learned with other mobile phones later when your phone is inactive over encrypted communication.
  • The weights shared by all the phones are averaged and all the models are updated, improving the models on thousands of devices without any of your private texts ever leaving your phone.
  • The process is repeated as the models encounter new data everyday.

Why should we care?

If you’re a paranoid user with tapes on your cam, then federated learning is definitely good news for you. There’s no stopping organisations from wanting to harvest your data but with federated learning, they don’t have to steal your data and upload it to their servers to train their models. They can use your data without violating your privacy.

But federated learning isn’t just about gleaning insights from sensitive user data on mobile phones without violating user privacy. The collaborative learning from shared encrypted communication approach can finally unleash the full commercial potential of machine learning. Federated learning lets companies leverage the might of user data without ethical conundrums or the very real threat of suits.

Do you remember the sales meeting from Silicon Valley where the sales team wanted to drop neural networks from their Pied Piper platform because not even the best people in sales can convince an organisation to trust the blackbox of a neural network with their real sensitive data? With federated learning that is no longer an issue. Federated learning could provide personalisation at hitherto unseen levels. It would allow enterprises use machine learning APIs without worrying about sharing their sensitive data.

Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. It also lets you use the new and improved model immediately without having to wait for an update to be rolled out.

An Example

In its most simplest form, you could build a decentralized model in a few short steps. Consider you’ve a binary classification dataset that has around 2000 data items, say 1000 images of cats and 1000 of dogs. In the traditional approach, you’d build a CNN that trains itself with these 2000 images over a few hours. With federated learning, you have 10 different datasets, each with 100 images of dogs and 100 of cats. You define your CNN and make 10 copies of it, train each of them over a few epochs on one of these 10 datasets. Once all 10 models are trained, you simply average their weights and update the models and you suddenly have a powerful classifier that does just as well as the centralised classifier in half the time and with half the efforts.

Consider the Pytorch code below as an illustration of updating a model with the averaged weights from multiple decentralised classifiers.

for i in range(1,num_models):
for name, param in models_[i].named_parameters():
if name in dict_params:
dict_params[name].data.copy_(beta*param.data + (1-beta)*dict_params[name].data)
updated_model.load_state_dict(dict_params, strict=False)

Popular federated learning frameworks

With federated learning being the coolest and the latest update in machine learning, data science enthusiasts are already hard at work to roll out federated learning frameworks that can be easily incorporated into your favourite machine learning frameworks. Some of the popular and recent Federated Learning frameworks include TensorFlow Federated and PySyft.

TensorFlow Federated is an open source framework by Google for experimenting with machine learning and other computations on decentralized data. PySyft is a Python library for secure, private Deep Learning. PySyft decouples private data from model training, using Federated Learning, Differential Privacy, and Multi-Party Computation within PyTorch.