Generally, standard machine learning approaches create the need to store training data in one central spot. However, with the recent upswing of privacy protection in machine learning, a new field of research, known as federated learning, has sparked global interest. In this blog post we present our first results regarding privacy-preserving collaborative machine learning, following up on our previous blog post introducing three different approaches to tackle the privacy problematic in this area.
However, before diving deeper into our proposed approach, let’s recapture the concept’s main points. The idea of federated learning is to train machine learning models without explicitly sharing data or concealing training participation. This scenario is relevant across-industry as well as at a personal level and becomes especially important in scenarios where malicious clients might want to infer another client’s participation.
As a simple example consider a collaboration of multiple hospitals and insurances training a universal model with their individual patient’s and customer’s data to get a better overview of current diseases, diagnoses and medical costs. Now imagine that one of the participating insurances would like to join this collaboration hoping to find out specific details about the patients belonging to the data set of a contributing hospital. If the hospital revealed confidential data during the general machine learning training process, its patients’ privacy would be violated, while the insurance might use these grounds to charge certain patients a higher price.
Another possible situation emerges when clients seek to unsubscribe from a service to which they have contributed in terms of model training without leaving behind a too specific data fingerprint on the model. Going back to the example of hospitals and insurances, if one insurance wanted to stop contributing to the training of the model, its withdrawal would reveal some confidential customer information that could be used to the advantage of other competing insurances in the model.
In short, in order to safeguard privacy in the context of machine learning, we must prevent the possibility of tracing back individual clients contributing to the model. This becomes particularly crucial, when the number of training instances for the model is not exorbitantly large. Our findings are therefore of special concern for institutions such as hospitals or insurances that want to benefit from generalized prediction models but experience high customer fluctuation and at the same time are bound to strong privacy requirements.
Federated learning — some details
We consider a federated learning setting, where a trusted curator collects parameters optimized in decentralized fashion by multiple clients whose data is typically non-iid, unbalanced and massively distributed. The resulting model is then distributed back to all clients, ultimately converging to a joint representative model without the clients having to explicitly share the data.
For every new communication round and allocation of a new central model, information about clients’ data leaks. Consequently, leaked information and thus privacy loss accumulates over the course of training. Although this likelihood might be infinitely small, a machine learning model is typically trained over the course of several rounds, which means that such privacy leakage could add up significantly.
In this setting, the communication between curator and clients might be limited and/or vulnerable to interception, which is why federated learning aims at determining a model with minimal information overhead between clients and curator. However, despite achieving this minimized overhead, the protocol is still vulnerable to differential attacks, which could originate from any party contributing during the federated learning process. In such an attack, a client’s contribution during training as well as information about their data set can be revealed through the analysis of distributed parameters.
Considering this problem, we propose an algorithm for client-sided differential privacy to preserve federated learning. The aim is to hide clients’ contributions during training, balancing the trade-off between privacy loss and model performance. The results of our first feasibility study suggest that with an increasing number of participating clients, our proposed procedure can further optimize client-level differential privacy.
What makes machine learning algorithms so attractive is that they derive their prediction model by inferring patterns from data without being explicitly programmed. As a result, these algorithms are heavily reliant on the information that is encoded in the data, which makes creates the need to attach them with certain properties in order to safeguard privacy.
This is where the definition of differential privacy comes into play. It can be seen as a sensitivity measure with respect to changes in the data. Specifically, it gives a guarantee about the limits of the effect presence or absence an individual data item may have on the final output of the algorithm. Intuitively, a machine learning approach that is differentially private will not significantly change its predictive behavior in case an item is removed from the training set. Referring to the former example, this would mean that all contributing insurances and hospitals could still count on the high performance and information accuracy of the universal model, although one of the hospitals avoids providing or takes out information about a certain patient.
In the proposed approach, we seek to take differential privacy to a new level considering data beyond a single data item and thereby tightening the sensitivity. We aim to ensure that removing a client with all its data items does not significantly affect the outcome of the algorithm. In our example this means that if a hospital with a large number of patients decides to stop contributing to the training of the central model, it won’t harm the work of the other participating institutions.
Connecting the dots — differential privacy preserving federated learning
To protect the federated learning protocol against possible differential attacks, a so-called privacy accountant keeps track of the incurred privacy loss and stops training once a defined threshold is reached.
In this context, we propose to apply a randomized mechanism, which consists of two steps: At the beginning of each communication round, a random subset of clients is chosen to contribute. Only these clients receive the central model and share their updates. Then, a Gaussian mechanism is used to distort the average of updates before allocating the new central model. This is done to hide a single client’s contribution within the aggregation and thus within the entire decentralized learning procedure.
Figure 2 illustrates a communication round adopting the proposed approach. In this optimized federated learning setting, a random client stops contributing during the communication round while the other clients continue updating the model. However, the withdrawal of one contributor does neither lead to the revelation of data nor does it harm the performance of the model.
The experimental setup
We simulate a decentralized setting to test our proposed algorithm. Our choice to train an image classifier model allows to benchmark the protocol against state of the art techniques in centralized learning. The federated, non-idd setup ensures that each client only gets a limited number of samples, where the samples of each client only associate to a fraction of overall classes. In such setup, a single client would never be able train a model capturing all classes given the individual data alone. We set two requirements for the differentially private federated learning process:
- Enable clients to jointly learn a model that reaches high classification accuracy
- During learning, hide what data an individual client is holding to preserve privacy
Ultimately, our work puts forward two contributions. First, we demonstrate that when a sufficient number of parties is involved, our algorithm achieves high model accuracy comparable to those in a centralized learning setup. At the same time, our proposed model remains differentially private on the client level. Although other studies show similar results, our experimental setup differs due to its distinct integration of element-level privacy measures. Second, we suggest a dynamic adaptation of the differential-privacy preserving mechanism during the decentralized learning process to further increase the model performance. While this amends latest results applying differential privacy in centralized settings, we argue that in a federated learning setting gradients display different sensibilities to noise and batch size.
In general, our findings are applicable to diverse industries. Someday, the study’s approach might enable companies to jointly learn prediction models or, as in our example, help multiple hospitals to train diagnostic models. The proposed algorithm would allow these diverse actors to benefit from a universal model learned with data from many peer contributors without the need of centralizing data or taking the risk of exposing private information.
We presented our advances in privacy protection in decentralized learning at NIPS 2017 workshop: Machine Learning on the Phone and other Consumer Devices.
For more details about our work please refer to the original study: https://arxiv.org/abs/1712.07557
Differentially Private Federated Learning: A Client Level Perspective was originally published in SAP Leonardo Machine Learning Research on Medium, where people are continuing the conversation by highlighting and responding to this story.