What is Differential Privacy?

Original article was published by shaistha fathima on Becoming Human: Artificial Intelligence Magazine

Introduction to DP with Randomized response example.

Summary: Sometimes knowing the basics is important! This beginner friendly blog post covers a quick and easy introduction to Differential Privacy and is part of “Differential Privacy Basics Series”. Thank you Jacob Merryman for the amazing graphic used in this blog post. For more posts like these on Differential Privacy follow Shaistha Fathima and OpenMined on twitter.

Differential Privacy Basics Series

Before we dive deeper into Differential Privacy (DP) and answer the 4 W’s and an H (What, Where, When, Why and How), few of the most important questions that you must ask yourself are… What is Privacy? Should we really care about it? How does it matter?…

What is Privacy?

As per Wikipedia’s definition,

Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby express themselves selectively.

To put it simply, privacy is an individual’s right to withhold some of their data which they deem to be private and share the ones they are comfortable with.

Coming to… Should we really care about it? How does it matter?

In this digital age, data privacy has always been a concern for some of us, to the extent that people are paranoid enough to use a made-up names instead of their own! Even if you are not that paranoid, say you are comfortable with sharing your real name with an unknown person as part of your introduction, you might not feel comfortable with sharing some of your other details like birth date, hangout places you love to go to, your hobbies, etc. This is where privacy comes in, the data YOU would want to keep PRIVATE!

Thus, Privacy can also be said to be the right to control how information about you is used, processed, stored, or shared.

For a better understanding of — why privacy matters? OR Privacy vs Security? I would recommend you to take a short read to the below blog posts:

Coming to the main question…

Artificial Intelligence Jobs

What is Differential Privacy (DP)?

Wikipedia definition:

Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.

Note: Differential Privacy is not an algorithm but a System or Framework described for better data privacy!

One of the easiest examples to understand DP concerning the above definition is, the one stated by Abhishek Bhowmick (Lead, ML Privacy, CORE ML | Apple) in his interview in the Udacity’s Secure and Private AI Course:

(Note: I will be using this course material as a reference throughout the post)

Suppose we want to know the average amount of money an individual holds in his/her pocket to be able to buy an online course? Now chances are, many might not want to give out the exact amount! So, what do we do?!

This is where DP comes in, instead of asking the exact amount, we ask the individuals to add any random value (noise) in the range of -100 to +100 to the amount they hold in their pockets and give us just the resultant sum of it. That is if ‘X’ had 30$ in his/her pocket by adding a random number say -10 to it, (30 + (-10) ), they give us just the result, which is 20$ in this case. Thus, preserving their individual privacy.

To protect the data privacy obtained from the different potential individual, we add noise(the random number like in the above example) to the data to make it more private and secure! DP works by adding statistical noise to the data (either to their inputs or the output).

But, this brings us to another question — How is it useful to us if all we get are some random numbers?

The answer to this is Law of Large Numbers:

The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.

When a sufficiently large number of individuals give their resultant sum values. It is seen that when the average of these statistically collected data is taken, the noise cancels out and the average obtained is near to the true average (average of the data without adding noise (random number)). Thus, we now have data on the “average amount an individual hold in their pocket”, at the same time preserving their privacy.

Key Takeaways:

  • The law of large numbers states that an observed sample average from a large sample will be close to the true population average and that it will get closer, the larger the sample.
  • The law of large numbers does not guarantee that a given sample, especially a small sample, will reflect the true population characteristics or that a sample which does not reflect the true population will be balanced by a subsequent sample.

For better understanding of the Law of Large Numbers, refer to the following:

Another way of looking at DP is a definition by Cynthia Dwork in her book Algorithmic Foundations of Differential Privacy

Differential Privacy describes a promise, made by a data holder, or curator, to a data subject (owner), and the promise is like this: “You will not be affected adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets or information sources are available”.

It sounds like a well thought definition but maybe more of a fantasy — as De-anonymization of datasets may happen!

This may lead to a question — How do we know if the privacy of a person in the dataset is protected or not? For example, a database with (1) patients and their cancer status and their information OR (2) a coin flip and their (heads or tails) response!

The key query for the DP in such cases would be,” If we remove a person from the database and the query does not change, then that person’s privacy is fully protected”. To put it simply, when querying a database, if I remove someone from the database, would the output of the query be any different?

How do we check this? By creating a parallel database with one less entry (N-1) compared to the original database entries (N).

Lets take a simple example of coin flips, if the first coin flip is heads say Yes (1) and if its tails then answer as per the second coin flip. So, our database will be made of 0’s and 1’s i.e., a binary dataset. The easiest query that can be thought of with this binary dataset is “Sum Query”. The Sum query will add all the 1’s in the database and give a result.

Assuming, D is the original database with N entries and D’ is the parallel database with N-1 entries. On running the sum query on each of them, if sum(D) != sum(D’), it means the output query actually is conditioned directly on the information from a lot of people in D database! It shows non-zero sensitivity, as the outputs are different.

Sensitivity is the maximum amount that a query changes when removing an individual from the database.

A good explanation of the above in terms of practice would be as quoted below from the blog post by Nicolas Papernot and Ian Goodfellow — Privacy and machine learning: two unexpected allies?

Differential privacy is a framework for evaluating the guarantees provided by a mechanism that was designed to protect privacy. Invented by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith [DMNS06], it addresses a lot of the limitations of previous approaches like k-anonymity. The basic idea is to randomize part of the mechanism’s behavior to provide privacy. In our case, the mechanism considered is always a learning algorithm, but the differential privacy framework can be applied to study any algorithm.

The intuition for introducing randomness to a learning algorithm is to make it hard to tell which behavioral aspects of the model defined by the learned parameters came from randomness and which came from the training data. Without randomness, we would be able to ask questions like: “What parameters does the learning algorithm choose when we train it on this specific dataset?” With randomness in the learning algorithm, we instead ask questions like: “What is the probability that the learning algorithm will choose parameters in this set of possible parameters, when we train it on this specific dataset?”

We use a version of differential privacy which requires that the probability of learning any particular set of parameters stays roughly the same if we change a single training example in the training set. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning, then that patient’s records cannot be memorized and her privacy is respected. We often refer to this probability as the privacy budget. Smaller privacy budgets correspond to stronger privacy guarantees.

In the above illustration, we achieve differential privacy when the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users.

Few good reads to understand De-anonymization of datasets are as follows:

Always Remember — Differential privacy is not a property of databases, but a property of queries. The intuition behind differential privacy is that we bound how much the output can change if we change the data of a single individual in the database.

So, What DP ‘does’ promises?

Differential privacy promises to protect individuals from any additional harm that they might face due to their data being in the private database x that they would not have faced had their data not been part of x.

By promising a guarantee of ε-differential privacy, a data analyst can promise an individual that his expected future utility will not be harmed by more than an exp(ε)≈(1+ε) factor. Note that this promise holds independently of the individual is utility function ui, and holds simultaneously for multiple individuals who may have completely different utility functions.

What DP ‘does not’ promise?

  • Differential privacy does not guarantee that what one believes to be one’s secrets will remain secret. That is, it promises to make the data differentially private and not disclose it BUT not to protect it from attackers! Ex: Differential attack is one of the most common form of privacy attack.
  • It merely ensures that one’s participation in a survey will not in itself be disclosed, nor will participation lead to disclosure of any specifics that one has contributed to the survey.

Do not confuse Security with Privacy, while Security controls “WHO” can access the data, Privacy is more about “When” and “WHAT” can they access .i.e, “You can’t have privacy without security, but you can have security without privacy.”

Security VS privacy

For example, everyone is familiar with the term “Login Authentication and Authorization” right? Here, authentication is about “Who” can access the data and thus, can come under Security, whereas, authorization is all about “What”, “When” and How much of data is accessible to that specific user, thus, can come under Privacy.

That’s all folks! In the next post of this series we will look into the Types of Differential Privacy — Local VS Global DP, with some real world examples.

Till then you may also check out my other beginner friendly Series:

Don’t forget to give us your 👏 !

What is Differential Privacy? was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.