Differential Privacy

Introduction

In our framework, differential privacy aimes to solve this problem: how can we train a model that has good performance in the population while protecting privacy at the individual level.

The basic idea of differential privacy is: add noise in the model training process, such that the model trained on two datasets that differ by one record are almost indistinguishable.

As shown in the left figure above, the algorithm maps the dataset into the parameter space. Because of some randomness (such as that in the initialization of neural network), the learned parameter has a probility distribution. Normally, if two datasets are different, the probility distribution of their learned models may be distinguishable (left figure). This means that, we can spot the difference between two datasets through the difference of the learned model parameters. Consider an extreme case where dataset D1 is identical to dataset D2 except one record in plus. Then by inspecting the difference in the output models, one can determine if this personal record is in the dataset. After introducing noise by differential privacy, the distribution of model parameters become indistinguishable. Then it is hard to tell if this specific record is in the dataset.

Mathematical Definition

Now we introduce some mathematical definition to quantify the previous hand-waving argument. A common definition is the (\epsilon, \delta) -differential privacy [Dwork2014]

A randomized algorithm \mathcal{M} with domain \mathbb{N}^{|\mathcal{x}|} is (\epsilon, \delta) -differentially private if for all \mathcal{S} \in Range(\mathcal{M}) and for all D_1, D_2 \in \mathbb{N}^{|\mathcal{X}|} such that ||D_1 - D_2||_1 \leq 1 :
Pr[\mathcal{M}(D_1) \in \mathcal{S}] \leq \exp(\epsilon) Pr[\mathcal{M}(D_2) \in \mathcal{S}] + \delta
—Dwork and Roth

In the above definition, the domain \mathbb{N}^{|\mathcal{X}|} is all possible datasets, and D_1 and D_2 are two datasets. Range(\mathcal{M}) is the algorithm output, such as trained neural networks. The constraint ||D_1-D_2||_1 \leq 1 means that they differ by only one record.