July 22, 2020
The first systematic study of the performance of all main private prediction techniques in realistic machine learning (ML) scenarios. This study is meant to help solve a common challenge in using private training data to train ML models. When a model is trained on private data but its predictions are made public, there is a risk that information about the private training data could leak. For example, someone using a hotel recommendation system could, in theory, extract someone else’s hotel reservation information. Several private prediction techniques have been developed over the past decade to address this problem. The study of these techniques has been largely theoretical until now, and little is known about the relative performance of these techniques in practice. Our study sheds light on which methods may be well suited for which learning setting.
Techniques for private prediction are based on a definition of privacy called differential privacy, which states that the output of an algorithm should look the same regardless of whether a specific data point was included as input. Even if you knew all the input data except one specific data point, you shouldn’t be able to infer much about that data point. Differential privacy is a very strong privacy guarantee, and is widely adopted in academia and industry (for example, in Facebook’s election data release).
Typically, differentially private prediction is achieved by introducing randomness somewhere in the prediction algorithm. For example, one can add random noise to the model parameters or the model predictions, randomly alter the loss function that is used to train the model, add noise to the parameter updates that are performed during training, or train multiple models on disjointed training sets and have the resulting models vote in a noisy way. For any of these approaches, it is possible to prove that the model predictions will be differentially private. But which approach is best suited for a practical learning scenario?
Our study provides insight into this question. For example, we find that in most practical scenarios it is better to introduce noise in the model parameters than in the model prediction. The study makes visible a variety of trade-offs between accuracy, privacy, choice of private prediction method, and hyperparameters of that method.
We use ML models across our family of products, and it’s important for us to understand the privacy implications of these models. This research is part of a larger research effort at Facebook AI around privacy-preserving and responsible AI.
To make it easier for other researchers to study private prediction techniques, we are open-sourcing code that implements all the private prediction methods that we studied and that can be used to reproduce the results of our experiments.