Research

Building systems to securely reason over private data

March 15, 2022

What the research is:

People today rely on AI systems such as assistants and chatbots to help with countless tasks, from answering questions about the weather to scheduling a meeting at work. For systems to execute these tasks, users must provide them with relevant information — such as one’s location or work calendar. In some cases, however, people would prefer to keep information private, which means not uploading it to cloud-based AI systems or sharing it with others. Today’s reasoning systems are not optimized to do this, however. In particular, today’s retrieval-based systems — systems that reason by retrieving information from knowledge bases — aren’t designed to take into consideration the privacy of their underlying data.

To address this limitation and spur research in this and related areas, Meta AI is releasing ConcurrentQA, the first public data set for studying information retrieval and question answering (QA) with data from multiple privacy scopes. Alongside the data set and problem exploration, we have developed a new methodology as a starting point for thinking about privacy in retrieval-based settings called Public-Private Autoregressive Information Retrieval (PAIR).

ConcurrentQA contains questions that might require both public and private information to answer, such as “With my GPA and SAT score, which universities should I apply to in the United States?” PAIR provides a guide for designing systems that can answer these questions without needing to tell a QA system one’s grades or SAT scores — it provides a way to reason about building systems that retrieve information from public and private sources without compromising the integrity of the private information.

How it works:

QA and reasoning over data with multiple privacy scopes remains an underexplored area in natural language processing. This is in large part due to a lack of public domain data sets and benchmarks for studying and comparing approaches. ConcurrentQA contains about 18,000 question-answer pairs. Each pair has a corresponding pair of passages originating either from Wikipedia or from the publicly available and widely used Enron Email Dataset. Each passage contains information that must be successively understood in a logical step to answer its corresponding question — these steps are called reasoning hops.

For example, to answer the question above about which universities to apply to, an AI system must make one hop to retrieve the asker’s GPA and SAT score and another hop to retrieve information about admissions practices at universities in the United States. This type of multi-hop reasoning has been well studied when the information needing to be retrieved can be found in a single privacy scope, such as a public domain knowledge base (e.g., Wikipedia), but not when some of that information is private, as is the case for a person’s grades and test scores.

As a starting point for studying privacy-aware reasoning with ConcurrentQA, we use the PAIR framework to define the design space for possible multi-hop QA models. AI systems based on PAIR should follow the following properties:

  • Data in a theoretical system is stored in at least two separate enclaves. Public and private information are stored separately in that system, so that private actors or retrievers can access public data, but public actors cannot access private data.

  • Systems should be designed so that public data retrievals precede private data retrievals. Systems designed with PAIR should make separate retrievals in sequence, so data retrieved from private knowledge bases doesn’t leak into public systems retrieving from public knowledge bases, which may be communal.

This example shows PAIR’s method for multi-hop information retrieval.

PAIR itself provides a starting point that researchers can use to design their own systems that can reason over public and private data with good privacy controls. We hope that researchers will use ConcurrentQA to create new reasoning systems that both extend PAIR and explore other approaches.

Why it matters:

Technology is tightly woven into the daily lives of people around the world — a trend that will only accelerate as we build for the metaverse. It is important that AI systems be able to perform useful tasks for people while also keeping personal information private. Privacy and security are a core part of Meta AI’s approach to responsible AI. As part of this effort, we’ve built and shared tools such as CrypTen, which allows AI researchers to more easily experiment with secure computing techniques, and Opacus, a library for training models with differential privacy.

The AI research community is only beginning to develop effective ways to perform question-answering tasks in a privacy-preserving way. The release of ConcurrentQA and PAIR as a starting point aims to accelerate this research; there is much more work to do for us and for our colleagues outside of Meta. We hope that new data sets like ConcurrentQA will help AI researchers and practitioners build systems that can help better protect people’s privacy. Absent further research in this area, it will be more difficult for the industry to develop AI systems that can retrieve information while also preserving people’s privacy.

We are aware of the biases and limitations of the Enron Email Dataset. ConcurrentQA and PAIR can be improved by using other, more representative data sets and finding other ways to mitigate bias. We further note measures taken by experts in government and academia to redact personal information from the Enron Email Dataset and to address privacy concerns. This was done prior to the data set’s release in its current form for use by AI researchers.

There are many important challenges ahead, but improving performance on the ConcurrentQA benchmark with PAIR or other new privacy frameworks will be critical. We hope the release of our data set will spur interest and innovation in how we model and reason about private information, and we hope that the PAIR framework and our straightforward baselines set a high bar for future work.

The work discussed in this blog post was completed by Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Chris Re.

Read the paper
Get it on GitHub

Written By

Jacob Kahn

Research Engineer