Reimagining database querying on unstructured data

August 26, 2021

What the research is:

By organizing information, databases are an essential component of nearly every computer program and online service. But the rigid structure of conventional database systems also constrains how they can be used. These systems require preset schemas and can only answer queries with well-defined semantics written in SQL (structured query language). Queries must be exacting to return correct information. Moreover, the data must be stored in a way to comply with the schema; therefore, taking advantage of the abundance of available unstructured data is challenging.

Facebook AI has developed a new approach called neural databases, which enables machines to search unstructured data — which might range from vast collections of text to recordings of songs — similar to how traditional systems can search a typical structured database. With neural databases, it might one day be possible to run a complex query such as “What is the third-longest entry about a Russian-born novelist?” directly on Wikipedia, for example.

Neural databases bridge an important gap between the fields of databases and NLP. Significant progress has been made in using natural language queries on standard structured data. This lets people pose ad hoc queries such as “How many teams won away games by more than three points?” But these existing systems can’t query a collection of information that isn’t organized into a structured database. Conversely, machine learning models can provide powerful predictions for tasks whose semantics are vague and that involve data that does not fit into a predefined schema. “Does this post contain hate speech?” for example. However, they do not have the benefits of composition that databases possess. Therefore, it is difficult or impossible to extend them to closely related but unseen predictions — such as “What percentage of reviews are positive for horror movies released in the 1970s?” Or even “How many directors under the age of 30 released positively reviewed horror movies in the 1970s?”

Given the incredible amount of data that exists outside of traditional databases — whether in sites like Wikipedia or public posts on social media — teaching machines to perform these sorts of complex data queries could be useful in a very wide range of applications.

Significant additional work is needed to deploy systems with these capabilities. But we hope that by sharing our work on neural databases we will help the AI research community achieve this important goal.

Examples of queries and answers from a neural database. (8 of 500 shown )

Nicholas lives in Washington D.C. with his wife.
Sheryl is Nicholas’ wife.
Teuvo was born in 1912 in Russia.
Sheryl’s mother gave birth to her in 1978.
Nicholas is a doctor.
Sarah was born in Chicago in 1982.
Sarah married John in 2010.
Sarah works in a hospital in NY as a doctor.

Queries:

List everyone born before 1980.
(Set) → Sheryl, Teuvo,...
Whose spouse is a doctor?
(Join) → Sheryl, John,...
Sheryl’s mother gave birth to her in 1978.
Nicholas is a doctor.
(Max) → Teuvo?
Sarah married John in 2010.
(Set) → NULL

How it works:

We began our work on neural databases with the observation that neural models, especially Transformer-based models, have shown impressive performance gains in answering queries from natural language text. However, existing works are unable to support database queries such as “List/count all female athletes who were born in the 20th century,” which require reasoning over sets of relevant facts using operations such as join, filtering, and aggregation.

It is well known that transformer models do not scale well to large inputs due to the use of self-attention. We found that mechanisms such as Fusion in Decoder (Izacard and Grave, 2020, FiD) and LongFormer (Beltagy et al., 2020), which mitigate the scaling issue, harm the model. These issues were overcome by our approach, which generates intermediate query-based derivations of small numbers of facts in the database before using conventional computation to aggregate the results.

To address the aforementioned challenges, we proposed an instance of a neural database architecture that operates over textual facts with parallelizable nonblocking operators before aggregating the results. The three core components of the architecture, shown in the figure below, are a support set generator (SSG), which retrieves small sets of relevant facts called support sets; a parallelizable, nonblocking neural select-project-join (NSPJ) operator that generates intermediate answers that can be unioned to produce the final answer; and an optional aggregation stage, which uses conventional computation to perform numerical reasoning. The key insight underlying our architecture is to leverage neural models for what they do best — namely, reasoning over a small set of facts.

Overview of the proposed architecture, consisting of a support set generator, NSPJ, and aggregation.

Our results indicate that our method can scale; it can reason over many sets of facts as the number of relevant support sets increases (see image below).

Even when provided with the correct contexts, baseline scores decrease for queries requiring the combination of multiple support sets.

The proposed neural database architecture also retains a higher accuracy in comparison with existing approaches when the size of the database increases to more than 500 facts (see image below).

Our method retains a higher accuracy when scaling to larger databases with a model trained using 25 facts and tested on larger databases.

Why it matters:

Neural may one day enable people to also query any data that is available online and is not stored in a traditional database. Such data can include text, images, and other modalities. Because so much of the world’s information exists outside of traditional databases, neural database technology could one day be used to access this data for specialized research, everyday tasks, and much more.

Get the data and code repository here

Read the full paper: 'Neural Databases'

Read the full paper: 'Database reasoning over text'