July 03, 2020
TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data. These sorts of representations are useful for natural language understanding tasks that involve joint reasoning over natural language sentences and tables. A representative example is semantic parsing over databases, where a natural language question (e.g., “Which country has the highest GDP?”) is mapped to a program executable over database (DB) tables.
This is the first pretraining approach across structured and unstructured domains, and it opens new possibilities regarding semantic parsing, where one of the key challenges has been understanding the structure of a DB table and how it aligns with a query.
TaBERT has been trained using a corpus of 26 million tables and their associated English sentences. Previous pretrained language models have typically been trained using only free-form natural language text. While these models are useful for tasks that require reasoning only for free-form natural language, they aren’t suitable for tasks like DB-based question answering, which requires reasoning over both free-form language and DB tables.
Benchmark experiments were performed on two widely used benchmarks for DB-based question answering — a classical supervised text-to-SQL task over structured data from the Spider dataset, and a weakly supervised parsing task on a semi-structured dataset from the WikiTableQuestions dataset. Weakly supervised learning is significantly more challenging compared with supervised learning because the parser does not have access to the labeled query and needs to explore a very large search space of queries.
Results show substantial improvements over the current state of the art on weakly supervised tasks and competitive performance on supervised tasks. Tests also demonstrate that pretraining for both table and language data is feasible and effective.
TaBERT is built on top of the BERT natural language processing (NLP) model and takes a combination of natural language queries and tables as input. By doing this, TaBERT can learn contextual representations for sentences as well as the elements of the DB table. These representations can be used downstream in other neural networks to create actual database commands. The training data for that task can then be used to further fine-tune TaBERT’s representations.
In training TaBERT, we used “content snapshots,” wherein the model encodes only the sections of a table that are most relevant to a query. Some database tables contain a large number of rows, which makes encoding them a computationally intense and inefficient process. Content snapshots allow TaBERT to deal with large tables by encoding only the subset of content that’s most relevant to the utterance.
For example, the utterance “In which city did Piotr’s last 1st place finish occur?” (example taken from the WikiTableQuestions dataset) may have an associated table containing data for year, venue, position, and event. A content snapshot will sample a subset of three rows. This subset will not cover all information in the table, but it is sufficient for the model to learn that, say, the venue column contains cities.
To model the structure of tables, TaBERT uses a combination of classical horizontal self-attention, which captures the dependency between cells of individual rows, and vertical self-attention, which models the information flow across cells in different rows. The final outputs from layers of such horizontal and vertical self-attentions are distributed representations of utterance tokens and columns in the table, which could be used in downstream semantic parsers to compute the database query.
Improving NLP allows us to create better, more seamless human-to-machine interactions for tasks ranging from basic web searches to queries with AI assistants. Large-scale pretrained language models have played a major role in recent advancements in machines’ ability to understand and answer free-form natural language text. TaBERT builds upon this by more efficiently bridging the gap between natural language utterances and queries and the structured databases they are executed on. This enables digital assistants, such as Portal from Facebook, to improve their accuracy in answering questions like “What’s the temperature in the afternoon?” and “What’s the population in the Pacific Northwest?” where the answer can be found in different databases or tables.
Someday, TaBERT could also be applied toward fact checking and verification applications. Third parties often check claims by relying on statistical data from existing knowledge bases. In the future, TaBERT could be used to map queries to relevant databases, thus not only verifying whether a claim is true, but also providing an explanation by referring to the relevant database."
In future research, we plan to evaluate TaBERT’s performance on other table-based reasoning tasks. We will also be exploring other strategies for linearizing table data, improving the quality of data used for pretraining, and designing novel pretraining objectives.
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data