A new open source framework for automatic differentiation with graphs

October 8th, 2020

What the research is:

GTN is an open source framework for automatic differentiation with a powerful, expressive type of graph called weighted finite-state transducers (WFSTs). Just as PyTorch provides a framework for automatic differentiation with tensors, GTN provides such a framework for WFSTs. AI researchers and engineers can use GTN to more effectively train graph-based machine learning models.

The WFST data structure is commonly used to combine different sources of information in such applications as speech recognition, natural language processing, and handwriting recognition. A standard speech recognizer may consist of an acoustic model, which predicts the letters present in a snippet of speech, and a language model, which predicts the likelihood that a given word follows another. These models can be represented as a WFST and are typically trained separately and combined to yield the most likely transcription. Our new GTN library makes it possible to train different types of models together, which provides better results.

Graphs are more structured than tensors, which allows researchers to encode more useful prior knowledge about tasks into a learning algorithm. In speech recognition, for example, if a word has a few possible pronunciations, GTN allows us to encode the pronunciations for that word into a graph and incorporate that graph into the learning algorithm.

Previously, the use of individual graphs at training time was implicit, and developers had to hard-code the graph structure in the software. Now, using this framework, researchers can use WFSTs dynamically at training time, and the whole system can learn and improve from the data more efficiently.

This diagram shows a simple WFST built in GTN, which transduces word piece decompositions of “the” to the word itself.

How it works:

With GTN (short for graph transformer networks), researchers can easily construct WFSTs, visualize them, and perform operations on them. Gradients can be computed with respect to any of the graphs participating in a computation with a simple call to gtn.backward. Here’s an example:


 import gtn
 g1 = gtn.Graph()
 g2 = gtn.Graph()
 # ... add some nodes and arcs to the graph ...
 # Compute a function of the graphs:
 intersection = gtn.intersect(g1, g2) 
 score = gtn.forward_score(intersection)
 # Calculate gradients:
 gtn.backward(score)

GTN’s programming style feels as familiar to use as popular frameworks like PyTorch. The imperative style, autograd API, and autograd implementation are based on similar design principles. The main difference is we replaced tensors with WFSTs and their corresponding operations. As with any framework, GTN is intended to be easy to use without sacrificing performance.

In our research paper paper, we show examples of how to implement algorithms with GTN. One example uses GTN to augment any sequence-level loss function with the ability to marginalize over the decomposition of phrases into word pieces. The model is free to choose how to decompose a word like “the” into word pieces. For instance, the model could choose to use “th” and “e”, or “t”, “h”, and “e”. Word pieces are frequently used in machine translation and speech recognition, but the decompositions are chosen from task-independent models. Our new approach lets the model learn the optimal decomposition of a word or phrase for a given task.

Why it matters:

Building machine learning models with useful graph-based data structures has been difficult because there aren’t many easy-to-use frameworks. By separating graphs (or data) from operations on graphs, researchers have more freedom and opportunity to experiment with a larger design space of structured learning algorithms.

If lessons from tensor-based frameworks hold true, then tools to more easily experiment with algorithms play a crucial role in developing new and better algorithms. With GTN, the graph structure is well suited for encoding useful prior knowledge in a way that’s suggestive but not overly prescriptive. By using these graphs at training time, the whole system can still learn and improve from data. In the long run, the structure of WFSTs combined with learning from data has the potential to make machine learning models more accurate, modular, and lightweight.

We hope that releasing the software will encourage other researchers in the field to help us explore this new design space for better learning algorithms.