December 2, 2021
Meta AI researchers have pushed the future of conversational voice assistants forward with two new works that significantly reduce latency and provide a framework for on-device processing.
The on-device vision for a voice assistant will be an important building block of the metaverse, where people will be able to seamlessly interact with their environment using vocal commands.
Conversational assistants have become ubiquitous on smart speakers, computers, smartphones, and other devices, helping people do everything from keeping track of their calendars to finding out the weather forecast. Such assistants rely on semantic parsing to convert a user’s request into a structured form, consisting of intents and slots to allow for downstream execution. The request usually needs to go off-device in order to access larger models running on the cloud.
Seq2seq modeling is the de facto tool for advanced semantic parsers. However, the latency of auto-regressive generation (token by token) makes such models prohibitive for on-device modeling. In two new papers, we propose a model for on-device assistants and we show how we can make larger server-side models less computationally expensive.
These two new works make seq2seq modeling more efficient while retaining scalability. In our first work, we propose non-autoregressive semantic parsing, a new model to decode all tokens in parallel. Our work overcomes the latency burden of seq2seq modeling through parallel decoding, showcasing significant latency reductions (up to 81 percent benchmarked on a 2017-era Android smartphone with an eight-core processor a cell phone) along with accuracy equivalent to autoregressive models.
While non-autoregressive semantic parsing is successful, we find that due to the rigidity of the length prediction task, such parsers have difficulty generalizing, as length prediction requires knowledge of both the user utterance (slot text) and the target ontology (intent/slot labels). To overcome this limitation, we propose span pointer networks, a non-autoregressive scheme that relies on span-based decoding using only the target ontology for length prediction. We show through this span formulation that we can significantly improve quality and generalization while also reducing latency and memory consumption at larger beam sizes.
We are exploring how to use these methods to power conversational assistants in Meta’s products and services. This research will also provide a framework for creating more useful assistants as we build the metaverse.
Our proposed architecture is built upon two of our works: non-autoregressive semantic parsing and span pointer networks for non-autoregressive parsing. Non-autoregressive modeling allows for fully parallel decoding (generating the entire sequence at once), while autoregressive (traditional) parsing is linear decoding, token by token. Often, non-autoregressive methods come at the cost of accuracy; however, our proposed methods for span-based non-autoregressive parsing achieve parity with auto-regressive methods.
In this work, we propose a fully parallelizable semantic parser by leveraging a CMLM-based non-autoregressive decoding scheme.
In the first model, our semantic parsers are broken down into three components: encoding, length prediction, and decoding. The model is responsible for encoding the utterance, then predicting the length of the output and creating that many mask tokens, and finally decoding each of these tokens in parallel.
Through extensive experimentation, we show that such non-autoregressive parsers can achieve accuracy parity with autoregressive parsers of similar architectures, while providing significant improvements in decoding speed.Span-based predictions to improve performance and efficiency
While our work on non-autoregressive parsing should have great potential for parallelizing models for efficient decoding, the generalization of such a model is fundamentally restricted due to the length prediction task. In order for the model to predict the correct length, it must know the number of intent and slot labels, as well as the length of any relevant slot text in the output parse.
To reduce such errors, we formulate a new parser that relies on span-based prediction rather than slot text generation. Switching to span decoding means the length of the output is now decoupled from the user’s utterance text, since it only takes two tokens (start/end index) to represent a slot. The length of slot text is always 2, regardless of what the text is or what language it is in, leading to a significantly more consistent modeling task. A depiction of our model is shown in the figure below.
This figure depicts our new generation form that relies on span-based prediction. Our results show significant improvements in quality (+1.2 percent absolute compared with prior non-autoregressive parsers); improvements in cross-domain generalization (an average 15 percent improvement on the reminder domain, compared with prior non-autoregressive parsers); and cross-lingual improvements (13.7 percent improvement over an XLM-R autoregressive baseline). Furthermore, they show a 70 percent reduction in latency and an 83 percent reduction in memory for a beam size of 5.
These improvements might help lay the groundwork for preserving on-device privacy at scale without sacrificing accuracy and speed, and serve as a blueprint for how we can bring larger models onto the server. Compared with prior work in both non-autoregressive modeling in machine translation and efficient modeling in semantic parsing, research shows we can decode full semantic parsers to state-of-the-art quality in a single decoding step, while retaining quality and generalization properties of auto-regressive parsers.
We want interactions with assistants to feel natural and lag-free, while also keeping people’s data on-device as much as possible. For further research, we will be relying on this method to make models faster and more data-efficient.
While the research community makes progress independently on efficient architectures for on-device and scalable architectures for low-resource learning, it is important to investigate strategies for a single model that is both efficient and scalable. Efficient modeling is critical for on-device language understanding, as well as bringing more scalable pretrained models to production environments.
We hope our research provides an effective avenue for researchers and practitioners to improve semantic parsing in conversational assistants.