A new approach to automatic speech recognition that jointly trains acoustic and language models. These models are typically trained separately and then combined at inference using a beam search decoder. By leveraging a language model at training time, this end-to-end technique — called a differentiable beam search decoder (DBD) — simplifies the acoustic model. DBD makes the entire system more lightweight and the overall inference process more efficient.
Beam search decoding is a technique commonly used at inference time in natural language processing (NLP) and automatic speech recognition (ASR) systems. NLP and ASR systems are often trained to predict letters or subword units, while at inference time, actual words need to be generated. In ASR, the acoustic model takes audio as input and outputs letters. To go from letters to words, a search procedure is used to constrain the outputs to be in a set of allowed words. Because exact search is far too computationally expensive, systems use approximate search algorithms, such as beam search. A language model is often used during the search to select more likely sequences of words. Using a language model is particularly important for speech recognition, where the acoustic information is often not enough to disambiguate words that sound the same.
Using the beam search decoder only at inference time is suboptimal, since the model behaves differently at inference than when training. This work demonstrates that training through the beam search decoder is possible even with acoustic and language models that operate at different granularities (letters or words). This work also shows that integrating the language model when training the acoustic model leads to lighter-weight acoustic models and more efficient inference. Finally, the DBD can jointly train acoustic and language models. In contrast to previous fully end-to-end approaches, which learn an implicit language model, the DBD learns an explicit language model, which keeps the benefit of fully end-to-end training but with more flexible components.
In tests against state-of-the-art speech recognition systems that only use acoustic data and transcriptions, models trained with DBD are significantly simpler, while also achieving a better word error rate. This approach could lead to speech models that are not only faster to train but can also run on systems with tight hardware constraints, such as for latency and throughput. And while our results were applied to speech, this is a general approach that we believe can be easily applied to other domains across the field of NLP.