January 13, 2020
We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency and accuracy and also discuss how these metrics can be tuned based on the user requirements.
Written by
Publisher
Research Topics
November 16, 2022
Kushal Tirumala, Aram H. Markosyan, Armen Aghajanyan, Luke Zettlemoyer
November 16, 2022
October 31, 2022
Fabio Petroni, Giuseppe Ottaviano, Michele Bevilacqua, Patrick Lewis, Scott Yih, Sebastian Riedel
October 31, 2022
December 06, 2020
Michael Lewis, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer, Marjan Ghazvininejad, Sida Wang
December 06, 2020
November 30, 2020
Dhruv Batra, Devi Parikh, Meera Hahn, Jacob Krantz, James Rehg, Peter Anderson, Stefan Lee
November 30, 2020
April 30, 2018
Yedid Hoshen, Lior Wolf
April 30, 2018
November 01, 2018
Yedid Hoshen, Lior Wolf
November 01, 2018
December 02, 2018
Sagie Benaim, Lior Wolf
December 02, 2018
June 30, 2019
Geng Ji, Dehua Cheng, Huazhong Ning, Changhe Yuan, Hanning Zhou, Liang Xiong, Erik B. Sudderth
June 30, 2019
Product experiences
Foundational models
Product experiences
Latest news
Foundational models