Our team advances the state of the art in Speech & Audio. We create spoken language technology to make it faster and easier for people to build community and connect with others around the world. We work on all aspects of speech and audio processing, including speech recognition and synthesis, speaker identification, acoustic event detection and music analysis and generation.
Our technology is deployed at scale, including voice interfaces for Portal and Oculus devices, and video understanding for Facebook and Instagram, including transcription, captioning, and content understanding. Our video understanding efforts are unique in their scope and scale, processing the billions of videos that Facebook and Instagram receive in dozens of languages.
In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation.
Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, Yoshua Bengio