Facebook AI and other members of the speech and language research community are gathering in Graz, Austria, from Sunday, September 15, through Thursday, September 19, for the 20th Annual Conference of the International Speech Communication Association (Interspeech). With around 2,000 attendees, Interspeech is considered the world’s largest and most comprehensive conference on the science and technology of spoken language processing.
Throughout the week, Facebook AI researchers in speech and natural language processing (NLP) are presenting their latest work in poster sessions and oral presentations, including research on a new deep learning method for singing voice conversion, lexicon-free speech recognition, a new self-supervised method for training speech recognition models, and work on joint grapheme and phoneme embeddings for contextual end-to-end ASR.
The speech team at Facebook has two major directions: human interaction and voice interfaces, and video content understanding. Software Engineering Manager Christian Fuegen encourages PhD students and industry professionals to come by booth #F7 to learn more about the work the Facebook Speech team is doing. “We’re always looking to connect with the speech research community,” he says. “For those unable to attend Interspeech, keep in touch on social media.” Facebook Research has multiple channels of communication, including @facebookai on Twitter and Facebook and @academics on Facebook
PhD students interested in learning more about Facebook’s speech research can also submit an application for the Facebook Fellowship Award in spoken language processing and audio classification. Successful awardees are invited to an annual summit where they can learn more about research at Facebook. The deadline to apply is October 4 at 11:59 pm PST. To see a schedule of Facebook papers and other activities at Interspeech, including presentation locations and recruiting information, click here.
End-to-end approaches to automatic speech recognition, such as Listen-Attend-Spell (LAS), blend all components of a traditional speech recognizer into a unified model. Although this simplifies training and decoding pipelines, a unified model is hard to adapt when mismatch exists between training and test data, especially if this information is dynamically changing. The Contextual LAS (CLAS) framework tries to solve this problem by encoding contextual entities into fixed-dimensional embeddings and utilizing an attention mechanism to model the probabilities of seeing these entities. In this work, we improve the CLAS approach by proposing several new strategies to extract embeddings for the contextual entities. We compare these embedding extractors based on graphemic and phonetic input and/or output sequences and show that an encoder-decoder model trained jointly toward graphemes and phonemes outperforms other approaches. Leveraging phonetic information obtains better discrimination for similarly written graphemic sequences and also helps the model generalize better to graphemic sequences unseen in training. We show significant improvements over the original CLAS approach and also demonstrate that the proposed method scales much better to a large number of contextual entities across multiple domains.
We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block, which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22 percent relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.
Eliya Nachmani and Lior Wolf
We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a single CNN encoder for all singers, a single WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on. In order to deal with relatively small data sets, we propose a new data augmentation scheme, as well as new training losses and protocols that are based on back translation. Our evaluation presents evidence that the conversion produces natural singing voices that are highly recognizable as the target singer.
We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, with character or word-based LMs
Organizing committee: Robin Algayres, Juan Benjumea, Mathieu Bernard, Laurent Besacier, Alan W. Black, Xuan-Nga Cao, Charlotte Dugrain, Ewan Dunbar, Emmanuel Dupoux, Julien Karadayi, Lucie Miskic, Lucas Ondel, Sakriani Sakti