July 10, 2020
We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise and reverberations. Using the WSJ0-2mix and WSJ0-3mix data sets, along with newly created variations with four and five simultaneous speakers, our model achieved a scale-invariant SI-SNR (signal-to-noise ratio, a common measure of separation quality) improvement of more than 1.5 dB (decibels) over the current state-of-the-art models.
To build our model, we use a novel recurrent neural network architecture that works directly on the raw audio waveform. Previously best-available models use a mask and a decoder to sort each speaker’s voice. The performance of these kinds of models rapidly degrades when the number of speakers is high or unknown.
As with standard speech separation systems, our model requires knowledge of the total number of speakers in advance. But in order to handle challenges when the number of speakers is unknown, we built a novel system that automatically detects the number of speakers and selects the most relevant model.
The main goal of speech separation models is to estimate the input sources, given an input mixture of speech signals, and generate an output of isolated channels for each speaker.
Our model uses an encoder network that maps the input signal to a latent representation. We applied a voice separation network composed of several blocks, where the input is the latent representation and the output is an estimated signal for each speaker. Previous methods typically use a mask when performing separation, which is problematic when the mask is not defined and some signal information may be lost in the process.
We trained the model and directly optimized the SI-SNR using several loss functions via the permutation invariant training. We inserted a loss function after every separation block to further improve the optimization process. Finally, to ensure each speaker is consistently mapped to a particular output channel, we added a perceptual loss function using a pretrained speaker recognition model.
We also built a new system to handle separation of unknown numbers of multiple speakers. We did this by training different models for separating two, three, four, and five speakers. We fed the input mixture to the model designed to accommodate up to five simultaneous speakers so that it would detect the number of active (nonsilent) channels present. Then, we repeated the same process with a model trained for the number of active speakers and checked to see whether all output channels were active. We repeated this process until either all channels were activated or we found the model with the lowest number of target speakers.
The ability to separate a single voice from conversations across many people can improve and enhance communication across a wide range of applications that we use in our daily lives, like voice messaging, assistants, and video tools, as well as AR/VR innovations. It can also improve audio quality for people with hearing aids, so it’s easier to hear others clearly in crowded and noisy environments such as parties, restaurants, or large video calls.
Beyond its separating different voices, our novel system can also be applied to separate other types of speech signals from a mixture of sounds such as background noise. Our work can also be applied to music recordings, improving our previous work on separating different musical instruments from a single audio file. As a next step, we’ll work on improving the generative properties of the model until it achieves high performance in real-world conditions.