May 15, 2020
Facebook AI has built and deployed a real-time neural text-to-speech system on CPU servers, delivering industry-leading compute efficiency and human-level audio quality.
Previous systems typically rely on GPUs or other specialized hardware to generate high-quality speech in real time. We increased synthesizing speed by 160x using model system co-optimization techniques, enabling us to generate one second of audio in 500 milliseconds on CPU.
It’s deployed in Portal, our video-calling device and available for use across a range of other Facebook applications, from reading support for the visually impaired to virtual reality experiences.
Modern text-to-speech (TTS) systems have come a long way in using neural networks to mimic the nuances of human voice. To generate humanlike audio, one second of speech can require a TTS system to output as many as 24,000 samples — sometimes even more. The size and complexity of state-of-the-art models require massive computation, which often needs to run on GPUs or other specialized hardware.
At Facebook, our long-term goal is to deliver high-quality, efficient voices to the billions of people in our community. In order to achieve this, we've built and deployed a neural TTS system with state-of-the-art audio quality. With strong engineering and extensive model optimization, we have attained a 160x speedup over our baseline while retaining state-of-the-art audio quality, which enables the whole service to be hosted in real time using regular CPUs — without any specialized hardware.
The system is highly flexible and will play an important role in creating and scaling new voice applications that sound more human and expressive and are more enjoyable to use. It’s currently powering Portal, our video-calling device, and it’s available as a service for other applications, like reading assistance and virtual reality. Today, we’re sharing details on our approach and how we solved core efficiency challenges to deploy this at scale.
We designed a pipeline that efficiently combines four components, each of which focuses on a different aspect of speech, into a powerful and flexible system:
A linguistic front-end converts input text to a sequence of linguistic features, such as phonemes and sentence type.
A prosody model that predicts the rhythm and melody to create the expressive qualities of natural speech.
An acoustic model that generates the spectral representation of the speech.
A neural vocoder that generates 24 kHz speech waveform conditioned on prosody and spectral features.
It’s important to build a separate prosody model in the pipeline because it allows easier control for the speech style during synthesis time. The prosody model takes in the sequence of linguistic features along with style, speaker and language embeddings, to predict the phone-level duration (i.e., speed) and frame-level fundamental frequency (i.e., melody) of the sentence. Its model architecture consists of a recurrent neural network with content-based global attention, whose context vector contains semantic information of the entire sentence. This allows the model to generate more realistic and natural prosody.
We use style embeddings that allow us to create new voice styles — including assistant, soft, fast, projected, and formal — using only a small amount of additional data with the existing data set. Since we don’t have to create a separate model for each style, we need only 30 to 60 minutes of training data for each voice style.
To achieve higher computational efficiency and high-quality speech, we adopted the conditional neural vocoder architecture that makes its predictions based on spectral inputs instead of one that generates audio directly from text or linguistic features (e.g., auto-regressive models like WaveNet or fairly complex parallel synthesis networks Parallel WaveNet). We used an acoustic model to transform linguistic and prosodic information into the frame-rate spectral feature, which is taken as neural vocoder inputs. This approach enables the neural vocoder to focus on spectral information packed in a few neighboring frames and allows us to train a lighter and smaller neural vocoder.
The trade-off, however, is that we’re now relying on the acoustic model to generate spectral features. While conventionally the 80 dimensional high-fidelity MFCC or Log-Mel features are used, it actually is a challenging problem itself to be able to predict realistic high-fidelity acoustic features. To address this spectral feature prediction problem, our approach was to use 13 dimensional MFCC features concatenated with the fundamental frequency and a 5 dimensional periodicity feature, which is much easier for the acoustic model to generate.
Our conditional neural vocoder consists of two components:
A convolutional neural network that upsamples (or expands) the input feature vectors from frame rate (around 200 predictions per second) to sample rate (24,000 predictions per second).
A recurrent neural network that’s similar to WaveRNN, which synthesizes audio samples auto-regressively (or one sample at a time) at 24,000 samples per second
To reduce the impact of quantization noise, the neural vocoder predicts samples on the delta-modulated mu-law audio.
The autoregressive nature of the neural vocoder requires generating samples in a sequential order, making real-time synthesis a major challenge. When we started our experiments, our baseline implementation was only able to run at a synthesis speed of around 80 real-time factor (RTF) on a single CPU core, generating one second of audio in 80 seconds. This synthesis speed is prohibitively slow for real-time systems. For real-time capabilities on such systems as Portal, this has to be brought down to under one RTF.
We combined and implemented the following optimization techniques in one TTS system that ultimately resulted in a 160x improvement in synthesis speed, achieving an RTF of 0.5:
Tensor-level optimizations and custom operators
We migrated from a training-oriented PyTorch setup to an inference-optimized environment with the help of PyTorch JIT. We obtained additional speedup using compiled operators and various tensor-level optimizations. For example, we designed custom operators by adopting efficient approximations for the activation function and applied operator fusion to reduce the total operator loading overhead.
Unstructured model sparsification.
We performed unstructured model sparsification through training to reduce the inference computation complexity. We were able to achieve 96 percent unstructured model sparsity — where 4 percent of the model parameters are nonzero — without degrading audio quality. By using optimized sparse matrix operators on the resulting inference net, we were able to increase speed by 5x.
We brought the simplification even further by applying blockwise sparsification where nonzero parameters are restricted in blocks of 16x1 and stored in contiguous memory blocks. This led to compact parameter data layout in memory and minimal indirect addressing, so that memory bandwidth utilization and cache usage are significantly improved.
We implemented customized operators for the blockwise sparse structure to achieve efficient matrix storage and compute, so that the compute is proportional to the number of nonzero blocks in the matrix. To optimize for high blockwise sparsity without degrading the audio quality, we trained the sparse model through model distillation using the dense model as a teacher model.
Distribution over multiple cores
Finally, we achieved further speedup by distributing heavy operators over multiple cores on the same socket. We did this by enforcing the nonzero blocks to be evenly distributed over the parameter matrix during training, and segmenting and distributing matrix multiplication among several CPU cores during inference.
To optimize the way we collect training data, we took this approach, which relies on a corpus of hand-generated utterances and modified the approach to select lines from large, unstructured datasets. The large datasets are filtered based on readability criteria via a language model. This novel modification allowed us to maximize the phonetic and prosodic diversity present in the corpus while still ensuring the language was natural and readable by our voice actor. This led to fewer annotations and studio edits for the recorded audio, as well as improved TTS quality. By automatically identifying script lines from a more diverse corpus, it allowed us to scale to new languages rapidly without relying on hand-generated datasets.
The combination of our new data collection method and our neural TTS system helped us reduce our voice development cycle --- from script generation and data collection to delivering the final voice --- from over a year to under six months. Recently, we successfully applied our new approach to create a British-accented voice. This is the first of more accents and languages to come.
We’re excited to provide higher-quality audio with a more scalable data collection method so that we can more efficiently continue to bring voice interactions to everyone in our community.
As voice assistant technology becomes more and more common in our daily lives, we think that interacting with assistants should feel as natural as interacting with people. The more systems sound like people, behave like people, and are personalized to peoples’ regional dialects, the more seamless future innovations will be. With our new TTS system, we’re laying the foundation to build flexible systems that will make this vision a reality.
As a next step, we are continuing to add more languages, accents, and dialects to our voice portfolio. And we’re focused on making our system lighter and more efficient so it can run on smaller devices.
We’re also exploring features to make our voice respond intelligently with different styles of speaking based on the context. For example, when you’re rushing out the door in the morning and need to know the time, your assistant would match your hurried pace. When you’re in a quiet place and you're speaking softly, your AI assistant would reply to you in a quiet voice. And later, when it gets noisy in the kitchen, your assistant would switch to a projected voice so you can hear the call from your mom.
All these advancements are part of our broader efforts in making systems capable of nuanced, natural speech that fits the content and the situation. When combined with our cutting-edge research in empathy and conversational AI, this work will play an important role in building truly intelligent, human-level AI assistants for everyone.
Tech Lead Manager
Tech Lead Manager