RESEARCH

NLP

Textless NLP: Generating expressive speech from raw audio

September 9, 2021

Text-based language models such as BERT, RoBERTa, and GPT-3 have made huge strides in recent years. When given written words as input, they can generate extremely realistic text on virtually any topic. In addition, they also provide useful pretrained models that can be fine-tuned for a variety of difficult natural language processing (NLP) applications, including sentiment analysis, translation, information retrieval, inferences, and summarization, using only a few labels or examples (e.g., BART and XLM-R).

There is an important limitation, however: These applications are mainly restricted to languages with very large text data sets suitable for training AI models.

We’re introducing Generative Spoken Language Model (GSLM), the first high-performance NLP model that breaks free of this dependence on text. GSLM leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text. It opens the door to a new era of textless NLP applications for potentially every language spoken on Earth—even those without significant text data sets.

GSLM also enables the development of NLP models that incorporate the full range of expressivity of oral language.

Previously, connecting an NLP application to speech inputs meant that researchers had to first train an automatic speech recognition (ASR) system, a resource-intensive operation that introduced errors, did a poor job of encoding casual linguistic interactions, and was available for just a handful of languages. With textless NLP, our hope is to make ASR obsolete and to work in an end-to-end fashion, from the speech input to speech outputs. We think preschool children’s ability to learn about language solely from raw sensory inputs and audio interactions is an exciting template for the future advances this research may enable.

We are now sharing our baseline GSLM model, which has three components: an encoder that converts speech into discrete units that represent frequently recurring sounds in spoken language; an autoregressive, unit-based language model that’s trained to predict the next discrete unit based on what it’s seen before; and a decoder that converts units into speech.

Something Went Wrong
We're having trouble playing this video.

The wide-ranging benefits of textless NLP

The NLP field has almost always used written text for training models. This works very well for languages like English, which have enormous text data sets suitable for training. But the majority of the world’s languages lack these extensive data sets, which means they have been largely unable to benefit from NLP technology. Upending this dynamic was an exciting challenge that required the work of a multidisciplinary team of Facebook AI researchers with expertise in signal processing, speech processing, NLP, and psycholinguistics.

Our research breaks new ground by training language models on textless inputs, which is foundationally important for several reasons.

First, textless NLP technology should make AI more inclusive and able to model a richer variety of languages than is possible today. This approach opens up the possibility of training models for any spoken language.

Second, by having access to the full expressivity of oral language, models should incorporate nuances and intonations; encode irony, anger, and uncertainty; and use vocalizations like laughter, yawning, and mouth clicks. Because of the rich expressivity of oral language, textless NLP may actually work better than using text for training models, even in text-rich languages like English.

Third, researchers should be able to train models on audio-first experiences, such as podcasts, radio shows, and social audio apps, without annotation or training an ASR. Textless NLP opens up the possibility of a set of applications never before envisioned, such as online expressive translation for multilingual video games, for example, or content search and summarization from archived audio.

And finally, these models may help developmental psychologists and speech and language clinicians predict how infants’ ability to learn to speak and to understand speech is affected by variances in linguistic input available in different languages.

In addition to helping advance these broader research goals, GSLM offers concrete benefits to those working in NLP today. Researchers will be able to pretrain models with a simple next sound unit prediction task, and fine-tune them for end-to-end tasks without any need for text. For instance, our work has enabled the first audio-only speech-to-speech translation system. Further work will address textless versions of standard NLP tasks, such as sentiment analysis, document retrieval, summarization, and more.

Building and evaluating a baseline model

GSLM begins by building our baseline model and evaluating it on two simple end-to-end tasks. The first is discrete resynthesis, where an input wave is encoded in a series of discrete units, which we call pseudo-text, and then used to resynthesize the input in the “voice” of the model. The second task is speech generation, where the language model is used to sample a new pseudo-text, either unconditionally or conditionally, on an input prompt through the encoder.

The architecture of our model. The encoder converts the speech waveform to discrete units (S2u), the decoder does the opposite mapping (u2S), and the unit-based language model models the distribution of sequence of units (pseudo-text).

We tested three state-of-the-art encoders: CPC, wav2vec 2.0, and HuBERT, followed by k-means clustering and deduplication (removing successive identical units). We used a standard causal Transformer for language modeling and Tacotron 2, a standard text-to-speech system, as our decoder.

We trained our encoder and unit-based language model (uLM) on 6,000 hours of Libri-Light and Librispeech (a large collection of audiobooks), and the decoder on Librispeech and LJspeech. The entire stack was trained with self-supervision from raw audio, with no text or labels, and the language model and text-to-speech components were trained on pseudo-text derived from that raw audio.

In comparing these different models, we couldn’t analyze the generated pseudo-text, because the units don’t map one-to-one with letters or phonemes. Good models typically use 100 units or more, and they generally encode stretches of speech that are shorter than phonemes. So we used a pretrained ASR to convert the generated audio back to text. This enabled us to measure the intelligibility of the resynthesized audio using phoneme error rate (PER) — a comparison of the phonemes of the original input with the phonemes retranscribed by the ASR — as well as the linguistic quality and diversity of the conditional or unconditional generated audio using an area under the curve (AUC) metric. The AUC is obtained by sampling sentences across a range of “temperatures,” which we define as the degree of inventivity of a language model. The lower the temperature, the more rigid a model is; the higher the temperature, the more variable the model.

Our two evaluation metrics, AUC and PER.

In performing these measurements, we discovered several things. First, it matters how many discrete units the quantizers use: A higher number yields better outcomes at the acoustic level, though at the cost of higher bit rates. Second, there’s a similar trend at the linguistic level, but in certain instances, using too many units becomes detrimental. Third, different encoders produced very different outcomes, with HuBERT providing the best overall result. Fourth, automatic generation metrics correlate well with people. And last, these metrics were predicted by faster-to-compute zero-shot metrics from the Zero Resource Speech Benchmark, which functioned as good proxies for fast iterations.

Automatic and human metrics (lower is better) for three encoders (wav2vec, CPC, and HuBERT) plus LogMel for comparison, which are quantized using k-means on three dictionary sizes (50, 100, and 200). The x-axis is the resulting bit rate of the units.

Here are some unconditionally generated samples from our best models (CPC or HuBERT on 100 units), which were trained on Libri-Light 6k. More samples are available here.

With a low temperature, sentences are repetitive (the transcriptions are made by an ASR):

Generation (temperature: 0.3)THE PROPERTY BY JAMES RESELL RED FOR LIBERATA OR BY JASON DOWNY THE PROPERTY BY JASON DOWNY THE PROPERTY THE PROPERTY THE PROPERTY THE PROPERTY

With a medium temperature, they become locally coherent (for a few words) and more varied:

Generation (temperature: 1.0) BUT IT IS ATTENDANT FROM THE PEOPLE TO DEFEND HIMSELF FROM THIS INFORMATION PRIDE OF THE POTENTIAL IN CRIMINAL ACTIVITY A CURIOSITY AND IMPETUOSITY OF THE WORLD A WAR SOON ACQUIRED

With a high temperature, they are quite varied but become incoherent. Some passages aren’t composed of actual words:

Generation (temperature: 1.5) ATION OF PURE BLUE HE SAID AT ONCE A LICKING STREAMY AT HER WARM SPOT OF HALF PERFORMED NOTE WAS A RAGING OATH LET IT AS BIR OF AMOLE IN MOOD STROLLING ER CRASS

Here’s an example of a generated continuation conditioned on the prompt "This reality begins to explain the dark pow[..]" (Introduction to J. Verne’s Twenty Thousand Leagues Under the Sea by P.F Walter) using a medium temperature (HuBERT 100). The model is able to complete an incomplete word (pow[..] → POWER) and continue using words in the same general mood (dark→ BLACKNESS). It also tends to repeat itself (MAGICAL):

PromptContinuationTHIS REALITY BEGINS TO EXPLAIN THE DARK POWER OF THE MAGICAL BLACKNESS AND IN THE MIDST OF IT IS MAGICAL AS A SINGLE BLACKNESS OF THE PAIN

Encoding and decoding prosody

While the units our encoders discovered are not phonemes, they have many of the same properties: They encode phonetic contrasts (like differentiating between “pa” and “ba”) while ignoring speaker and channel information. Further, like phonemes, they often ignore more global speech properties that are nevertheless expressive, like intonation and rhythm. This is known as prosody. So our second step is capturing prosody by improving the encoder and decoder.

To do this, we train a variational autoencoder utilizing vector quantization to acquire a unique latent representation. This so-called VQ-VAE system is fed with pitch (F0) information together with a simplified text-to-speech system that inputs the discrete — non-deduplicated — pseudo-phone units described above; the quantized pitch from the VQ-VAE; and a learned speaker embedding.

In the architecture of our unsupervised disentangling encoder-decoder, pseudo-text units are encoded on the top left, the quantized pitch units in the middle, and the speaker embeddings in the bottom. On the right, the decoder reconstructs the waveform.

We evaluated this architecture on LJspeech (single speaker) and VCTK (multispeaker) and found again that HuBERT-based units provide very good results both for objective metrics and subjective evaluation scores.

Our system’s performance when trained on two data sets (LJ: single speaker, and VCTK: multiple speakers), compared with the original audio (Ground Truth, GT) and three types of discrete units (CPC, HuBERT, VQ-VAE). We evaluated the resynthesized along three dimensions: content, F0, and speaker using automatic techniques, as well as globally with human evaluators (Mean Opinion Score, MOS).

As the speech and prosodic units achieve a high degree of speaker independence, our model is able to perform voice transfer by changing the output speaker embedding while preserving the phonetic units and the prosody of the original input:

Original messageReference voiceConverted (Hubert 100)
Original 1 Ref 1Output 1
Original 2Ref 2Output 2
Original 3 Ref 3Output 3
Original 4Ref 4Output 4

It can also be used as a speech codec, transmitting only a voice embedding and the discrete codes for units and prosody. Our system compared favorably with current speech codecs while using a much lower bit rate. To be precise, this represents a 20x compression factor when compared with Opus, a standard codec with similar compression quality, and 2x when compared with the latest research speech codec using vector quantized variational autoencoders. However, while our system achieves high compression rates, it’s specialized to speech and cannot encode other forms of audio, such as music.

Subjective resynthesis score (MUSHRA, higher is better) as a function of bit rate (lower is better) for different codecs. Our super-low bit rate unsupervised codec is in green.

Examples of both voice transfer and voice codec use cases are available here.

Jointly modeling content and prosody

Our final step was to incorporate expressive prosody in the LM and jointly model the content aspect of speech and its prosodic aspect. We introduced a multistream causal Transformer, where the input and output layers have multiple heads, one for each of the speech channels we choose to model. Here, we used three channels: pseudo-phone units, duration, and quantized pitch.

Multistream causal transformer, where the discrete pseudo-phone units u are supplemented with their duration d and their quantized log pitch lf.

As in our baseline model, this prosodic-GSLM is trained from the raw waveform of audiobooks. Adding these extra channels and tasks improves the LM performance in terms of perplexity scores of the units. We also show that the system can now generate multiple realistic prosodic “inpainting” for the same prompt (where we impose the phonetic content and sample only duration and pitch).

Prosodic “inpainting” task, where we fix the pseudo-phonetic units and let the system generate different prosodies for it (here, the first three seconds of the prosody are also fixed).

This trained model can also jointly generate novel content and prosody congruently with the prompt’s expressive style. Here are continuations of the prompt "When an aristocracy carries on public affairs, its [..]" from a rather formal rendering of Alexis de Tocqueville’s political essay Democracy in America:

Original sentence Continuation 1 Continuation 2 Continuation 3

And here are continuations from the prompt "She was quite shocked when I asked her whether wine was allowed [..]" from an expressive rendering of Jane Austen’s novel Mansfield Park:

Original sentenceContinuation 1 Continuation 2 Continuation 3

More examples can be found here: https://speechbot.github.io/pgslm

Where we go next

As our research continues, our next goal is to apply GSLM to data sets of casual and spontaneous speech and dialogue, where text-based methods and ASR struggle most. In addition, we wish to show that GSLM can be an effective method for pretraining downstream tasks trained with few available labeled data, like spoken summarization, spoken sentiment analysis, and information retrieval tasks. Our goal is to leverage the tremendous advantages in expressivity and subtlety of meaning that oral language offers over written language. As well, we want to make it possible to train models on any of the world’s languages, which opens up an almost infinite collection of potential data for understanding human thought. We hope to share updates on our work as it progresses.

The work discussed in this blog post reflects the contributions of Yossi Adi, Jade Copet, Emmanuel Dupoux, Wei Ning Hsu, Evgeny Kharitonov, Kushal Lakhotia, Ann Lee, Abdelrahman Mohamed, Tu Anh Nguyen, Adam Polyak, and Morgane Rivière (listed in alphabetical order).

Get our step 1 GLSM paper
Get our step 2 expressive resynthesis paper
Get our step 3 prosody-aware GSLM paper
Get the code + pretrained models