Back to Blog

Text-to-Speech Models: Multilingual Capabilities and Voice Cloning

Text-to-Speech (TTS) technology has revolutionized the way we interact with digital content. By converting written text into spoken words, TTS is useful for marketing, customer service, and many other applications related to audio and human speech.

Text-to-speech (TTS) technology has evolved considerably over the past two years, driven by the trend of training large models on extensive datasets sourced from the public internet. This approach minimizes the need for intensive labeling efforts. As a result, TTS synthesized voices are becoming increasingly difficult to distinguish from human voices. Notably, Microsoft recently unveiled a new model, VALL-E-2, which they claim has achieved “human parity”.

On this article, we have a look at the best available modes. In particular, we will use the Coqui AI toolkit to analyze their XTTS v2 multilingual model.

Models timeline

XTTSv2

Similar to a Large Language Model (LLM), TTS models like XTTS v2 are autoregressive models that operate with tokens. The input text is tokenized and fed into the speech synthesis model, which then processes the input in iterations to generate audio chunks.

Regarding to its architecture, XTTS v2 uses a VQ-VAE model to convert the audio into discrete audio tokens. Then, a GPT model is used to predict these audio tokens based on the input prompt text and speaker latents. These speaker latents are generated by a series of self-attention layers. Finally, the GPT model’s output is fed into a UnivNet vocoder model, which produces the generated audio signal.

The model was pre trained with around 30k hours, splitted into 16 languages, but mostly on English. Moreover, it provides the possibility of fine-tuning the model by training it with a custom dataset for a better performance. It also has the ability to generate the voice of an unseen speaker during training, by employing even less than 10 seconds of that voice. This is called Zero-Shot.

A nice capability is Cross-language voice cloning, in which the model receives a cloning sample in one language and a prompt in another one, generating an audio with the sample voice tone but in the given prompt’s language. XTTS is distinguished from other models by having support for many languages as English and even Polish or Czech.

The model can also perform streaming inference, producing real-time generated audio with approximately 200ms latency. The time taken for audio generation increases linearly with longer prompt texts. For a prompt text of about 45 words, XTTSv2 requires around 1 second to generate the audio. If we include the model loading time, it takes about 25 seconds. However, this loading time can be avoided if the model is already loaded.

Cloning a voice with XTTS Zero-shot

Zero-shot cloning is the simplest way to clone a voice. It requires just a short sample of speech for the voice to be cloned, and the model will synthesise any desired text with that particular voice. When doing zero-shot, the model uses the provided sample to learn the tone of each phoneme, and then is capable of cloning them and even extrapolate to unheard phonemes.

In the following audio, we create a synthesized voice by performing zero-shot cloning on a custom voice using a 13-second recording of that voice.

Zero-shot could also be useful to specify desired characteristics on the synth voice, like emotions. In the next audio, we used it to change the voice emotion to sadness, by conditioning it (with 18 seconds) using a recording with a noticeable emotion .

Going further with XTTS: Fine-Tuning

When zero-shot capabilities are not enough, an alternative is to fine-tune the model with a custom dataset of target voice.

Sometimes, cloning a voice by zero-shot could not be able to produce a result on the desired quality.

To achieve higher quality in generated voice, fine-tuning the model is the next step. By retraining it on a custom dataset containing ample samples of the desired voice, XTTSv2 will improve its emulation.

Fine-tuning benefits from larger datasets, unlike zero-shot. Only audio files are needed to create a dataset. In the background, XTTSv2 uses Whisper process the dataset by transcribing those audio.

On the following examples, we fine-tune the model with a custom dataset of recordings for the target voice we want to clone. The following audio is how the real voice read the following paragraph:

If we use some seconds of that voice recordings on the zero-shot, the generated voice sounds like this:

And the following are the results of generating that same phrase, fine-tuning the model every time with a bigger dataset of recordings for the target voice.

Fine-tuning the model with 23 hours of recordings:

While using only 45 minutes for fine-tuning:

Although its hard to clearly distinguish which of the fine-tuned voices is better, both resemble the original voice closely than the zero-shot generation. Indeed, zero-shoting is a great resource when just some seconds of the voice are available, but if there’s access to a big dataset, fine-tuning the model will let it better learn the how different phonemes sound for that voice and replicating them better.

Comparing between models

We have seen how generated voices with XTTS v2 sound like. Does every model sound the same? Certainly not, and the differences might be hard to identify. Here we show how some different state-of-the-art models synthesize the target voice, using the following sample for zero-shot cloning.

Speaker prompt

XTTSv2 Zero-shot

OpenVoice Zero-shot

VALL-E Zero-shot

The choice for the best model might not be the same on every case, as it will depend on the characteristics of the voice being generated, and of course the speaker prompt used for the zero-shot. Furthermore, sometimes the distinction on which voice is better may differ between persons, and someone could prefer a voice that sounds artificial for others.

To deal with this evaluation subjectivity, researchers often conduct listening tests where participants are asked to listen to a series of audio clips and determine whether each clip was produced by a human or a TTS model. Similar to a Turing Test, but for voices.

Some of the metrics used to measure different aspects of voice generated with a model are:

  • Naturalness is measured using CMOS (Comparative Mean Opinion Score), which measures two TTS models directly (e.g. VALL-E against XTTSv2, VALL-E against Voicebox, etc.)
  • Similarity is represented in SMOS (Similarity Mean Opinion Score), which compares generated audios to human recordings.
  • Robustness is measured using WER (Word Error Rate), which compares between ground truth and generated audios’ transcripts. Differently for the others, this is not a subjective metric, and is computed by performing Automated Speech Recognition on the generated audio and calculating the difference respect to the original prompt. Neural TTS systems sometimes experience deletion, insertion and replacement errors, and this metrics measures how often these error happen.

Towards human parity

The main goal of TTS is well defined: reaching human parity. It refers to the point at which synthesized speech is indistinguishable from human speech. Achieving human parity means that the model's output is so natural and realistic that listeners cannot reliably tell whether the audio was generated by a machine or spoken by a human.

The current trend on AI is training Large models on huge datasets, and TTS certainly following this trend too. Inspired by the success of Large Language Models in natural language processing, latest developed TTS models follow LLM architectures and training recipes.

Just recently, on Microsoft marked a milestone by claiming its VALLE-2 model to be the first model ever to reach human parity. They indicate that the robustness, naturalness, and similarity metrics of VALL-E 2 surpass those of the ground truth samples.

VALL-E 2 used a huge dataset of 50.000 recording hours, and was trained for days on 16 V100 GPUs. Renting on a cloud provider, such an infrastructure would cost around $1000 per day.

OpenAI has also made similar advances, and is building since late 2022 Voice Engine, a model able to generate natural-sounding speech. These model is indeed currently used to power the ChatGPT voice, but similarly to Microsoft, OpenAI has chosen not to release its voice cloning power to the public usage.

The reason restraining AI companies to release these powerful models is clear: security. The misuse of human parity voice cloning carries potential risks for fraudulent calls, voice authentication systems and the generation of fake news. With elections on the US taking place this year, it is likely that none of this models will be released soon.

Both OpenAI and Microsoft talk about security measurements when considering the possibility of releasing the models. These safety measures would provide a way to detected wether a voice is real or generated, by training detection models and including watermarks on generated audios. Furthermore, voice authentication systems could verify that the original speaker has given his consent.  But such ideas are currently just concepts far away from being implemented.

It’s likely that AI giants will stay away from publicly commercializing voice cloning until these safety measures are granted. In the meantime, the development of large models will remain in the hands of those with the necessary resources, with little motivation to release these models and risk misuse.

Related posts