AI has drastically altered the way people go about their daily lives. Voice recognition has simplified activities like taking notes, typing documents, and more. Its speed and efficiency are what makes it so popular. With the progress made in AI, many voice recognition applications have been created. Google, Alexa, and Siri are a few examples of virtual assistants that use voice recognition software to communicate with users. Additionally, texttospeech, speechtotext, and texttotext have been widely adopted in various applications.

Creating human-level speech is essential for Artificial Intelligence (AI), especially when it comes to chatbots. Recent advances in deep learning have drastically improved the quality of synthesized speech produced by neural-based Text-to-Speech (TTS) systems. However, most of the data used for training these systems have been limited to recordings from controlled environments, such as reading aloud or performing a script. Human beings on the other hand, can speak spontaneously with varied prosodies that convey paralinguistic information, such as subtle emotions. This ability is acquired from being exposed to a long duration of real-world speech.

Here are 3 new ways that will dramatically improve text to speech.

Multi-codebook vector quantized TTS

Researchers at Carnegie Mellon University have developed an artificial intelligence (AI) system that can be trained to generate texttospeech with a wide range of voices. To do this, they analyzed actual speech taken from YouTube videos and podcasts. By using existing recordings, they could simplify the environment and focus on texttospeech. They hope that this will replicate the success of large language models like GPT3.

Using a limited amount of resources, these systems can be tailored to particular speaker qualities or recording conditions. This paper examines the new challenges that arise when training TTS systems on actual speech, such as increased prosodic variance and background noise that are not found in speech recorded in controlled environments. The authors show that the use of melspectrogrambased autoregressive algorithms cannot reproduce accurate textaudio alignment when applied to realworld speech, resulting in distorted speech. The failure of inference alignment is attributed to the errors that accumulate in the decoding process, as they also demonstrate that precise alignments can be learned during the training phase.

Researchers found that replacing the melspectrogram with a learned discrete codebook could solve the problem. This is due to the fact that discrete representations are more resistant to input noise. However, their research showed that a single codebook still produced distorted speech even when the codebook was increased in size. It is believed that there are too many prosody patterns in spontaneous speech for a single codebook to capture. Therefore, multiple codebooks were used to create architectures for multicode sampling and monotonic alignment. A pure silence audio prompt was used during the inference process to ensure that the model produced clear speech despite being trained on a noisy corpus.

In this paper, the authors present their new technology MQTTS (multicodebook vector quantized TTS). To understand its potential for realworld voice synthesis, they compare melspectrogrambased systems in Section 5 and carry out an ablation analysis. They then compare MQTTS to nonautoregressive models, finding that it produces better intelligibility and speaker transferability. Additionally, MQTTS has greater prosody variety and naturalness, though nonautoregressive models have faster computing speed and higher resilience. Furthermore, MQTTS may achieve a lower signaltonoise ratio with a clean, quiet cue (SNR). The authors make their source code available on GitHub for public use.

Hugging Face Transformers has recently gained its first texttospeech model, SpeechT5.

The  highly successful T5 (Text-To-Text Transfer Transformer) has been the inspiration for the SpeechT5 framework, a unified-model which uses encoder-decoder pre-training for self-supervised learning of speech/text representation. The SpeechT5 model has now been added to the Hugging Face Transformers toolkit, an open-source library with easy access to the latest machine learning models.

SpeechT5 utilizes a conventional encoderdecoder design to develop combined contextual representations for both voice and text. It features three distinct speech models: texttospeech (for creating audio from nothing), speechtotext (for automated speech recognition), and speechtospeech (for carrying out speech augmentation or changing between voices).

The core concept of SpeechT5 is to prepare a single model by combining texttospeech, speechtotext, texttotext, and speechtospeech data. This encourages the model to learn from both speech and written text. The base of SpeechT5 is a standard Transformer encoderdecoder structure, which can perform sequential transformations with hidden representations, like any other Transformer. Prenets and postnets are added to make the same Transformer suitable for text and audio. The prenets convert the input of text or speech into the Transformer‘s hidden representations, while the postnets convert the Transformer‘s outputs into text or speech. To train the model for multiple languages, the team supplies it with text/speech data as input and produces the corresponding output as text/speech.

SpeechT5 stands out from other models as it allows for multiple activities to be carried out with one architecture, simply by adapting the prenets and postnets. The model has been finetuned to tackle a variety of tasks, and studies have shown that it outshines all baseline models in a number of spoken language processing tasks. To improve the model even further, scientists plan to pretrain SpeechT5 with a larger model and more unlabeled data. Additionally, they are exploring ways to use the framework to handle tasks involving spoken language processing in multiple languages.

VALLE by Microsoft

Microsoft has developed a revolutionary language model for texttospeech synthesis (TTS) known as VALLE. The AI utilizes audio codec codes as intermediate representations and is capable of replicating someone‘s voice with only three seconds of audio input. VALLE is a neural codec language model which tokenizes speech, and then uses algorithms to generate waveforms which sound like the speaker, even replicating their unique timbre and emotional tone. As stated in the research paper, VALLE can produce highquality personalized speech with just a threesecond sample of the speaker‘s voice, without the need for additional structural engineering, premade acoustic features, or finetuning. It also supports contextual learning and promptbased zeroshot TTS approaches. Demonstration audio clips are provided in the research paper, with one sample being a threesecond prompt that VALLE must replicate. To compare, another sample is a previouslyrecorded phrase by the same speaker (theground truth“), while thebaseline sample is a typical texttospeech synthesis example, and theVALLE sample is the output of the VALLE model.