Computers are already able to play chess games and they became unbeatable opponents; we let them read our texts and they started to write. They also learned to paint and retouch photographs. Did anyone doubt that artificial intelligence would be able to do the same with speeches and music?

Google’s research division has presented AudioLM, a framework for generating high-quality audio that remains consistent over the long term. To do this, it starts with a recording of just a few seconds in length, and is able to prolong it in a natural and coherent way. What is remarkable is that it achieves this without being trained with previous transcriptions or annotations even though the generated speech is syntactically and semantically correct Moreover, it maintains the identity and prosody of the speaker to such an extent that the listener is unable to discern which part of the audio is original and which has been generated by an artificial intelligence.

The examples of this artificial intelligence are striking. Not only is it able to replicate articulation, pitch, timbre and intensity, but it is able to input the sound of the speaker’s breathing and form meaningful sentences. If it does not start from a studio audio, but from one with background noise, AudioLM replicates it to give it continuity. More samples can be heard on the AudioLM website.

Google Brain

Google Audio LM is an artificial intelligence trained in semantics and acoustics. How does it do it? The generation of audio or music is nothing new. But the way Google researchers have devised to tackle the problem is. From each audio, semantic markers are extracted to encode a high-level structure (phonemes, lexicon, semantics…), and acoustic markers (speaker identity, recording quality, background noise…). With this data already processed and understandable to the artificial intelligence, AudioML begins its work by establishing a hierarchy in which it first predicts the semantic markers, which are then used as conditions for predicting the acoustic markers. The latter are then used again at the end to convert the bits into something humans can hear.

This semantic separation of acoustics, and its hierarchy, is not only a beneficial practice for training language models to generate speech. According to the researchers, it is also more effective for continuing piano compositions, as they show on their website. It is much better than models that are only trained using acoustic markers.

The most significant thing about AudioLM’s artificial intelligence is not that it is able to continue speeches and melodies, but that it can do everything at once. It is therefore a unique language model that can be used to convert text to speech – a robot could read entire books  – or to make any device able to communicate with people using a familiar voice. This idea has already been explored by Amazon, which considered using the voice of loved ones in its Alexa speakers.

Innovation or danger?

Software such as Dalle-2 and Stable Diffusion are exceptional tools that allow ideas to be sketched out or creative resources to be generated in a few seconds. Audio can be even more important, and one can imagine the voice of an announcer being used on demand by various companies. Even films could be dubbed with the voices of deceased actors. The reader may be wondering whether this possibility, while exciting, is not dangerous. Any audio recording could be manipulated for political, legal or judicial purposes. Google says that, while humans may have difficulty detecting what comes from humans and what comes from artificial intelligence, a computer can detect whether the audio is organic or not. In other words, not only can the machine replace us, but another machine will be essential to assess its work.

For the moment AudioLM is not open to the public, it is only a language model that can be integrated into different projects. But this demonstration, together with OpenAI’s Jukebox music programme, shows how quickly we are entering a new world where nobody will know, or care, whether that photograph is taken by a person or whether there is a person or an artificially generated voice-over on the other end of the phone in real time.