OpenAI has introduced a new automatic speech recognition (ASR) system called Whisper as an open-source software kit on GitHub. Whisper’s AI can transcribe conversations in multiple languages and translate them into English, and the GPT-3 teams claim that Whisper’s training makes it easier to distinguish voices in noisy environments and understand heavy accents and technical language.

Automatic speech recognition, often called ASR, turns spoken language into text. Speech-to-text software that automatically converts your voice into written language.

This technology has many applications, including dictation and visual voice messaging software.

Open source speech to text: let’s discover Whisper, an automatic speech recognition (ASR) tool

OpenAI trained Whisper on 680,000 hours of audio data and corresponding transcripts in 98 languages collected from the web. According to OpenAI, this open collection approach led to “better robustness to accents, background noise and technical language.” It can also detect spoken language and translate it into English.

OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use the context gleaned from the input data to learn associations that can then be translated into the model’s output. OpenAI presents this overview of how Whisper works:

The input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed to an encoder. A decoder is trained to predict the corresponding text caption, mixed with special tokens that direct the unique model to tasks such as language identification, sentence-level timestamping, multilingual speech transcription, and English speech translation.

While impressive, OpenAI’s research paper suggests that ASR only really performs well in about 10 languages, a limitation that likely stems from the fact that two-thirds of the training data is in English. And while OpenAI admits that Whisper’s accuracy is not always on par with other models, the “robust” nature of its training gives it a leg up on other models. And while “robust” training allows Whisper to discern and transcribe speech through background noise and accent variations, it also creates new problems.

According to Openai, “Our studies show that, compared to many existing ASR systems, the models have better robustness to accents, background noise, technical language, as well as translation from multiple languages into English; and that the accuracy of speech recognition and translation is close to the best,” the OpenAI researchers explain on GitHub. “However, because the models are trained in a weakly supervised manner using data spiked with large-scale background noise, the predictions may include text that is not actually spoken in the audio (i.e., the so-called hallucination phenomenon). We hypothesize that this occurs because, given their general knowledge of language, the models combine trying to predict the next word in the audio with trying to transcribe the audio itself.”

OpenAI is getting a lot of press coverage for GPT-3 and other products such as the DALL-E text-to-image generator. Whisper provides a glimpse of how the company’s AI research is expanding into other areas. Whisper is open source, but the value of neural network-based AI speech recognition for individuals and businesses has been conclusively proven at this point. Whisper could be a starting point for OpenAI membership, as researchers have already speculated.

Openai anticipates that the transcription capabilities of Whisper models can be used to improve the accessibility of certain tools. Although Whisper models cannot be used out of the box for real-time transcription, their speed and size suggest that other entities may be able to create applications that enable near-real-time speech recognition and translation.