Amazon Researchers Train the Largest Ever Text-to-Speech AI Model to Date

Amazon researchers have unveiled the largest text-to-speech AI model to date, which they claimed shows "emergent" qualities that enhance its ability to speak even complex sentences naturally.

According to TechCrunch, this development could mark an advancement for the technology, potentially bridging the gap known as the uncanny valley.

Amazon Researchers Train the Largest Ever Text-to-Speech AI Model to Date — Amazon researchers have unveiled the largest text-to-speech AI model to date, which they claimed shows "emergent" qualities that enhance its ability to speak even complex sentences naturally. MARK RALSTON/AFP via Getty Images

Amazon AGI's BASE TTS

While the growth and improvement of these models were anticipated, researchers aimed to achieve a notable leap in performance, similar to the advancements observed in language models, as they surpassed a certain size threshold.

The team at Amazon AGI speculated that similar progress could occur with text-to-speech models, and their findings suggest this hypothesis holds true.

The newly introduced model, dubbed Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), leverages a massive dataset comprising 100,000 hours of public domain speech, predominantly in English, with additional German, Dutch, and Spanish segments.

TechCrunch reported that with 980 million parameters, the BASE TTS model stands out as the largest in its category, surpassing previous iterations in scale and capability.

In their research, the team evaluated various model sizes, including 400M- and 150M-parameter versions, to identify the emergence of desired behaviors. Surprisingly, the medium-sized model exhibited the anticipated leap in performance, particularly its ability to handle challenging speech tasks not explicitly trained for.

Despite encountering difficulties inherent to text-to-speech engines, such as mispronunciations or intonation errors, the BASE TTS model showcased remarkable proficiency in tackling complex linguistic constructs, according to the researchers.

The model's adeptness in parsing intricate sentences, emphasizing compound nouns and reproducing emotional or whispered speech underscores its promising capabilities.

Moreover, the BASE TTS models share a common architecture, suggesting that their size and training data significantly influence their ability to navigate linguistic complexities. However, it's important to note that these models remain experimental and require further refinement before widespread deployment.

'Streamable'

TechCrunch reported that a notable feature of the BASE TTS model is its "streamable" nature, allowing it to generate speech in real time at a relatively low bit rate.

Additionally, efforts have been made to encapsulate speech metadata, such as emotionality and prosody, in a separate, bandwidth-efficient stream, enhancing its versatility and adaptability.

While the emergence of advanced text-to-speech models holds immense potential, particularly for accessibility purposes, researchers caution against premature dissemination of model data due to security concerns.

Despite the reluctance to publish the model's source and associated data, the certainty of its eventual accessibility underscores the need for robust safeguards against potential misuse.