Meta Introduces Generative AI Model for Speech Generation 'Voicebox'

Meta, the parent company of Facebook and Instagram, has unveiled its latest development in the field of generative AI for speech with the introduction of Voicebox.

This AI model showcases capabilities in speech generation, such as editing, sampling, and stylizing, even without specific training for these tasks.

Through in-context learning, Voicebox can produce high-quality audio clips while preserving the content and style of the original recording. Notably, this multilingual model can generate speech in six different languages.

A photo of the META logo during the US social network Instagram opening on a tablet screen in Moscow on November 11, 2021. - Facebook chief Mark Zuckerberg announced the parent company's name is being changed to "Meta" to represent a future beyond just its troubled social network. KIRILL KUDRYAVTSEV/AFP via Getty Images)

Meta Voicebox's Various Functionalities

The versatility of Voicebox is demonstrated through its various functionalities:

1. In-context text-to-speech synthesis: With just a two-second audio sample, Voicebox can match the style of the sample and generate text-to-speech output accordingly.

2. Speech editing and noise reduction: Voicebox possesses the ability to recreate interrupted speech segments affected by noise or replace misspoken words without the need to re-record the entire speech. This feature allows for seamless audio editing, akin to an eraser for audio.

3. Cross-lingual style transfer: Voicebox can read passages of text in different languages, producing speech in the desired language regardless of the language of the provided sample. This cross-lingual capability offers the potential for natural communication between individuals who speak different languages.

4. Diverse speech sampling: Having been trained on a wide range of data, Voicebox can generate speech that closely resembles how people naturally speak in real-world scenarios across the six supported languages.

Due to potential risks associated with misuse, the model and code are not currently available to the public. However, audio samples and a research paper detailing the model's approach and results have been shared.

Flow Matching Model

Voicebox leverages the Flow Matching model, representing Meta's latest breakthrough in non-autoregressive generative models. This advancement allows Voicebox to learn from varied speech data without the need for extensive labeling, resulting in a broader and more diverse training dataset.

With over 50,000 hours of recorded speech and transcripts from public domain audiobooks, Voicebox is trained to predict speech segments based on context, enabling the generation of speech within existing audio recordings.

The capabilities of Voicebox, along with its potential impact on the field of generative AI for speech, mark a significant milestone in Meta's research endeavors.

By sharing their approach and results, Meta encourages the research community to build upon their work and contribute to responsible AI development.

"Voicebox is a generative AI model that can help with audio editing, sampling and styling. This type of technology could be used in the future to help creators easily edit audio tracks, allow visually impaired people to hear written messages from friends in their voices, and enable people to speak any foreign language in their own voice," Meta wrote in its announcement post.