Meta unveils Audiobox, an AI that clones voices and generates ambient sounds

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.

Voice cloning is one of the areas rapidly emerging thanks to generative AI. The term refers to replicating a person’s vocal stylings — pitch, timbre, rhythms, mannerisms, and unique pronunciations — through technology.

While startups including ElevenLabs have received tens of millions in funding for dedicating themselves to this pursuit, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp and Oculus VR has released its own free voice cloning program, Audiobox — with a catch.

Unveiled today on Meta’s website by researchers working at the Facebook AI Research (FAIR) lab, Audiobox is described as a “new foundation research model for audio generation” build atop its earlier work in this area, Voicebox.

“It can generate voices and sound effects using a combination of voice inputs and natural language text prompts — making it easy to create custom audio for a wide range of use cases,” reads the Audiobox webpage.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!

Learn More

Simply type in a sentence that you want a cloned voice to say, or a description of a sound you want to generate, and Audiobox will do the rest. Users can also record their own voice and have it cloned by Audiobox.

A ‘family’ of audio generating AIs

Meta further noted that it actually created a “family of models,” one for speech mimicry and the other for generating more ambient sounds and sound effects such as dogs barking or sirens or children playing, and that they are all “built upon the shared self-supervised model Audiobox SSL.”

Self-supervised learning (SSL) is a machine learning (ML) deep learning technique in which artificial intelligence algorithms are assigned to generate their own labels for data that is unlabeled, as opposed to supervised learning, where the data may already be labeled.

The researchers published a scientific paper explaining some of their methodology and rationale for taking an SSL approach, writing “because labeled data are not always available or of high quality, and data scaling is the key to generalization, our strategy is to train this foundation model using audio without any supervision, such as transcripts, captions, or attribute labels, which can be found in larger quantities.”

Of course, most leading generative AI models are heavily dependent on human generated data for training how to create new content, and Audiobox is no exception. The FAIR researchers relied upon “160K hours of speech (primarily English), 20K hours of music and 6K hours of sound samples.”

“The speech portion covers audiobooks, podcasts, read sentences, talks, conversations, and in-the-wild recordings including various acoustic conditions and non-verbal voices. To ensure fairness and a good representation for people from various groups, it includes speakers from over 150 countries speaking over 200 different primary languages.”

The research paper does not specify exactly where this data was sourced from and whether or not it was in the public domain, but that is surely an important question with various artists, authors, and music publishers suing a host of AI companies for training on potentially copyrighted material without the creators/rights owners’ express consent. We’ve reached out to a Meta spokesperson for clarification and will update when we receive it.

You can try it yourself and clone your own voice now

To showcase the capabilities of Audiobox, Meta has also released a host of interactive demos, including one that lets you record the audio of the user speaking about a sentence’s worth of text and replicates their voice.

Then, the user can type in text that they want their cloned voice to say and hear it read back to them in their cloned voice.

You can try it for yourself here. In my case, the resulting AI generated cloned audio was eerily similar, though not exactly the same as my own voice (as testified by my wife and child, who heard it not knowing what it was).

Meta also allows users to generate whole new voices from text descriptions of what they should sound like “deep feminine voice” “high pitched masculine speaker from the U.S.” etc., as well as restyle voices recorded by the user, or type in a text prompt to generate whole new sound. I tried the latter with “dogs barking” and received two versions that were indistinguishable to the real thing in my ears.

Now for the big catch: Meta includes a disclaimer with its Audiobox interactive demos noting that “this is a research demo and may not be used for any commercial purpose(s),” and furthermore, that it is restricted to those outside of “the States of Illinois or Texas,” which have state laws that apparently prohibit the kind of audio collection Meta is doing for the demos.

Interestingly, like its new Imagine by Meta AI image generation web app unveiled last week, Audiobox also is not open source, bucking Meta’s commitment to the field that was evidenced earlier by the release of its Llama 2 family of large language models (LLMs). We also asked our Meta contact about this and whether Audiobox would be made open source at some point and will update when we receive a response.

So, the technology can’t be used for any moneymaking/business purposes — nor can it be used by residents of two of the most populous states in the U.S. — for now. But with AI advancing at a rapid clip, expect this to change and there to be commercial versions in the near future, if not from Meta, from others.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.