A pair of Facebook AI researchers used TED Talks and other data to make AI that closely mimics music and the voices of famous people, including Bill Gates. MelNet is a generative model that uses spectrogram visuals of audio for training data instead of waveforms. Doing so allows for the capture of multiple seconds of timesteps from audio, then creates models for end-to-end text-to-speech, unconditional speech, and solo piano music generation. MelNet was also trained to generate multi-speaker speech models.
Using spectrograms instead of waveforms allows for the capture of timesteps for several seconds. Well-known synthesizers of voices like Google’s WaveNet rely on waveforms instead of spectrograms for training AI systems.
“The temporal axis of a spectrogram is orders of magnitude more compact than that of a waveform, meaning dependencies that span tens of thousands of timesteps in waveforms only span hundreds of timesteps in spectrograms,” Facebook AI researchers said in a paper explaining how MelNet was created. “Combining these representational and modelling techniques yields a highly expressive, broadly applicable, and fully end-to-end generative model of audio.”
A website with samples of music, voices, and text-to-speech generated by MelNet was created to highlight the model’s performance and accompanies a paper published earlier this month on arXiv by Facebook AI research scientist Mike Lewis and AI resident Sean Vasquez.
A data set of more than 2,000 TED Talks voice recordings was also used to generate AI that sounds like George Takei, Jane Goodall, and luminary AI scholars like Daphne Koller and Dr. Fei-Fei Li. The Blizzard 2013, a data set of 140 hours of audiobooks, was also used to train MelNet’s single speaker-speech skills. VoxCeleb2, a data set of more than 2,000 hours of speech with more than 100 nationalities and a variety of accents, ethnicities, and other attributes helped hone the model’s multi-speaker speech function.
Creating MelNet also meant solving for other challenges such as producing high fidelity audio and the reduction of information loss.