Microsoft decides not to launch its new generative voice AI because it has reached 'human parity' and is too realistic

At the beginning of 2023, Microsoft presented VALLEYan artificial intelligence with the ability to clone voices from a three second clip of the sameThe result was not perfect, but it was remarkable for the capacity of the AI to replicate the speaker’s vocal timbre, emotional tone and acoustic environment discernible in the original recording. A year and a half later, Microsoft has announced that has completed development of its successor, VALL-E 2According to the team of researchers responsible, the tool is now able to convincingly clone people’s voices and has achieved ‘human parity’. Given its potential for malicious uses, Microsoft has decided not to release it to the public and to use it solely for ‘research purposes’.

Like its predecessor, VALL-E 2 is a neural codec language modela category within Deep Learning that uses neural network techniques to encode and decode linguistic information. However, unlike VALL-E, VALL-E 2 performs text-to-speech synthesis without specific prior trainingwhich means it uses text instructions to generate voices it has not been trained on. With VALL-E, the results were noticeably better when the original clip contained a voice similar to those it had been trained on.

VALL-E 2 uses a vast training library, in this case LibriSpeech and VCTKto map text inputs to corresponding audio outputs. This mapping accommodates variations in pronunciation, intonation, cadence and more. After ‘listening’ to a short clip of someone’s speech along with the user’s text input, VALL-E 2 incorporates those variations into its response to produce an artificial speech that imitates the sampled voice and contains what is indicated in the text input.

AI speech generators are amazing, but it’s very difficult to get them to sound completely natural. And it’s not the same with simple sentences as with more elaborate speeches. But according to researchers at the Natural Language Computing Group in Microsoft Research AsiaVALL-E 2 does this without a hitch. So well, in fact, that the voice generator, according to them, It is the first to ‘achieve human parity’ and making it available to the public could cause more harm than good.

‘VALL-E 2 is purely a research project‘, says the publication of the Blog of the researchers. ‘Currently, We have no plans to incorporate VALL-E 2 into a product or expand public access.. It may carry potential risks in misuse of the model, such as voice ID spoofing or imitating a specific speaker.’

The team notes that VALL-E 2 could be useful in education or entertainmentwhere the model could be a narrator of online courses or audiobooks maintaining the natural voice of a particular person. Other voice generators, such as Voicebox of Goal and AI-powered storytelling tool Alexa of Amazonhave raised controversy over the ethics of allowing AI to mimic a real person’s voice, especially when that person no longer exists to give consent. Like other forms of generative AI, speech generators also raise questions about their use in place of human workerssomething that is of particular concern to voice actors.

With VALL-E 2 under lock and key, it’s easy for Microsoft to test the practical limits of the model without running into trouble. “We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis,” the researchers explain. “If the model generalizes to unseen speakers in the real world, should include a protocol to ensure that the speaker approves the use of his or her voice and a model for detecting synthesized speech‘.