Artificial intelligence, like other disruptive technologies such as genetics or nuclear energy, has two sides: a positive and a negative. The ability of AI to accelerate scientific research, for example with drug discovery, is undoubted. As is his ability to manipulate public opinion. And this is the case of Vasa-1, the innovation of the team of artificial intelligence researchers at Microsoft Research Asia, an AI that gives life and audio to a person's portrait.
According to a study VASA-1 is capable of converting a still image of a person and an audio track in an animation which accurately portrays the individual speaking or singing the audio track with facial expressions that are almost indistinguishable from a video.
Microsoft just dropped VASA-1.
This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba
10 wild examples:
1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
— Min Choi (@minchoi) April 18, 2024
The research team sought to animate still images of talking and singing, using any supporting audio tracks provided, while also displaying believable facial expressions. Clearly, they succeeded with the development of VASA-1, an artificial intelligence system that converts static images, whether captured by a camera, drawn or painted, in what they describe as “exquisitely synchronized” animations.
The group has demonstrated the effectiveness of its system by publishing short videos of its test results. In one, a cartoon version of the Mona Lisa performs a rap song; in another, a photograph of a woman has been transformed into a singing performance, and in another, a drawing of a man giving a speech.
The First AI-Generated Video That Looks Super Real
Microsoft Research announced VASA-1.
It takes a single portrait photo and speech audio and produces a hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements… pic.twitter.com/6bxd4mEgFR
— Bindu Reddy (@bindureddy) April 17, 2024
In each of the animations, facial expressions change along with the words in a way that emphasizes what is being said. The researchers also note that, despite the realistic nature of the videos, closer inspection may reveal faults and evidence that they have been artificially generated.
Introducing: VASA-1 by Microsoft Research.
TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.
Tap to see all the videos. pic.twitter.com/pPC6qZOBW2
— Eduardo Borges (@duborges) April 18, 2024
The research team achieved their results by training their application with thousands of images with a wide variety of facial expressions. They also note that the system currently produces 512 by 512 pixel images at 45 frames per second. Additionally, it took an average of two minutes to produce the videos using a desktop Nvidia RTX 4090 GPU, i.e. available technology.
The research team suggests that VASA-1 could be used to generate extremely realistic avatars for games or simulations. At the same time, They recognize the potential for abuse and therefore do not make the system available for general use.