The AI ​​that Leonardo would have liked to use: makes Mona Lisa rap

Artificial intelligence, like other disruptive technologies such as genetics or nuclear energy, has two sides: a positive and a negative. The ability of AI to accelerate scientific research, for example with drug discovery, is undoubted. As is his ability to manipulate public opinion. And this is the case of Vasa-1, the innovation of the team of artificial intelligence researchers at Microsoft Research Asia, an AI that gives life and audio to a person's portrait.

According to a study VASA-1 is capable of converting a still image of a person and an audio track in an animation which accurately portrays the individual speaking or singing the audio track with facial expressions that are almost indistinguishable from a video.

The research team sought to animate still images of talking and singing, using any supporting audio tracks provided, while also displaying believable facial expressions. Clearly, they succeeded with the development of VASA-1, an artificial intelligence system that converts static images, whether captured by a camera, drawn or painted, in what they describe as “exquisitely synchronized” animations.

The group has demonstrated the effectiveness of its system by publishing short videos of its test results. In one, a cartoon version of the Mona Lisa performs a rap song; in another, a photograph of a woman has been transformed into a singing performance, and in another, a drawing of a man giving a speech.

In each of the animations, facial expressions change along with the words in a way that emphasizes what is being said. The researchers also note that, despite the realistic nature of the videos, closer inspection may reveal faults and evidence that they have been artificially generated.

The research team achieved their results by training their application with thousands of images with a wide variety of facial expressions. They also note that the system currently produces 512 by 512 pixel images at 45 frames per second. Additionally, it took an average of two minutes to produce the videos using a desktop Nvidia RTX 4090 GPU, i.e. available technology.

The research team suggests that VASA-1 could be used to generate extremely realistic avatars for games or simulations. At the same time, They recognize the potential for abuse and therefore do not make the system available for general use.