VASA-1：微软发布的实时音频驱动数字人解决方案 - 文章 - 开发者社区

点击下方卡片，关注“ 慢慢学AIGC ”

picture.image

太长不看版

TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

简而言之：单张肖像照片+语音音频=超写实的会说话的面部视频，具有精准的嘴唇与音频同步、栩栩如生的面部行为以及自然的头部运动，并可实时生成。

摘要 Abstract

We introduce VASA, a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

我们介绍了 VASA，这是一个可以根据单个静态图像和语音音频剪辑生成具有吸引人的视觉情感技能(VAS)的虚拟角色生动的对话面部的框架。我们的首款模型 VASA-1 不仅能够生成与音频完美同步的嘴唇动作，还能捕捉到大量促进真实感和生动感的面部细微表情和自然头部运动。其核心创新包括在面部潜在空间中工作的整体面部动态和头部运动生成模型，以及利用视频开发如此富有表现力和解耦的面部潜在空间。通过大量实验，包括基于一系列新指标的评估，我们证明了我们的方法在各个方面都显著优于以前的方法。我们的方法不仅能够以高视频质量呈现逼真的面部和头部动态，而且支持以高达 40 FPS 的速度在线生成 512x512 分辨率的视频，启动延迟可忽略不计。它为与模仿人类对话行为的逼真化身实时互动铺平了道路。

(Note: all portrait images on this page are virtual, non-existing identities generated by StyleGAN2 or DALL·E-3 (except for Mona Lisa). We are exploring visual affective skill generation for virtual, interactive characters, NOT impersonating any person in the real world. This is only a research demonstration and there's no product or API release plan. See also the bottom of this page for more of our Responsible AI considerations.)

注意：

本页面上的所有肖像图像都是虚拟的、不存在的身份，由 Sty leG AN2 或 DALL· E-3 生成 (除了蒙娜丽莎)。我们正在探索为虚拟互动角色生成视觉情感技能，而不是模仿现实世界中的任何人。这只是一个研究演示，没有产品或 API 发布计划。有关我们负责任的人工智能考虑的更多信息，请参见本页底部。

现实感和生动感 Realism and liveliness

Our method is capable of not only producing precious lip-audio synchronization, but also generating a large spectrum of expressive facial nuances and natural head motions. It can handle arbitary-length audio and stably output seamless talking face videos.

我们的方法不仅能够实现精准的嘴唇与音频同步，还能生成丰富的富有表现力的面部细微表情和自然的头部运动。它可以处理任意长度的音频，并稳定输出无缝连接的会说话的面部视频。

生成的可控性 Controllability of generation

Our diffusion model accepts optional signals as condition, such as main eye gaze direction and head distance, and emotion offsets.

我们的扩散模型接受可选信号作为条件输入，例如主视线方向、头部距离和情感偏移。

分布外泛化能力 Out-of-distribution generalization

Our method exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios, and non-English speech. These types of data were not present in the training set.

我们的方法表现出处理超出训练分布的照片和音频输入的能力。例如，它可以处理艺术照片、唱歌音频和非英语语音。这些类型的数据在训练集中不存在。

解耦的力量 Power of disentanglement

Our latent representation disentangles appearance, 3D head pose, and facial dynamics, which enables separate attribute control and editing of the generated content.

我们的潜在表示能够解耦外观、3D头部姿态和面部动态，从而实现对生成内容的单独属性控制和编辑。

实时效率 Real-time efficiency

Our method generates video frames of 512x512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms , evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.

我们的方法在 离线批处理模式 下可生成 512x512 分辨率的视频帧，帧率为 45 fps。在 在线流模式 下，在搭载单个 NVIDIA RTX 4090 GPU 的台式机上，可支持 高达 40 fps 的帧率，前置延迟仅为 170 毫秒 。

风险和负责任的人工智能考虑

Risks and responsible AI considerations

Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there's still a gap to achieve the authenticity of real videos.

我们的研究重点是为虚拟 AI 化身生成视觉情感技能，旨在实现积极的应用。它并非旨在创建用于误导或欺骗的内容。然而，与其他相关的内容生成技术一样，它仍然可能被滥用于模仿人类。我们反对任何创建虚假或有害的真人内容的行为，并有兴趣将我们的技术应用于推进伪造检测。目前，该方法生成的视频仍然包含可识别的伪影，而且数值分析表明，要达到真实视频的真实性仍有一定差距。

While acknowledging the possibility of misuse, it's imperative to recognize the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.

虽然承认可能被滥用，但必须认识到我们这项技术的重大积极潜力。其好处 - 如提高教育公平性、改善面临交流挑战个人的无障碍性、为有需要的人提供陪伴或治疗支持等 - 凸显了我们的研究及其他相关探索的重要性。我们致力于负责任地开发人工智能，目标是促进人类福祉。

Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.

鉴于这一背景，在确定该技术将按照适当法规负责任使用之前，我们没有计划发布在线演示、API、产品、额外的实施细节或任何相关服务。

更多技术细节可以查看论文：https://arxiv.org/abs/2404.10667

扩展阅读：

利用 Wav2Lip 实现音频驱动数字人

从 DALL·E 到 Stable Diffusion：文本到图像生成模型是如何工作的？

OpenAI Sora：视频生成模型即世界模拟器

点击下方卡片，关注“ 慢慢学AIGC ”