如何让女神拍照姿态百变，看google新方法Imagic - 文章 - 开发者社区

picture.image

本周，两种全新但 截然不同 的AI图形算法‍‍R‍unway 和 Imagic‍ 横空出世。它们可以对照片中的对象进行高细粒度的有效更改和调整。

第一个 是来自 google的Imagic[1] ，这款算法是由google、以色列理工学院、魏茨曼科学学院合作推出的。Imagic通过对扩散模型的微调，基于文本对对象进行细粒度编辑。

picture.image

第二款 叫Runway。它能让你更容易地使用擦除和替换功能，这些基于机器学习的视觉特效 新特性[2] 都包含在他们的'AI Magic Tools'里。

picture.image

Runway ML’s Erase and Replace feature, already seen in a preview for a text-to-video editing system. Source: https://www.youtube.com/watch?v=41Qb58ZPO60

由于擦除和替换并不是很稀奇，无非是好一点和差一点的差别，所以我们主要来介绍Imagic，因为它可以 直接修改物体本身的姿态和特点 ，但是在此之前我们还是快速地浏览一下Runway的功效。

Runway：擦除和替换物体

和Imagic一样的是，Runway主打的擦除(Erase)和替换(Replace)只处理 静态图像 ，不过Runway在一款即将发布的”文本转视频”的编辑工具中提供了一样的解决方案。

picture.image

Though anyone can test out the new Erase and Replace on images, the video version is not yet publicly available. Source: https://twitter.com/runwayml/status/1568220303808991232

picture.image

和Imagic(如上图)，擦除和替换是“以 物体对象 为单位”，你不能只是擦除一个物体的部分，或者一个空的东西（比如说一个墙面），下图中我们可以清晰地看到，仅仅是一个展品被换成了另一个展品，而没有影响它附近的物体。

picture.image

Substituting a domestic table for a ‘table made of ice’ in Runway ML’s Erase and Replace.

了解Erase和Replace使用什么方法来做隔离替换是一件比较有意思的事情（即我想擦除这个物体，而不会影响与它临近的物体）。它背后的原理是，将图像用像 CLIP[3] （Connecting Text and Images）这样的程序进行处理，通过对象识别（object recognition）和语义分割（semantic segmentation）对离散项进行逐个处理。

擦除和替换目前来说似乎是一个很有价值的艺术创作，但是，它不能编辑图像上已有的对象，只能替换它们。在不影响周围环境的情况下真正改变现有的图像内容可以说是一项艰巨得多的任务，这也是计算机视觉研究长期以来致力于解决的问题， 但是下面谈到的Imagic已经在着手解决 了。

Imagic：编辑物体本身

Imagic涉及到的新论文[4]中提供了许多编辑的例子，成功地修改了照片的个别方面，而不影响图像的其他部分。

picture.image

In Imagic, the amended images do not suffer from the characteristic stretching, distortion and ‘occlusion guessing’ characteristic of deepfake puppetry, which utilizes limited priors derived from a single image.

该系统有三个阶段——文本嵌入优化，模型微调，最后生成修正后的图像。

picture.image

Imagic encodes the target text prompt to retrieve the initial text embedding, and then optimizes the result to obtain the input image. After that, the generative model is fine-tuned to the source image, adding a range of parameters, before being subjected to the requested interpolation.

毫不奇怪的是，该框架是基于谷歌的”Imagen text-to-video”架构，尽管研究人员表示，该系统的原理广泛适用于潜在扩散模型。

Imagen使用的是三层架构，而不是该公司最新的文本到视频迭代软件所使用的七层架构。三个不同的模块组成一个生成扩散模型，在64x64px的分辨率下运行;接着使用一个超分辨率模型，将输出放大到256x256px;最后一个额外的超分辨率模型，可以一直输出到1024×1024。

Imagen在该过程的最初阶段（64x64px）进行干预，使用Adam优化器，以0.0001的静态学习率优化所给到的text embedding。

picture.image

A master-class in disentanglement: those end-users that have attempted to change something as simple as the color of a rendered object in a diffusion, GAN or NeRF model will know how significant it is that Imagic can perform such transformations without ‘tearing apart’ the consistency of the rest of the image.

然后基于给到的调整过的embedding，对Imagen的基础模型进行微调，每张输入图像需要1500 steps。同时，64px>256px的这层基于给到的图像进行并行优化。研究人员注意到，对最终256px>1024px图层的类似优化对最终结果“几乎没有影响”，因此没有实现。

论文指出，在双TPUV4芯片上，每个图像的优化过程大约需要8分钟。最终渲染在DDIM采样下的Imagen中进行。

与谷歌的DreamBooth微调过程类似，生成的embedding还可以用于风格仿效（stylization）如下图。

picture.image

Flexible photoreal movement and edits can be elicited via Imagic, while the derived and disentangled codes obtained in the process can as easily be used for stylized output.

研究人员将 Imagic 与之前的工作 SDEdit 进行了比较，SDEdit是一种基于GAN的方法，从2021年开始，由斯坦福大学和卡内基梅隆大学合作完成;以及从2022年4月起由魏兹曼科学研究所和英伟达合作的Text2Live。

picture.image

A visual comparison between Imagic, SDEdit and Text2Live.

很明显，从上图看来，之前的方法面对这类问题都举步维艰，但在最后一列，图片中的物体修改需要巨大的姿势或外观变化，只有Imagic取得了显著成功。

训练Imagic 的资源消耗和时长虽然以学术界追求的标准来看是不错的，但是目前它还不太可能被纳入个人电脑上的本地图像编辑应用程序中，而且还不清楚微调过程能在多大程度上降到消费者水平。

就目前的情况来看，Imagic是一个令人印象深刻的产品。

参考资料

[1] Runway 和 Imagic: https://www.unite.ai/ai-assisted-object-editing-with-googles-imagic-and-runways-erase-and-replace/

[2] Imagic: https://arxiv.org/pdf/2210.09276.pdf

[3] Runway新特性: https://twitter.com/runwayml/status/1581996497952206849

[4] CLIP: https://openai.com/blog/clip/

END -