Imagic - 基素基

Imagic

https://gyazo.com/10b67738c95d328f087afefb1604e9ec

Imagic: Text-Based Real Image Editing with Diffusion Models

https://arxiv.org/abs/2210.09276

Submitted on 17 Oct 2022 (v1), last revised 22 Nov 2022 (this version, v2)

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani

2022

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either

limited to specific editing types (e.g., object overlay, style transfer),

or apply to synthetically generated images,

or require multiple input images of a common object.

DeepL: 近年、テキストを用いた画像編集が注目されている。しかし、ほとんどの手法は、特定の編集タイプ（例えば、オブジェクトのオーバーレイ、スタイル転送）に限定されるか、合成的に生成された画像に適用されるか、共通のオブジェクトの複数の入力画像を必要とするかのいずれかであるのが現状である。

In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image.

本論文では、複雑な（例えば、非剛体の）テキストガイド付き意味編集を単一の実画像に適用する能力を、初めて実証する

この文脈の非剛体ってどういう意味？基素.icon

こういうことかな？

https://gyazo.com/98b3ba9ff6aead54107e050f848586a3

https://tech.preferred.jp/ja/blog/non-rigid-registration-model-for-pathological-images/

Rigid registrationは，線形変換を画像に適用することで，画像全体を一様に変換します．一方でNon-rigid registrationは，非線形変換により画像の局部ごとに非一様な変換が可能になります

いや？？？わけわからんが基素.icon

For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics.

例えば、画像内の1つまたは複数のオブジェクトの姿勢や構図を、元の特徴を維持したまま変更することが可能である。

Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user.

立っている犬を座らせたり、ジャンプさせたり、鳥を羽ばたかせたりすることができます。-- 本手法は、ユーザが与えた高解像度の自然画像内にある、1つまたは複数の物体の姿勢や構図を、元の特徴を維持したまま変化させることができる。

Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit).

It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object).

本手法は、従来とは異なり、1枚の入力画像と目的のテキスト（編集したい内容）のみを必要とします。

また、実画像上で動作するため、画像マスクやオブジェクトの追加ビューなどの追加入力を必要としない。

Our method, which we call "Imagic", leverages a pre-trained Text-to-Image diffusion model for this task.

It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance.

Imagicと呼ぶこの手法は、事前に学習したText-to-Image拡散モデルを利用して、このタスクを実行する。

これは、画像特有の外観を捉えるために拡散モデルを微調整しながら、入力画像とターゲットテキストの両方に一致するテキスト埋め込みを生成するものである。

We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

様々なドメインからの多数の入力に対して、我々の手法の品質と汎用性を実証し、単一の統一されたフレームワーク内で、高品質の複雑な意味的画像編集の数々を紹介する。

@jaguring1: おっ、グーグルの研究者がまた画像編集技術を発表してる！名前は「Imagic」。たった1枚の画像とテキストを使って新しい画像を生成！キスしてない2羽のオウムの写真から「キスをしてる2羽のオウム」へ編集できたり、リアルの滝の写真から「子供が描いた滝」へ編集できたり

https://t.co/FNdBP2XT2e

https://pbs.twimg.com/media/FfVms9hVUAE3BEY.jpg

/nishio/画像生成AI勉強会(2022年10月ダイジェスト)#636da8e2aff09e00005b3e86

図解がすごくわかりやすい

Imagicを理解する - ほげほげ

image-to-imageだとこうはいかないので、気になってペーパーとソースを読みました。

手法の概要

Target Embeddings（以下Etar）を調整してOptimized Embeddings（以下Eopt）を作って

拡散モデル本体をファインチューニングして

EtarとEoptを適当な比率でマージしながらちょうどいい感じの生成結果を探す

https://gyazo.com/3ec75f9b5b6ff167155481c5843d02b0

正直これで動くのが信じられないのですが、動いている以上文句を付けられません。

ImagicはStable Diffusionとpytorchを触っている人なら誰でも実装できるほどアルゴリズムが簡単な上に、テキストによる画像操作をにわかには信じがたいクオリティで行えます。

Imagicで遊ぶ

in/outはInstructPix2Pixも同じ