CLIP Interrogator - SFC - CCLab x-visual Survey

CLIP Interrogator

このサーベイの趣旨 CLIP Interrogatorの内部がどのように動いているのかをざっくり理解する。

論文は上がっていないので、ソースコードから読み解く。

CLIP Interrogator とは？

任意の画像に対して、CLIPを利用するtext2imgモデルでその画像を生成できるプロンプトを生成するモデル。

内部ではCLIPと、CLIPの改良版であるBLIPが動いている。

手法

https://github.com/pharmapsychotic/clip-interrogator/blob/e22b005ba59a9a16174a09665ca42924afc0e4ae/clip_interrogator/clip_interrogator.py#L168

image captioningモデルで生成したキャプションと、magic promptsやmagic wordsが入っている辞書（Flavors）の2つを、CLIP潜在空間内の類似度が高い順で結合して最終的なプロンプトを生成している

image captioningは最大長が短い傾向にある&txt2imgに最適化されている訳ではないから、captioningした文章とimg2txtで利用されるmagic promptsを組み合わせて特徴を増やしていく

テスト用画像

https://gyazo.com/a723688fffb6eea0782d8d52c7548699

outputs: a painting of a large group of people, inspired by Moses Soyer, fully clothed in red robes, pale orange colors, eliezer yudkowsky, round-cropped, desperation, in rows, purgatory, necropolis, website, mihaly munkacsy, ivan plusch artwork

Image Captioning

BLIPでキャプショニングを行っている。

BLIP内部ではBERTが走っていて、ViTでencodeしたimage embeddingsに後続する形でcaptionを生成している？

code: clip_interrogator.py Interrogator.interrogate / l115

# 文頭のキャプションを生成

caption = self.generate_caption(image)

code: clip_interrogator.py Interrogator.generate_caption / l121

caption = self.blip_model.generate(

gpu_image,

sample=False,

num_beams=self.config.blip_num_beams,

max_length=self.config.blip_max_length,

min_length=5

)

output: "a painting of a large group of people"

Flavors Selection

CLIPのViTでencodeした画像特徴量と、t = n-1時点でのprompt に Flavorsを後続させたプロンプトの組み合わせのなかで、類似度が最も高くなるFlavor n をpromptに後続させる

code:clip_interrogator.py Interrogator.interrogate / l105

self.artists = LabelTable(artists, "artists", self.clip_model, self.tokenize, config)

self.flavors = LabelTable(_load_list(config.data_path, 'flavors.txt'), "flavors", self.clip_model, self.tokenize, config)

self.mediums = LabelTable(_load_list(config.data_path, 'mediums.txt'), "mediums", self.clip_model, self.tokenize, config)

self.movements = LabelTable(_load_list(config.data_path, 'movements.txt'), "movements", self.clip_model, self.tokenize, config)

self.trendings = LabelTable(trending_list, "trendings", self.clip_model, self.tokenize, config)

class LabelTable: data/*.txtに入っているFlavorsをロードしCLIPのEmbeddingsを取得し、キャッシュテーブルとして保持するクラス。（一度CLIP Embeddingsとして読み込んだ*.txt上のFlavorsは、LabelTableのインスタンスごとpickleとして保存されるので毎回CLIPの特徴量を計算する必要がなくなる！Flavorsを新規に追加したらその分だけCLIP Embeddingsが追加され、pickleがアップデートされる。）

example: data/artists.txt

A. B. Jackson

A. J. Casson

A. R. Middleton Todd

A.B. Frost

A.D.M. Cooper

...

code:clip_interrogator.py Interrogator.interrogate / l172

flaves = self.flavors.rank(image_features, self.config.flavor_intermediate_count)

best_medium = self.mediums.rank(image_features, 1)0

best_artist = self.artists.rank(image_features, 1)0

best_trending = self.trendings.rank(image_features, 1)0

best_movement = self.movements.rank(image_features, 1)0

output: flaves = ['art of émile eisman-semenowsky', 'crowd of people', 'crowd of longhairs', 'oil painting of an overpopulated', 'crowds of people praying'n delville fertile', 'scene from church', 'group of people' ...

LabelTable.rank() -> Classのself.labelsに入っているFlavorsと入力画像のCLIP Embeddingの類似度を取ってきて、上位top_count（第2引数）のlabelsを返す

code:clip_interrogator.py Interrogator.interrogate / l181~

def check(addition: str) -> bool:

nonlocal best_prompt, best_sim

prompt = best_prompt + ", " + addition

sim = self.similarity(image_features, prompt)

if sim > best_sim:

best_sim = sim

best_prompt = prompt

return True

return False

def check_multi_batch(opts: Liststr):

nonlocal best_prompt, best_sim

prompts = []

for i in range(2**len(opts)):

prompt = best_prompt

for bit in range(len(opts)):

if i & (1 << bit):

prompt += ", " + optsbit

prompts.append(prompt)

t = LabelTable(prompts, None, self.clip_model, self.tokenize, self.config)

best_prompt = t.rank(image_features, 1)0

best_sim = self.similarity(image_features, best_prompt)

check_multi_batch(best_medium, best_artist, best_trending, best_movement)

extended_flavors = set(flaves)

for _ in tqdm(range(max_flavors), desc="Flavor chain", disable=self.config.quiet):

best = self.rank_top(image_features, f"{best_prompt}, {f}" for f in extended_flavors)

flave = bestlen(best_prompt)+2:

if not check(flave):

break

if _prompt_at_max_len(best_prompt, self.tokenize):

break

extended_flavors.remove(flave)

return best_prompt

この辺まだ完全には理解できてないです！！！！すいません！！！！

check_multi_batch(): 単純に結合するのではなくて、Flavorsがtop_nで結合されたstrのList( ('Flavor0'), ('Flavor0, Flavor1'), ('Flavor0, Flavor1, Flavor2')...)をもとにさらにLabelTableをインスタンシングし、もっとも画像との類似度が高い文章を持ってくる

best_promptに代入

best_prompt after check_multi_batch: a painting of a large group of people, inspired by Moses Soyer

flavesの集合をリスト内包表記でbest_promptに接続して、もっとも画像との類似度が高い文章を持ってくる

Chain状にFlavorsを繋いでいく。繋いでいく際に、(Flavor0, Flavor1 ... Flavorn-1)とFlavor x（まだ選択されていないFlavor）を繋いだpromptと画像特徴量の類似度から、Flavor nを決定していくイメージ。

Flavor chain: 0%| | 0/32 [00:00<?, ?it/sa

painting of a large group of people, inspired by Moses Soyer, fully clothed in red robes

Flavor chain: 3%|███▎ | 1/32 [00:02<01:06, 2.14s/ita

painting of a large group of people, inspired by Moses Soyer, fully clothed in red robes, pale orange colors

Flavor chain: 6%|██████▌ | 2/32 [00:04<01:02, 2.09s/ita

painting of a large group of people, inspired by Moses Soyer, fully clothed in red robes, pale orange colors, eliezer yudkowsky

...

output: a painting of a large group of people, inspired by Moses Soyer, fully clothed in red robes, pale orange colors, eliezer yudkowsky, round-cropped, desperation, in rows, purgatory, necropolis, website, mihaly munkacsy, ivan plusch artwork