VQGAN（2020）

2012.09841 Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks.

連続したデータに対して長距離の相互作用を学習するように設計されたトランスフォーマーは、様々なタスクにおいて最先端の結果を示し続けている。

In contrast to CNNs, they contain no inductive bias that prioritizes local interactions.

CNNとは対照的に、トランスフォーマーには局所的な相互作用を優先させる帰納的バイアスがない

This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images.

このため、表現力は豊かであるが、高解像度画像のような長いシーケンスでは計算できない

単純なTransformerでは高解像度の画像は合成できない基素.icon

We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

CNNの帰納的バイアスの有効性と変換器の表現力を組み合わせることで、高解像度画像のモデリングとそれによる合成を可能にする方法を示す。

We show how to

(i) use CNNs to learn a context-rich vocabulary of image constituents,

and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images.

(i)CNNを用いて文脈に富む画像構成要素の語彙を学習し、(ii)変換器を用いて高解像度画像内でのそれらの構成を効率的にモデル化する方法を示す。

Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image.

このアプローチは、オブジェクトクラスなどの非空間情報とセグメンテーションなどの空間情報の両方が生成される画像を制御することができる条件合成タスクに容易に適用することができる。

In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet.

特に、変換器を用いたメガピクセル画像の意味的に導かれた合成に関する最初の結果を示し、クラス条件付きImageNetにおける自己回帰モデルに関する技術の現状を得ることができる。

Code and pretrained models can be found at this https URL .

Patrick Esser, Robin Rombach, Björn Ommer

2020