Figure.1がdata2vec論文のサマリと思われる

https://gyazo.com/5b6d0b8bd9d8cd1939f7beb658697af4

Illustration of how data2vec follows the same learning process for different modalities.

異なるモダリティに対してどのように同一の学習プロセスをたどるかを図示

The model first produces representations of the original input example (teacher mode) which are then regressed by the same model based on a masked version of the input.

モデルはまず、元の入力例の表現を生み出す（teacher mode）

それから、マスクされたバージョンのインプットに基づく同じモデルにより、表現は後戻りする

The teacher parameters are an exponentially moving average of the student weights.

teacher modeのパラメタは、student modeの重みの指数移動平均（exponentially moving average）である

The student predicts the average of K network layers of the teacher (shaded in blue).

student modeのモデルはteacher modeのK個のレイヤーの平均（青く表示）を予測する

1つのモデルで2つのモード

teacher

マスクされていない元の画像を知っている

student

マスクされた入力

teacherのレイヤーの重みK層の平均を推論する訓練

この重みはstudentの重みの指数移動平均：このあたりが自己教師あり学習っぽい

（リーケージしない？）

音声、画像、テキストいずれも上記のteacher / studentで扱っている（モダリティによらず同一の学習プロセス）