Transformer

2017年に発表された深層学習モデル

主に自然言語処理の分野で使用される

As a preview, here's a sped-up version of one of the core scenes, though admittedly it may not make full sense without narration.

https://video.twimg.com/ext_tw_video/1776961855698751489/pu/vid/avc1/1280x720/hM1v3RG6090Vfwr8.mp4

https://www.youtube.com/watch?v=KlZ-QmPteqM

RNNやLSTMの次世代、みたいな立ち位置

任意の個数のベクトルをインプットし、任意の個数のベクトルをアウトプットする

Attention

CNNやRNNの時代は、Attentionはサブの立ち位置だったが、

Transformerでは、Attentionだけをメインで使い倒している

https://gyazo.com/6620a6f16208bde90348f2d844c06437

左側がEncoder

文章などの入力をAIが理解できるように(?)encodeする

N個のtransformerが垂直に積み重なっている

Nxと書かれているとこ

Positional Encordingとは？

入力のポジションをencodeするもの

例えば入力が文章なら、何単語目か？の情報など

入力が画像なら、そのpxの座標的なもの

音声なら何秒後の音声か、とか

右側がDecoder

1単語ずつ(?)出力する部分

N個のtransformerが垂直に積み重なっている

Nxと書かれているとこ

図の右下で、decoderにも入力を入れているがどういうことか

Autoregressiveなことをしている

10単語を入力にし、11単語目を出力するみたいな

自己回帰的

図の上部では、Outputの確率、を出力している

Multi-Head Attention

$ \operatorname{MultiHead}(Q, K, V)=\text { Concat }\left(\operatorname{head}_1, \ldots, \operatorname{head}_h\right) W^O

$ \operatorname{head}_i=\operatorname{Attention}\left(Q W_i^Q, K W_i^W, V W_i^V\right)

$ \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{\mathrm{T}}}{\sqrt{d_k}}\right) V

Masked Multi-Head Attention

Autoregressive

Non-autoregressive

毎回毎回全部を入力し直すと時間がかかるのでしない

そうではなく、10単語分のベクトルの種のようなものを用意して使う

そうすることで、わざわざ10回回す必要がなくなる

Attention Is All You Need

https://deeplearning.hatenablog.com/entry/transformer

解説

https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

https://huggingface.co/blog/how-to-generate

https://euske.github.io/introdl/transformer/index.html

https://isobe324649.hatenablog.com/entry/2022/08/20/212110

#??

学習にかかる時間がRNNなどより短縮されたことが肝 #??

AIの評価って学習をいかに短縮させるか、ってとこになるのか？

そんなに単純なわけ無いか

1回の学習で効率良く学ぶことも大事だろう

なぜファインチューニングしやすい？