元論文のTransformer(のEncoder)とBERT系のレイヤーで、レイヤー内の構造が違うっぽい問題

from BERTとTransformerの違い

これは以下をメモる過程で知ったが、PreNormとPostNormというものがある

Transformerは最初post-normで、BERTもこれを利用したのでpost-norm

だが、後にpre-normのほうが安定性がいいとかなんとか

transformerの実装であるtensor2tensor上でも後にpre-normに代わった

しかし、BERTはそのままpost-normが主流っぽい

原因は不明

BERTとの制度比較のために、派生モデルが合わせている

BERT系でのpre-norm, post-normの性能比較が行われてるかまでは調べてない

https://gyazo.com/38c89faca0fc8966d5bf75c964eb8000

元論文の図はこう

レイヤー内にサブレイヤーが２つある

この２つの入力側部分

multi head attention

feed forward

パターンA（prenorm

上から下に

norm

sublayer

dropout

残差接続

Annotated Transformerを見る

SublayerConnectionというのがあり

code:py

class SublayerConnection(nn.Module):

"""

A residual connection followed by a layer norm.

Note for code simplicity the norm is first as opposed to last.

"""

def __init__(self, size, dropout):

super(SublayerConnection, self).__init__()

self.norm = LayerNorm(size)

self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):

"Apply residual connection to any sublayer with the same size."

return x + self.dropout(sublayer(self.norm(x)))

https://qiita.com/halhorn/items/c91497522be27bde17ce

途中の図

https://gyazo.com/8b3f39c351ff0c0f059ef0aa46d5724c

code:py

class ResidualNormalizationWrapper(tf.keras.models.Model):

def __init__(self, layer: tf.keras.layers.Layer, dropout_rate: float, *args, **kwargs) -> None:

super().__init__(*args, **kwargs)

self.layer = layer

self.layer_normalization = LayerNormalization()

self.dropout_layer = tf.keras.layers.Dropout(dropout_rate)

def call(self, input: tf.Tensor, training: bool, *args, **kwargs) -> tf.Tensor:

tensor = self.layer_normalization(input)

tensor = self.layer(tensor, training=training, *args, **kwargs)

tensor = self.dropout_layer(tensor, training=training)

return input + tensor

これはannotatedと同じ

このqiita記事が上のannotated transformerを参照してるのだろうか

パターンB（postnorm

sublayer

dropout

残差接続

layer norm

BERTは論文中では、tensor2tensorの実装使ったとしか言ってない

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/transformer_layers.py

これを見ると、たしかに後者

いや、見間違えたな

masterを見ると、prenormだな

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L1845-L1846

しかしattention is all you needの実装はこれのことらしい

initial commitを見たがBだな

https://github.com/tensorflow/tensor2tensor/commit/3d9c62f2aca9492db5c22676416974005b9dcbae#diff-c4846fd5d84f2239102a6e93498958272ffdd1c06f3742488b1a6e96ed6819d7R59-R61

TFのチュートリアル

https://www.tensorflow.org/tutorials/text/transformer

TFで実装

元論文

. That is, the output of each sub-layer is

LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer

itself

Residual Dropout We apply dropout to the output of each sub-layer, before it is added to the

sub-layer input and normalized.

やはり元論文はパターンBといっている

pytorchのtransformerみるとBERTの方と同じな気がする

https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer

https://github.com/jadore801120/attention-is-all-you-need-pytorch

元論文の実装ぽいが、こっちはパターンBだった