On Layer Normalization in the Transformer Architecture

TransformerにおいてLayer NormalizationはResidual BlockのMulti-Head Attentionの直前に適用する（Pre-LN）と学習が大きく安定化しwarm-upが必要なくなり学習率を大きくできる。https://t.co/VWMbmf3w3x