BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
BERT is an extension of that trend. Created a model that allows pre-training.
Simply by adding an output layer to the pre-trained model, we were able to achieve good results in 12 different language comprehension tasks.
Transformer
The same trend that has made it easy to use pre-trained models for image recognition tasks for field applications is likely to occur in natural language processing in the future. (But what about minor languages such as Japanese...)(Ha, would you rather the government professionals create a nice pre-trained model and distribute it for free?) (By all means, with the intention of getting it done in time before the Olympics)
BERT: Bidirectional Encoder Representations from Transformers.
proposing a new pre-training objective: the “masked language model” (MLM), inspired by the Cloze task (Taylor, 1953).
Cloze task is so-called "fill-in-the-blanks" problem
we also introduce a “next sentence prediction” task that jointly pre-trains text-pair representations.
The Transformer implementation uses the original tensor2tensor library, and nothing was tampered with.
The word is split in WordPiece as he likes play ##ing, with embedded position and sentence number information
Task to mask 15% at random and predict what is masked.
Not all of them are MASK tokens.
MASK 80% of 15%, 10% random words, and 10% words as they were
nishio.iconSo even seemingly ordinary words may be questioned later.
Learning MaskLM is easier than other tasks ---
This page is auto-translated from /nishio/BERT. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.