BERTの日本語での事前学習モデル一覧

from BERTを日本語で扱う上での参考記事

BERTの日本語での事前学習モデル一覧

https://qiita.com/donuzium/items/27f5ba2d6660db14d860

BERTの日本語向けまとめ。よくまとまってる

table:bert

制作者フレームワーク形態素解析器, トークナイザーソースライセンス

Google TensorFlow 2 WordPiece? 日本語Wikipedia? Apache2.0

京都大学黒橋・河原研究所 TensorFlow 1x, PyTorch(transformers) Juman++ 日本語Wikipedia Apache2.0

東北大学乾・鈴木研究室 TensorFlow ?, PyTorch(transformers) MeCab(IPADic, NEologd) + WordPiece 日本語Wikipedia Apache2.0

菊田遥平 TensorFlow < 2.0 SentencePiece 日本語Wikipedia Apache2.0

株式会社ホットリンク TensorFlow 1.11 SentencePiece Twitter日本語評判分析データセット独自規約

https://github.com/google-research/bert/blob/master/multilingual.md

BERTリポジトリにあるの複数言語サポートのドキュメント

http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT日本語Pretrainedモデル

京大の黒橋・河原研究所が公開しているもの

多言語pretrainedモデルには日本語も含まれていますので日本語のタスクに多言語pretrainedモデルを利用することも可能ですが、基本単位がほぼ文字となっていることは適切ではないと考えます。そこで、入力テキストを形態素解析し、形態素をsubwordに分割したものを基本単位とし、日本語テキストのみ(Wikipediaを利用)でpretrainingしました。

https://github.com/cl-tohoku/bert-japanese

This is a repository of pretrained Japanese BERT models. The pretrained models are available along with the source code of pretraining.

Update (Dec. 15 2019): Our pretrained models are now included in Transformers by Hugging Face. You can use our models in the same way as other models in Transformers.

yoheikikuta/bert-japanese

https://github.com/hottolink/hottoSNS-bert

BERTは他言語で学習する場合に、言語に応じた対応が必要

英語の事前学習は日本語のタスクには使えない

そもそも事前学習の際にケアすべき事がある

形態素解析

学習ソース

2020/3

https://alaginrc.nict.go.jp/nict-bert/index.html

日本語の評価セットがある点で良さそう

https://github.com/akirakubo/bert-japanese-aozora/

https://laboro.ai/column/laboro-bert/

バンダイナムコのdistilBERT

https://github.com/BandaiNamcoResearchInc/DistilBERT-base-jp

https://github.com/himkt/awesome-bert-japanese