transformersの文書分類の例のrun_glue.pyにlivedoorニュースコーパスを渡す

そこで多クラス分類に当たる #ライブドアニュースコーパスでまず動かしてみる

第6章文章分類（『 #BERTによる自然言語処理入門』）の内容を、コードを書かずにやる

結果、再現した（と言えるはず）🙌

accuracyは3%低いが、データの分け方が違うためと考えている

code:shell

$ pwd

/.../transformers/examples/pytorch/text-classification

$ # 『BERTによる自然言語処理入門』を参考にlivedoorニュースコーパスを取得

$ python preprocess.py text livedoor_news_corpus

code:ipynb

!wc -l transformers/examples/pytorch/text-classification/livedoor_news_corpus/*.json

1474 transformers/examples/pytorch/text-classification/livedoor_news_corpus/test.json

4420 transformers/examples/pytorch/text-classification/livedoor_news_corpus/train.json

1473 transformers/examples/pytorch/text-classification/livedoor_news_corpus/val.json

7367 total

code:train.ipynb

!cd transformers/examples/pytorch/text-classification/ && python run_glue.py \

--model_name_or_path cl-tohoku/bert-base-japanese-whole-word-masking \

--train_file livedoor_news_corpus/train.json \

--validation_file livedoor_news_corpus/val.json \

--test_file livedoor_news_corpus/test.json \

--do_train \

--do_eval \

--do_predict \

--data_seed 42 \

--max_seq_length 128 \

--per_device_train_batch_size 32 \

--learning_rate 1e-5 \

--num_train_epochs 5 \

--output_dir /tmp/livedoor/

learning_rateとnum_train_epochsは第6章文章分類を参照した

20分未満の訓練の末

code:/tmp/livedoor/README.md

It achieves the following results on the evaluation set:

- Loss: 0.4653

- Accuracy: 0.8513