transformersの文書分類の例のrun_glue.pyの引数について

以下で実行したときの引数を理解するためにまとめる

python run_glue.py -hから抜粋

--model_name_or_path（唯一の必須引数）

Path to pretrained model or model identifier from huggingface.co/models (default: None)

huggingface.co の下を探すので、以下の東北大BERTの場合は cl-tohoku/bert-base-japanese-whole-word-masking

ci-tohokuから指定する必要あり

--task_name

The name of the task to train on: cola, mnli, mrpc, qnli, qqp, rte, sst2, stsb, wnli (default: None)

--dataset_name

The name of the dataset to use (via the datasets library). (default: None)

--max_seq_length

The maximum total input sequence length after tokenization.

Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128)

--do_train（フラグ）

Whether to run training. (default: False)

--do_eval（フラグ）

Whether to run eval on the dev set. (default: False)

--do_predict（フラグ）

Whether to run predictions on the test set. (default: False)

--per_device_train_batch_size

Batch size per GPU/TPU core/CPU for training. (default: 8)

--learning_rate

The initial learning rate for AdamW. (default: 5e-05)

--num_train_epochs

Total number of training epochs to perform. (default: 3.0)

--output_dir

The output directory where the model predictions and checkpoints will be written. (default: None)

--seed

Random seed that will be set at the beginning of training. (default: 42)

--data_seed

Random seed to be used with data samplers. (default: None)

READMEと完全に一致しなかったのはdata_seedが揃っていなかったのが原因？

手元のCSVやJSONファイルはどう指定するのか？

以下が試す候補

--train_file

A csv or a json file containing the training data. (default: None)

--validation_file

A csv or a json file containing the validation data. (default: None)

--test_file

A csv or a json file containing the test data. (default: None)

trainとvalidationは指定しなければならない

ValueError: Need either a GLUE task, a training/validation file or a dataset name.