transformersの文書分類の例のrun_glue.pyの引数について
以下で実行したときの引数を理解するためにまとめる
python run_glue.py -hから抜粋
--model_name_or_path(唯一の必須引数)
Path to pretrained model or model identifier from huggingface.co/models (default: None)
huggingface.co の下を探すので、以下の東北大BERTの場合は cl-tohoku/bert-base-japanese-whole-word-masking
ci-tohokuから指定する必要あり
--task_name
The name of the task to train on: cola, mnli, mrpc, qnli, qqp, rte, sst2, stsb, wnli (default: None)
--dataset_name
The name of the dataset to use (via the datasets library). (default: None)
--max_seq_length
The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128)
--do_train(フラグ)
Whether to run training. (default: False)
--do_eval(フラグ)
Whether to run eval on the dev set. (default: False)
--do_predict(フラグ)
Whether to run predictions on the test set. (default: False)
--per_device_train_batch_size
Batch size per GPU/TPU core/CPU for training. (default: 8)
--learning_rate
The initial learning rate for AdamW. (default: 5e-05)
--num_train_epochs
Total number of training epochs to perform. (default: 3.0)
--output_dir
The output directory where the model predictions and checkpoints will be written. (default: None)
--seed
Random seed that will be set at the beginning of training. (default: 42)
--data_seed
Random seed to be used with data samplers. (default: None)
READMEと完全に一致しなかったのはdata_seedが揃っていなかったのが原因?
手元のCSVやJSONファイルはどう指定するのか?
以下が試す候補
--train_file
A csv or a json file containing the training data. (default: None)
--validation_file
A csv or a json file containing the validation data. (default: None)
--test_file
A csv or a json file containing the test data. (default: None)
trainとvalidationは指定しなければならない
ValueError: Need either a GLUE task, a training/validation file or a dataset name.