transformers.DataCollatorForLanguageModeling
tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) — The tokenizer used for encoding the data.
mlm (bool, optional, defaults to True) — Whether or not to use masked language modeling.
デフォルトTrueは、masked language model
Falseを指定すると、causal language model(テキスト生成)
mlm_probability (float, optional, defaults to 0.15) — The probability with which to (randomly) mask tokens in the input, when mlm is set to True.
return_tensors (str) — The type of Tensor to return. Allowable values are “np”, “pt” and “tf”.
デフォルトが"pt"っぽい