tokenizers.Tokenizer
Parameters
model (Model) — The core algorithm that this Tokenizer should be using.
A Tokenizer works as a pipeline. It processes some raw text as input and outputs an Encoding.
trainメソッド
Parameters
files (List[str]) — A list of path to the files that we should use for training
trainer (~tokenizers.trainers.Trainer, optional) — An optional trainer that should be used to train our Model
Reads the files line by line, while keeping all the whitespace, even new lines.
「改行であっても、すべての空白文字を保持して、ファイルを1行1行読み込む」
encodeメソッド
Parameters
add_special_tokens
以下がドキュメントに見つからないのはなぜ?
tokenize
convert_ids_to_tokens