tokenizers.Tokenizer

Parameters

model (Model) — The core algorithm that this Tokenizer should be using.

A Tokenizer works as a pipeline. It processes some raw text as input and outputs an Encoding.

trainメソッド

Parameters

files (List[str]) — A list of path to the files that we should use for training

trainer (~tokenizers.trainers.Trainer, optional) — An optional trainer that should be used to train our Model

Reads the files line by line, while keeping all the whitespace, even new lines.

「改行であっても、すべての空白文字を保持して、ファイルを1行1行読み込む」

encodeメソッド

Parameters

add_special_tokens

以下がドキュメントに見つからないのはなぜ？

tokenize

convert_ids_to_tokens