transformers.PreTrainedTokenizerBase
https://huggingface.co/docs/transformers/v4.18.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase
from_pratrained
https://huggingface.co/docs/transformers/v4.18.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.from_pretrained
第1引数 pretrained_model_name_or_path
渡し方(いくつかサポート)
A path to a directory containing vocabulary files required by the tokenizer
返り値 cls._from_pretrained(...)
https://github.com/huggingface/transformers/blob/v4.18.0/src/transformers/tokenization_utils_base.py#L1780-L1788
from_pretrainedで使わなかったkwargsが渡る
例:max_length
from_pretrainedで返したオブジェクトはencode(...)でテキストをトークンのID列にできる
https://huggingface.co/docs/transformers/v4.18.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode
Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.
Same as doing self.convert_tokens_to_ids(self.tokenize(text)).