tokenizers.pre_tokenizers.ByteLevel

https://huggingface.co/docs/tokenizers/main/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel

This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.

「単語に分割するのはもちろん、与えられた文字列の全てのバイトを対応する表現に置き換えることもする」

参考 ByteLevelBPETokenizer output seems weird

Parameters

add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

「先頭の単語の前にスペースを追加する」

「「hello」と「say hello」で同じ扱いになる」（どちらも「 hello」←半角スペースが1つ入ったhello）

use_regex (bool, optional, defaults to True) — Set this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.

alphabetメソッド（スタティックメソッド）

Returns the alphabet used by this PreTokenizer.

Since the ByteLevel works as its name suggests, at the byte level, it encodes each byte value to a unique visible character. This means that there is a total of 256 different characters composing this alphabet.

「その名が提案するようにByteLevelはバイトレベルで作用するので、各バイトの値を一意の可視の文字にエンコードする」

「つまり、このアルファベットを構成する合計で256の別々の文字がある」

code:show_pre_tokenizer_alphabet.py

>> from tokenizers import pre_tokenizers

>> len(pre_tokenizers.ByteLevel.alphabet())

256

>> pre_tokenizers.ByteLevel.alphabet()

['²', 'q', 'ı', 'ĺ', '®', '7', 'h', 'A', 'K', 'À', 'È', '3', 'Z', 'Ě', ')', 'ă', 'Ĵ', 'Ģ', 'ñ', 's', '~', 'Ï', '§', 'Ī', '!', ';', 'Ď', 'Ė', 'ĵ', 'č', 'U', '&', 'ß', '¬', '9', '±', 'ğ', '»', 'Ú', 'ø', '1', '©', 'Ń', 'Ü', 'ó', 'Ô', 'N', 'æ', 'É', 'Û', 'a', 'ç', 'p', 'ü', 'ò', 'ď', 'Õ', '/', 'Ĉ', '4', 'o', 'Ì', '>', ''ï', 'ĩ', ',', 'Ë', 'ļ', 'G', 'F', '0', '¥', 'Ò', '·', 'V', 'ö', 'r', 'Í', '-', '`', '*', '$', 'ù', 'Ã', 'Î', 'Ă', 'ą', 'W', 'X', 'Ĳ', 'T', 'Į', 'Á', 'R', 'ċ', 'ĳ', 'Đ', 'º', 'Ġ', 'Ħ', 'D', 'J', 'Ù', 't', '÷', 'Â', 'þ', 'm', '@', 'à', '#', 'Ä', 'Ć', 'Þ', 'ì', 'Å', 'ľ', 'ô', '%', 'ħ', 'ª', 'Ç', 'e', 'I', '_', 'Ę', '¡', '?', 'İ', 'ĝ', 'ł', '¼', 'Ğ', 'y', 'í', 'ĉ', 'f', 'õ', '´', '¸', 'Y', 'z', 'ú', 'Ł', 'Ā', 'L', 'Ŀ', 'Ļ', 'Ĕ', 'ě', '¦', 'ā', '(', 'M', '×', 'Ķ', 'ė', 'Ý', '°', 'd', 'k', '¹', 'á', '¤', 'Ċ', 'Ö', 'â', 'đ', 'c', '\\', '.', 'b', 'ę', '8', 'Ĩ', 'n', 'Æ', 'x', 'Ľ', '|', '£', 'ë', '¾', 'Q', 'ð', '^', 'ê', 'H', 'î', 'Ê', 'ý', 'ĥ', 'Ą', 'Č', 'Ĝ', '"', 'Ĺ', '5', '¨', '6', 'Ĥ', 'Ĭ', 'ķ', 'S', 'å', 'ī', 'ĭ', '}', 'i', 'u', 'C', '¶', 'ġ', 'v', '+', 'P', '¯', 'Ñ', 'ã', 'j', 'ĕ', '2', 'ĸ', 'B', '³', 'é', 'ē', 'ŀ', 'Ē', 'l', 'E', 'ä', '', 'Ó', 'ć', 'ģ', 'O', ':', "'", 'è', 'Ø', 'į', 'µ', '¢', 'û', 'Ð', '«', '<', '¿', '=', 'ÿ', '{', '½', 'w', 'g']