How to count tokens with tiktoken

モデル名からencodingを取得できる

Encodings specify how text is converted into tokens. Different models use different encodings.

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

encoding.encodeメソッドでテキストをトークンIDに変換できる

code:python

>> encoding.encode("tiktoken is great!") # 6トークン

>> encoding.encode("お誕生日おめでとう") # 9トークン (33334は「お」)

encodeメソッドの返り値（リスト）の長さから、何トークン分かが（ChatGPTのAPIに送らなくても）分かる

encoding.decodeメソッドでトークンIDから戻せる

Warning: although .decode() can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.

「単一のトークンにdecodeを適用できるけれども、utf-8 boundariesにないトークンは失いうることに注意」

encoding.decode_single_token_bytesでトークンIDからbytesに戻せる

code:python

>> "お誕生日おめでとう".encode() # 全体のbytesと比較するアイデア

b'\xe3\x81\x8a\xe8\xaa\x95\xe7\x94\x9f\xe6\x97\xa5\xe3\x81\x8a\xe3\x82\x81\xe3\x81\xa7\xe3\x81\xa8\xe3\x81\x86'

>> b'\xe3\x81\x8a'.decode()

'お'

>> b'\xe8\xaa'.decode() # 「誕」は2トークンに分かれている！

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

>> b'\xe8\xaa\x95'.decode()

'誕'

>> b'\xe3\x81\xa8\xe3\x81\x86'.decode() # 「とう」で1トークン

'とう'

以下、TODO

encodingを比較する例

API呼び出しのレスポンスに含まれるトークンの数と一致することを確認する例

トークンのID列（encode）

👉Getting Dense Word Embeddings（PyTorchのチュートリアル「Word Embeddings: Encoding Lexical Semantics」）