embedding

from text-embedding-ada-002

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

なんらかの概念を数値の並びで表したもの

概念の間の関係性(類似度)をコンピュータで扱えるようになる

似ているものは似ている数値の並びになり、似ていないものはそうでない並びになるようにモデルを作っているはず基素.icon

こういうモデルをどう作るのか？

https://platform.openai.com/docs/guides/embeddings

An embedding is a vector (list) of floating point numbers

The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

利用例

Search (where results are ranked by relevance to a query string)

検索

Clustering (where text strings are grouped by similarity)

Recommendations (where items with related text strings are recommended)

情報推薦

Anomaly detection (where outliers with little relatedness are identified)

Diversity measurement (where similarity distributions are analyzed)

Classification (where text strings are classified by their most similar label)

text-embedding-ada-002を使った例

テキストをこのモデルに渡すと次のような表現が返ってくる

code:json

"embedding": [

-0.006929283495992422,

-0.005336422007530928,

...

-4.547132266452536e-05,

-0.024047505110502243

modelによってサイズは異なるが高次元のベクトルになる

このようなデータを二次元平面に可視化するためにt-SNEを使う