PostgreSQL の pgvector モジュールを試してみる

https://gyazo.com/4c348b9d995536f455aea1de0274bb01

PostgreSQL でベクトル検索可能なモジュールが公開されているとのことで、試してみました。

https://github.com/pgvector/pgvector

準備

Docker イメージも公開されているので、そちらを利用します。

今回のモジュールはエクステンションを有効にすれば良いようです。

code:docker-compose.yml

version: '3.7'

services:

postgres:

image: ankane/pgvector

container_name: sample-pgvector

ports:

- '5432:5432'

volumes:

- sample-pg-vector-store:/var/lib/postgresql/data

environment:

- POSTGRES_PASSWORD=root

volumes:

sample-pg-vector-store:

Docker 起動後、コンテナ内の DB にログインし、エクステンションを有効にします。

code:sql

$ psql -h localhost -U postgres

CREATE EXTENSION vector;

サンプルのデータベースとテーブルも作成しておきます。

code:sql

CREATE DATABASE sample_pgvector;

# ...

\c sample_pgvector;

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

\dt;

# ...

\d items;

Column | Type | Collation | Nullable | Default

-----------+-----------+-----------+----------+-----------------------------------

id | bigint | | not null | nextval('items_id_seq'::regclass)

embedding | vector(3) | | |

Indexes:

"items_pkey" PRIMARY KEY, btree (id)

embedding というカラム名で、 vector 型ができています。

長さは 3 なので、3 次元までのベクトルということでしょうか。

クエリ

実際にデータを入れてみます。

code:sql

INSERT INTO items (embedding) VALUES ('1,2,3'), ('4,5,6');

select * from items;

id | embedding

----+-----------

1 | 1,2,3

2 | 4,5,6

(2 rows)

同様に 2 次元や 4 次元のデータを入れようとしたら型エラーになったので、次元数は同じにしないといけないかもしれないです。

ドキュメントには 2,000 次元まで対応できると記載がありました。

https://github.com/pgvector/pgvector#indexing

サンプルにあるように、クエリを実行してみます。

code:sql

SELECT * FROM items ORDER BY embedding <-> '3,1,2' LIMIT 5;

id | embedding

----+-----------

1 | 1,2,3

2 | 4,5,6

これだと何かよくわからなかったので、以下のようにクエリしてみます。

code:sql

SELECT *, embedding <-> '3,1,2' as l2 FROM items ORDER BY embedding <-> '3,1,2' LIMIT 5;

id | embedding | l2

----+-----------+-------------------

1 | 1,2,3 | 2.449489742783178

2 | 4,5,6 | 5.744562646538029

<-> 演算子は L2 距離（ユークリッド距離）を算出するようです。

条件区で <-> を指定すると、 L2 距離で近い順に並べ替えてくれるようです。

ちなみに、 <#> だと負の内積、 <=> だとコサイン距離を算出するようです。

テスト用のテーブルを作ってみる

試しに、私のこの scrapbox のブログ記事からいい感じに検索できるか試してみようと思います。

よくあるブログ用のテーブルに、 embedding のカラムを追加しています。

後ほど記載しますが、 embedding のデータ長は 1536 にします。

code:sql

CREATE TABLE posts ( id TEXT PRIMARY KEY, title TEXT, content TEXT,

embedding vector(1536), created TIMESTAMP, updated TIMESTAMP );

OpenAI の Embedding API を使用する

記事の本文をベクトル化するにあたり、 OpenAI が用意している API を利用するのが簡単そうです。

https://platform.openai.com/docs/api-reference/embeddings/create

Python から利用する例を以下に記載します。

code:python

import openai

openai.api_key = "..."

response = openai.Embedding.create(

input="content",

model="text-embedding-ada-002"

)

print(response"data"0"embedding"

この時、ベクトルは 1536 次元で返ってくるようです。

そのため、先ほどのテーブルの型を vector(1536) にしました。

このブログのデータをインサートする際に、一緒にベクトル化も実施してインサートしました。

クエリから検索する

今までの方法をまとめて、あるクエリが与えられたら、ベクトル化して似たものを検索するスクリプトを書いてみます。

postgresql のクライアントとして、 psycopg2 を使用していますが、 pgvector を利用するために追加モジュールを使用しています。

https://github.com/pgvector/pgvector-python

code:python

from datetime import datetime

import psycopg2

import openai

import sys

from pgvector.psycopg2 import register_vector

import numpy as np

openai.api_key = "..."

def main(query):

query_embedding = embedding_content(query)

conn = connection()

register_vector(conn)

cur = conn.cursor()

select_query = "SELECT id, title FROM posts ORDER BY embedding <-> %s LIMIT 5;"

e = np.array(query_embedding"data"0"embedding")

values = (e,)

cur.execute(select_query, values)

results = cur.fetchall()

for row in results:

print(row)

cur.close()

conn.close()

def connection():

database = "sample_pgvector"

user = "postgres"

password = "root"

host = "localhost"

port = "5432"

return psycopg2.connect(database=database, user=user, password=password, host=host, port=port)

def embedding_content(content):

print(content)

return openai.Embedding.create(

input=content,

model="text-embedding-ada-002"

)

if __name__ == '__main__':

argument = sys.argv1

print(f"引数: {argument}")

main(argument)

例えば、以下のように実行します。

code:sql

$ python query.py AWS

('API Gateway の API キー認証を試し、リクエスト数の制御、閲覧を試す')

('AWS CodeBuild を使ってみた')

('AWS SAA に合格するために勉強したこと')

('Serverless Next.js Plugin を使ってデプロイしてみる')

('terraform で AWS のリソースを管理するようにした')

$ python query.py 機械学習

('G検定学習メモ')

('Coursera "Machine Learning" を受講した')

('教師あり学習単語メモ')

('G検定を受験したので学習方法などをまとめてみる')

('勾配降下法の微分式が分からなかった')

それっぽい記事が取得できているようです。

また、少し質問文っぽいものにしても検索できました。（そんなに AWS の記事を書いていないので比較しにくいですが。。）

code:sql

$ python query.py AWSでAPIを使いたい

('API Gateway の API キー認証を試し、リクエスト数の制御、閲覧を試す')

('AWS CodeBuild を使ってみた')

('AWS Secrets Manager を使ってみた')

('AWS ECS を使ってみた')

('AWS SES を使ってメールを送る')

全文検索などはいくつかありましたが、今後はベクトル検索が主流になりそうな気がします。

ベクトル検索が RDB でできるようになると、今までの技術の延長で導入できそうなので、色々試せそうです。

参考

https://dev.classmethod.jp/articles/search-with-openai-embeddings/

https://dev.classmethod.jp/articles/amazon-rds-postgresql-pgvector-embedding/