LanceDB - pokutuna

LanceDB

Serverless Vector Database for AI らしい

ファイルシステムに書く

File Format — Lance documentation

code:layout

/path/to/dataset:

data/*.lance -- Data directory

latest.manifest -- The manifest file for the latest version.

_versions/*.manifest -- Manifest file for each dataset version.

_indices/{UUID-*}/index.idx -- Secondary index, each index per directory.

_deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows

that have been deleted.

Cloud Storage などにも書ける?

Serverless QA Bot with S3 and Lambda - LanceDB Docs

Cloud Run の FUSE からマウントしてできないかなあ

入力を Apache Arrow テーブルに変換して Lance Format で保存している

Apache Arrow | Apache Arrow

Working with tables - LanceDB

圧縮して削除済みの行を実際に消す、フラグメント数を100以下に抑える等

Data management - LanceDB

スキーマ定義

Working with tables - LanceDB

Data Types and Schemas — Apache Arrow v15.0.2

nullable=True にしても add する際に値には None 入ってないとだめ、省略したらエラーになる

Pydantic や DataFrame そのまま入れて定義できるようになって楽になった

Vector カラムなくてもいい

整合性 ≒ ファイルシステムをチェックする間隔

lancedb.connect("./.lancedb", read_consistency_interval=timedelta(0))

0秒なら read のたびにチェックする

デフォルトは他プロセスからの更新をチェックしない

クエリ

Working with tables - LanceDB

Filtering - LanceDB

db.open_table(name)

table.search().select([row1, row2])

SQL filter

Filtering - LanceDB

.where(... prefilter=True) ベクトル検索の前に絞り込む

table.search().where('id = 10').limit(1) 空の search でベクトル検索せずにフィルタ

DataFusion の関数 Scalar Functions — Apache DataFusion documentation

値の取得

クエリ組み立ててほしい形式を呼ぶ

.to_list()

.to_pydantic(model) list of pydantic model が返る

.to_arrow()

.to_pandas() / .to_df()

table.count_rows()

Index

デフォルトは index とかなくて pyarrow 舐めているだけ?

Vector Index

小規模(1k 次元, <100k ベクトル)のデータセットでは index いらない、まあローカルで遊ぶぐらいなら大抵要らないな

IVF_PQ

table.create_index(distance_type='L2', num_partitions=256, num_sub_vectors=96)

distance_type は L2(eucrid), cosine, dot

Scalar Index

table.create_scalar_index("publisher", index_type="BITMAP")

index_type は BTREE, BITMAP, LABEL_LIST

ユニークな値の種類が 1000 以下なら BITMAP

FTS Iindex

table.create_fts_index("text")

search(str) の時に使われる

全文検索

Full-text search - LanceDB = FTS

(そうじゃないベクトルの方は Approximate Nearest Neighbor = ANN)

内部で tantivy を使っている

table.create_fts_index("text") しておくと

table.search('word') を受け入れて全文検索できて BM25 でスコア付けされて返る

...が、日本語はできない

Tantivy 使ってるなら Lindera 指定したら日本語いけないかなと思ったけど

tantivy(PyPI) は https://github.com/quickwit-oss/tantivy-py

これにカスタムトークナイザを渡せるようにする話がでて取り組まれている

Supporting tokenizer register · Issue 25 · quickwit-oss/tantivy-py

Optional Lindera tokenizer support (was: Custom tokenizer support) by cjrh · Pull Request 200 · quickwit-oss/tantivy-py これ待ち

まあ LIKE 使えるのでショボい検索はできる