GPTIndexにScrapboxの内容を突っ込んで遊ぶ

GPTIndexにScrapboxの内容を突っ込んで遊ぶblu3mo.icon

@robbalian: Hey GPT: When did I peak?

I build a model that queries thousands of pages of my emails and personal notes. You can use it too at https://t.co/tSHSzWoM6q

Here's what I learned... 🧵

このcolabを使えそうblu3mo.icon

code: py

import json

# open the json file

with open('scrapbox_export.json') as json_file:

data = json.load(json_file)

# iterate through the pages

for page in data'pages':

title = page'title'

lines = page'lines'

# join the lines using newline character

content = "\n".join(lines)

title = title.replace("/", "-")

# print the title and content of each page

f = open("data/" + title + ".txt", "w+")

f.write(content)

とりあえずこれでjsonをtxtファイルに置き換えられる

なんか/blu3moだとエラー出るけど未解決

その上で、Semantic SearchとかQ&Aができる

はずだが、Semantic Searchで有用なファイルを引っ張ってくる段階がうまくいっていないみたい

人間の目からみて、あまり関係ないファイルばかり引っ張られてくる

ソースが日本語なのが問題..?

EmbeddingとIndexingの仕組みを読まないと分からんなblu3mo.icon

@mutaguchi: あと日本語特有の問題として、インデックス化するときのembeddingは多分日本語だと厳しいので、英訳して突っ込む必要がありそうだし、合成プロンプトもデフォは英語だからLLMの回答に影響するはずで、プロンプトのカスタマイズも要りそう。

@mutaguchi: 諸々考えると、GPT Indexをそのまま使うのは割とむずそうなんだよな。

+1blu3mo.icon

そもそも文章をsplitする段階から英語前提なので、色々カスタマイズ必要そう

スクラッチで書いた方がラクかもしれないなぁとも思える。