pVectorSearch2023-06-07
Next action is to suck the data from the public project and index it without going through export
First from /halsk
127 sec
Two left.
335 sec
done
I guess I need to make aiohttp.ClientSession hit where I am hitting with requests now...
Now hitting on requests" is False
I'm using the official openai library.
This can be passed in a list for batch processing.
3,500 RPM / 350,000 TPM
The layer of parallelization was not what was originally planned.
I thought it was a function that calls the embed API.
A class that represents an index with a method to execute it in batch.
It's done, so we run it and go to lunch.
This model's maximum context length is 8191 tokens, however you requested 9158 tokens (9158 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
oops
Stop at PDB and observe
code::
(Pdb) print(list(map(get_size, texts)))
85, 122, 88, 118, 5, 106, 100, 100, 9158, 9152, 9164, 121, 113, 81, 456, 133, 45, 6, 6, 7, 514, 351, 278, 441, 322, 27, 210, 515, 517, 252, 23, 28, 363, 92, 486, 318, 350, 326, 276, 340, 418, 355, 309, 292, 366, 334, 273, 387, 349, 341 Ah, I see, that's one line.
If the chunked file is huge, only the first part of the file is taken and embedded.
I originally did that, but when I made the parallelization, it was embargoed, so I threw it into the API as long as it was.
I don't know how long this trouble will take.
238/238 [11:46<00:00, 2.97s/it]
This is the /tkgshndata that ran after the fix The 5748-page Scrapbox is divided into a little over 10,000 chunks and processed in 238 batches of 50 pieces each.
It is about 3 seconds per batch, and the total time is like 12 minutes.
Put in Qdrant
ResponseHandlingException: The write operation timed out
I was able to hit it without wait=True on /nishio, but when I tried to get 3 people in at once, 1.5 people died.
WAL overflowed?
PS: Conflicts with indexing
If you add wait=True, it looks like this
117/117 [02:57<00:00, 1.51s/it]
44/44 [00:54<00:00, 1.23s/it]
Well, I'll be in there in a few minutes, so it's nothing to worry about.
Now you can cross search.
What's that? Why do I get hits only on yuiseki?
Is that possible?
Oh, I was doing client.recreate_collection.
redoing
I get ResponseHandlingException: The write operation timed out even if wait=True.
It happens even if you sleep(1).
Ah, I see, you stop the generation of indexes.
23/23 [00:34<00:00, 1.48s/it]
117/117 [02:59<00:00, 1.53s/it]
44/44 [01:15<00:00, 1.72s/it]
It's done.
I can do a cross search.
https://gyazo.com/e23a56be5c403dfdf5d2452e0ec051bf
Plenty of room.
What we were able to verify this time
If the other party is a public project in Scrapbox, no special work by the other party is required
The time and monetary costs are not significant.
@yuiseki_: If you say that we will combine multiple people's personal Scrapboxes and make them vector searchable to test their usefulness in cooperation, consensus building, etc., I feel the priority of what and how much to write in my Scrapbox explode! @nishio: in a little while, anyone can throw an agenda at this I looked at it on my iPhone and it looked terrible.
https://gyazo.com/fb090d7d77fef80327e2a58e395d05fb
@yuiseki_: added about 100 pages of important information for now! We need to implement an update function...
---
This page is auto-translated from /nishio/pVectorSearch2023-06-07 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.