Efficient Vector Search Algorithms.
#chatgpt
I want to retrieve the vector that is closest in distance to a certain multidimensional vector from a set of multidimensional vectors with sample size N. The naive method can be obtained by N times comparisons, but I want to search for vectors with as few calculations as possible by creating an index or database. I would also like to add data to the database with as little computation time as possible. Do you know of any such algorithms?
There are several algorithms and data structures that can be used to efficiently find the closest vector in a large set of vectors.
One popular method is to use a k-d tree, which is a type of binary search tree that is specialized for multidimensional data. K-d trees divide the data into subspaces using a series of hyperplanes, allowing for efficient search and insertion of vectors.
Another popular algorithm is the approximate nearest neighbor (ANN) search. This approach uses a variety of techniques to quickly find approximate nearest neighbors, such as hashing and clustering, while sacrificing some accuracy for speed. Some popular libraries for ANN search are:
Annoy
Faiss
NMSLIB
Both k-d tree and ANN algorithms can be further optimized with parallel processing.
Finally, you can use a pre-trained model like Facebook's InferSent to encode your sentence and use the cosine similarity to find the closest vector.
Sorry, I was missing one assumption. Dimensions are larger than 100 dimensions. k-d trees are not practical with this assumption?
k-d trees are typically not practical for high-dimensional data, especially with dimensions larger than 100. The reason for this is that the curse of dimensionality affects the performance of k-d trees, which causes the tree to become very deep and inefficient. The curse of dimensionality states that as the dimensionality increases, the volume of the space increases exponentially while the amount of data available stays constant. This means that data points become increasingly sparse and distant from each other as the dimensionality increases, making it harder to find nearest neighbors.
Instead, you can use other algorithms such as Locality-Sensitive Hashing (LSH) which can be used to solve the approximate nearest neighbor problem in high-dimensional space. LSH is a probabilistic algorithm that hashes the data points into a lower-dimensional space, so that data points that are close to each other in the high-dimensional space are likely to be hashed to the same value. This allows for faster nearest neighbor search by searching only the hashed values that are likely to contain the nearest neighbors.
Another method that can be used is the use of Random Projection based methods like Random Kitchen Sinks, which projects the data on a lower dimensional space while still preserving the distance.
Finally, you can use a pre-trained model like Facebook's InferSent to encode your sentence and use the cosine similarity to find the closest vector.
近似最近傍探索ライブラリ比較 - Qiita
あいまい検索とLocality Sensitive Hashing - 人生なんて思い出づくり