Use PCA & update maximum allowed Vector Dims#18
Draft
shindi-renuo wants to merge 4 commits intomainfrom
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR will not be merged. I decided not to use PCA to decrease vector dim sizes from larger models down to 2000 dims, since I have observed a significant decrease in the quality of the search results.
The less keywords, the less accurate it was, even if there was an exact match of the query.
For example:
The issue lies in the fact that we need to pad up Vectors that match the condition$$<5500$$ with $$0$$ s all the way up to $$5500$$ . This is called Zero Padding. And after we do that, we then need to apply the maths of PCA, and only then do we have a vector ready for similarity search.
It is possible my implementation of PCA is inaccurate, as I am quite inexperienced in this. I tried to follow the Wikipedia Article as closely as possible.
Note
Note: The reason why I don't just change the vector size of the embedding column in a new migration is: That's impossible.
pgvectorand PostgreSQL do not support vector dimensions larger than 2000 natively. There is a possibility of using a custom version, but it would be very hacky, and as I've read through threads about this problem, it seems PCA is the standard way to solve this issue, not to self-compose postgres.