Skip to content

Use PCA & update maximum allowed Vector Dims#18

Draft
shindi-renuo wants to merge 4 commits intomainfrom
feature/update-max-dimensions
Draft

Use PCA & update maximum allowed Vector Dims#18
shindi-renuo wants to merge 4 commits intomainfrom
feature/update-max-dimensions

Conversation

@shindi-renuo
Copy link
Copy Markdown
Contributor

@shindi-renuo shindi-renuo commented May 14, 2025

This PR will not be merged. I decided not to use PCA to decrease vector dim sizes from larger models down to 2000 dims, since I have observed a significant decrease in the quality of the search results.

The less keywords, the less accurate it was, even if there was an exact match of the query.

For example:

CleanShot 2025-05-14 at 11 52 40

The issue lies in the fact that we need to pad up Vectors that match the condition $$<5500$$ with $$0$$ s all the way up to $$5500$$. This is called Zero Padding. And after we do that, we then need to apply the maths of PCA, and only then do we have a vector ready for similarity search.

It is possible my implementation of PCA is inaccurate, as I am quite inexperienced in this. I tried to follow the Wikipedia Article as closely as possible.

Note

Note: The reason why I don't just change the vector size of the embedding column in a new migration is: That's impossible. pgvector and PostgreSQL do not support vector dimensions larger than 2000 natively. There is a possibility of using a custom version, but it would be very hacky, and as I've read through threads about this problem, it seems PCA is the standard way to solve this issue, not to self-compose postgres.

@shindi-renuo shindi-renuo self-assigned this May 14, 2025
@shindi-renuo shindi-renuo added the enhancement New feature or request label May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant