Overview
During the recent investigation work of Model Training Support in opensearch-project/opensearch-jvector#472, it was identified that there might be some benefits with pre-training PQ. Currently, PQ training is automatically happening while indexing by using the current segment vectors. This issue is to discuss the potential pre-training APIs of PQ models on dedicated training datasets before indexing.
Idea
As per the current implementation, there is no possibility to pre-training. With the pre-training of model, there might be additional benefits to it such as:
- No need to training while indexing
- Less latency while indexing and faster indexing
- Maybe training model would be reused for multiple indices
- Help in verifying compressed vectors quality
- Large dataset that is already pre-trained instead of per segment training
- Maybe helpful in testing different subspaces/code sizes for compression of vectors for particular datasets
Training Flow:
User Training Vectors → trainProductQuantization() → Trained Model → Serialize → Store
Indexing Flow (with pre-trained model):
Load Pre-trained Model → Skip Training → Use Codebooks → Compress Vectors → Index
Implementation
Implementation for pre-training API's will most likely be done in jvector library in the similar way as the core training implemented. Based on the initial investigation, it would require few additional changes on the top of core training API:
- Expose separate training API that would allow pre-training on training datasets
- Serialising and deserialising of trained models
- Loading of pre-trained model when indexing
- Option to skip training if already pre-trained when indexing
Open questions
- How much training dataset size would be optimal?
- How does the API design look like?
- How does the storage of trained model look like?
- How to verify the compression quality? Perhaps some metrics to check the quality?
Overview
During the recent investigation work of Model Training Support in opensearch-project/opensearch-jvector#472, it was identified that there might be some benefits with pre-training PQ. Currently, PQ training is automatically happening while indexing by using the current segment vectors. This issue is to discuss the potential pre-training APIs of PQ models on dedicated training datasets before indexing.
Idea
As per the current implementation, there is no possibility to pre-training. With the pre-training of model, there might be additional benefits to it such as:
Training Flow:
Indexing Flow (with pre-trained model):
Implementation
Implementation for pre-training API's will most likely be done in
jvectorlibrary in the similar way as the core training implemented. Based on the initial investigation, it would require few additional changes on the top of core training API:Open questions