Separate Training APIs for PQ Vectors


### Overview

During the recent investigation work of Model Training Support in https://github.com/opensearch-project/opensearch-jvector/issues/472, it was identified that there might be some benefits with pre-training PQ. Currently, PQ training is automatically happening while indexing by using the current segment vectors. This issue is to discuss the potential pre-training APIs of PQ models on dedicated training datasets before indexing.

### Idea

As per the current implementation, there is no possibility to pre-training. With the pre-training of model, there might be additional benefits to it such as:
- No need to training while indexing
- Less latency while indexing and faster indexing
- Maybe training model would be reused for multiple indices
- Help in verifying compressed vectors quality
- Large dataset that is already pre-trained instead of per segment training
- Maybe helpful in testing different subspaces/code sizes for compression of vectors for particular datasets

#### Training Flow:
```
User Training Vectors → trainProductQuantization() → Trained Model → Serialize → Store
```

#### Indexing Flow (with pre-trained model):
```
Load Pre-trained Model → Skip Training → Use Codebooks → Compress Vectors → Index
```

### Implementation

Implementation for pre-training API's will most likely be done in `jvector` library in the similar way as the core training implemented. Based on the initial investigation, it would require few additional changes on the top of core training API:
- Expose separate training API that would allow pre-training on training datasets
- Serialising and deserialising of trained models
- Loading of pre-trained model when indexing
- Option to skip training if already pre-trained when indexing

### Open questions

- How much training dataset size would be optimal?
- How does the API design look like?
- How does the storage of trained model look like?
- How to verify the compression quality? Perhaps some metrics to check the quality?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate Training APIs for PQ Vectors #676

Overview

Idea

Training Flow:

Indexing Flow (with pre-trained model):

Implementation

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Separate Training APIs for PQ Vectors #676

Description

Overview

Idea

Training Flow:

Indexing Flow (with pre-trained model):

Implementation

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions