Skip to content

Separate Training APIs for PQ Vectors #676

@kush992

Description

@kush992

Overview

During the recent investigation work of Model Training Support in opensearch-project/opensearch-jvector#472, it was identified that there might be some benefits with pre-training PQ. Currently, PQ training is automatically happening while indexing by using the current segment vectors. This issue is to discuss the potential pre-training APIs of PQ models on dedicated training datasets before indexing.

Idea

As per the current implementation, there is no possibility to pre-training. With the pre-training of model, there might be additional benefits to it such as:

  • No need to training while indexing
  • Less latency while indexing and faster indexing
  • Maybe training model would be reused for multiple indices
  • Help in verifying compressed vectors quality
  • Large dataset that is already pre-trained instead of per segment training
  • Maybe helpful in testing different subspaces/code sizes for compression of vectors for particular datasets

Training Flow:

User Training Vectors → trainProductQuantization() → Trained Model → Serialize → Store

Indexing Flow (with pre-trained model):

Load Pre-trained Model → Skip Training → Use Codebooks → Compress Vectors → Index

Implementation

Implementation for pre-training API's will most likely be done in jvector library in the similar way as the core training implemented. Based on the initial investigation, it would require few additional changes on the top of core training API:

  • Expose separate training API that would allow pre-training on training datasets
  • Serialising and deserialising of trained models
  • Loading of pre-trained model when indexing
  • Option to skip training if already pre-trained when indexing

Open questions

  • How much training dataset size would be optimal?
  • How does the API design look like?
  • How does the storage of trained model look like?
  • How to verify the compression quality? Perhaps some metrics to check the quality?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions