perf(knn): reduce memory for batch flat vector search#6950
Conversation
Avoid retaining whole scan RecordBatches (and vectors) in batch flat KNN heaps when the final projection does not include the vector column. When vectors are requested, retain only a per-row copy.
|
@BubbleCal Could you review this PR when you have time? It addresses #6940 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| input | ||
| }; | ||
| let retain_vector = if self.is_batch_nearest { | ||
| let vector_field_id = self.dataset.schema().field_id(q.column.as_str())?; |
There was a problem hiding this comment.
does this work with nested vector field? can we add a test for this?
| let input_schema = Arc::new(input_schema); | ||
|
|
||
| let stored_schema = if is_batch && !retain_vector { | ||
| Arc::new(Schema::new(vec![ROW_ID_FIELD.clone()])) |
There was a problem hiding this comment.
something like intput_schema.drop_column(vector_column) would be better, we may have more columns other than row ID here
There was a problem hiding this comment.
only vector column will be dropped now
| column: &str, | ||
| row_index: u32, | ||
| ) -> DataFusionResult<ArrayRef> { | ||
| let vectors = batch.column_by_name(column).ok_or_else(|| { |
There was a problem hiding this comment.
just do batch[column] is fine
| )) | ||
| })?; | ||
| let indices = UInt32Array::from(vec![row_index]); | ||
| arrow::compute::take(vectors, &indices, None) |
There was a problem hiding this comment.
wonder whether there is an easier way to avoid constructing this indices, probably just directly copy the data out?
There was a problem hiding this comment.
would introducing Array::slice be a better choice?
| .clone(); | ||
|
|
||
| let slim_batch = if retain_vector { | ||
| Some(Arc::new( |
There was a problem hiding this comment.
slim batch will be pinned when necessary
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod batch_knn_memory_tests { |
There was a problem hiding this comment.
I guess we don't need these memory tests
probably only test that if the query doesn't select vector column, then the batch shouldn't contain it
| row_index, | ||
| let candidate = if retain_vector { | ||
| let row_index = row_index as u32; | ||
| let vector_row = Self::take_vector_row(&batch, &column, row_index)?; |
There was a problem hiding this comment.
Don't copy the vector unless it will be pushed into the candidate heap
There was a problem hiding this comment.
there will be a would_enter_heap check ahead and a slice for vector when necessary
| row_id: u64, | ||
| batch: RecordBatch, | ||
| row_index: u32, | ||
| enum BatchKnnCandidate { |
There was a problem hiding this comment.
keep it a struct and make only the diff fields into enum
There was a problem hiding this comment.
minimize the diff and make a nested enum inside
Refactor heap candidates to struct + extra enum, defer vector and slim batch copies until rows enter top-k, and use slice/batched take for assembly. Add correctness tests asserting vector column is omitted when not projected.
For batch flat KNN, keep stored schema as input without the vector column instead of row-id-only, and carry slim batches when needed so non-vector columns can be assembled without retaining vectors. Add a filter-based batch KNN regression test to verify non-vector outputs remain correct.
Add regression tests for payload.vec batch KNN omit/retain behavior and use qualified batch column access when retaining nested vectors.
|
|
||
| #[derive(Clone)] | ||
| enum BatchKnnExtra { | ||
| RowIdOnly, |
There was a problem hiding this comment.
Can you add a test which selects row_id and row_addr but no vectors?
I think it can't work with the current implementation
Rebuild batch KNN partition statistics from output schema, remove vector columns via nested path-aware helpers, and switch projected vector row capture from slice to take/copy to avoid retaining full buffers.
Add a batch flat KNN regression test for projecting row_id and row_addr without vectors to ensure system columns are materialized correctly.
Use parse_field_path-based vector resolution for batch retained vectors and remove the nested projected fallback that cloned full batches, then add regressions for escaped nested vector projection and nested slim-batch vector removal.
Summary
Closes #6940.
Batch flat KNN previously kept a full scan
RecordBatch(including the vector column) for each top-k candidate, which could use on the order ofm × k × d × 4bytes when many query vectors and largek/dare used.This PR changes batch flat KNN so that:
Behavior
Scanner::flat_knnsetsretain_vectorfrom whether the user projection includes the vector field.KNNVectorDistanceExecbatch mode maintains a per-query top-k heap across all scan batches and emits one result batch withquery_indexand_distance.Memory (qualitative)
vec + _rowidTest plan
cargo test -p lance --lib test_batch_knn_flatcargo test -p lance --lib test_batch_knn_flat_omits_vector_without_projectioncargo test -p lance --lib test_batch_knn_flat_filter_keeps_non_vector_columnscargo test -p lance --lib test_batch_knn_flat_respects_distance_rangecargo fmt --allandcargo clippy -p lance --lib -- -D warnings