Skip to content

feat(index): add streaming ivf kmeans training#6913

Open
BubbleCal wants to merge 11 commits into
mainfrom
yang/streaming-ivf-kmeans-training
Open

feat(index): add streaming ivf kmeans training#6913
BubbleCal wants to merge 11 commits into
mainfrom
yang/streaming-ivf-kmeans-training

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented May 22, 2026

Summary

Adds LanceStream, a streaming IVF kmeans training path that keeps peak raw-vector training memory bounded for very large IVF partition counts while preserving the existing non-streaming trainer when streaming is not requested.

The latest update also bounds the sampling read buffer: fixed sampled ranges are split into smaller range reads and take_scan readahead defaults to one batch. This avoids the previous RSS spike from large contiguous sampled ranges while keeping an environment override for benchmark tuning.

Feature

  • Adds streaming_sample_rate, streaming_coreset_rate, and streaming_refine_passes to IVF build parameters.
  • Exposes the new parameters through the Python index creation path.
  • Keeps num_partitions > 256 on hierarchical kmeans; large-k streaming does not fall back to flat kmeans.
  • Uses fixed sampled row ranges for streaming training so coreset construction and optional refine passes read the same sample.
  • Splits fixed sample ranges into bounded sub-ranges before take_scan; default range chunk is 8192 rows.
  • Adds LANCE_STREAMING_IVF_PREFETCH_DEPTH and LANCE_STREAMING_IVF_TAKE_RANGE_ROWS for experimental tuning. Default prefetch depth is 1 because deeper prefetch gave minimal speedup and higher RSS.

Algorithm

The existing IVF trainer samples up to num_partitions * sample_rate raw vectors before training. For very large num_partitions, that materializes too many raw vectors at once. LanceStream decouples total sample size from peak raw-vector memory.

The streaming path works as follows:

  1. Build a fixed sample plan as sorted random row ranges, bounded by num_partitions * sample_rate and dataset row count.
  2. Split those ranges into training chunks with at most num_partitions * streaming_sample_rate raw vectors.
  3. For each chunk, load only the current raw vectors, train a local kmeans summary, assign raw vectors to local centroids, and record weighted local centroids plus local loss contribution.
  4. Merge weighted local summaries into a global coreset bounded by num_partitions * streaming_coreset_rate.
  5. Train final IVF centroids with weighted hierarchical kmeans over the coreset. The hierarchical trainer starts with a bounded fanout and repeatedly splits high-loss/high-weight partitions until it reaches num_partitions.
  6. Run a short weighted Lloyd refinement over the coreset centroids.
  7. Optionally run streaming_refine_passes raw-vector Lloyd passes over the same fixed sample. Each pass streams chunks, assigns raw vectors to current centroids, accumulates global sums/counts/loss, updates non-empty centroids at pass end, and preserves old centroids for empty clusters.

Peak raw-vector memory is bounded by:

num_partitions * streaming_sample_rate * dimension * sizeof(float32)

The coreset budget is separately bounded by:

num_partitions * streaming_coreset_rate

Centroid and accumulator buffers still scale with num_partitions * dimension.

Benchmarks

Setup

  • VM: GCP c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB disk.
  • Dataset: real SIFT vectors, 128-dimensional float32.
  • Dataset choice: k=1024 on SIFT1M subset; k>=4096 on SIFT1B prefix.
  • Training sample size: first k * 256 rows for unified exact-loss evaluation; training sample size also k * 256.
  • LanceStream defaults in this run: stream=64, coreset=16, prefetch=1, refine=0, sample_rate=256, max_iters=50.
  • Tuned LanceStream variants only changed stream, coreset, and prefetch_depth.
  • Timeout: 1 hour per run.

Loss Methodology

All comparable losses below are computed outside the training algorithm:

  1. Save trained centroids to f32 files.
  2. Read the first k * 256 vectors from the dataset.
  3. Compute exact kmeans objective with faiss.IndexFlatL2:
sum_x min_c ||x - c||^2

No table below uses algorithm-internal reported loss for comparison. Exact loss is reported for k <= 16,384. For k=65,536 and 131,072, exact loss was not run because one evaluation is roughly O(k * k * 256) distance work; those rows report train time/RSS/status only.

LanceStream Tuning

Percentages are relative to LanceStream stream=64, coreset=16, prefetch=1 at the same k.

k config status train train delta RSS RSS delta exact loss loss delta
1,024 s64 c16 p1 ok 10.38s baseline 0.15 GiB baseline 13.839e9 baseline
1,024 s64 c16 p16 ok 10.04s -3.2% 0.21 GiB +40.0% 13.823e9 -0.1%
1,024 s64 c8 p1 ok 5.39s -48.0% 0.15 GiB +0.9% 13.971e9 +1.0%
4,096 s64 c16 p1 ok 29.73s baseline 0.32 GiB baseline 63.509e9 baseline
4,096 s64 c16 p16 ok 27.58s -7.2% 0.63 GiB +98.3% 63.516e9 +0.0%
4,096 s64 c8 p1 ok 15.52s -47.8% 0.29 GiB -8.8% 64.025e9 +0.8%
16,384 s64 c16 p1 ok 94.72s baseline 0.92 GiB baseline 228.083e9 baseline
16,384 s64 c16 p16 ok 95.40s +0.7% 1.24 GiB +35.9% 225.302e9 -1.2%
16,384 s64 c8 p1 ok 58.07s -38.7% 0.80 GiB -12.4% 241.662e9 +6.0%
65,536 s64 c16 p1 ok 522.12s baseline 3.04 GiB baseline n/a n/a
65,536 s64 c16 p16 ok 513.28s -1.7% 3.52 GiB +15.9% n/a n/a
65,536 s64 c8 p1 ok 304.78s -41.6% 2.73 GiB -10.3% n/a n/a
131,072 s64 c16 p1 ok 935.95s baseline 6.02 GiB baseline n/a n/a
131,072 s64 c16 p16 ok 920.70s -1.6% 6.42 GiB +6.7% n/a n/a
131,072 s64 c8 p1 ok 619.45s -33.8% 5.33 GiB -11.4% n/a n/a
131,072 s16 c8 p1 ok 1022.17s +9.2% 2.39 GiB -60.2% n/a n/a

Takeaways:

  • Increasing prefetch_depth is not a good default. At 128K, p16 was only 1.6% faster than p1 but used 6.7% more RSS. At smaller k it often increased RSS much more.
  • Lowering coreset from 16 to 8 is the strongest speed/RSS knob, but it can raise loss. At 16K it was 38.7% faster and 12.4% lower RSS, but loss was 6.0% worse.
  • Smaller stream lowers RSS but does not necessarily improve time. At 128K, stream=16, coreset=8 used only 2.39 GiB but took 17.0 minutes.

Algorithm Comparison

Percentages are relative to LanceStream stream=64, coreset=16, prefetch=1 at the same k.

k algorithm status train time delta RSS RSS delta exact loss loss delta
1,024 LanceStream ok 10.38s baseline 0.15 GiB baseline 13.839e9 baseline
1,024 Lance non-stream ok 1.83s -82.4% 0.31 GiB +102.5% 14.154e9 +2.3%
1,024 Faiss ok 33.51s +223.0% 0.20 GiB +29.4% 13.279e9 -4.1%
1,024 MiniBatch ok 31.75s +205.9% 0.32 GiB +111.2% 13.494e9 -2.5%
1,024 BICO-style ok 34.55s +233.0% 0.53 GiB +252.7% 14.307e9 +3.4%
1,024 treeCoreset ok 13.77s +32.7% 1.35 GiB +792.7% 14.745e9 +6.5%
4,096 LanceStream ok 29.73s baseline 0.32 GiB baseline 63.509e9 baseline
4,096 Lance non-stream ok 17.36s -41.6% 1.17 GiB +269.9% 62.694e9 -1.3%
4,096 Faiss ok 544.94s +1732.8% 0.59 GiB +85.8% 60.843e9 -4.2%
4,096 MiniBatch ok 446.67s +1402.3% 0.86 GiB +171.4% 61.763e9 -2.8%
4,096 BICO-style ok 147.32s +395.5% 1.05 GiB +230.4% 65.588e9 +3.3%
4,096 treeCoreset ok 176.30s +492.9% 3.49 GiB +1003.3% 67.386e9 +6.1%
16,384 LanceStream ok 94.72s baseline 0.92 GiB baseline 228.083e9 baseline
16,384 Lance non-stream ok 26.04s -72.5% 4.04 GiB +341.3% 228.268e9 +0.1%
16,384 Faiss timeout 1h n/a 2.16 GiB +135.9% n/a n/a
16,384 MiniBatch timeout 1h n/a 3.02 GiB +230.5% n/a n/a
16,384 BICO-style ok 2705.57s +2756.4% 3.78 GiB +313.4% 230.613e9 +1.1%
16,384 treeCoreset ok 2393.10s +2426.5% 12.32 GiB +1245.9% 237.508e9 +4.1%
65,536 LanceStream ok 522.12s baseline 3.04 GiB baseline n/a n/a
65,536 Lance non-stream ok 147.08s -71.8% 16.61 GiB +446.0% n/a n/a
65,536 Faiss skipped n/a n/a n/a n/a n/a n/a
65,536 MiniBatch skipped n/a n/a n/a n/a n/a n/a
65,536 BICO-style timeout 1h n/a 7.29 GiB +139.7% n/a n/a
65,536 treeCoreset error 139 26.93s n/a 22.31 GiB +633.4% n/a n/a
131,072 LanceStream ok 935.95s baseline 6.02 GiB baseline n/a n/a
131,072 Lance non-stream ok 297.29s -68.2% 32.62 GiB +442.0% n/a n/a
131,072 Faiss skipped n/a n/a n/a n/a n/a n/a
131,072 MiniBatch skipped n/a n/a n/a n/a n/a n/a
131,072 BICO-style skipped n/a n/a n/a n/a n/a n/a
131,072 treeCoreset skipped n/a n/a n/a n/a n/a n/a

Large-k notes:

  • Faiss and MiniBatch both timed out at 16K, so 64K and 128K were skipped.
  • BICO-style completed 16K but took 45.1 minutes and timed out at 64K.
  • treeCoreset completed 16K but used 12.32 GiB RSS; at 64K it crashed with signal 11 after reaching 22.31 GiB RSS.
  • Lance non-stream is faster but scales memory with the full k * 256 raw sample; at 128K it used 32.62 GiB RSS on this 128D dataset.
  • LanceStream completed every tested k with much lower RSS than Lance non-stream. At 128K, default LanceStream used 6.02 GiB; the lower-memory stream=16, coreset=8 variant used 2.39 GiB.

Validation

  • cargo fmt --all --check
  • cargo check -p lance --locked
  • cargo test -p lance --lib test_split_ranges_by_row_count --locked on a clean temp worktree containing only the production ivf.rs change
  • git diff --check

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@github-actions github-actions Bot added enhancement New feature or request python labels May 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 72.26322% with 451 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/ivf.rs 71.50% 402 Missing and 34 partials ⚠️
rust/lance-index/src/vector/kmeans.rs 83.33% 11 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change. One concern I have is the training time increased a lot to 4x. I hope we can have a follow-up issues to address it. I think it will ba good tradeoff if we only add 10% to 50%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants