feat(index): add streaming ivf kmeans training by BubbleCal · Pull Request #6913 · lance-format/lance

BubbleCal · 2026-05-22T08:52:52Z

Summary

Adds LanceStream, a streaming IVF kmeans training path that keeps peak raw-vector training memory bounded for very large IVF partition counts while preserving the existing non-streaming trainer when streaming is not requested.

The latest update also bounds the sampling read buffer: fixed sampled ranges are split into smaller range reads and take_scan readahead defaults to one batch. This avoids the previous RSS spike from large contiguous sampled ranges while keeping an environment override for benchmark tuning.

Feature

Adds streaming_sample_rate, streaming_coreset_rate, and streaming_refine_passes to IVF build parameters.
Exposes the new parameters through the Python index creation path.
Keeps num_partitions > 256 on hierarchical kmeans; large-k streaming does not fall back to flat kmeans.
Uses fixed sampled row ranges for streaming training so coreset construction and optional refine passes read the same sample.
Splits fixed sample ranges into bounded sub-ranges before take_scan; default range chunk is 8192 rows.
Adds LANCE_STREAMING_IVF_PREFETCH_DEPTH and LANCE_STREAMING_IVF_TAKE_RANGE_ROWS for experimental tuning. Default prefetch depth is 1 because deeper prefetch gave minimal speedup and higher RSS.

Algorithm

The existing IVF trainer samples up to num_partitions * sample_rate raw vectors before training. For very large num_partitions, that materializes too many raw vectors at once. LanceStream decouples total sample size from peak raw-vector memory.

The streaming path works as follows:

Build a fixed sample plan as sorted random row ranges, bounded by num_partitions * sample_rate and dataset row count.
Split those ranges into training chunks with at most num_partitions * streaming_sample_rate raw vectors.
For each chunk, load only the current raw vectors, train a local kmeans summary, assign raw vectors to local centroids, and record weighted local centroids plus local loss contribution.
Merge weighted local summaries into a global coreset bounded by num_partitions * streaming_coreset_rate.
Train final IVF centroids with weighted hierarchical kmeans over the coreset. The hierarchical trainer starts with a bounded fanout and repeatedly splits high-loss/high-weight partitions until it reaches num_partitions.
Run a short weighted Lloyd refinement over the coreset centroids.
Optionally run streaming_refine_passes raw-vector Lloyd passes over the same fixed sample. Each pass streams chunks, assigns raw vectors to current centroids, accumulates global sums/counts/loss, updates non-empty centroids at pass end, and preserves old centroids for empty clusters.

Peak raw-vector memory is bounded by:

num_partitions * streaming_sample_rate * dimension * sizeof(float32)

The coreset budget is separately bounded by:

num_partitions * streaming_coreset_rate

Centroid and accumulator buffers still scale with num_partitions * dimension.

Benchmarks

Setup

VM: GCP c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB disk.
Dataset: real SIFT vectors, 128-dimensional float32.
Dataset choice: k=1024 on SIFT1M subset; k>=4096 on SIFT1B prefix.
Training sample size: first k * 256 rows for unified exact-loss evaluation; training sample size also k * 256.
LanceStream defaults in this run: stream=64, coreset=16, prefetch=1, refine=0, sample_rate=256, max_iters=50.
Tuned LanceStream variants only changed stream, coreset, and prefetch_depth.
Timeout: 1 hour per run.

Loss Methodology

All comparable losses below are computed outside the training algorithm:

Save trained centroids to f32 files.
Read the first k * 256 vectors from the dataset.
Compute exact kmeans objective with faiss.IndexFlatL2:

sum_x min_c ||x - c||^2

No table below uses algorithm-internal reported loss for comparison. Exact loss is reported for k <= 16,384. For k=65,536 and 131,072, exact loss was not run because one evaluation is roughly O(k * k * 256) distance work; those rows report train time/RSS/status only.

LanceStream Tuning

Percentages are relative to LanceStream stream=64, coreset=16, prefetch=1 at the same k.

k	config	status	train	train delta	RSS	RSS delta	exact loss	loss delta
1,024	s64 c16 p1	ok	10.38s	baseline	0.15 GiB	baseline	13.839e9	baseline
1,024	s64 c16 p16	ok	10.04s	-3.2%	0.21 GiB	+40.0%	13.823e9	-0.1%
1,024	s64 c8 p1	ok	5.39s	-48.0%	0.15 GiB	+0.9%	13.971e9	+1.0%
4,096	s64 c16 p1	ok	29.73s	baseline	0.32 GiB	baseline	63.509e9	baseline
4,096	s64 c16 p16	ok	27.58s	-7.2%	0.63 GiB	+98.3%	63.516e9	+0.0%
4,096	s64 c8 p1	ok	15.52s	-47.8%	0.29 GiB	-8.8%	64.025e9	+0.8%
16,384	s64 c16 p1	ok	94.72s	baseline	0.92 GiB	baseline	228.083e9	baseline
16,384	s64 c16 p16	ok	95.40s	+0.7%	1.24 GiB	+35.9%	225.302e9	-1.2%
16,384	s64 c8 p1	ok	58.07s	-38.7%	0.80 GiB	-12.4%	241.662e9	+6.0%
65,536	s64 c16 p1	ok	522.12s	baseline	3.04 GiB	baseline	n/a	n/a
65,536	s64 c16 p16	ok	513.28s	-1.7%	3.52 GiB	+15.9%	n/a	n/a
65,536	s64 c8 p1	ok	304.78s	-41.6%	2.73 GiB	-10.3%	n/a	n/a
131,072	s64 c16 p1	ok	935.95s	baseline	6.02 GiB	baseline	n/a	n/a
131,072	s64 c16 p16	ok	920.70s	-1.6%	6.42 GiB	+6.7%	n/a	n/a
131,072	s64 c8 p1	ok	619.45s	-33.8%	5.33 GiB	-11.4%	n/a	n/a
131,072	s16 c8 p1	ok	1022.17s	+9.2%	2.39 GiB	-60.2%	n/a	n/a

Takeaways:

Increasing prefetch_depth is not a good default. At 128K, p16 was only 1.6% faster than p1 but used 6.7% more RSS. At smaller k it often increased RSS much more.
Lowering coreset from 16 to 8 is the strongest speed/RSS knob, but it can raise loss. At 16K it was 38.7% faster and 12.4% lower RSS, but loss was 6.0% worse.
Smaller stream lowers RSS but does not necessarily improve time. At 128K, stream=16, coreset=8 used only 2.39 GiB but took 17.0 minutes.

Algorithm Comparison

Percentages are relative to LanceStream stream=64, coreset=16, prefetch=1 at the same k.

k	algorithm	status	train	time delta	RSS	RSS delta	exact loss	loss delta
1,024	LanceStream	ok	10.38s	baseline	0.15 GiB	baseline	13.839e9	baseline
1,024	Lance non-stream	ok	1.83s	-82.4%	0.31 GiB	+102.5%	14.154e9	+2.3%
1,024	Faiss	ok	33.51s	+223.0%	0.20 GiB	+29.4%	13.279e9	-4.1%
1,024	MiniBatch	ok	31.75s	+205.9%	0.32 GiB	+111.2%	13.494e9	-2.5%
1,024	BICO-style	ok	34.55s	+233.0%	0.53 GiB	+252.7%	14.307e9	+3.4%
1,024	treeCoreset	ok	13.77s	+32.7%	1.35 GiB	+792.7%	14.745e9	+6.5%
4,096	LanceStream	ok	29.73s	baseline	0.32 GiB	baseline	63.509e9	baseline
4,096	Lance non-stream	ok	17.36s	-41.6%	1.17 GiB	+269.9%	62.694e9	-1.3%
4,096	Faiss	ok	544.94s	+1732.8%	0.59 GiB	+85.8%	60.843e9	-4.2%
4,096	MiniBatch	ok	446.67s	+1402.3%	0.86 GiB	+171.4%	61.763e9	-2.8%
4,096	BICO-style	ok	147.32s	+395.5%	1.05 GiB	+230.4%	65.588e9	+3.3%
4,096	treeCoreset	ok	176.30s	+492.9%	3.49 GiB	+1003.3%	67.386e9	+6.1%
16,384	LanceStream	ok	94.72s	baseline	0.92 GiB	baseline	228.083e9	baseline
16,384	Lance non-stream	ok	26.04s	-72.5%	4.04 GiB	+341.3%	228.268e9	+0.1%
16,384	Faiss	timeout	1h	n/a	2.16 GiB	+135.9%	n/a	n/a
16,384	MiniBatch	timeout	1h	n/a	3.02 GiB	+230.5%	n/a	n/a
16,384	BICO-style	ok	2705.57s	+2756.4%	3.78 GiB	+313.4%	230.613e9	+1.1%
16,384	treeCoreset	ok	2393.10s	+2426.5%	12.32 GiB	+1245.9%	237.508e9	+4.1%
65,536	LanceStream	ok	522.12s	baseline	3.04 GiB	baseline	n/a	n/a
65,536	Lance non-stream	ok	147.08s	-71.8%	16.61 GiB	+446.0%	n/a	n/a
65,536	Faiss	skipped	n/a	n/a	n/a	n/a	n/a	n/a
65,536	MiniBatch	skipped	n/a	n/a	n/a	n/a	n/a	n/a
65,536	BICO-style	timeout	1h	n/a	7.29 GiB	+139.7%	n/a	n/a
65,536	treeCoreset	error 139	26.93s	n/a	22.31 GiB	+633.4%	n/a	n/a
131,072	LanceStream	ok	935.95s	baseline	6.02 GiB	baseline	n/a	n/a
131,072	Lance non-stream	ok	297.29s	-68.2%	32.62 GiB	+442.0%	n/a	n/a
131,072	Faiss	skipped	n/a	n/a	n/a	n/a	n/a	n/a
131,072	MiniBatch	skipped	n/a	n/a	n/a	n/a	n/a	n/a
131,072	BICO-style	skipped	n/a	n/a	n/a	n/a	n/a	n/a
131,072	treeCoreset	skipped	n/a	n/a	n/a	n/a	n/a	n/a

Large-k notes:

Faiss and MiniBatch both timed out at 16K, so 64K and 128K were skipped.
BICO-style completed 16K but took 45.1 minutes and timed out at 64K.
treeCoreset completed 16K but used 12.32 GiB RSS; at 64K it crashed with signal 11 after reaching 22.31 GiB RSS.
Lance non-stream is faster but scales memory with the full k * 256 raw sample; at 128K it used 32.62 GiB RSS on this 128D dataset.
LanceStream completed every tested k with much lower RSS than Lance non-stream. At 128K, default LanceStream used 6.02 GiB; the lower-memory stream=16, coreset=8 variant used 2.39 GiB.

Validation

cargo fmt --all --check
cargo check -p lance --locked
cargo test -p lance --lib test_split_ranges_by_row_count --locked on a clean temp worktree containing only the production ivf.rs change
git diff --check

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

codecov · 2026-05-22T09:38:03Z

Codecov Report

❌ Patch coverage is 72.26322% with 451 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/ivf.rs	71.50%	402 Missing and 34 partials ⚠️
rust/lance-index/src/vector/kmeans.rs	83.33%	11 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo

Nice change. One concern I have is the training time increased a lot to 4x. I hope we can have a follow-up issues to address it. I think it will ba good tradeoff if we only add 10% to 50%.

feat(index): add streaming ivf kmeans training

ed5b3be

claude Bot reviewed May 22, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request python labels May 22, 2026

chore: merge main into streaming ivf kmeans branch

6c5574d

BubbleCal added 2 commits May 22, 2026 19:06

fix(index): simplify kmeans helper return types

0b1c7d8

fix(index): satisfy streaming ivf clippy lints

ac0e6cb

BubbleCal requested review from Xuanwo, westonpace and wjones127 May 26, 2026 03:12

Xuanwo approved these changes May 26, 2026

View reviewed changes

BubbleCal added 7 commits May 26, 2026 18:40

perf(index): skip streaming ivf loss pass

6ecfb20

perf(index): sample streaming ivf training by ranges

ee1970f

perf(index): bound streaming ivf sample reads

f0e3ea7

perf(index): throttle streaming ivf progress updates

6112cba

perf(index): skip unused kmeans step loss

7f5fbac

perf(index): reduce streaming ivf progress frequency

bfe6da8

perf(index): reuse streaming coreset materialization

3ffe1f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): add streaming ivf kmeans training#6913

feat(index): add streaming ivf kmeans training#6913
BubbleCal wants to merge 11 commits into
mainfrom
yang/streaming-ivf-kmeans-training

BubbleCal commented May 22, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Xuanwo left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BubbleCal commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Feature

Algorithm

Benchmarks

Setup

Loss Methodology

LanceStream Tuning

Algorithm Comparison

Validation

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BubbleCal commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading

Xuanwo left a comment •

edited

Loading