Skip to content

PS-11020 [8.0]: MyRocks - concurrent ADD INDEX could pollute L0 or crash the server.#5921

Draft
polchawa-percona wants to merge 5 commits into
percona:8.0from
polchawa-percona:PS-11020
Draft

PS-11020 [8.0]: MyRocks - concurrent ADD INDEX could pollute L0 or crash the server.#5921
polchawa-percona wants to merge 5 commits into
percona:8.0from
polchawa-percona:PS-11020

Conversation

@polchawa-percona
Copy link
Copy Markdown

Problem:

Concurrent ADD INDEX could lead to L0 level being polluted with the bulk-loaded data that was supposed to go to LMAX. Moreover, if the user has set rocksdb_bulk_load_fail_if_not_bottommost_level session variable to true, the server could crash when concurrently running ADD INDEX.

The same issues occurred when ADD INDEX was running while a new table was concurrently created.

Solution:

Serialize bulk loads together with all index id allocations so they could ingest SST files only having a guarantee that all RocksDB keys are the highest when inserted (to avoid the risk of collision with overlapping key ranges in merged SST files that include both smaller and higher index id values).

Wait when ingesting bulk-loaded SST, or when creating a table, until all bulk loads for the smaller index id are finished.

… the server.

Problem:

Concurrent ADD INDEX could lead to L0 level being polluted with the bulk-loaded data that was supposed to go to LMAX.
Moreover, if the user has set rocksdb_bulk_load_fail_if_not_bottommost_level session variable to true, the server
could crash when concurrently running ADD INDEX.

The same issues occurred when ADD INDEX was running while a new table was concurrently created.

Solution:

Serialize bulk loads together with all index id allocations so they could ingest SST files only having guarantee that
all RocksDB keys are the highest when inserted (to avoid risk of collision with overlapping key ranges in merged SST
files that include both smaller and higher index id values).

Wait when ingesting bulk loaded SST, or when creating a table up, until all bulk loads for smaller index id are finished.
@polchawa-percona
Copy link
Copy Markdown
Author

polchawa-percona commented Apr 9, 2026

This is draft of PR - I still want to double check that I don’t need to remove index id in other cases than finish_bulk_load and I want to add some assertions.

@polchawa-percona polchawa-percona changed the title BUG PS-11020 MyRocks - concurrent ADD INDEX could pollute L0 or crash the server. PS-11020 [8.0]: MyRocks - concurrent ADD INDEX could pollute L0 or crash the server. Apr 9, 2026
@inikep
Copy link
Copy Markdown
Collaborator

inikep commented Apr 9, 2026

Commits reviewed:

  • 3b006d7a34831060fbbd4805b6623900eb4cdfb4 - Added MTR test
  • 1bdec346d3383e073aa102f56abd9a2083fce2a6 - BUG PS-11020 MyRocks - concurrent ADD INDEX could pollute L0 or crash the server.

Scope:

  • Static review only.
  • I did not build or run tests.

Findings

1. High: multi-index ALTER drops reservations for indexes that have not been populated yet

inplace_populate_sk() populates added secondary indexes one by one, but it passes the full indexes set into finish_bulk_load() on every iteration. finish_bulk_load() then waits on and removes reservations for every index in that set immediately after the first ingest.

That means ALTER TABLE ... ADD INDEX ..., ADD INDEX ... can clear the reservation for later indexes before their own bulk load starts. A concurrent table/index creation with a larger index_id can then proceed between the first and second index population steps, reintroducing the same ordering violation this patch is trying to prevent for the remaining indexes.

References:

  • storage/rocksdb/ha_rocksdb.cc:14905-14907
  • storage/rocksdb/ha_rocksdb.cc:3993-4009

2. High: reserved index ids are leaked when finish_bulk_load() has no work to ingest

The new reservation is removed only in the post-ingest path. However, finish_bulk_load() has multiple successful early returns before that block, including the m_curr_bulk_load.size() == 0 case and the sst_commit_list.size() == 0 case.

For ALTER TABLE ... ADD INDEX on an empty table, or any other path where no SST files are produced, the new index id remains reserved forever even if the DDL succeeds. Future DDL that gets a larger index_id will keep waiting on that stale reservation.

References:

  • storage/rocksdb/ha_rocksdb.cc:3588-3593
  • storage/rocksdb/ha_rocksdb.cc:3931-3951
  • storage/rocksdb/ha_rocksdb.cc:4004-4009
  • storage/rocksdb/ha_rocksdb.cc:8900-8915

3. Medium: rollback/error paths do not clear the new reservation state

On failure, commit_inplace_alter_table(commit=false) cleans up uncommitted keydefs and ongoing create-index metadata, but it never removes entries from m_index_id_reserved_for_create.

So any error after create_key_def() reserves the id, but before the successful ingest cleanup runs, leaves a stale in-memory reservation behind until restart. This can happen on allocation failures, scan errors, duplicate-key failures before ingest, or any other abort path before the post-ingest cleanup.

References:

  • storage/rocksdb/ha_rocksdb.cc:14986-15034
  • storage/rocksdb/ha_rocksdb.cc:8900-8915

4. Medium: the new MTR test is timing-based and misses the leaked-reservation cases

The new test replaces deterministic synchronization with --sleep 5, so it now depends on machine speed/load rather than a specific engine state. That makes it vulnerable to CI flakiness.

It also inserts a row specifically to make the new secondary index non-empty, so it does not cover the no-work reservation leak described above.

References:

  • mysql-test/suite/rocksdb/t/alter_table_add_secondary_key.test:39-91

- always reclaim index id reservation in finish_bulk_load
- finish_bulk_load is invoked within a for-loop (not outside)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants