Skip to content

fix(upsert): skip redundant writes on idempotent upsert_e and upsert_v#888

Open
keithf123r wants to merge 3 commits intoHelixDB:mainfrom
keithf123r:fix/upsert-noop-writes
Open

fix(upsert): skip redundant writes on idempotent upsert_e and upsert_v#888
keithf123r wants to merge 3 commits intoHelixDB:mainfrom
keithf123r:fix/upsert-noop-writes

Conversation

@keithf123r
Copy link
Copy Markdown
Contributor

@keithf123r keithf123r commented Mar 20, 2026

Description

upsert_n already skips the write when properties haven't changed. upsert_e and upsert_v were missing the same check, so repeated upserts with identical data still wrote to disk every time, bloating the database file over time.

This PR adds the same property-equality check to all upsert paths:

  • upsert_e: skip the write entirely when nothing changed.
  • upsert_v (Vector): same — skip the write entirely.
  • upsert_v (VectorNodeWithoutVectorData): skip the property merge, but still write because the vector data itself is being updated.

Checklist when merging to main

  • No compiler warnings (if applicable)
  • Code is formatted with rustfmt
  • No useless or dead code (if applicable)
  • Code is easy to understand
  • Doc comments are used for all functions, enums, structs, and fields (where appropriate)
  • All tests pass
  • Performance has not regressed (assuming change was not to fix a bug)
  • Version number has been updated in helix-cli/Cargo.toml and helixdb/Cargo.toml

Additional Notes

This PR adds the same property-equality check to all upsert paths:

  • upsert_e: skip the write entirely when nothing changed.
  • upsert_v (Vector): same — skip the write entirely.
  • upsert_v (VectorNodeWithoutVectorData): skip the property merge, but still write because the vector data itself is being updated.

Greptile Summary

This PR extends the idempotent-write optimisation (already present in upsert_n) to upsert_e and upsert_v, short-circuiting redundant LMDB writes when an upsert's incoming properties are identical to what is already stored. The logic is correct for the edge and the full-vector paths, but there is a meaningful asymmetry in how the three upsert_v variants handle embedding data that is worth resolving.

  • upsert_e — two new early returns: one for edge.properties == None && props.is_empty(), one for Some(old) when all incoming props already match old. Both correctly skip the edges_db.put call.
  • upsert_v (Vector) — mirrors the above for vectors. However, the Vector branch never sets vector.data = query, so a caller that passes a different embedding vector but unchanged properties will hit the early return and silently lose the embedding update. The VectorNodeWithoutVectorData branch avoids this by always writing (it sets vector.data = query first), which highlights the inconsistency.
  • upsert_v (VectorNodeWithoutVectorData) — correctly wraps only the property-merge logic in the skip guard, but unconditionally calls put_vector to persist updated embedding data.
  • Tests — four new tests cover the primary noop paths for edges and vectors. Missing: a test for VectorNodeWithoutVectorData (noop merge but still writes), and for Some(old) + empty props (vacuous-truth early return).

Important Files Changed

Filename Overview
helix-db/src/helix_engine/traversal_core/ops/util/upsert.rs Adds early-return guards to upsert_e and upsert_v (Vector) to skip redundant LMDB writes when properties are unchanged; the VectorNodeWithoutVectorData path correctly skips only the property merge but still writes for vector data. Main concern: the Vector variant early return silently skips put_vector even when query (new embedding data) differs from the stored vector.data, because vector.data = query is never set in that branch.
helix-db/src/helix_engine/tests/traversal_tests/upsert_tests.rs Adds four new tests covering the noop paths for upsert_e (identical props, empty props) and upsert_v (identical props, empty props). Tests are well-structured and correctly assert identity and property preservation. Missing coverage for: VectorNodeWithoutVectorData noop-property-merge-but-still-writes path, and Some(old) + empty props edge case for both ops.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[upsert_e / upsert_v called] --> B{Existing record found?}
    B -- No --> C[Create new record + write to LMDB]
    B -- Yes: Vector case --> D{vector.properties?}
    B -- Yes: Edge case --> E{edge.properties?}
    B -- Yes: VectorNodeWithoutVectorData --> VND[Set vector.data = query]

    D -- None --> D1{props.is_empty?}
    D1 -- Yes --> SKIP1[⏭ Early return — skip put_vector]
    D1 -- No --> D2[Add secondary indices + create props map]
    D2 --> WRITE1[put_vector]

    D -- Some old --> D3{all props match old?}
    D3 -- Yes --> SKIP2[⏭ Early return — skip put_vector]
    D3 -- No --> D4[Update secondary indices + merge props]
    D4 --> WRITE2[put_vector]

    E -- None --> E1{props.is_empty?}
    E1 -- Yes --> SKIP3[⏭ Early return — skip edges_db.put]
    E1 -- No --> E2[Create new props map]
    E2 --> WRITE3[edges_db.put]

    E -- Some old --> E3{all props match old?}
    E3 -- Yes --> SKIP4[⏭ Early return — skip edges_db.put]
    E3 -- No --> E4[Merge props]
    E4 --> WRITE4[edges_db.put]

    VND --> VND2{vector.properties?}
    VND2 -- None --> VND3{props.is_empty?}
    VND3 -- Yes --> VND_WRITE[put_vector — always writes]
    VND3 -- No --> VND4[Add secondary indices + create props map]
    VND4 --> VND_WRITE
    VND2 -- Some old --> VND5{all props match old?}
    VND5 -- Yes --> VND_WRITE
    VND5 -- No --> VND6[Update secondary indices + merge props]
    VND6 --> VND_WRITE

    style SKIP1 fill:#ffd700
    style SKIP2 fill:#ffd700
    style SKIP3 fill:#ffd700
    style SKIP4 fill:#ffd700
    style VND_WRITE fill:#90EE90
    style WRITE1 fill:#90EE90
    style WRITE2 fill:#90EE90
    style WRITE3 fill:#90EE90
    style WRITE4 fill:#90EE90
Loading

Last reviewed commit: "fix(upsert): skip re..."

Greptile also left 1 inline comment on this PR.

Comment on lines 563 to +609
@@ -590,6 +604,10 @@ impl<'db, 'arena, 'txn, I: Iterator<Item = Result<TraversalValue<'arena>, GraphE
vector.properties = Some(map);
}
Some(old) => {
if props.iter().all(|(k, v)| old.get(k) == Some(v)) {
return Ok(TraversalValue::Vector(vector));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Early return skips put_vector even when query data differs

In the Vector variant (Some(Ok(TraversalValue::Vector(...)))), when the early-return fires (either props.is_empty() with None properties, or all props match old), put_vector is skipped entirely. However, the query parameter — carrying potentially updated vector embedding data — is never applied to vector.data in this branch. This means a caller that passes a different query but identical properties will silently get a no-op: the stored vector data stays unchanged, with no error or indication.

By contrast, the VectorNodeWithoutVectorData path (lines ~762–908) correctly sets vector.data = query before writing, which is exactly why that path always calls put_vector even when properties are unchanged.

If the Vector path is intentionally designed to not update vector embedding data (i.e., query is used only for creation / VectorNodeWithoutVectorData), the early-return optimization is correct but the silent data-skipping is easy to misuse. Consider either:

  1. Setting vector.data = query before the early-return check (so changed embeddings are always persisted), or
  2. Adding a code comment explicitly documenting that the Vector path never updates vector.data, so the skip is safe.

The new tests (test_upsert_v_noop_when_all_properties_identical) only pass an identical query when testing the no-op path, so the case of "same props, different embedding" is not exercised.

Copy link
Copy Markdown
Contributor Author

@keithf123r keithf123r Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by design. 2 paths serve different purposes:

  • TraversalValue::Vector: the vector already has its data loaded. This path only updates props
  • VectorNodeWithoutVectorData: vector was loaded w/o its embedding data, so it explicitly sets vector.data = query to fill it in

early return doesnt change this. the query parameter was already being ignored

@keithf123r keithf123r changed the title fix(upsert): skip redundant LMDB writes on idempotent upsert_e and upsert_v fix(upsert): skip redundant writes on idempotent upsert_e and upsert_v Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant