Skip to content

Implemented Merkle{Packer,Unpacker} for FileBackend#504

Closed
malcolmgreaves wants to merge 1 commit into
mg/merkle_packing_interfacesfrom
mg/merkle_pack_impls
Closed

Implemented Merkle{Packer,Unpacker} for FileBackend#504
malcolmgreaves wants to merge 1 commit into
mg/merkle_packing_interfacesfrom
mg/merkle_pack_impls

Conversation

@malcolmgreaves
Copy link
Copy Markdown
Collaborator

@malcolmgreaves malcolmgreaves commented Apr 29, 2026

MerkleTransport
Implemented the new MerkleTransport traits for packing up
and unpacking Merkle tree nodes as they're sent on the wire
from the oxen client to server. Preserves the existing tar-gz
formats specific to the FileBackend's physical store layout.

Supports the two unique tar-gz paths:

  • upload: only captures {prefix}/{suffix}/{node,children}
  • everything else: tree/nodes/{prefix}/{suffix}/{node,children}

Progress Bar in Bytes
Changes the progress bar for uploads & downloads to use bytes/s
instead of nodes/s. Nodes are not evenly sized, so progress will
look like it stops when handling a large file. This now hooks
into the reader & writer types that the trait uses.

})
}

/// Gunzip + collect tar entries into a deterministic map for byte-compat comparison.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test code helpers from here down. This is quite verbose, but the point is to extract the existing logic for Merkle tree node packing & unpacking on main and preserve that here. This way, we can run both the copied old code as well as the newly refactored code and prove that they behave the same.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9700ebaa-17e4-4e27-ac6d-9125f32cb2a6

📥 Commits

Reviewing files that changed from the base of the PR and between f7406e4 and 7d78e5c.

📒 Files selected for processing (4)
  • .gitignore
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • .gitignore
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
  • crates/lib/src/core/db/merkle_node/file_backend.rs

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Merkle node transport now supports tar-gz serialization with configurable compression levels
    • Added progress tracking capability for merkle operations
  • Bug Fixes

    • Enhanced security by preventing path traversal during archive extraction
    • Improved error diagnostics for archive validation and structure issues
    • Added strict validation for archive integrity and node identification

Walkthrough

Implements tar-gz Merkle node pack/unpack in FileBackend with legacy and server wire layouts, adds precise MerkleDbError variants, enables additional tokio-util features, and updates .gitignore to ignore .claude/scheduled_tasks.lock.

Changes

Cohort / File(s) Summary
Config / Dependencies
\.gitignore, Cargo.toml
Added .claude/scheduled_tasks.lock to gitignore; expanded workspace tokio-util features to include io and io-util.
Merkle transport implementation
crates/lib/src/core/db/merkle_node/file_backend.rs
Large addition: FileBackend now implements MerklePacker and MerkleUnpacker using tar-gz serialization, supports two historical wire layouts (different gzip levels), validates tar entries (rejects path traversal and unsupported types), stages VFS repos to temp dirs, and exposes pack_nodes_byte_estimate. Extensive tests added.
Error types
crates/lib/src/core/db/merkle_node/merkle_node_db.rs
Added multiple MerkleDbError variants for transport/tar validation and IO errors (FsTransport, CannotReadMerkle, UnsupportedTarEntry, PathTraversal, InvalidTarStructure, InvalidNodeIdHex, MissingNodeDir, MissingTreeNodesDir).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant FileBackend
    participant TarBuilder
    participant Compressor
    participant FileSystem

    Caller->>FileBackend: pack_nodes(repo, hashes, options)
    FileBackend->>FileBackend: check node dirs & options
    loop per hash
        FileBackend->>FileBackend: read node files/metadata
        FileBackend->>TarBuilder: add_entry(path, content)
    end
    FileBackend->>TarBuilder: finalize()
    alt gzip layout
        TarBuilder->>Compressor: compress(stream)
        Compressor-->>FileBackend: compressed_stream
    else legacy layout (different gzip level)
        TarBuilder->>Compressor: compress(stream, level)
        Compressor-->>FileBackend: compressed_stream
    end
    FileBackend->>FileSystem: write output
    FileSystem-->>FileBackend: ok
    FileBackend-->>Caller: packed_artifact
Loading
sequenceDiagram
    participant Caller
    participant FileBackend
    participant Validator
    participant Extractor
    participant VFSStaging
    participant FileSystem

    Caller->>FileBackend: unpack_nodes(archive, repo, options)
    FileBackend->>FileBackend: decompress(archive)
    loop for each tar entry
        FileBackend->>Validator: validate_entry(path, type)
        Validator-->>FileBackend: ok / reject
        alt valid
            FileBackend->>Extractor: extract(entry)
            alt VFS repo
                Extractor->>VFSStaging: stage to temp
                VFSStaging->>FileSystem: copy into VFS store
            else standard repo
                Extractor->>FileSystem: write to repo path
            end
        else rejected
            FileBackend-->>Caller: error (e.g., PathTraversal/UnsupportedTarEntry)
        end
    end
    FileSystem-->>FileBackend: all_written
    FileBackend-->>Caller: extracted_hashes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • gschoeni
  • subygan

Poem

🐰 I hop through tarballs, gzip in tow,
Packing hashes row by row,
I guard against .. and broken names,
Stage VFS safely, no sneaky games.
A byte-estimate hums — off we go! 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly summarizes the main changes: implementing MerklePacker and MerkleUnpacker traits for FileBackend, which is the primary focus of the changeset.
Description check ✅ Passed The description is directly related to the changeset, explaining the MerkleTransport trait implementations and progress reporting changes that align with the code modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mg/merkle_pack_impls

Review rate limit: 3/5 reviews remaining, refill in 15 minutes and 34 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 377-389: The code currently calls file.unpack(&dst_path) and only
afterwards calls extract_hash_from_entry_path(&dst_path, oxen_hidden), which
lets malformed entries be written and causes skipped-existing entries to not
report their parsed hash; move the call to extract_hash_from_entry_path before
any side-effecting operations (before the exists/overwrite_existing check and
before file.unpack), so validation/classification runs first and returns or
records InvalidTarStructure/InvalidNodeIdHex errors as required, and when
extract_hash_from_entry_path yields a hash still allow the subsequent existence
check (using dst_path and overwrite_existing) to skip unpack but still include
the parsed hash in the returned set.
- Around line 355-373: The code currently only rejects ParentDir components but
still allows absolute or Windows-prefixed paths which will make dst_path escape
oxen_hidden; update the validation on the local variable path (used to compute
dst_path) to also reject absolute/root/prefix components (e.g.,
Component::RootDir and Component::Prefix) and any path where path.is_absolute()
or the first component is RootDir/Prefix, returning MerkleDbError::PathTraversal
or a new appropriate error; keep the existing ParentDir check and then perform
that extra guard before computing dst_path (the oxen_hidden, tree_nodes_prefix,
and dst_path logic should remain unchanged once the path is verified).
- Around line 349-354: The loop over tar entries in unpack() currently logs and
continues on an unreadable entry leading to partial installs; instead propagate
the error immediately: change the `for entry in entries { let Ok(mut file) =
entry else { ... continue; } }` handling to return an Err constructed from the
entry's error (or map it to the function's error type) rather than calling
log::error and continuing, so unpack() fails fast on corrupt tar entries and the
caller receives the failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1f5a48c2-b500-4b25-8ee1-7c341da833bb

📥 Commits

Reviewing files that changed from the base of the PR and between b37ba5f and f7406e4.

📒 Files selected for processing (4)
  • .gitignore
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs

Comment on lines +349 to +354
for entry in entries {
let Ok(mut file) = entry else {
log::error!("Could not unpack file in merkle tar archive");
// TODO: raise this error to the caller instead!?
continue;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on unreadable tar entries.

If one entry is corrupt, this currently logs and continues, so unpack() can return Ok(...) after a partial install. That makes archive corruption look successful.

Suggested change
-        let Ok(mut file) = entry else {
-            log::error!("Could not unpack file in merkle tar archive");
-            // TODO: raise this error to the caller instead!?
-            continue;
-        };
+        let mut file = entry.map_err(MerkleDbError::CannotReadMerkle)?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 349 - 354,
The loop over tar entries in unpack() currently logs and continues on an
unreadable entry leading to partial installs; instead propagate the error
immediately: change the `for entry in entries { let Ok(mut file) = entry else {
... continue; } }` handling to return an Err constructed from the entry's error
(or map it to the function's error type) rather than calling log::error and
continuing, so unpack() fails fast on corrupt tar entries and the caller
receives the failure.

Comment on lines +355 to +373
let path = file.path()?.into_owned();
// Path-traversal guard: refuse any entry whose path resolves above its container.
if path.components().any(|c| matches!(c, Component::ParentDir)) {
return Err(MerkleDbError::PathTraversal(path.display().to_string()));
}
// Entry-type validation: only regular files and directories are allowed.
let entry_type = file.header().entry_type();
if !entry_type.is_file() && !entry_type.is_dir() {
return Err(MerkleDbError::UnsupportedTarEntry {
path: path.display().to_string(),
});
}
// Server-style entries already contain `tree/nodes/...`; join directly.
// Legacy client-push entries begin at `{prefix}/{suffix}/...`; prepend `tree/nodes/`.
let dst_path = if path.starts_with(&tree_nodes_prefix) {
oxen_hidden.join(&path)
} else {
oxen_hidden.join(&tree_nodes_prefix).join(&path)
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reject absolute and prefixed tar paths before join().

The current guard only blocks ... An entry like /tmp/pwn or a Windows-prefixed path will cause PathBuf::join to discard oxen_hidden and unpack outside the repo.

Suggested change
-        if path.components().any(|c| matches!(c, Component::ParentDir)) {
+        if path.components().any(|c| {
+            matches!(
+                c,
+                Component::ParentDir | Component::RootDir | Component::Prefix(_)
+            )
+        }) {
             return Err(MerkleDbError::PathTraversal(path.display().to_string()));
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let path = file.path()?.into_owned();
// Path-traversal guard: refuse any entry whose path resolves above its container.
if path.components().any(|c| matches!(c, Component::ParentDir)) {
return Err(MerkleDbError::PathTraversal(path.display().to_string()));
}
// Entry-type validation: only regular files and directories are allowed.
let entry_type = file.header().entry_type();
if !entry_type.is_file() && !entry_type.is_dir() {
return Err(MerkleDbError::UnsupportedTarEntry {
path: path.display().to_string(),
});
}
// Server-style entries already contain `tree/nodes/...`; join directly.
// Legacy client-push entries begin at `{prefix}/{suffix}/...`; prepend `tree/nodes/`.
let dst_path = if path.starts_with(&tree_nodes_prefix) {
oxen_hidden.join(&path)
} else {
oxen_hidden.join(&tree_nodes_prefix).join(&path)
};
let path = file.path()?.into_owned();
// Path-traversal guard: refuse any entry whose path resolves above its container.
if path.components().any(|c| {
matches!(
c,
Component::ParentDir | Component::RootDir | Component::Prefix(_)
)
}) {
return Err(MerkleDbError::PathTraversal(path.display().to_string()));
}
// Entry-type validation: only regular files and directories are allowed.
let entry_type = file.header().entry_type();
if !entry_type.is_file() && !entry_type.is_dir() {
return Err(MerkleDbError::UnsupportedTarEntry {
path: path.display().to_string(),
});
}
// Server-style entries already contain `tree/nodes/...`; join directly.
// Legacy client-push entries begin at `{prefix}/{suffix}/...`; prepend `tree/nodes/`.
let dst_path = if path.starts_with(&tree_nodes_prefix) {
oxen_hidden.join(&path)
} else {
oxen_hidden.join(&tree_nodes_prefix).join(&path)
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 355 - 373,
The code currently only rejects ParentDir components but still allows absolute
or Windows-prefixed paths which will make dst_path escape oxen_hidden; update
the validation on the local variable path (used to compute dst_path) to also
reject absolute/root/prefix components (e.g., Component::RootDir and
Component::Prefix) and any path where path.is_absolute() or the first component
is RootDir/Prefix, returning MerkleDbError::PathTraversal or a new appropriate
error; keep the existing ParentDir check and then perform that extra guard
before computing dst_path (the oxen_hidden, tree_nodes_prefix, and dst_path
logic should remain unchanged once the path is verified).

Comment on lines +377 to +389
if dst_path.exists() && !overwrite_existing {
log::info!("Node already exists at {dst_path:?}, skipping");
continue;
}
file.unpack(&dst_path)?;

// Extract the merkle hash from this entry's path, if it identifies one.
//
// After the path-resolution above, `dst_path` is of the form
// `<oxen_hidden>/tree/nodes/<rest>`. We classify entries by the SHAPE
// of `<rest>`, never by whether components happen to be hex. We assume that
// we have the hex-encoded hash as the `{prefix}/{suffix}` dirs.
if let Some(hash) = extract_hash_from_entry_path(&dst_path, oxen_hidden)? {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate and classify the entry before any skip/write side effects.

extract_hash_from_entry_path() is the only place that records parsed hashes and rejects InvalidTarStructure / InvalidNodeIdHex, but it runs only after the SkipExisting early return and after file.unpack(&dst_path). That means malformed entries can already be written to disk before unpack() returns Err, and duplicate entries are omitted from the returned hash set even though the unpack contract says parsed hashes should still be reported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 377 - 389,
The code currently calls file.unpack(&dst_path) and only afterwards calls
extract_hash_from_entry_path(&dst_path, oxen_hidden), which lets malformed
entries be written and causes skipped-existing entries to not report their
parsed hash; move the call to extract_hash_from_entry_path before any
side-effecting operations (before the exists/overwrite_existing check and before
file.unpack), so validation/classification runs first and returns or records
InvalidTarStructure/InvalidNodeIdHex errors as required, and when
extract_hash_from_entry_path yields a hash still allow the subsequent existence
check (using dst_path and overwrite_existing) to skip unpack but still include
the parsed hash in the returned set.

@malcolmgreaves
Copy link
Copy Markdown
Collaborator Author

STACKED PR: Do not merge until #502 has been merged.

**`MerkleTransport`**
Implemented the new `MerkleTransport` traits for packing up
and unpacking Merkle tree nodes as they're sent on the wire
from the oxen client to server. Preserves the existing tar-gz
formats specific to the FileBackend's physical store layout.

Supports the two unique tar-gz paths:
- upload: only captures `{prefix}/{suffix}/{node,children}`
- everything else: `tree/nodes/{prefix}/{suffix}/{node,children}`

**Progress Bar in Bytes**
Changes the progress bar for uploads & downloads to use bytes/s
instead of nodes/s. Nodes are not evenly sized, so progress will
look like it stops when handling a large file. This now hooks
into the reader & writer types that the trait uses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant