Skip to content

Implement Merkle{Packer,Unpacker} for FileBackend#515

Open
malcolmgreaves wants to merge 1 commit intomg/merkle_dyn_pack_interfacesfrom
mg/merkle_dyn_pack_impls
Open

Implement Merkle{Packer,Unpacker} for FileBackend#515
malcolmgreaves wants to merge 1 commit intomg/merkle_dyn_pack_interfacesfrom
mg/merkle_dyn_pack_impls

Conversation

@malcolmgreaves
Copy link
Copy Markdown
Collaborator

MerklePacker & MerkleUnpacker
Implemented the new MerkleTransport traits for packing up
and unpacking Merkle tree nodes as they're sent on the wire
from the oxen client to server. Preserves the existing tar-gz
formats specific to the FileBackend's physical store layout.

Supports the two unique tar-gz paths:

  • upload: only captures {prefix}/{suffix}/{node,children}
  • everything else: tree/nodes/{prefix}/{suffix}/{node,children}
    The PackOptions enum controls this specific behavior.

Progress Bar in Bytes
Changes the progress bar for uploads & downloads to use bytes/s
instead of nodes/s. Nodes are not evenly sized, so progress will
look like it stops when handling a large file. This now hooks
into the reader & writer types that the trait uses.

Errors
MerkleDbError gains transport-side variants for tar-archive parsing
(CannotReadMerkle, UnsupportedTarEntry, PathTraversal,
InvalidTarStructure, InvalidNodeIdHex, MissingNodeDir,
MissingTreeNodesDir, FsTransport) plus a corresponding tokio-util
features bump (io, io-util) for async I/O bridging.

Testing
Tests cover round-trips, path-traversal rejection, malformed entries,
empty tarballs, and on-disk byte parity with the existing compress_* /
unpack_async_tar_archive paths.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds tar.gz packing/unpacking transport for Merkle nodes to FileBackend (with VFS staging and byte-estimate), extends MerkleDbError with tar/transport-specific variants, adds extensive wire-format and security tests, and enables additional features for the workspace tokio-util dependency.

Changes

Dependency Features

Layer / File(s) Summary
Manifest
Cargo.toml
tokio-util workspace dependency remains 0.7.8 and adds features io and io-util to the existing compat feature.

Merkle Node Transport & Errors

Layer / File(s) Summary
Data Shape / Errors
crates/lib/src/core/db/merkle_node/merkle_node_db.rs
Added MerkleDbError variants: FsTransport, CannotReadMerkle, UnsupportedTarEntry, PathTraversal, InvalidTarStructure { entry_path, reason }, InvalidNodeIdHex { id, source }, MissingNodeDir(MerkleHash), MissingTreeNodesDir.
API Surface
crates/lib/src/core/db/merkle_node/file_backend.rs
FileBackend gains is_vfs: bool. Added pub fn pack_nodes_byte_estimate(...). Implemented MerklePacker (pack_nodes, pack_all) and MerkleUnpacker (unpack) for FileBackend.
Core Implementation
crates/lib/src/core/db/merkle_node/file_backend.rs
Refactored packing into write_hashes_tar/write_all_tar supporting server-canonical (tree/nodes/...) and legacy layouts, compression-level selection, and skip-vs-error policies. Refactored unpacking into extract_tar_under with gzip+tar extraction, path-traversal rejection, allowed-entry-type enforcement, layout normalization, hex parsing, and hash collection.
Wiring / Repo Integration
crates/lib/src/core/db/merkle_node/file_backend.rs
unpack honors UnpackOptions overwrite semantics and, when is_vfs, stages extraction into a temp directory then copy_dir_all into .oxen/; otherwise writes directly. FileBackend::new reads LocalRepository::is_vfs().
Tests
crates/lib/src/core/db/merkle_node/file_backend.rs (test module)
Added deterministic tar-entry listing helper, retained legacy helpers for parity, and many tests: pack/unpack parity across layouts, empty-tar handling, skipping absent hashes, pack_nodes_byte_estimate bounds, VFS staging flow, and security/format regressions (path traversal, unsupported entry types, invalid node ids, over-deep structures, unknown leaf filenames, leading-zero nibble handling).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FileBackend
    participant Filesystem
    participant TarWriter

    Client->>FileBackend: pack_nodes(hashes, PackOptions)
    FileBackend->>Filesystem: stat/iterate node directories
    Filesystem-->>FileBackend: dir entries & file sizes
    FileBackend->>TarWriter: write entries (tree/nodes/... or legacy layout)
    TarWriter->>TarWriter: gzip compress (chosen level)
    FileBackend->>Client: stream tar.gz bytes
Loading
sequenceDiagram
    participant Client
    participant FileBackend
    participant TarReader
    participant Validator
    participant Filesystem

    Client->>FileBackend: unpack(tar_gz_stream, UnpackOptions)
    FileBackend->>TarReader: decompress & iterate entries
    TarReader->>Validator: validate (no '..', allowed types, layout)
    Validator-->>FileBackend: entry ok / error
    alt VFS repository
        FileBackend->>Filesystem: write to TempDir
        FileBackend->>Filesystem: copy_dir_all -> .oxen/
    else Regular repository
        FileBackend->>Filesystem: write directly under .oxen/
    end
    FileBackend->>FileBackend: parse node ids -> HashSet<MerkleHash>
    FileBackend->>Client: return unpacked HashSet
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Oxen-AI/Oxen#470 — related changes to MerkleDbError surface and integration with OxenError.
  • Oxen-AI/Oxen#634 — related to merkle-node tar packaging and archive path/layout logic.

Suggested reviewers

  • CleanCut
  • rpschoenburg

Poem

🐇 I packed the nodes with careful art,
Gzipped tight so bytes won't part.
No ".." may creep or slyly pry,
Staged in temp, then copied by.
Hop—unpack done—now hashes fly.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely summarizes the main change: implementing MerklePacker and MerkleUnpacker traits for FileBackend, which is the primary objective of this PR.
Description check ✅ Passed The description comprehensively relates to the changeset, covering MerklePacker/MerkleUnpacker implementation, tar-gz format handling, progress bar changes, error handling extensions, and testing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mg/merkle_dyn_pack_impls

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 379-395: The code currently calls
extract_hash_from_entry_path(&dst_path, oxen_hidden) after create_dir_all, the
overwrite_existing check and file.unpack(), which lets invalid tar paths be
written and causes hashes to be skipped under UnpackOptions::SkipExisting; fix
by calling extract_hash_from_entry_path early—compute let hash_opt =
extract_hash_from_entry_path(&dst_path, oxen_hidden)? (or equivalent parser that
returns validation) immediately after computing dst_path and before any
create_dir_all / existence check / file.unpack(), then use that parsed result
for validation (return InvalidTarStructure if None/invalid), use hash_opt to
decide whether to skip adding to hashes when skipping existing files, and only
perform create_dir_all and file.unpack() after validation succeeds so invalid
entries are never written and hashes are consistently collected.
- Around line 354-359: The loop in file_backend.rs currently swallows tar entry
errors (for entry in entries { let Ok(mut file) = entry else { log::error!(...);
continue; } }) which allows partial installs; change it to propagate the entry
error as a MerkleDbError::CannotReadMerkle so the unpack fails immediately.
Replace the else/continue with an early Err return that converts the tar entry
error into MerkleDbError::CannotReadMerkle (including the original error message
for context), or use entry.map_err(|e| MerkleDbError::CannotReadMerkle(...))? to
bind mut file; ensure the error type returned by the surrounding function
matches this change.
- Around line 360-377: The code only checks for ParentDir components and misses
absolute/Windows-rooted tar paths which can discard the oxen_hidden prefix when
joined; in extract_tar_under (and similar to unpack_async_tar_archive) add a
guard that rejects path.is_absolute() and any component matching
Component::RootDir or Component::Prefix(_) (e.g., via path.components().any(|c|
matches!(c, Component::RootDir | Component::Prefix(_)))) and return a
MerkleDbError::PathTraversal (or existing appropriate error) before constructing
dst_path so absolute or rooted entries are refused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9f3ced5-5195-4725-8535-49844571d33c

📥 Commits

Reviewing files that changed from the base of the PR and between b38fcd2 and d9c8c29.

📒 Files selected for processing (3)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs

Comment thread crates/lib/src/core/db/merkle_node/file_backend.rs
Comment on lines +360 to +377
let path = file.path()?.into_owned();
// Path-traversal guard: refuse any entry whose path resolves above its container.
if path.components().any(|c| matches!(c, Component::ParentDir)) {
return Err(MerkleDbError::PathTraversal(path.display().to_string()));
}
// Entry-type validation: only regular files and directories are allowed.
let entry_type = file.header().entry_type();
if !entry_type.is_file() && !entry_type.is_dir() {
return Err(MerkleDbError::UnsupportedTarEntry {
path: path.display().to_string(),
});
}
// Server-style entries already contain `tree/nodes/...`; join directly.
// Legacy client-push entries begin at `{prefix}/{suffix}/...`; prepend `tree/nodes/`.
let dst_path = if path.starts_with(&tree_nodes_prefix) {
oxen_hidden.join(&path)
} else {
oxen_hidden.join(&tree_nodes_prefix).join(&path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In Rust, what does std::path::PathBuf::join do when the right-hand path is absolute or has a Windows drive prefix? In the tar crate, can Entry::path() surface archive paths with those forms?

💡 Result:

In Rust's standard library, std::path::PathBuf::join(path) creates a new PathBuf by adjoining the given path to self. If the right-hand path is absolute (starts with '/' on Unix or has both prefix and root on Windows), it replaces the current path entirely, returning the absolute path. On Windows, if the path has a prefix (e.g., drive letter like "C:") but no root, self is ignored and the prefixed path is returned unchanged. Examples from docs: Path::new("/etc").join("/bin/sh") yields "/bin/sh".

Citations:


🏁 Script executed:

# First, let's find and examine the file mentioned in the review
fd "file_backend.rs" | head -5

Repository: Oxen-AI/Oxen

Length of output: 107


🏁 Script executed:

# Check the Cargo.toml to see what tar crate version is used
fd "Cargo.toml" | xargs grep -l "tar" | head -3

Repository: Oxen-AI/Oxen

Length of output: 114


🏁 Script executed:

# Let's examine the actual code around lines 360-377
wc -l crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 112


🏁 Script executed:

# Read the specific lines mentioned in the review
sed -n '360,377p' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 1044


🏁 Script executed:

# Get broader context around those lines to understand the function
sed -n '340,400p' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 2761


🏁 Script executed:

# Check tar crate version in lib's Cargo.toml
grep -A2 -B2 "^tar" crates/lib/Cargo.toml

Repository: Oxen-AI/Oxen

Length of output: 231


🏁 Script executed:

# Check workspace Cargo.toml to see tar version
cat Cargo.toml | grep -A5 "^\[workspace"

Repository: Oxen-AI/Oxen

Length of output: 729


🏁 Script executed:

# Also search for any other path validation or safety checks in the codebase
rg "path_traversal|absolute|RootDir" crates/lib/src/core/db/merkle_node/ -A 2 -B 2

Repository: Oxen-AI/Oxen

Length of output: 1066


🏁 Script executed:

# Check if there's a util::fs::unpack_async_tar_archive mentioned in the comment
fd "*.rs" -path "*/util/*" | xargs grep -l "unpack_async_tar_archive" | head -3

Repository: Oxen-AI/Oxen

Length of output: 225


🏁 Script executed:

# Get the tar version from workspace dependencies
grep "^tar" Cargo.toml

Repository: Oxen-AI/Oxen

Length of output: 71


🏁 Script executed:

# Find and read the path traversal test
sed -n '/test_unpack_rejects_path_traversal/,/^[[:space:]]*\}/p' crates/lib/src/core/db/merkle_node/file_backend.rs | head -100

Repository: Oxen-AI/Oxen

Length of output: 903


🏁 Script executed:

# Find util::fs::unpack_async_tar_archive
rg "unpack_async_tar_archive" crates/lib/src/

Repository: Oxen-AI/Oxen

Length of output: 1960


🏁 Script executed:

# Read more of the path traversal test
sed -n '/async fn test_unpack_rejects_path_traversal/,/^    \}/p' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 1339


🏁 Script executed:

# Read the unpack_async_tar_archive function to see what checks it does
sed -n '/pub async fn unpack_async_tar_archive/,/^}\s*$/p' crates/lib/src/util/fs.rs | head -80

Repository: Oxen-AI/Oxen

Length of output: 1896


🏁 Script executed:

# Search for any tests with absolute paths in tar entries
rg "^/" crates/lib/src/core/db/merkle_node/file_backend.rs -B3 -A3

Repository: Oxen-AI/Oxen

Length of output: 8959


🏁 Script executed:

# Check if there are other tests for extract_tar_under
rg "extract_tar_under" crates/lib/src/core/db/merkle_node/file_backend.rs -B2 -A2

Repository: Oxen-AI/Oxen

Length of output: 608


🏁 Script executed:

# Look for test cases with absolute paths in the tar module
rg "test.*tar.*absolute|absolute.*tar.*test" crates/

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Check if tar crate version 0.4.44 has any security considerations
cat Cargo.toml | grep -A20 "\[workspace.dependencies\]" | grep -A2 "tar"

Repository: Oxen-AI/Oxen

Length of output: 266


🏁 Script executed:

# Look for any CVE or security issues related to absolute paths in tar unpacking
rg "absolute.*path|is_absolute" crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Search for test with absolute paths or verify the fix suggestion
rg "RootDir|Prefix\|Component" crates/lib/src/

Repository: Oxen-AI/Oxen

Length of output: 364


🏁 Script executed:

# See the context of the RootDir check in util/fs.rs
sed -n '/if components.next() == Some/,+5p' crates/lib/src/util/fs.rs

Repository: Oxen-AI/Oxen

Length of output: 199


🏁 Script executed:

# See the context of RootDir | Prefix check in fs.rs
rg "Component::RootDir \| Component::Prefix" crates/lib/src/util/fs.rs -B3 -A3

Repository: Oxen-AI/Oxen

Length of output: 363


🏁 Script executed:

# Check the workspaces/files.rs context
rg "ParentDir \| Component::Prefix.*RootDir" crates/lib/src/core/v_latest/workspaces/files.rs -B3 -A3

Repository: Oxen-AI/Oxen

Length of output: 442


🏁 Script executed:

# Check if there are any GitHub issues or discussions about tar with absolute paths
# Let's look at the imports and see if there's any normalization happening
sed -n '1,50p' crates/lib/src/core/db/merkle_node/file_backend.rs | grep -E "use|import"

Repository: Oxen-AI/Oxen

Length of output: 802


🏁 Script executed:

# Let's see the full file.path()?.into_owned() context
sed -n '355,385p' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 1527


🏁 Script executed:

# Let's create a simple test to see if tar crate allows absolute paths
cat > /tmp/test_tar_absolute.rs << 'EOF'
use tar::Builder;
use std::io::Cursor;

fn main() {
    let mut buf = Vec::new();
    {
        let mut tar = Builder::new(&mut buf);
        let mut header = tar::Header::new_old();
        header.set_size(0);
        header.set_entry_type(tar::EntryType::Regular);
        
        // Try setting an absolute path
        let result = header.set_path("/etc/passwd");
        println!("set_path with absolute: {:?}", result);
        
        // Try raw bytes approach (like the test does)
        let name_bytes = b"/absolute/path";
        let old = header.as_old_mut();
        old.name[..name_bytes.len()].copy_from_slice(name_bytes);
        
        // Check what Entry::path would see
        println!("Raw header name: {:?}", std::str::from_utf8(&old.name));
    }
}
EOF
# Note: We can't run this without rustc, but we can reason about it

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Let's look at what the tar crate documentation says - check if there are any cargo docs
rg "absolute.*path|set_path" crates/lib/Cargo.toml

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Let's check the tar crate version and look for any path normalization
# by examining how paths are handled in existing code
rg "file\.path\(\)" crates/lib/src/ -B2 -A2 | head -50

Repository: Oxen-AI/Oxen

Length of output: 1786


Add guards for absolute and rooted tar paths.

The current guard only checks for .. components. A tar entry with an absolute path (e.g., /etc/passwd) or Windows drive prefix will cause PathBuf::join to discard the oxen_hidden prefix entirely, writing the file outside the repository. The unpack_async_tar_archive function uses a safer component-iteration approach that skips RootDir and Prefix components; extract_tar_under should do the same. Add checks for path.is_absolute() and Component::RootDir | Component::Prefix(_) before constructing dst_path.

Suggested fix
-        if path.components().any(|c| matches!(c, Component::ParentDir)) {
+        if path.is_absolute()
+            || path.components().any(|c| {
+                matches!(
+                    c,
+                    Component::ParentDir | Component::RootDir | Component::Prefix(_)
+                )
+            })
+        {
             return Err(MerkleDbError::PathTraversal(path.display().to_string()));
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 360 - 377,
The code only checks for ParentDir components and misses absolute/Windows-rooted
tar paths which can discard the oxen_hidden prefix when joined; in
extract_tar_under (and similar to unpack_async_tar_archive) add a guard that
rejects path.is_absolute() and any component matching Component::RootDir or
Component::Prefix(_) (e.g., via path.components().any(|c| matches!(c,
Component::RootDir | Component::Prefix(_)))) and return a
MerkleDbError::PathTraversal (or existing appropriate error) before constructing
dst_path so absolute or rooted entries are refused.

Comment thread crates/lib/src/core/db/merkle_node/file_backend.rs
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from b38fcd2 to c5c8270 Compare May 4, 2026 17:47
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from d9c8c29 to 3e16ed9 Compare May 4, 2026 17:48
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from c5c8270 to f12af68 Compare May 4, 2026 17:58
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from 3e16ed9 to 5009a3a Compare May 4, 2026 17:59
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch 2 times, most recently from c65ffd7 to 5813fbf Compare May 5, 2026 00:42
@malcolmgreaves
Copy link
Copy Markdown
Collaborator Author

NOTE: Stacked PR! Must merge #514 before merging.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
crates/lib/src/core/db/merkle_node/file_backend.rs (3)

352-357: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate unreadable tar entries instead of continuing.

Continuing here lets a malformed archive partially install files and still return Ok(...). Convert the entry error into MerkleDbError::CannotReadMerkle and abort the unpack.

Suggested fix
-        let Ok(mut file) = entry else {
-            log::error!("Could not unpack file in merkle tar archive");
-            // TODO: raise this error to the caller instead!?
-            continue;
-        };
+        let mut file = entry.map_err(MerkleDbError::CannotReadMerkle)?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 352 - 357,
The loop over tar archive entries in file_backend.rs currently swallows entry
unpack errors and continues; change this to propagate the failure by converting
the entry unpack error into MerkleDbError::CannotReadMerkle and returning
Err(...) so unpack aborts instead of partially installing files. Locate the for
entry in entries loop where it currently does let Ok(mut file) = entry else {
log::error!(...); continue; } and replace the continue path with constructing
and returning Err(MerkleDbError::CannotReadMerkle(...)) (include the original
entry error message/context) so the caller receives the failure immediately.

358-376: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reject absolute and rooted tar paths before joining them.

The current guard only rejects ... Absolute paths and Windows-rooted/prefixed paths can still cause join to discard oxen_hidden, which lets an archive write outside the repo.

Suggested fix
-        if path.components().any(|c| matches!(c, Component::ParentDir)) {
+        if path.is_absolute()
+            || path.components().any(|c| {
+                matches!(
+                    c,
+                    Component::ParentDir | Component::RootDir | Component::Prefix(_)
+                )
+            })
+        {
             return Err(MerkleDbError::PathTraversal(path.display().to_string()));
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 358 - 376,
The path guard only rejects ParentDir but still allows absolute or
platform-rooted paths which can make oxen_hidden.join(...) discard the prefix;
update the validation around file.path() to also reject paths that are absolute
or contain Component::RootDir or Component::Prefix (or use path.is_absolute())
and return MerkleDbError::PathTraversal with the offending path string; keep the
existing checks for entry_type and then safely build dst_path using the
now-ensured relative path (variables to edit: the `path` from file.path(), the
guard that returns MerkleDbError::PathTraversal, and the dst_path construction
that uses `tree_nodes_prefix` and `oxen_hidden`).

372-392: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parse and validate the entry path before any write or skip decision.

extract_hash_from_entry_path() runs too late here. That means invalid tar structures can create directories/files before you return InvalidTarStructure, and UnpackOptions::SkipExisting drops hashes from the returned set even though the unpack contract says skipped entries should still be reported.

Suggested fix
         let dst_path = if path.starts_with(&tree_nodes_prefix) {
             oxen_hidden.join(&path)
         } else {
             oxen_hidden.join(&tree_nodes_prefix).join(&path)
         };
+        let parsed_hash = extract_hash_from_entry_path(&dst_path, oxen_hidden)?;
+
         if let Some(parent) = dst_path.parent() {
             std::fs::create_dir_all(parent)?;
         }
         if dst_path.exists() && !overwrite_existing {
             log::info!("Node already exists at {dst_path:?}, skipping");
+            if let Some(hash) = parsed_hash {
+                hashes.insert(hash);
+            }
             continue;
         }
         file.unpack(&dst_path)?;
-        if let Some(hash) = extract_hash_from_entry_path(&dst_path, oxen_hidden)? {
+        if let Some(hash) = parsed_hash {
             hashes.insert(hash);
         } else {
             log::warn!(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 372 - 392,
Call extract_hash_from_entry_path(...) and validate the entry path before
performing any filesystem side-effects (creating dirs or calling file.unpack) so
malformed tar entries cannot create files/dirs before we return
InvalidTarStructure; specifically move the
extract_hash_from_entry_path(&dst_path, oxen_hidden)? call to precede
std::fs::create_dir_all(parent)? and file.unpack(&dst_path)?, and when
overwrite_existing is false and dst_path.exists() do not continue without
recording the extracted hash into the unpack result set
(UnpackOptions::SkipExisting must still report skipped-entry hashes) — update
the logic around dst_path, overwrite_existing, and the returned hashes to always
extract/record the hash before skipping or writing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 197-529: The impls fail because FileBackend lacks the &'repo
LocalRepository reference and uses a non-existent self.repo; add a repo: &'repo
LocalRepository field to the FileBackend struct (making it FileBackend<'repo>)
and update its constructor/creators to populate that field (you can keep
repo_path if still needed or remove it), then ensure methods referencing
self.repo (write_hashes_tar, write_all_tar, extract_tar_under usage in
MerklePacker::pack_nodes and MerkleUnpacker::unpack) compile against the new
field; alternatively, if you prefer not to store the repo, refactor those
callers (pack_nodes, pack_all, unpack and the helper functions) to accept &Path
and an is_vfs: bool instead of using self.repo.
- Around line 519-523: The VFS branch bypasses UnpackOptions::SkipExisting by
extracting into a TempDir then unconditionally calling util::fs::copy_dir_all,
which overwrites existing files; fix it by making the VFS path respect
overwrite_existing: either call extract_tar_under(reader, &oxen_hidden,
overwrite_existing) directly when self.repo.is_vfs(), or (if a TempDir is
required) replace the unconditional util::fs::copy_dir_all(tmp.path(),
&oxen_hidden) with a copy routine that checks overwrite_existing (the same logic
used by extract_tar_under) and skips existing files when overwrite_existing is
false; update the code around self.repo.is_vfs(), TempDir::new(),
extract_tar_under, and util::fs::copy_dir_all to apply this change.

---

Duplicate comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 352-357: The loop over tar archive entries in file_backend.rs
currently swallows entry unpack errors and continues; change this to propagate
the failure by converting the entry unpack error into
MerkleDbError::CannotReadMerkle and returning Err(...) so unpack aborts instead
of partially installing files. Locate the for entry in entries loop where it
currently does let Ok(mut file) = entry else { log::error!(...); continue; } and
replace the continue path with constructing and returning
Err(MerkleDbError::CannotReadMerkle(...)) (include the original entry error
message/context) so the caller receives the failure immediately.
- Around line 358-376: The path guard only rejects ParentDir but still allows
absolute or platform-rooted paths which can make oxen_hidden.join(...) discard
the prefix; update the validation around file.path() to also reject paths that
are absolute or contain Component::RootDir or Component::Prefix (or use
path.is_absolute()) and return MerkleDbError::PathTraversal with the offending
path string; keep the existing checks for entry_type and then safely build
dst_path using the now-ensured relative path (variables to edit: the `path` from
file.path(), the guard that returns MerkleDbError::PathTraversal, and the
dst_path construction that uses `tree_nodes_prefix` and `oxen_hidden`).
- Around line 372-392: Call extract_hash_from_entry_path(...) and validate the
entry path before performing any filesystem side-effects (creating dirs or
calling file.unpack) so malformed tar entries cannot create files/dirs before we
return InvalidTarStructure; specifically move the
extract_hash_from_entry_path(&dst_path, oxen_hidden)? call to precede
std::fs::create_dir_all(parent)? and file.unpack(&dst_path)?, and when
overwrite_existing is false and dst_path.exists() do not continue without
recording the extracted hash into the unpack result set
(UnpackOptions::SkipExisting must still report skipped-entry hashes) — update
the logic around dst_path, overwrite_existing, and the returned hashes to always
extract/record the hash before skipping or writing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c40c3a9-3131-457a-a28f-2563c95f5a61

📥 Commits

Reviewing files that changed from the base of the PR and between 3e16ed9 and 484843a.

📒 Files selected for processing (3)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
✅ Files skipped from review due to trivial changes (2)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs

Comment thread crates/lib/src/core/db/merkle_node/file_backend.rs
Comment thread crates/lib/src/core/db/merkle_node/file_backend.rs Outdated
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from 11eacec to d4d0762 Compare May 5, 2026 01:27
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from 484843a to d524e90 Compare May 5, 2026 01:27
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/lib/src/core/db/merkle_node/file_backend.rs (1)

1757-1790: 💤 Low value

Test only exercises VFS with Overwrite, not SkipExisting.

This test verifies the VFS staging branch works, but uses UnpackOptions::Overwrite. It doesn't test the SkipExisting policy which is bypassed in the VFS code path (see earlier comment on lines 519-523). Consider adding a test that pre-populates the target, unpacks with SkipExisting, and asserts the pre-existing files weren't overwritten.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 1757 - 1790,
The current test test_unpack_via_vfs_branch only exercises the VFS branch with
UnpackOptions::Overwrite; add a new async test (e.g.,
test_unpack_via_vfs_skip_existing) that follows the same setup (pack_all into a
buffer, init a temp repo, clone.set_vfs(Some(true))) but before calling
FileBackend::unpack use the clone to pre-populate one or more hashes/files
(create files or call FileBackend::new(&clone).unpack or otherwise write known
content), then call FileBackend::new(&clone).unpack(&mut &packed[..],
UnpackOptions::SkipExisting) and assert the pre-populated items were not changed
(by reading content or checking timestamps) and that other items were installed;
reference FileBackend::pack_all, FileBackend::unpack,
UnpackOptions::SkipExisting, FileBackend::new, exists and clone.set_vfs to
locate the relevant code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 1757-1790: The current test test_unpack_via_vfs_branch only
exercises the VFS branch with UnpackOptions::Overwrite; add a new async test
(e.g., test_unpack_via_vfs_skip_existing) that follows the same setup (pack_all
into a buffer, init a temp repo, clone.set_vfs(Some(true))) but before calling
FileBackend::unpack use the clone to pre-populate one or more hashes/files
(create files or call FileBackend::new(&clone).unpack or otherwise write known
content), then call FileBackend::new(&clone).unpack(&mut &packed[..],
UnpackOptions::SkipExisting) and assert the pre-populated items were not changed
(by reading content or checking timestamps) and that other items were installed;
reference FileBackend::pack_all, FileBackend::unpack,
UnpackOptions::SkipExisting, FileBackend::new, exists and clone.set_vfs to
locate the relevant code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4ca3faf-c1d7-4bcd-a447-6e5af9082d90

📥 Commits

Reviewing files that changed from the base of the PR and between 484843a and d524e90.

📒 Files selected for processing (3)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
✅ Files skipped from review due to trivial changes (2)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs

@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch 2 times, most recently from 11e30a5 to 5a896ca Compare May 5, 2026 01:44
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
crates/lib/src/core/db/merkle_node/file_backend.rs (3)

373-375: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add guards for absolute and rooted tar paths.

The path-traversal guard only checks for Component::ParentDir. A tar entry with an absolute path (e.g., /etc/passwd) or a Windows drive prefix will cause PathBuf::join to discard the oxen_hidden prefix entirely, writing files outside the repository. This matches the check in util::fs::unpack_async_tar_archive which also guards against RootDir and Prefix components.

Suggested fix
-        if path.components().any(|c| matches!(c, Component::ParentDir)) {
+        if path.is_absolute()
+            || path.components().any(|c| {
+                matches!(
+                    c,
+                    Component::ParentDir | Component::RootDir | Component::Prefix(_)
+                )
+            })
+        {
             return Err(MerkleDbError::PathTraversal(path.display().to_string()));
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 373 - 375,
The path-traversal check in file_backend.rs only rejects Component::ParentDir
but must also reject absolute/rooted tar paths; update the guard in the function
that inspects `path.components()` (the block that currently returns
`MerkleDbError::PathTraversal(path.display().to_string())`) to also match
Component::RootDir and Component::Prefix (i.e., any(|c| matches!(c,
Component::ParentDir | Component::RootDir | Component::Prefix))). Use the same
MerkleDbError::PathTraversal error with the path display as before; mirror the
behavior of util::fs::unpack_async_tar_archive when validating entries.

533-541: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

VFS staging path bypasses SkipExisting semantics.

When is_vfs is true, extraction happens into a fresh temp directory where the overwrite_existing check inside extract_tar_under is always a no-op (nothing exists yet). Then util::fs::copy_dir_all performs an unconditional recursive copy that overwrites any colliding files in oxen_hidden, ignoring the UnpackOptions::SkipExisting parameter entirely.

The non-VFS path at line 540 correctly respects UnpackOptions::SkipExisting because it invokes extract_tar_under directly on the real destination.

Possible approaches
  1. Direct extraction for VFS: If the VFS can handle small file streams, extract directly like the non-VFS path.

  2. Copy with skip logic: Replace copy_dir_all with a variant that checks overwrite_existing and skips existing destination files when false.

  3. Post-copy cleanup: After extracting to temp, iterate and only copy files that don't exist at the destination (if !overwrite_existing).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 533 - 541,
The VFS branch bypasses UnpackOptions::SkipExisting by extracting into a TempDir
then unconditionally calling util::fs::copy_dir_all; change this so that when
self.is_vfs and overwrite_existing is false you do a conditional copy instead of
copy_dir_all: after extract_tar_under(reader, tmp.path(), overwrite_existing)
succeeds, if overwrite_existing is true keep util::fs::copy_dir_all(tmp.path(),
&oxen_hidden) but if false iterate the entries under tmp.path() and for each
entry compute the corresponding destination under oxen_hidden and only copy the
entry when the destination does not already exist (preserving file/dir
semantics, symlinks, and metadata); this keeps extract_tar_under, TempDir::new,
is_vfs, and oxen_hidden intact while enforcing UnpackOptions::SkipExisting.

365-370: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate tar entry errors instead of continuing.

The continue on entry errors allows a malformed archive to partially install files and still return Ok(...). The TODO comment indicates this was recognized as problematic. The suggested fix in the past review was to use entry.map_err(MerkleDbError::CannotReadMerkle)? to fail immediately.

Suggested fix
-        let Ok(mut file) = entry else {
-            log::error!("Could not unpack file in merkle tar archive");
-            // TODO: raise this error to the caller instead!?
-            continue;
-        };
+        let mut file = entry.map_err(MerkleDbError::CannotReadMerkle)?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 365 - 370,
Replace the current non-failing unpack logic that logs and continues when a tar
entry is Err (the `let Ok(mut file) = entry else { log::error!(...); continue;
}` pattern) with immediate error propagation by mapping the tar entry error into
MerkleDbError::CannotReadMerkle and using the `?` operator (e.g., `let mut file
= entry.map_err(MerkleDbError::CannotReadMerkle)?;`) so the function fails fast
instead of partially installing files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 373-375: The path-traversal check in file_backend.rs only rejects
Component::ParentDir but must also reject absolute/rooted tar paths; update the
guard in the function that inspects `path.components()` (the block that
currently returns `MerkleDbError::PathTraversal(path.display().to_string())`) to
also match Component::RootDir and Component::Prefix (i.e., any(|c| matches!(c,
Component::ParentDir | Component::RootDir | Component::Prefix))). Use the same
MerkleDbError::PathTraversal error with the path display as before; mirror the
behavior of util::fs::unpack_async_tar_archive when validating entries.
- Around line 533-541: The VFS branch bypasses UnpackOptions::SkipExisting by
extracting into a TempDir then unconditionally calling util::fs::copy_dir_all;
change this so that when self.is_vfs and overwrite_existing is false you do a
conditional copy instead of copy_dir_all: after extract_tar_under(reader,
tmp.path(), overwrite_existing) succeeds, if overwrite_existing is true keep
util::fs::copy_dir_all(tmp.path(), &oxen_hidden) but if false iterate the
entries under tmp.path() and for each entry compute the corresponding
destination under oxen_hidden and only copy the entry when the destination does
not already exist (preserving file/dir semantics, symlinks, and metadata); this
keeps extract_tar_under, TempDir::new, is_vfs, and oxen_hidden intact while
enforcing UnpackOptions::SkipExisting.
- Around line 365-370: Replace the current non-failing unpack logic that logs
and continues when a tar entry is Err (the `let Ok(mut file) = entry else {
log::error!(...); continue; }` pattern) with immediate error propagation by
mapping the tar entry error into MerkleDbError::CannotReadMerkle and using the
`?` operator (e.g., `let mut file =
entry.map_err(MerkleDbError::CannotReadMerkle)?;`) so the function fails fast
instead of partially installing files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e00cbca-b911-4883-be41-02e5c929e3dd

📥 Commits

Reviewing files that changed from the base of the PR and between d524e90 and 11e30a5.

📒 Files selected for processing (3)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
✅ Files skipped from review due to trivial changes (2)
  • Cargo.toml
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs

@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch 2 times, most recently from 98cb275 to 120f270 Compare May 5, 2026 17:06
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from d4d0762 to 234e31d Compare May 5, 2026 17:09
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from 120f270 to c28bcd5 Compare May 5, 2026 17:09
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from 234e31d to 029e820 Compare May 7, 2026 19:24
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from c28bcd5 to 35659ea Compare May 7, 2026 19:24
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch 2 times, most recently from 2bd348f to cbf9b0a Compare May 7, 2026 20:12
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from 35659ea to a82b35d Compare May 7, 2026 20:30
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from cbf9b0a to df8d6fb Compare May 7, 2026 21:40
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from a82b35d to b12893f Compare May 7, 2026 21:41
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_interfaces branch from df8d6fb to f7950a6 Compare May 7, 2026 23:34
**`MerklePacker` & `MerkleUnpacker`**
Implemented the new `MerkleTransport` traits for packing up
and unpacking Merkle tree nodes as they're sent on the wire
from the oxen client to server. Preserves the existing tar-gz
formats specific to the FileBackend's physical store layout.

Supports the two unique tar-gz paths:
- upload: only captures `{prefix}/{suffix}/{node,children}`
- everything else: `tree/nodes/{prefix}/{suffix}/{node,children}`
The `PackOptions` enum controls this specific behavior.

**Progress Bar in Bytes**
Changes the progress bar for uploads & downloads to use bytes/s
instead of nodes/s. Nodes are not evenly sized, so progress will
look like it stops when handling a large file. This now hooks
into the reader & writer types that the trait uses.

**Errors**
`MerkleDbError` gains transport-side variants for tar-archive parsing
(`CannotReadMerkle`, `UnsupportedTarEntry`, `PathTraversal`,
`InvalidTarStructure`, `InvalidNodeIdHex`, `MissingNodeDir`,
`MissingTreeNodesDir`, `FsTransport`) plus a corresponding `tokio-util`
features bump (`io`, `io-util`) for async I/O bridging.

**Testing**
Tests cover round-trips, path-traversal rejection, malformed entries,
empty tarballs, and on-disk byte parity with the existing `compress_*` /
`unpack_async_tar_archive` paths.
@malcolmgreaves malcolmgreaves force-pushed the mg/merkle_dyn_pack_impls branch from b12893f to ac7ffb3 Compare May 7, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant