Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
be70bdf
Switch dependency manager from poetry to uv
datadavev Dec 4, 2025
f19b491
Update 'README.md' with instructions on how to install 'uv' for depen…
doulikecookiedough Dec 4, 2025
b91e8bb
Add 'exceptiongroup' dependency to 'pyproject.toml' to resolve python…
doulikecookiedough Dec 4, 2025
16a2598
Add python 3.11 to github workflow test
doulikecookiedough Dec 4, 2025
243953d
Add workflow for CI with uv, version bump, limit python upper version
datadavev Dec 4, 2025
ca94cdd
Merge branch 'feature-150-switch-uv' of https://github.com/DataONEorg…
datadavev Dec 4, 2025
7ca4b33
Update uv CI workflow to use current
datadavev Dec 4, 2025
1d3fdf6
uv CI withough lock
datadavev Dec 4, 2025
886de81
Update lockfile
datadavev Dec 23, 2025
93eaf6c
Add support for folder store / retrieve. This is a WIP.
datadavev Jan 15, 2026
154a1cb
Added notes about folders in hashstore
datadavev Jan 15, 2026
e3056b7
Adjust typehints for 3.9
datadavev Jan 15, 2026
c5a5e72
Adjust check_string to check for leading or trailing whitespace
datadavev Jan 15, 2026
b8ff3bf
Rename header and update folder hierarchy example
datadavev Jan 20, 2026
1b72b4d
KeyError instead of ValueError, add list_pids()
datadavev Feb 20, 2026
040868e
some refactoring for folder support
datadavev Apr 30, 2026
74f1870
Dependency updates require 3.10 minimum python version
datadavev May 8, 2026
fda4af6
Adjust folder related method signatures
datadavev May 8, 2026
880c355
Dependency updates
datadavev May 8, 2026
e84f4de
Simplify folder methods, build recursion is responsibility of caller;…
datadavev May 8, 2026
067955b
Add structure for folder entries
datadavev May 8, 2026
45ef848
Make find_object less noisy for common and expected cases
datadavev May 8, 2026
97ffe89
Make find_object part of base hashstore, tweak hints
datadavev May 8, 2026
d324b9f
WIP: adjust cli for revised hashstore folder support
datadavev May 8, 2026
ca34815
Merge branch 'feat-152_folder_store' of https://github.com/DataONEorg…
datadavev May 8, 2026
0970b1e
Added option to capture object creation events to an index file
datadavev May 8, 2026
f7b24be
Added folder inspection methods
datadavev May 8, 2026
9b12931
Enable folder creation without folderEntry cids
datadavev May 11, 2026
9cf4f21
Change type to bool for efficiency
datadavev May 11, 2026
67e2668
Use alternate path delimiter
datadavev May 11, 2026
b1403b3
switch path delim, path as list instead of str
datadavev May 12, 2026
8e852e3
Refactor to use path segments in api
datadavev May 12, 2026
8c2a673
Fix recursion step, allow delimtier to be specified
datadavev May 12, 2026
07e32a7
Initial store - WIP
datadavev May 14, 2026
fa1eb13
Limit code changes to folder support
datadavev May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/poetry-package-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10"]
python-version: ["3.9", "3.10", "3.11"]

steps:
- uses: actions/checkout@v3
Expand Down
28 changes: 28 additions & 0 deletions .github/workflows/uv-package-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Python CI with uv and pytest
on:
workflow_dispatch:
push:
branches: [ "main"]
pull_request:
branches: [ "main" ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14" ]
steps:
- uses: actions/checkout@v5

- name: Setup uv
uses: astral-sh/setup-uv@v7
with:
version: '0.9.15'
python-version: ${{ matrix.python-version }}

- name: Install the project
run: uv sync --all-extras --dev

- name: Run tests with pytest
run: uv run pytest tests
3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@
"editor.formatOnSave": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
}
},
"python-envs.pythonProjects": []
}
24 changes: 17 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,13 +316,23 @@ use_multiprocessing = os.getenv("USE_MULTIPROCESSING", "False") == "True"

## Development build

HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org)
build tool.

To install `hashstore` locally, create a virtual environment for python 3.9+,
install poetry, and then install or build the package with `poetry install` or `poetry build`,
respectively. Note, installing `hashstore` with poetry will also make the `hashstore` command
available through the command line terminal (see `HashStore Client` section below for details).
HashStore is a python package. We recommend installing it using `uv`. Instructions on how to install and set up `uv` can be found [here](https://gist.github.com/datadavev/3975f244e5db500ba0328ef771ca74dd).

Friendly Notes:
- You may run into a `command not found: compdef` when adding code to your `.zshrc` file, this can be resolved by adjusting the code to be:
```sh
# .zshrc
autoload -Uz compinit
compinit
eval "$(uv generate-shell-completion zsh)"
eval "$(uvx --generate-shell-completion zsh)"
```
- When downloading the script `uv-python-symlink`, an extension may be added to it, for example: `uv-python-symlink.txt`. It may also not have an executable status. You can execute the following to adjust it:
```sh
$ mv uv-python-symlink uv-python-symlink.sh
chmod +x uv-python-symlink.sh
```
- After following the steps and navigating to the python project, `uv` may not have sufficient permissions to run. Follow the given prompts and execute `direnv allow`

To run tests, navigate to the root directory and run `pytest`. The test suite contains tests that
take a longer time to run (relating to the storage of large files) - to execute all tests, run
Expand Down
164 changes: 164 additions & 0 deletions folder_operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Folders in HashStore

Describes storing directory trees in hashstore (hs).

## Assumptions

- The root of a folder hierarchy is identified by a PID
- A folder hierarchy (including content) identified by a PID is immutable
- A mutation to a folder hierarchy results in a new folder hierarchy identified by a new PID
- Any subfolder may optionally be identified by a PID
- Any file contained within a folder hierarchy may be identified by a PID
- Permissions are associated with a PID and so apply to content of PID identified containers or files.
- A folder hierarchy may reference all or part of another identified folder hierarchy
- A folder is represented by a `container` in hashstore.

## Virtual hashstore

When a folder is added to `hs`, it is necessary to calculate file and folder hashes and compare these with any existing content in the target `hs`. The efficiency of updating an existing folder entry in `hs` can be significantly improved by computing the hashes locally and determining what may need to be sent to the target `hs`. This is especially important for large folder structures that may have isolated changes.

A virtual `hs` (`vhs`) is a local folder structure that is similar to a `hs` except that the content bytes are not stored (except for containers), only hashes of the content. Time stamps of the hash entries are compared with content time stamps to identify candidates for hash recalculation. If hash values have changed, then the files are tagged for upload to the target hs.

A `vhs` is composed of CID and PID ref files, and container files for folder hashes. Even though content ids are calculated, the content files are not stored.



## Containers

Hashstore is augmented by adding an additional type of content that represents a `container`, the contents of which represent a single folder. A `container` has two types of entries: `file` that represents a single file and `folder` which represents a single subfolder. Each entry in a `container` has properties: `type`, `cid`, and `name`, where:

`type` - Indicates if the entry is a folder (`0`) or file (`1`).

`cid` - The content ID for the respective file or container.

`name` - The name component of the path to the entry. i.e. The last path segment for a subfolder or the file name (without path) for a file.

The CID for a container is computed from the serialized content on the container which includes the CID values for any subfolders. Hence, computing the CID for folders in a hierarchy requires a depth-first approach where the CIDs for leaves of a branch are computed before the branch.

A container is serialized space delimited rows in a text file. Each row represents an entry in the container, with values `type`, `cid`, and `name` in that order. Since folder or file names *may* contain whitespace, the `name` entry consumes the remainder of the row.

Since the CID for a container is dependent on its content, the content order is sorted by the `type` and `cid` values so hashing is consistent. Hence rows referencing subfolder containers will always appear before rows referencing files.

For example, given the folder hierarchy:

```
PID_1 <- dbc15
├── A <- ad5eb
│ ├── a1.txt
│ └── a2.txt
└── B <- cc08d
└── b1.csv
```

The following `container` entries are created (`cid` values are truncated):

Container `ad5eb`:
```
1 10fbd a1.txt
1 c880c a2.txt
```

Container `cc08d`:
```
1 00e99 b1.csv
```

Container `dbc15`:
```
0 ad5eb A
0 cc08d B
```

The hashstore entry for `PID_1` might be:
```
$ cat refs/pids/53/b2/f2/58a2f3061a7bee4ba8b157aab217795c4692e2a2d8856e2fd97eb7fa3f
dbc1516e49e7437ea441f279570d32b1e2f149c44ab0a77682629215f4a5970b

$ cat refs/cids/db/c1/51/6e49e7437ea441f279570d32b1e2f149c44ab0a77682629215f4a5970b
PID_1
```

Each container is resolveable by the combination of PID and path. So for example,
the folder `B` within the context of `PID_1` can be resolved using the identifier `PID_1 B`.
Similarly, the file `A/a2.txt` can be resolved with the identifier `PID_1 A/a2.txt`.
Corresponding entries in hashstore `refs/pids` and `refs/cids` are created.

## Operations

### Get an object by path

Given a PID and a path, retrieve the corresponding object (file or folder) from hashstore.

Persistent identifiers for objects within a folder hierarchy are constructed by concatenating the PID with the path using a space as a delimiter. For example, to retrieve the object at path `data/file1.txt` within the folder hierarchy identified by PID `abc123`, the identifier would be `abc123 data/file1.txt`.

```
hashstore = HashStore(...)
path_pid = "<PID>" + " " + "<path>"
object_stream = hashstore.retrieve_object(path_pid)
```

### Store a new folder hierarchy

To store a new folder hierarchy, recursively create `container` entries for each folder in the hierarchy, starting from the leaves and working up to the root. For each folder, create a `container` with entries for its subfolders and files, compute the CID for the container, and store it in hashstore. Finally, associate the root container's CID with the PID representing the entire folder hierarchy.

This is achieved by the `hashstore.store_folder()` method.

```
hashstore = HashStore(...)
pid = "<PID>"
source_path = "<local_folder_path>"
hashstore.store_folder(pid, source_path)
```

### Retrieve folder hierarchy structure

To retrieve the structure of a folder hierarchy identified by a PID, recursively resolve each `container` starting from the root PID. For each folder, read its `container` entries to identify subfolders and files, and continue resolving subfolders until the entire hierarchy is reconstructed.

This is achieved by the `hashstore.retrieve_folder()` method.

```
hashstore = HashStore(...)
pid = "<PID>"
destination_path = "<local_folder_path>"
hashstore.retrieve_folder(pid, destination_path)
```


---

## `add`

`add(PID:str, path:pathlib.Path)->None`

Add an object or folder to `vhs`.


## `init`

`init(path:pathlib.Path)->None`

Initializes a `vhs` folder within the current folder.


## `status`

`status()->VhsStatus`

Reports the status of the entries in the `vhs` versus the current contents of
registered content.


## `update`

`update(PID:str|None)`

Recalculates CID values based on the current content of registered entries.


## `commit`

`commit()`

Makes entries in the `vhs` immutable preventing any further updates to existing
PIDs. Any further changes require new PID.

Loading
Loading