Skip to content

Add dataset split methodology: SplitMetadata infrastructure, benchmark split enforcement, and CLI split commands#18

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-dataset-split-methodology
Draft

Add dataset split methodology: SplitMetadata infrastructure, benchmark split enforcement, and CLI split commands#18
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-dataset-split-methodology

Conversation

Copilot AI commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

The benchmark workflow had no train/val/test separation, making it trivial to overfit profiles to benchmark images and cherry-pick results. The Anti-DreamBooth set_C split was documented as "holdout metadata" rather than being actively used for held-out evaluation.

Core split infrastructure (src/auralock/core/splits.py)

  • SplitType enum (TRAIN, VALIDATION, TEST, DEVELOPMENT)
  • SplitMetadata dataclass — tracks image_ids, split_method, deterministic split_hash, split_ratio, random_seed; includes verify_no_leakage() and full JSON serialization
  • compute_split_hash() — short SHA-256 fingerprint over sorted image IDs + seed, enabling tamper detection
  • create_random_split() — seeded, leakage-free train/val/test partitioning with full metadata
  • save_split_manifest() / load_split_manifest() — JSON persistence for reproducible split assignments
  • validate_split_manifest() — checks cross-split overlap and hash integrity, returns list of issues
  • warn_non_test_split() — emits UserWarning when benchmarking on non-TEST splits

Placed in core/ (not benchmarks/) to avoid the circular import services → benchmarks → lora → services. benchmarks/splits.py is a re-export shim.

Benchmark enforcement (src/auralock/services/protection.py)

  • BenchmarkSummary gains split_metadata: SplitMetadata | None = None; serialized in to_report_dict()
  • benchmark_file() — when split_metadata provided: raises ValueError if image not in split, warns on non-TEST
  • benchmark_directory() — filters images to declared split members, warns on excluded images and non-TEST splits
splits = create_random_split(image_paths, random_seed=42)
save_split_manifest(splits, Path("splits.json"))

# Later: benchmark only the held-out test images
test_split = splits[SplitType.TEST]
summary = service.benchmark_directory(dataset_dir, profiles=("balanced",), split_metadata=test_split)
# UserWarning if split_type != TEST; ValueError if images outside split

CLI split commands (src/auralock/cli.py)

New auralock split sub-command group:

  • auralock split create <dir> --output splits.json [--train-ratio 0.7 --val-ratio 0.15 --test-ratio 0.15 --seed 42]
  • auralock split validate splits.json — exits 1 with details on leakage or hash mismatch

auralock benchmark gains --split-manifest and --split-type options; prints a non-test-split bias warning to stdout before running.

Anti-DreamBooth set_C note

Corrected the misleading "set_C is preserved as holdout metadata" note to explicitly document set_C as the held-out validation split for measuring out-of-sample protection effectiveness.

…nforcement, CLI split commands

Agent-Logs-Url: https://github.com/VoDaiLocz/Lock-ART./sessions/8c1bc471-0b09-4dd2-9406-3208cb4888c3

Co-authored-by: VoDaiLocz <88762074+VoDaiLocz@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix dataset split methodology risks and enforce separation Add dataset split methodology: SplitMetadata infrastructure, benchmark split enforcement, and CLI split commands Apr 1, 2026
Copilot AI requested a review from VoDaiLocz April 1, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants