Skip to content

Add benchmark-integrity audit issue pack with evidence-backed remediation criteria#6

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/audit-benchmark-issues
Draft

Add benchmark-integrity audit issue pack with evidence-backed remediation criteria#6
Copilot wants to merge 3 commits into
mainfrom
copilot/audit-benchmark-issues

Conversation

Copilot AI commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

This PR adds a strict benchmark audit artifact focused on anti-mimicry evaluation quality. It translates the identified risk areas into ready-to-file GitHub issues with concrete repo evidence and measurable upgrade criteria.

  • Scope: benchmark audit issue set

    • Added docs/BENCHMARK_AUDIT_ISSUES.md with 7 issue drafts covering:
      • proxy metric validity
      • benchmark framing accuracy
      • missing ground-truth LoRA/DreamBooth evaluation
      • robustness transform gaps (incl. JPEG/crop)
      • dataset split leakage risk
      • reproducibility metadata gaps
      • weak comparative methodology
  • Issue format standardization

    • Each draft includes:
      • Title
      • What is wrong
      • Why it misleads
      • Evidence from repo (file/line references)
      • Proposed benchmark upgrade
      • Acceptance criteria
      • Labels
  • Traceability note

    • Added a header note to re-verify line references at filing time to account for line drift across commits.

Example excerpt format used in the new document:

## 4) Robustness suite misses common purification transforms

**Title**
`Robustness testing gap: no JPEG compression or crop transforms in protection-readability benchmark`

**Evidence from the repo**
- Current transform suite: identity, blur, resize(0.75), resize(0.5): `src/auralock/core/style.py:217-231`

**Acceptance criteria**
- [ ] Transform suite includes JPEG and crop variants by name.
- [ ] Protection report contains per-transform metrics for all added transforms.
Original prompt

This section details on the original issue you should resolve

<issue_title>audit benchmark</issue_title>
<issue_description>Analyze this repository specifically for benchmark integrity and anti-mimicry evaluation quality.

Create GitHub issues for:

  • proxy metrics that may not reflect real anti-mimicry strength
  • misleading benchmark framing
  • missing ground-truth LoRA/DreamBooth evaluation
  • weak robustness testing (resize, JPEG, blur, crop)
  • dataset split risks
  • missing reproducibility details
  • weak comparison methodology

For each issue, include:

  • Title
  • What is wrong with the benchmark
  • Why it can mislead users
  • Evidence from the repo
  • Proposed benchmark upgrade
  • Acceptance criteria
  • Labels

Be strict and technical.
</issue_description>

Comments on the Issue (you are @copilot in this section)


⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Copilot AI changed the title [WIP] Create GitHub issues for benchmark integrity evaluation Add benchmark-integrity audit issue pack with evidence-backed remediation criteria Mar 21, 2026
Copilot AI requested a review from VoDaiLocz March 21, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

audit benchmark

2 participants