Skip to content

fix(compaction): tighten memory rules, fix history integrity, honest event flags#1273

Merged
njbrake merged 6 commits intomainfrom
fix/compaction-quality
May 8, 2026
Merged

fix(compaction): tighten memory rules, fix history integrity, honest event flags#1273
njbrake merged 6 commits intomainfrom
fix/compaction-quality

Conversation

@njbrake
Copy link
Copy Markdown
Member

@njbrake njbrake commented May 8, 2026

Description

Bundles the four sub-tasks from #1243 plus the broken-tool memory concern surfaced this week, when the CompanyCam photo-search bug (now fixed in v0.4.1) was persisted into a user's MEMORY.md as a durable "fact" by the compaction LLM.

Reframes the whole compaction layer around the principle that Clawbolt is not the system of record. The contractor's authoritative data lives in their integrations:

Source of truth What it owns
QuickBooks customers, contacts, invoices, estimates, items, payments
CompanyCam projects, addresses, photos, project status
AppFolio work orders, tenant info, vendor jobs
Heartbeat / Calendar time-bounded reminders
Drive saved files, receipt images

Memory exists for cross-system knowledge that lives nowhere else: pricing rules, cross-system relationships, disambiguation guidance, communication conventions. Anything an integration can answer live, the agent should look up live.

What changed (per #1243 sub-task)

1. HISTORY.md append integrity (memory_db.py, compaction.py)

  • MemoryStore.append_history now returns the row's new full plaintext under the same row-level lock that wrote it. Compaction's audit-snapshot path uses that return value instead of reconstructing the snapshot from a .strip()'d read of pre-write text. The reconstruction silently dropped the trailing newline that the DB write actually retains, so two consecutive history entries showed up jammed together in the audit even when they were separated in the live row.
  • Append also guarantees a newline separator at the entry boundary even if the stored text lacks one (e.g. a manual edit, or pre-guarantee data).
  • Regression tests in test_memory.py and test_compaction.py.

2. Reduce compaction prompt noise (compaction.py)

  • New _strip_assistant_noise pre-processing pass strips URLs from assistant content before sending to the compactor. Deep links to CompanyCam photos and QBO drafts no longer eat context budget.
  • User content is left untouched: contractor-pasted URLs are usually intentional.
  • Two regression tests pin the asymmetric behavior.

3. Tighten durable-memory rules (prompts/compaction.md) + broken-tool memory concern

  • Full prompt rewrite around the system-of-record principle. Explicit include / exclude lists. Memory must not duplicate facts that already live in QBO / CompanyCam / AppFolio / Drive / heartbeat.
  • Explicit prune rule on rewrite: drop items that have moved into a source-of-truth integration; drop completed or expired reminders; drop transient bug notes (a bug that was true yesterday is not a durable fact).
  • HISTORY.md format guidance: breadcrumbs over numbers, pointers over deep links.

4. Honest event flags (compaction.py)

  • memory_updated / user_profile_updated / soul_updated now reflect real persisted diffs. An LLM that echoes the existing memory verbatim no longer flips the flag to True.
  • The persisted CompactionEvent row, the compaction.summary structured log line, and the admin "memory updated" indicator are all now honest about whether anything actually changed.
  • Two regression tests cover the echo case.

Type

  • Feature
  • Bug fix
  • Refactor
  • Test
  • CI/CD
  • Documentation

Checklist

  • Tests pass (uv run pytest -v) — 2466 passed, 2 skipped
  • Lint passes (ruff check backend/ && ruff format --check backend/)
  • New tests added for new functionality
  • Bug fixes include regression tests

New regression tests:

  • test_format_messages_strips_urls_from_assistant_content
  • test_format_messages_keeps_user_content_urls
  • test_compact_session_does_not_write_when_memory_unchanged
  • test_compact_session_summary_log_marks_memory_unchanged_when_llm_echoes
  • test_compaction_history_snapshot_matches_db_after_append
  • test_append_history_returns_new_full_text
  • test_append_history_inserts_separator_when_existing_lacks_newline

What this does NOT do

  • The "snapshot has entry, DB does not" symptom in Fix compaction history integrity and reduce low-signal compaction noise #1243 sub-task 1 is not directly diagnosed; I could not reproduce it from code-reading alone. The append-path hardening above removes the most plausible cause (the snapshot/DB mismatch on the trailing-newline boundary). If it recurs, we now have append_history's return-value as the source of truth for the snapshot, which closes the read-after-write race window.
  • Existing polluted MEMORY.md content (e.g. customer IDs, invoice line items already in users' memory) is not migrated. The next compaction will gradually prune under the new rules. If you want immediate cleanup, run an admin-triggered compaction with a steering note after this lands.

AI Usage

  • AI-assisted (designed and drafted by Claude via discussion with @njbrake; design pass walked through Jesse's actual MEMORY.md against the system-of-record principle before implementation; reasoning and design decisions are mine, prose is Claude's.)
  • No AI used

Fixes #1243

njbrake and others added 2 commits May 8, 2026 10:49
…event flags

Bundles the four sub-tasks from #1243 plus the broken-tool memory
concern surfaced when the CompanyCam search bug got persisted into a
user's MEMORY.md as a "fact."

Reframes the compaction prompt around the principle that Clawbolt is
not the system of record. QuickBooks owns customers, contacts,
invoices, estimates. CompanyCam owns projects, addresses, photos.
AppFolio owns work orders. Heartbeat / Calendar own time-bounded
reminders. Drive owns saved files. Memory exists for cross-system
knowledge that lives nowhere else: pricing rules, cross-system
relationships, disambiguation guidance, communication conventions.

Specific changes:

1. Rewrite prompts/compaction.md with explicit include / exclude
   lists and a prune rule. Forbid: source-of-truth duplicates
   (customer IDs, emails, invoice line items, project addresses),
   transient tool-call failures and "X is broken" notes, completed
   reminders. Tighten HISTORY.md rules: breadcrumbs over numbers.

2. Pre-process assistant content with a URL-stripping pass before
   sending to the compactor, so deep links to CompanyCam photos and
   QBO drafts no longer eat context budget. User content untouched
   since contractor-pasted URLs are intentional.

3. Make MemoryStore.append_history return the row's new full
   plaintext under the same row-level lock that wrote it, and
   guarantee a newline separator at the entry boundary even if the
   stored text was missing the trailing newline. Compaction's audit
   snapshot now records what actually landed in the DB instead of
   reconstructing from a .strip()'d read that silently dropped the
   separator.

4. Make memory_updated / user_profile_updated / soul_updated reflect
   real persisted diffs. An LLM that echoes the existing memory
   verbatim no longer flips the flag to True. The persisted event
   row, the structured-summary log, and the admin "memory updated"
   indicator are all now honest.

Fixes #1243

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…values

Two follow-ups to the review of #1273:

1. Re-add the explicit "preserve every existing field" guard to the
   USER.md guidance. The previous prompt revision dropped it; without
   it, an LLM rewriting USER.md while updating one field could
   silently trim others.

2. Promote the underlying durable-memory rule from "the integration
   owns this fact" to "the integration can change this fact without
   telling Clawbolt." Memorizing a value the source system can mutate
   creates a stale-cache risk that's worse than just inefficient: a
   memorized phone number, email, status, or amount can become
   actively wrong by the next read.

The worked example anchors the rule with a concrete archetype:
AppFolio rotates tenant contact phone numbers every few days for
privacy. A memorized number quoted back next week now belongs to a
different tenant, and the contractor calls a stranger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
njbrake and others added 4 commits May 8, 2026 12:00
Apply the OSS AGENTS.md "Editing prompt files" guidance to the
compaction prompt (107 -> 75 lines, same rules). Specifically:

- Promote the stale-cache rule and AppFolio worked example into the
  Operating principle. They're the most important rules in the
  document and were buried halfway down.
- Merge the standalone "Maintaining MEMORY.md" section into the
  MEMORY.md section. The prune rules were stated twice.
- Flatten the section nesting. ``## What goes in each file`` with
  ``### user_profile``, ``### memory``, ``### soul`` subsections is
  more structure than the document needs.
- Drop the redundant dispatch reminder at the end. The USER.md and
  MEMORY.md sections already say where each kind of fact lives.
- Drop the parenthetical "(the agent should look these up...)" that
  duplicated what the source-of-truth table already states.

Also tighten the no-op-echo comment in compaction.py from six lines
to three. The "why" (honest event flags) is enough; the enumeration
of downstream consumers (persisted flag, summary log, admin UI) was
restating what's evident from where the flags get used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slim pass left real customer / contact names from a production
user's MEMORY.md inline as worked examples (Arbors, Wilham, Brett
Rentschler, Surman, paula@..., $7,413.49 estimate, txnId=544).
Those are real businesses and people Jesse works with; they should
not ship as durable values inside the system prompt of a public
codebase.

Replaced with placeholder slots (<Client>, <Tenant>, <Property Mgmt>)
and generic numbers ($X,XXX, txnId=NNN). The patterns are identical;
the LLM still gets the same shape signal without us shipping real
relationships in the OSS repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… user-explicit save override

Two prompts touch the same files but were drifting in different
directions. The base agent prompt told Marshall to write "client
names and contact info, pricing history, supplier details" into
MEMORY.md. The new compaction prompt says: those live in the
integrations and don't belong in memory. The base agent was
actively writing things the compactor was supposed to drop.

Aligning the two:

- ``instructions.md``: rewrite the MEMORY.md / USER.md descriptions
  to match the system-of-record framing. Drop "client names and
  contact info" from the include list; explicitly call out customer
  contact details, invoice contents, project addresses, and
  work-order state as do-not-write.
- ``instructions.md``: new section "Remember this requests" that
  tells Marshall how to handle explicit user save requests. Honor
  them, but flag staleness risk on values that mutate upstream
  (rotating phone numbers being the canonical case) and warn against
  duplicating canonical-in-integration values.
- ``instructions.md``: new short section explaining how Marshall
  answers "how do you remember things?" when asked. Honest, brief,
  do-not-volunteer.
- ``compaction.md``: corresponding override clause: an explicit user
  save request in the conversation overrides the exclusion rules,
  even when the value overlaps with what an integration owns. The
  base agent owns the staleness warning; by the time the conversation
  reaches the compactor, an explicit save is intentional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip the inline "e.g." illustrative examples from both prompts.
The category names alone carry the rule; the worked-quote shapes
were padding the bullet list. One disambiguating note kept on
"Cross-system relationships" since the term itself benefits from
a brief definition. The standalone AppFolio worked-example block
in compaction.md stays as it anchors the most important rule
(stale-cache risk) with a concrete failure mode.

Also strips a similar e.g. from instructions.md "Multi-field
tasks" that pre-dated this PR but matched the same pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@njbrake njbrake merged commit 071edc2 into main May 8, 2026
8 checks passed
@njbrake njbrake deleted the fix/compaction-quality branch May 8, 2026 12:43
njbrake added a commit that referenced this pull request May 8, 2026
…#1280)

* feat(agent): bounded-growth policies for agent-managed markdown files

Every agent-managed markdown surface (USER.md, SOUL.md, HEARTBEAT.md,
MEMORY.md, HISTORY.md, BOOTSTRAP.md) now has an explicit byte budget
declared in a single source-of-truth registry and enforced at the
write boundary. Without these bounds, a long-lived user's working
memory could grow without limit, bloating every system prompt and
degrading both cost and response quality.

Storage and prompt-injection paths are protected:

* Workspace write_file / edit_file reject over-budget content with a
  clean ToolResult error so the agent can rewrite smaller. HISTORY.md
  rejects write_file / edit_file outright so the append-only invariant
  from #1273 cannot be bypassed.
* memory_db write_*_async raise BudgetExceededError; compaction
  catches and logs, leaving the previous file in place rather than
  crashing or corrupting durable state.
* memory_db.append_history applies windowed FIFO drop on overflow so
  HISTORY.md storage stays bounded across long-lived users.
* System prompt builders tail-truncate over-budget legacy rows so a
  pre-cap row cannot poison every prompt forever.

Cap is a uniform 25 KiB across surfaces, matching Claude Code's
MEMORY.md cap and the existing compaction_event_snapshot per-file
audit cap (so an in-budget surface always fits in audit rows without
truncation).

Fixes #1244

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(agent): tighten bounded-growth tests and logging

Self-review follow-ups for #1280:

* Tighten history-entry boundary regex from ``^\[`` to
  ``^\[\d{4}-\d{2}-\d{2}`` so a multi-line summary whose body line
  happens to start with ``[`` is not mis-split into two entries.
* Strengthen the FIFO drop test to assert per-iteration sentinels
  (``iter=0000`` dropped, ``iter=0199`` retained). The old assertion
  used cycling timestamps that re-introduced earlier dates and was
  effectively tautological.
* Add a BOOTSTRAP.md test for the disk path through write_file,
  and an under-budget scratch-file test that locks in the
  registry-is-opt-in behavior for unknown markdown surfaces.
* Demote ``markdown_registry: truncated ...`` log from WARNING to
  INFO. The truncation event fires every turn for an affected user
  until the agent rewrites the file; a per-turn WARNING crowds out
  actionable signals. INFO still surfaces the event to operators
  searching for it.
* Promote observability from a one-line out-of-scope note to its
  own section in ``docs/markdown_growth_policies.md`` covering the
  current state (compaction WARNING on rejection, registry INFO on
  truncation, ``compaction_events`` audit rows) and the shape of a
  follow-up dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(agent): close PUT /user/profile bypass for bounded-growth cap

Second-pass review of #1280 caught that ``PUT /api/user/profile``
accepts ``user_text`` / ``soul_text`` / ``heartbeat_text`` and writes
them via raw ``setattr`` without going through the registry. The
dashboard editor (or a premium admin) could push an over-budget blob
that the agent paths now reject. The read-side
``truncate_for_injection`` clamps it for prompts, but the asymmetry
breaks the cap as a "what the LLM sees" guarantee.

* ``update_profile`` now validates each string field against
  ``assert_column_within_budget`` before any commit and returns 413
  with the registry's actual / allowed sizes.
* Three regression tests cover ``user_text`` / ``soul_text`` /
  ``heartbeat_text`` rejection and one sanity test covers the
  under-budget happy path.
* Drive-by: switch the 413 status constant in ``user_memory.py`` and
  ``user_profile.py`` from the deprecated
  ``HTTP_413_REQUEST_ENTITY_TOO_LARGE`` to ``HTTP_413_CONTENT_TOO_LARGE``
  to silence the Starlette deprecation warning the new tests
  surfaced.
* Document ``_huge()``'s ASCII assumption so a future refactor does
  not silently change byte-size semantics in the integration tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix compaction history integrity and reduce low-signal compaction noise

1 participant