fix(compaction): tighten memory rules, fix history integrity, honest event flags#1273
Merged
fix(compaction): tighten memory rules, fix history integrity, honest event flags#1273
Conversation
…event flags Bundles the four sub-tasks from #1243 plus the broken-tool memory concern surfaced when the CompanyCam search bug got persisted into a user's MEMORY.md as a "fact." Reframes the compaction prompt around the principle that Clawbolt is not the system of record. QuickBooks owns customers, contacts, invoices, estimates. CompanyCam owns projects, addresses, photos. AppFolio owns work orders. Heartbeat / Calendar own time-bounded reminders. Drive owns saved files. Memory exists for cross-system knowledge that lives nowhere else: pricing rules, cross-system relationships, disambiguation guidance, communication conventions. Specific changes: 1. Rewrite prompts/compaction.md with explicit include / exclude lists and a prune rule. Forbid: source-of-truth duplicates (customer IDs, emails, invoice line items, project addresses), transient tool-call failures and "X is broken" notes, completed reminders. Tighten HISTORY.md rules: breadcrumbs over numbers. 2. Pre-process assistant content with a URL-stripping pass before sending to the compactor, so deep links to CompanyCam photos and QBO drafts no longer eat context budget. User content untouched since contractor-pasted URLs are intentional. 3. Make MemoryStore.append_history return the row's new full plaintext under the same row-level lock that wrote it, and guarantee a newline separator at the entry boundary even if the stored text was missing the trailing newline. Compaction's audit snapshot now records what actually landed in the DB instead of reconstructing from a .strip()'d read that silently dropped the separator. 4. Make memory_updated / user_profile_updated / soul_updated reflect real persisted diffs. An LLM that echoes the existing memory verbatim no longer flips the flag to True. The persisted event row, the structured-summary log, and the admin "memory updated" indicator are all now honest. Fixes #1243 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…values Two follow-ups to the review of #1273: 1. Re-add the explicit "preserve every existing field" guard to the USER.md guidance. The previous prompt revision dropped it; without it, an LLM rewriting USER.md while updating one field could silently trim others. 2. Promote the underlying durable-memory rule from "the integration owns this fact" to "the integration can change this fact without telling Clawbolt." Memorizing a value the source system can mutate creates a stale-cache risk that's worse than just inefficient: a memorized phone number, email, status, or amount can become actively wrong by the next read. The worked example anchors the rule with a concrete archetype: AppFolio rotates tenant contact phone numbers every few days for privacy. A memorized number quoted back next week now belongs to a different tenant, and the contractor calls a stranger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the OSS AGENTS.md "Editing prompt files" guidance to the compaction prompt (107 -> 75 lines, same rules). Specifically: - Promote the stale-cache rule and AppFolio worked example into the Operating principle. They're the most important rules in the document and were buried halfway down. - Merge the standalone "Maintaining MEMORY.md" section into the MEMORY.md section. The prune rules were stated twice. - Flatten the section nesting. ``## What goes in each file`` with ``### user_profile``, ``### memory``, ``### soul`` subsections is more structure than the document needs. - Drop the redundant dispatch reminder at the end. The USER.md and MEMORY.md sections already say where each kind of fact lives. - Drop the parenthetical "(the agent should look these up...)" that duplicated what the source-of-truth table already states. Also tighten the no-op-echo comment in compaction.py from six lines to three. The "why" (honest event flags) is enough; the enumeration of downstream consumers (persisted flag, summary log, admin UI) was restating what's evident from where the flags get used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slim pass left real customer / contact names from a production user's MEMORY.md inline as worked examples (Arbors, Wilham, Brett Rentschler, Surman, paula@..., $7,413.49 estimate, txnId=544). Those are real businesses and people Jesse works with; they should not ship as durable values inside the system prompt of a public codebase. Replaced with placeholder slots (<Client>, <Tenant>, <Property Mgmt>) and generic numbers ($X,XXX, txnId=NNN). The patterns are identical; the LLM still gets the same shape signal without us shipping real relationships in the OSS repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… user-explicit save override Two prompts touch the same files but were drifting in different directions. The base agent prompt told Marshall to write "client names and contact info, pricing history, supplier details" into MEMORY.md. The new compaction prompt says: those live in the integrations and don't belong in memory. The base agent was actively writing things the compactor was supposed to drop. Aligning the two: - ``instructions.md``: rewrite the MEMORY.md / USER.md descriptions to match the system-of-record framing. Drop "client names and contact info" from the include list; explicitly call out customer contact details, invoice contents, project addresses, and work-order state as do-not-write. - ``instructions.md``: new section "Remember this requests" that tells Marshall how to handle explicit user save requests. Honor them, but flag staleness risk on values that mutate upstream (rotating phone numbers being the canonical case) and warn against duplicating canonical-in-integration values. - ``instructions.md``: new short section explaining how Marshall answers "how do you remember things?" when asked. Honest, brief, do-not-volunteer. - ``compaction.md``: corresponding override clause: an explicit user save request in the conversation overrides the exclusion rules, even when the value overlaps with what an integration owns. The base agent owns the staleness warning; by the time the conversation reaches the compactor, an explicit save is intentional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip the inline "e.g." illustrative examples from both prompts. The category names alone carry the rule; the worked-quote shapes were padding the bullet list. One disambiguating note kept on "Cross-system relationships" since the term itself benefits from a brief definition. The standalone AppFolio worked-example block in compaction.md stays as it anchors the most important rule (stale-cache risk) with a concrete failure mode. Also strips a similar e.g. from instructions.md "Multi-field tasks" that pre-dated this PR but matched the same pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 tasks
njbrake
added a commit
that referenced
this pull request
May 8, 2026
…#1280) * feat(agent): bounded-growth policies for agent-managed markdown files Every agent-managed markdown surface (USER.md, SOUL.md, HEARTBEAT.md, MEMORY.md, HISTORY.md, BOOTSTRAP.md) now has an explicit byte budget declared in a single source-of-truth registry and enforced at the write boundary. Without these bounds, a long-lived user's working memory could grow without limit, bloating every system prompt and degrading both cost and response quality. Storage and prompt-injection paths are protected: * Workspace write_file / edit_file reject over-budget content with a clean ToolResult error so the agent can rewrite smaller. HISTORY.md rejects write_file / edit_file outright so the append-only invariant from #1273 cannot be bypassed. * memory_db write_*_async raise BudgetExceededError; compaction catches and logs, leaving the previous file in place rather than crashing or corrupting durable state. * memory_db.append_history applies windowed FIFO drop on overflow so HISTORY.md storage stays bounded across long-lived users. * System prompt builders tail-truncate over-budget legacy rows so a pre-cap row cannot poison every prompt forever. Cap is a uniform 25 KiB across surfaces, matching Claude Code's MEMORY.md cap and the existing compaction_event_snapshot per-file audit cap (so an in-budget surface always fits in audit rows without truncation). Fixes #1244 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(agent): tighten bounded-growth tests and logging Self-review follow-ups for #1280: * Tighten history-entry boundary regex from ``^\[`` to ``^\[\d{4}-\d{2}-\d{2}`` so a multi-line summary whose body line happens to start with ``[`` is not mis-split into two entries. * Strengthen the FIFO drop test to assert per-iteration sentinels (``iter=0000`` dropped, ``iter=0199`` retained). The old assertion used cycling timestamps that re-introduced earlier dates and was effectively tautological. * Add a BOOTSTRAP.md test for the disk path through write_file, and an under-budget scratch-file test that locks in the registry-is-opt-in behavior for unknown markdown surfaces. * Demote ``markdown_registry: truncated ...`` log from WARNING to INFO. The truncation event fires every turn for an affected user until the agent rewrites the file; a per-turn WARNING crowds out actionable signals. INFO still surfaces the event to operators searching for it. * Promote observability from a one-line out-of-scope note to its own section in ``docs/markdown_growth_policies.md`` covering the current state (compaction WARNING on rejection, registry INFO on truncation, ``compaction_events`` audit rows) and the shape of a follow-up dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(agent): close PUT /user/profile bypass for bounded-growth cap Second-pass review of #1280 caught that ``PUT /api/user/profile`` accepts ``user_text`` / ``soul_text`` / ``heartbeat_text`` and writes them via raw ``setattr`` without going through the registry. The dashboard editor (or a premium admin) could push an over-budget blob that the agent paths now reject. The read-side ``truncate_for_injection`` clamps it for prompts, but the asymmetry breaks the cap as a "what the LLM sees" guarantee. * ``update_profile`` now validates each string field against ``assert_column_within_budget`` before any commit and returns 413 with the registry's actual / allowed sizes. * Three regression tests cover ``user_text`` / ``soul_text`` / ``heartbeat_text`` rejection and one sanity test covers the under-budget happy path. * Drive-by: switch the 413 status constant in ``user_memory.py`` and ``user_profile.py`` from the deprecated ``HTTP_413_REQUEST_ENTITY_TOO_LARGE`` to ``HTTP_413_CONTENT_TOO_LARGE`` to silence the Starlette deprecation warning the new tests surfaced. * Document ``_huge()``'s ASCII assumption so a future refactor does not silently change byte-size semantics in the integration tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Bundles the four sub-tasks from #1243 plus the broken-tool memory concern surfaced this week, when the CompanyCam photo-search bug (now fixed in v0.4.1) was persisted into a user's MEMORY.md as a durable "fact" by the compaction LLM.
Reframes the whole compaction layer around the principle that Clawbolt is not the system of record. The contractor's authoritative data lives in their integrations:
Memory exists for cross-system knowledge that lives nowhere else: pricing rules, cross-system relationships, disambiguation guidance, communication conventions. Anything an integration can answer live, the agent should look up live.
What changed (per #1243 sub-task)
1. HISTORY.md append integrity (
memory_db.py,compaction.py)MemoryStore.append_historynow returns the row's new full plaintext under the same row-level lock that wrote it. Compaction's audit-snapshot path uses that return value instead of reconstructing the snapshot from a.strip()'d read of pre-write text. The reconstruction silently dropped the trailing newline that the DB write actually retains, so two consecutive history entries showed up jammed together in the audit even when they were separated in the live row.test_memory.pyandtest_compaction.py.2. Reduce compaction prompt noise (
compaction.py)_strip_assistant_noisepre-processing pass strips URLs from assistant content before sending to the compactor. Deep links to CompanyCam photos and QBO drafts no longer eat context budget.3. Tighten durable-memory rules (
prompts/compaction.md) + broken-tool memory concern4. Honest event flags (
compaction.py)memory_updated/user_profile_updated/soul_updatednow reflect real persisted diffs. An LLM that echoes the existing memory verbatim no longer flips the flag to True.CompactionEventrow, thecompaction.summarystructured log line, and the admin "memory updated" indicator are all now honest about whether anything actually changed.Type
Checklist
uv run pytest -v) — 2466 passed, 2 skippedruff check backend/ && ruff format --check backend/)New regression tests:
test_format_messages_strips_urls_from_assistant_contenttest_format_messages_keeps_user_content_urlstest_compact_session_does_not_write_when_memory_unchangedtest_compact_session_summary_log_marks_memory_unchanged_when_llm_echoestest_compaction_history_snapshot_matches_db_after_appendtest_append_history_returns_new_full_texttest_append_history_inserts_separator_when_existing_lacks_newlineWhat this does NOT do
append_history's return-value as the source of truth for the snapshot, which closes the read-after-write race window.AI Usage
Fixes #1243