Skip to content

feat(agent): persist extended-thinking output on outbound messages#1267

Merged
njbrake merged 2 commits intomainfrom
feat/persist-llm-thinking
May 8, 2026
Merged

feat(agent): persist extended-thinking output on outbound messages#1267
njbrake merged 2 commits intomainfrom
feat/persist-llm-thinking

Conversation

@njbrake
Copy link
Copy Markdown
Member

@njbrake njbrake commented May 7, 2026

Description

Captures the LLM's extended-thinking blocks (Anthropic ThinkingBlock) on the final agent response and persists them on a new Message.thinking_text column. Today the SDK returns thinking content alongside text and tool calls; get_response_text extracts only text and parse_tool_calls only tool uses, so the reasoning stream is silently discarded. With the column populated, the assistant message row carries the reasoning that produced its body.

The motivation is the admin-side activity-pane request in clawbolt-premium#456: "give me a dropdown for each agent response that lets me expand and see the reasoning." Premium has nothing to surface today because the data isn't persisted; this PR fills that gap.

Capture point: the FINAL response in the agent loop, the one whose reply_text becomes Message.body. Earlier rounds produce tool calls and their thinking justifies a tool decision rather than the user-visible reply, so keeping the persisted record aligned with what the user saw is the load-bearing semantic for the admin view that consumes this column. Heartbeat-driven outbound also picks up thinking via the same AgentResponse.thinking_text field, so proactive replies are covered too.

Encryption: thinking_text rides through EncryptedString like body and tool_interactions_json. Thinking blocks routinely quote user content back at length (names, addresses, integration payloads), so they need the same at-rest treatment as the message body.

Architecture

  • alembic/versions/033_add_thinking_text_to_messages.py: new column on messages as NOT NULL with server_default="" so existing rows remain readable and raw-SQL inserts in older migration tests keep working.
  • backend/app/models.py: Message.thinking_text mapped through EncryptedString; the model declares server_default="" so Base.metadata.create_all() (used by tests) builds an identical schema to what the migration produces.
  • backend/app/agent/llm_parsing.py: new get_response_thinking() joins ThinkingBlock content across response blocks. Empty thinking strings are skipped so we don't render stray separators; the signature field is intentionally not surfaced (it has no audit value to a human reader and is only meaningful for replaying the block back to Anthropic).
  • backend/app/agent/core.py: capture from the FINAL response only, in both exit paths (clean break with no more tool calls AND the max-rounds branch). AgentResponse.thinking_text carries the value out to the persistence layer.
  • backend/app/agent/router.py + heartbeat.py: thread response.thinking_text into session_store.add_message.
  • backend/app/agent/session_db.py: add_message[_async] and add_message_by_session_id[_async] accept thinking_text kwarg; included in _MESSAGE_UPDATABLE_FIELDS so future updates stay consistent.
  • backend/app/agent/dto.py: StoredMessage.thinking_text defaults to "" so existing StoredMessage(direction=..., body=...) constructors stay compatible.

Tests

  • tests/test_llm_parsing.py (5 new cases): single block, multi-block join, empty when absent, skips empty blocks, empty content list.
  • tests/test_session_db_async.py (1 new case): outbound thinking round-trips through the encryption decorator and is visible on the reloaded StoredMessage.

Full OSS suite: 2396 passed, 2 skipped, 13 deselected. ruff/format/ty all clean.

Type

  • Feature
  • Bug fix
  • Refactor
  • Test
  • CI/CD
  • Documentation

Checklist

  • Tests pass (uv run pytest -v)
  • Lint passes (ruff check backend/ && ruff format --check backend/)
  • New tests added for new functionality
  • Bug fixes include regression tests

AI Usage

  • AI-assisted (Claude Code drafted the implementation, tests, and PR text after a planning conversation; reviewed by njbrake)
  • No AI used

njbrake added 2 commits May 7, 2026 18:36
Captures the LLM's ``ThinkingBlock`` content from the final response in
the agent loop and stores it on ``Message.thinking_text``. Until now the
content was extracted by the SDK but silently discarded by
``get_response_text``, so admins debugging "why did the agent reply
this way" had no audit surface short of re-running the call.

Encrypted at rest under ``EncryptedString`` like ``body`` and
``tool_interactions_json``: thinking blocks routinely quote user
content back at length, so they need the same treatment as the
message body.

The capture point is the FINAL response in the agent loop (the one
whose ``reply_text`` becomes ``Message.body``). Earlier rounds produce
tool calls and their thinking justifies a tool decision rather than
the user-visible reply; keeping the persisted record aligned with what
the user saw is the load-bearing semantic for the admin view that
consumes this column (clawbolt-premium #456).

Heartbeat-driven outbounds also pick up thinking via the same
``AgentResponse.thinking_text`` field, so the admin activity pane
covers proactive replies too.

Migration 033 adds the column as ``NOT NULL`` with a server default of
empty string, which keeps existing rows readable and lets ORM ``insert``
calls that don't list the column (raw SQL paths in tests, the
``messages_001..`` migration end-to-end test) keep working.
Two cleanups from the self-review of #1267:

1. Migration 033 docstring claimed "Nullable / default empty" but the
   column is ``NOT NULL`` with a server default of empty string. The
   behavior is correct and intentional (raw-SQL inserts in older
   migration tests don't list the column and need the server default
   to backfill); only the docstring drifted. Brought the docstring in
   line with the actual ``op.add_column`` call.

2. The error-stop branch in ``ClawboltAgent.run`` returned
   ``AgentResponse(is_error_fallback=True, ...)`` with an implicit
   empty ``thinking_text`` even when the LLM produced thinking blocks
   before hitting the error stop. ``persist_outbound`` short-circuits
   on ``is_error_fallback`` so this rides along the in-memory response
   only today, but downstream observers (and any future policy that
   records error fallbacks for triage) now get the reasoning that led
   to the bail-out.
@njbrake njbrake merged commit 59f0aa4 into main May 8, 2026
8 checks passed
@njbrake njbrake deleted the feat/persist-llm-thinking branch May 8, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant