diff --git a/designs/0009-background-tasks.md b/designs/0009-background-tasks.md
new file mode 100644
index 000000000..2556cf4a3
--- /dev/null
+++ b/designs/0009-background-tasks.md
@@ -0,0 +1,1590 @@
+# Background Tasks: Non-Blocking, Async Concurrency for Strands Agents
+
+**Status**: Proposed
+**Date**: 2026-04-24
+**Scope**: TypeScript SDK. Python SDK parity implementation to follow.
+
+```
+ STANDARD BACKGROUND
+
+ agent ▓▓▓·························▓▓▓ agent ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
+ tools ████ tools ████
+ ████████ ████████
+ ██████████████ ██████████████
+ ├──────────────────────────────┤ ├────────────────┤
+ 0s 30s 0s 15s
+```
+
+---
+
+
+Definitions
+
+| Term | Definition |
+|------|-----------|
+| **Background tool** | A tool declared in `backgroundTools` that the SDK dispatches on a fork instead of executing inline. The model calls it normally; the async behavior is invisible to the model except through the system prompt augmentation. |
+| **Fork** | An independent copy of the agent created via `fork()`. Has its own conversation, execution lock, and task manager, but shares the parent's model client and tool registry. The isolation primitive that makes concurrent execution safe. |
+| **Decision Point** | One of three locations in the modified agent loop where background task logic is injected. **A** (top of cycle): pop settled results. **B** (per tool): fork and dispatch or execute inline. **C** (end of turn): wait if tasks are pending. |
+| **ACK** | The immediate `tool_result` returned to the model when a background tool is dispatched. Contains "Background task dispatched" or "Background task queued." Not a real result — the actual output arrives later via injection. |
+| **Injection** | The mechanism by which background task results enter the parent's conversation. Results are appended as user text messages with a `[Background Task Result]` prefix and `toolUseId` for correlation. |
+| **Settle window** | A configurable pause (default: 0ms) after a background task completes, giving closely-completing tasks a chance to finish so their results are injected as one batch rather than triggering separate model calls. |
+| **Wait state** | The blocking state entered when the agent has no foreground work but background tasks are still in flight. Entered from Decision Point C (model ended turn) or after an all-ACK turn (every tool was background). |
+| **BackgroundTask** | The handle wrapping a background operation. Provides synchronous status inspection (`status`, `result`, `error`), cancellation (`cancel()`), and implements `PromiseLike` for `await`. |
+| **TaskManager** | Per-agent lifecycle manager for `BackgroundTask` instances. Owns concurrency limits, queuing, settlement detection, cancellation, and TTL enforcement. |
+
+
+
+---
+## Table of Contents
+
+- [Problem(s)](#problems) · [Proposal](#proposal) · [How It Works](#how-it-works) · [New Primitives](#new-primitives) · [Failure Modes](#failure-modes) · [Retry](#retry) · [Cancellation](#cancellation) · [Context Management](#context-management) · [Consequences](#consequences) · [Design Decisions & Alternatives](#design-decisions--alternatives) · [Developer Experience](#developer-experience) · [Benchmark Results](#benchmark-results) · [Real-World Demos](#real-world-demos)
+- Appendices: [A (Landscape Analysis)](#appendix-a-landscape-analysis) · [B (Interface Design)](#appendix-b-interface-design-rationale) · [C (Background Tasks vs Graph)](#appendix-c-background-tasks-vs-graph) · [D (Naming)](#appendix-d-naming-alternatives) · [E (Extensions & Roadmap)](#appendix-e-extension-to-containerized-dispatch) · [G (System Prompt Rationale)](#appendix-g-system-prompt-augmentation-rationale) · [H (Conversation Traces)](#appendix-h-conversation-traces) · [I (Development Plan)](#appendix-i-development-plan)
+
+---
+
+## Problem(s)
+
+The agent loop is built on a synchronous assumption: the model calls tools, the SDK executes all of them, collects every result, and only *then* lets the model reason again. Even with concurrent tool execution within a turn (i.e. using `ConcurrentToolExecutor`), the model still waits for the slowest tool to finish before it can reason about any result.
+
+### The agent blocks on every tool, with no ability to continue working in parallel
+
+When the model calls six tools and one of them takes 30 seconds (while the rest complete in just a few), the agent remains idle for the entire 30 seconds. Every tool call is treated as part of the same batch, even though they might be completely independent and have their own follow-up work. Unrelated tasks should not block each other.
+
+### The agent cannot drive concurrency on its own
+
+Graph and Swarm multiagent primitives let developers define parallel pipelines, but this topology is *fixed* before the agent runs. The model has no way to dispatch tasks simultaneously based on its own reasoning nor adjust what it dispatches based on earlier results.
+
+### A single agent instance cannot be invoked more than once at a time
+
+The existing `.invoke()` and `.invokeAsync()` methods both acquire a per-instance `_isInvoking` lock. If a second call is made while the first is still running — whether via `Promise.all` or overlapping async calls — a `ConcurrentInvocationError` is immediately thrown. There is no built-in mechanism to create an independent copy of an agent and run it alongside the original.
+
+## Proposal
+
+Background task scheduling removes the synchronous assumption. Instead of a single execution path where everything queues behind everything else, the agent can now dispatch independent work onto *separate* paths and continue reasoning. Results flow back as they finish (success, error, cancellation). Existing framework-level parallelism (`ConcurrentToolExecutor`) continues to work for foreground tools — background tasks are complementary, not a replacement.
+
+### The agent no longer blocks on every tool
+
+Tools can now be executed asynchronously and independently, no longer blocking the agent. Results are injected back into the conversation as they finish – the model sees them on its next turn.
+
+### The agent can now drive concurrency on its own
+
+The model can dispatch multiple tools simultaneously, and they all begin executing immediately. As each result arrives, the model can react and adjust its strategy in real time, triggering follow-up work as needed or cancelling tasks that are no longer required. No predefined topology is required – the model's dispatch strategy emerges from its own reasoning.
+
+### A single agent instance can now run concurrent work
+
+The SDK now provides a built-in mechanism to "fork" an agent (create an independent copy) and run it alongside the original. No manual cloning, no lock conflicts, no coordination overhead.
+
+**Zero overhead when not used.** Agents that don't configure `backgroundTools` pay no cost. No system prompt augmentation is injected, no management tools are registered, no token or context overhead is added. The decision points check `taskManager.size` and `_backgroundToolNames` and short-circuit immediately — no forks, no queues, no settlement checks. The agent loop behaves identically to today.
+
+## How It Works
+
+### Current Agent Loop
+
+
+
+### Modified Agent Loop
+
+
+
+Legend: **green** = new, **gray** = existing (unchanged from current loop)
+
+### Three Decision Points
+
+| Point | Location | Blocking | Purpose |
+|-------|----------|----------|---------|
+| **A** | Start of each loop cycle | No | Pop any background tasks that finished (success, error, or cancelled) since the last cycle and inject their results into the conversation as user messages. Proceeds immediately if none have settled. |
+| **B** | Per tool, during dispatch | No | For each tool the model calls, check if it's designated as a background tool. If yes: fork the agent (or queue if `maxConcurrentBackgroundTasks` is reached), dispatch the tool on the fork, and return an immediate ACK to the model. If no: execute the tool inline as normal. The agent continues without waiting for background results. |
+| **C** | End of turn, no foreground work remaining | Conditional | Wait for the next background task to settle, then re-enter the loop. Prevents the agent from exiting while work is still in flight. |
+
+#### The Wait State — Why Block?
+
+Two paths in the modified loop converge on the same blocking wait:
+
+1. **Decision Point C**: The model ends its turn (no foreground work remaining) but background tasks are still in flight. Without the wait, the loop would exit and the `finally` block would cancel all pending tasks.
+2. **All-ACK turns**: The model calls multiple tools and *all* of them are background tasks — every tool result is merely an acknowledgement. Without the wait, the model would re-enter the loop with nothing but ACKs to reason about, risking hallucinated results or redundant re-dispatches.
+
+In both cases, the agent blocks until the next background task settles, injects the result into the conversation, and re-enters the loop. See [Cancellation](#cancellation) for the safety bounds that prevent indefinite waits.
+
+### How the Model Sees Background Tasks
+
+Background tools appear in the model's tool definitions identically to foreground tools — same name, same description, same input schema. There is no schema-level async marker. The model learns which tools run asynchronously and how to interact with them solely from the system prompt augmentation described below.
+
+When any background tools are passed to the agent, the SDK auto-generates and appends the following block to the system prompt:
+
+```
+## Background Tools
+
+The following tools run asynchronously:
+- `searchWeb`
+- `analyzeData`
+
+**How it works:**
+- Calling one returns: `Background task dispatched` (started immediately) or `Background task queued` (waiting for a slot).
+- Do not fabricate results. You are not blocked — continue working or yield your turn.
+- Results arrive as user messages:
+
+[Background Task Result]
+tool:
+toolUseId:
+status: success|error|cancelled
+
+...output...
+
+**Incorporate all [Background Task Result] messages into your work before producing your final response.**
+
+**Management tools:**
+- `list_background_tasks()` — check which tasks are still running and how long they've been running.
+- `cancel_background_task({ toolUseId })` — cancel a specific task. `cancel_background_task({ toolName })` — cancel all instances of a tool.
+```
+
+The tool names are populated from the `backgroundTools` list. The block costs ~100 tokens, paid once per invocation. See [Appendix G](#appendix-g-system-prompt-augmentation-rationale) for why this augmentation is necessary and why alternatives were rejected.
+
+#### Dispatch and Acknowledgement
+
+When the model calls a background tool, it receives an immediate acknowledgement as a standard `tool_result` — satisfying the provider API's synchronous pairing requirement:
+
+```
+[assistant] tool_use: { toolUseId: "tu-1", name: "searchWeb",
+ input: { query: "research latest advancements in agentic AI" } }
+[user] tool_result: { toolUseId: "tu-1", status: "success",
+ content: "Background task dispatched" }
+```
+
+The ACK pairs with the original `tool_use` via `toolUseId`, just like any foreground tool result. The system prompt tells the model this is not a real result — actual output will arrive later.
+
+#### Result Notification
+
+When the background task completes, its result is injected into the conversation as a user text message — not a `tool_result`, since there is no `tool_use` to pair it with. The `toolUseId` from the original dispatch is echoed for correlation:
+
+```
+[user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: success
+
+ "
+```
+
+This is the structural asymmetry at the heart of background tasks: the ACK uses native `tool_result` pairing, but the real result arrives as a plain text message. This is why the system prompt augmentation is necessary — see [Appendix G](#appendix-g-system-prompt-augmentation-rationale).
+
+#### Model-Driven Task Management
+
+Two foreground tools are auto-registered whenever background tools are configured:
+
+**`list_background_tasks`** — Returns all currently pending tasks with their elapsed time, giving the model visibility into what's still running:
+
+```
+Model calls: list_background_tasks()
+Result: "2 tasks in progress:
+ - searchWeb (toolUseId: tu-1, elapsed: 12.3s)
+ - analyzeData (toolUseId: tu-2, elapsed: 8.7s)"
+```
+
+The model can use this to make informed decisions — for example, checking whether a task is still running before dispatching a follow-up, or deciding whether to cancel a long-running task and try a different approach.
+
+**`cancel_background_task`** — If the model determines that a pending task is no longer needed — because an earlier result already answered the question, or the user changed direction — it can cancel by `toolUseId` (specific task) or `toolName` (all instances). See [Cancellation](#cancellation) for the full cancellation API.
+
+For full conversation traces covering mixed turns, all-ack turns, batched results, error scenarios, and non-deterministic ordering, see [Appendix H: Conversation Traces](#appendix-h-conversation-traces).
+
+---
+## New Primitives
+
+### BackgroundTask
+
+`BackgroundTask` is the foundational unit of background work in the SDK. Every background operation, whether dispatched by the model via `backgroundTools` or by the developer via `invokeBackground()`, is represented as a `BackgroundTask`. It wraps a Promise with synchronous status inspection and implements `PromiseLike` for direct `await`, enabling both the agent loop and the developer to check status, read results, or cancel work without blocking.
+
+```typescript
+type TaskStatus = 'queued' | 'inProgress' | 'success' | 'error' | 'cancelled'
+
+class BackgroundTask implements PromiseLike {
+ readonly id: string // UUID
+ readonly name: string // Human-readable label
+ readonly createdAt: number // ms since epoch
+ get status(): TaskStatus
+ get result(): unknown | undefined // Defined when status is 'success'
+ get error(): Error | undefined // Defined when status is 'error'
+ cancel(): void // Marks cancelled + fires fork abort signal
+}
+```
+
+#### Awaiting
+
+`BackgroundTask` implements `PromiseLike`, so the primary consumption path is `await`:
+
+```typescript
+// Success — resolves with the tool result
+const result = await task
+
+// Error — rejects with the tool's error
+try {
+ await task
+} catch (e) {
+ // e is the Error from the failed tool
+}
+
+// Cancelled — rejects with TaskCancelledError
+try {
+ await task
+} catch (e) {
+ if (e instanceof TaskCancelledError) {
+ // Intentional cancellation, not a failure
+ }
+}
+```
+
+Cancellation follows the same pattern as `fetch` with an aborted `AbortSignal` — intentional cancellation is a rejection, not a silent resolve. The synchronous getters (`status`, `result`, `error`) enable inspection without awaiting.
+
+#### Status Derivation
+
+`status` is the single source of truth. The full lifecycle:
+
+| Condition | `status` |
+|---|---|
+| Created but `maxConcurrentTasks` reached — waiting for a slot | `'queued'` |
+| Slot available, fork created, tool executing | `'inProgress'` |
+| `cancel()` called | `'cancelled'` |
+| Promise rejected | `'error'` |
+| Promise resolved, tool returned `status: 'error'` | `'error'` |
+| Promise resolved, tool returned `status: 'success'` | `'success'` |
+
+Task state is tracked via an internal discriminated union rather than promise state, because the agent loop must check settlement without blocking — a raw Promise offers no synchronous status inspection. The union carries the associated data for each state (result value, error, cancellation reason), eliminating the need for separate flags. When `cancel()` is called, status transitions to `'cancelled'` immediately regardless of current state (queued or inProgress). No-op if the task has already settled. See [Cancellation](#cancellation) for the full API.
+
+#### TaskManager
+
+`TaskManager` is the lifecycle manager for `BackgroundTask` instances created during the `backgroundTools` dispatch path ([Decision Point B](#three-decision-points)). It owns settlement detection, cancellation, and cleanup. Each agent instance holds its own `TaskManager`; forks get fresh instances with the same config. This isolation ensures a fork's background tasks are the fork's responsibility — the parent only sees the fork itself as one task, never the sub-tasks the fork may spawn internally.
+
+These settings tune the agent loop's waiting behavior at [the wait state](#the-wait-state--why-block):
+
+```typescript
+interface TaskManagerConfig {
+ heartbeatMs?: number // How often to emit BackgroundTaskPendingEvent while waiting (default: 5000ms)
+ settleWindowMs?: number // How long to wait for additional tasks to finish before injecting (default: 0ms)
+ maxCycles?: number // Max loop re-entries from background results per invocation (default: 50)
+ defaultTtlMs?: number // Auto-cancel tasks that exceed this duration (no default — tasks live until settled)
+ maxConcurrentTasks?: number // Max simultaneous background tasks (default: 10). Excess dispatches queue until a slot opens.
+}
+```
+
+These are exposed on `AgentConfig` as `backgroundToolHeartbeatMs`, `backgroundToolSettleWindowMs`, `maxBackgroundCycles`, `backgroundTaskTtlMs`, and `maxConcurrentBackgroundTasks` respectively.
+
+`maxConcurrentTasks` prevents runaway forking. When the limit is reached, new dispatches are queued rather than forked immediately:
+
+1. Decision Point B calls `TaskManager.enqueue()` instead of creating a fork directly.
+2. If a slot is available (`runningCount < maxConcurrentTasks`), the task is forked and started immediately → status `inProgress`, ACK: `"Background task dispatched"`.
+3. If no slot is available, the task enters an internal FIFO queue → status `queued`, ACK: `"Background task queued"`.
+4. When a running task settles (via `popCompleted()` or cancellation), the TaskManager automatically drains the queue — pulls the next queued task, creates its fork, and starts execution. The task transitions from `queued` to `inProgress`.
+
+The model receives an ACK for every dispatched tool immediately regardless of queue state. The distinction between "dispatched" and "queued" ACKs lets the model know whether work has started.
+
+Key methods:
+
+- **`enqueue()`** — the entry point for background dispatch. Creates a `BackgroundTask`, either starts it immediately or queues it based on concurrency, and returns the task handle with the appropriate ACK.
+- **`popCompleted()`** — returns and removes all settled tasks from the registry. Triggers queue drain if slots opened. This is the only way tasks leave the registry, ensuring no task is accidentally lost or processed twice.
+- **`cancel(id)`** — cancel a specific task by its internal ID. Works on both queued and running tasks. Cancelling a queued task removes it from the queue without ever forking.
+- **`cancelByToolUseId(toolUseId)`** — cancel a specific task by its `toolUseId`. This is the primary path for model-driven cancellation, since the model knows `toolUseId` from its own `tool_use` blocks in conversation history.
+- **`cancelByName(name)`** — cancel all in-progress and queued tasks with a given tool name. Returns the count. Available to both the model (via `cancel_background_task`) and developers directly.
+- **`cancelAll()`** — cancel everything (running and queued) and clear the registry. Called by the agent loop's `finally` block on exit.
+
+#### Events
+
+Three new hookable events:
+
+| Event | When | Key Fields |
+|-------|------|------------|
+| `BackgroundTaskDispatchEvent` | Tool call dispatched to background (Decision Point B) | `toolUse`, `taskId`, `taskName` |
+| `BackgroundTaskResultEvent` | Task finishes and result injected (Decision Point A or C) | `taskId`, `taskName`, `result`, `durationMs`, `error?`, mutable `retry` |
+| `BackgroundTaskPendingEvent` | Heartbeat while in the [wait state](#the-wait-state--why-block) | `pendingTasks: BackgroundTask[]`, `completedCount`, `elapsedMs` |
+
+`BackgroundTaskResultEvent.error` carries the `Error` object when the tool failed or threw — available for logging, metrics, and retry decisions. The mutable `retry` flag lets hook callbacks suppress error notifications and re-dispatch the same tool call as a new background task. See [Retry](#retry) for the full three-layer retry story.
+
+All three are available via hooks (`addHook`) and the streaming interface (`AgentStreamEvent`):
+
+```typescript
+agent.addHook(BackgroundTaskDispatchEvent, (event) => {
+ console.log(`Dispatched ${event.toolUse.name} as task ${event.taskId}`)
+})
+
+agent.addHook(BackgroundTaskResultEvent, (event) => {
+ if (event.error) {
+ console.log(`Task ${event.taskName} failed in ${event.durationMs}ms: ${event.error.message}`)
+ }
+})
+```
+
+#### backgroundTools Config
+
+`backgroundTools` accepts the same types as `tools` — `Tool`, `McpClient`, `Agent`, `Graph`, `Swarm`, or nested arrays. Anything that can be a tool can be a background tool.
+
+```typescript
+const agent = new Agent({
+ tools: [calculateMetrics, formatReport],
+ backgroundTools: [searchWeb, analyzeData, researcher],
+})
+```
+
+The model sees background tools as normal tools in its tool definitions. A system prompt augmentation explains the async contract — see [How the Model Sees Background Tasks](#how-the-model-sees-background-tasks) for the full prompt block and message format details.
+
+#### fork()
+
+`fork()` creates an independent copy of the agent that can be invoked concurrently with the original. It is both the isolation primitive that makes background tasks possible (each dispatch at Decision Point B creates a fork) and a standalone capability for developers who want to parallelize work using the same agent configuration.
+
+Why it's needed: `invoke()` on the same agent instance acquires an `_isInvoking` lock. A second concurrent call throws `ConcurrentInvocationError`. This is a deliberate safety rail — concurrent writes to the same `messages` array would corrupt conversation state. `fork()` gives each concurrent invocation its own messages, state, and lock:
+
+```typescript
+// This throws ConcurrentInvocationError
+await Promise.all([
+ agent.invoke('query A'),
+ agent.invoke('query B'),
+])
+
+// fork() gives each invocation its own lock and state
+const [a, b] = await Promise.all([
+ agent.fork().invoke('query A'),
+ agent.fork().invoke('query B'),
+])
+```
+
+```typescript
+interface ForkOptions {
+ name?: string // Override fork name
+ messages?: Message[] // Override conversation history
+ systemPrompt?: SystemPrompt // Override system prompt
+ printer?: boolean // Control output printing (false for background)
+}
+```
+
+##### What gets shared vs. isolated
+
+| What | Fork behavior |
+|---|---|
+| **Conversation** (messages, app state) | Deep-copied from parent. Independent after forking. Background tool forks start empty ([Context Management](#context-management)). |
+| **Model and tools** | Shared reference. Same model, tool registry, and MCP clients. |
+| **System prompt** | Copied from parent. Overridable via `ForkOptions`. |
+| **Hooks** | Rebuilt with propagating callbacks only ([Hook Propagation](#hook-propagation)). |
+| **Session** | Dropped. Forks do not persist to the parent's session. |
+| **Execution** (invocation lock, cancellation, metrics) | Fresh. Each fork can be invoked and cancelled independently. |
+| **Task management** | Fresh `TaskManager`, same config. Fork manages its own background tasks. |
+
+Messages are deep-copied by default. For background tool forks specifically, the SDK automatically passes `messages: []` at Decision Point B — see [Context Management](#context-management) for why and how to opt out via `inheritMessages`.
+
+Session managers are dropped — forks do not persist to the parent's session. Fork results reach the parent's session indirectly: they are injected into the parent's conversation, and the parent's session hooks persist them on the next save. Intermediate work inside forks (tool calls, reasoning steps) is not persisted.
+
+##### Cancellation
+
+Each fork has its own `AbortController`. Cancellation flows in one direction:
+
+- **Parent exit → forks cancelled.** When the parent's agent loop exits (normal completion, error, or cancel), its `finally` block calls `taskManager.cancelAll()`, which fires each background fork's abort signal. No orphaned forks.
+- **Fork cancelled → parent unaffected.** A fork's abort controller is independent. Cancelling a fork (via `BackgroundTask.cancel()` or TTL) does not propagate to the parent.
+- **Manual forks.** For forks created directly by the developer (not via `backgroundTools` or `invokeBackground`), the developer is responsible for cancellation — there is no automatic cleanup.
+
+##### Fork depth guard
+
+A configurable depth limit (default: 20, set via `maxForkDepth` on `AgentConfig`) prevents infinite recursive forking — for example, a background tool that itself dispatches background tools. Throws if exceeded.
+
+##### Model-driven self-forking (emergent capability)
+
+This was not the primary use case for background tasks — the goal was unlocking async tool and agent-as-tool dispatch. But during development, a broader capability emerged: the `fork()` + `invokeBackground()` + result injection infrastructure composes to let the model spawn copies of itself.
+
+A `fork_self` tool — a thin wrapper over `fork()` + `invokeBackground()` — would let the model dispatch a copy of itself with a new prompt. The fork has the same tools, model, and system prompt, but its own conversation and execution state. Results flow back via the standard background task injection mechanism.
+
+This unlocks patterns that go beyond tool-level parallelism: self-parallelization (the model decomposes a task and dispatches a fork per sub-problem), context window extension (fork with a summarized conversation to work in a fresh context), speculative execution (fork to explore competing approaches in parallel), and recursive decomposition (forks that fork themselves, up to `maxForkDepth`).
+
+The infrastructure is already in place — no new primitives are needed. A dedicated design document will explore model-driven self-forking patterns, system prompt strategies, and benchmark results.
+
+#### Hook Propagation
+
+Forks inherit hook callbacks selectively. An `options` parameter on `addHook` controls this:
+
+```typescript
+agent.addHook(AfterModelCallEvent, callback) // propagate: true (default)
+agent.addHook(AfterModelCallEvent, callback, { propagate: false }) // stays on this agent only
+```
+
+Session persistence hooks and conversation manager overflow hooks must not propagate to forks. A fork writing to the parent's session or triggering the parent's overflow recovery would corrupt state. The built-in hooks set `propagate: false`:
+
+- `ConversationManager.initAgent()` → overflow recovery hook
+- `SlidingWindowConversationManager.initAgent()` → after-invocation trimming hook
+- `SessionManager` → all persistence hooks
+
+User-registered hooks propagate by default, which is the expected behavior for logging and guardrails. The `options` parameter is backwards-compatible — existing `addHook` calls without it continue to work with the default `propagate: true`.
+
+#### invokeBackground()
+
+Dispatches a full agent invocation as a background task:
+
+```typescript
+interface InvokeBackgroundOptions {
+ name?: string // Task label for debugging
+ messages?: Message[] // Override fork conversation
+}
+```
+
+This creates a fork, starts its `invoke()`, wraps the resulting Promise in a `BackgroundTask`, and returns immediately. Unlike `backgroundTools` dispatch, `invokeBackground` does not route through the `TaskManager` — the developer is responsible for awaiting, cancelling, and handling errors on the returned task:
+
+```typescript
+const task = agent.invokeBackground(`Summarize: ${text}`, { name: 'summarizer' })
+const result = await task
+```
+
+---
+
+## Failure Modes
+
+Background task failures produce a `[Background Task Result]` notification with one of two statuses: `error` for tool failures, `cancelled` for intentional stops (TTL expiration, developer cancellation, or model-driven cancellation via `cancel_background_task`). If a fork crashes entirely — unrecoverable context overflow, provider errors, unhandled exceptions — the parent is unaffected. The fork's promise rejects, the task settles with `status: error`, and the model receives the error notification like any other failure. Conversely, if the parent agent exits — whether normally, via error, or cancellation — all in-flight background tasks are cancelled automatically via the `finally` block.
+
+**Tool-level failure (`status: error`).** The tool returns a `ToolResultBlock` with `status: 'error'`, or throws an exception (which `executeTool` catches and wraps in an error `ToolResultBlock`). The notification carries the tool's error text.
+
+```
+Model calls: searchWeb({ query: "quantum computing" })
+Tool result ACK: "Background task dispatched"
+Model continues working...
+
+Tool fails → model receives:
+
+[Background Task Result]
+tool: searchWeb
+toolUseId: tu-1
+status: error
+
+Connection timeout after 30000ms
+```
+
+**Cancellation (`status: cancelled`).** The task was intentionally stopped — by TTL expiration, developer `cancel()`, or the model's `cancel_background_task` tool. The TaskManager fires the fork's abort signal and the task's status transitions to `'cancelled'` immediately (see [What Cancellation Does to the Fork](#what-cancellation-does-to-the-fork)). Example with TTL:
+
+```
+Model calls: analyzeData({ dataset: "large_corpus" })
+Tool result ACK: "Background task dispatched"
+
+30 seconds pass, backgroundTaskTtlMs fires...
+
+Model receives:
+
+[Background Task Result]
+tool: analyzeData
+toolUseId: tu-2
+status: cancelled
+
+Task cancelled (TTL exceeded)
+```
+
+In both cases the model receives the notification inline with other results and can decide how to proceed. The distinction matters: `status: error` suggests retrying may help, while `status: cancelled` signals the task was intentionally stopped.
+
+---
+
+## Retry
+
+Retry operates at three layers. Each layer handles a different failure granularity, and they compose — Layer 1 catches transient failures silently, Layer 2 gives developers programmatic retry policies, Layer 3 gives the model final say.
+
+### Layer 1: Tool-Level Retry (Inside the Fork)
+
+Register an `AfterToolCallEvent` hook. It fires inside the fork's `executeTool` loop when the tool fails. Set `retry = true` and the fork retries the tool internally. The parent never sees the failure.
+
+This handles transient, mechanical failures — network timeouts, rate limits, flaky APIs. Already works with no changes needed; the same hook mechanism used for foreground tool retry propagates to forks automatically.
+
+### Layer 2: Task-Level Retry (Developer-Controlled)
+
+`BackgroundTaskResultEvent` carries the full `ToolResultBlock` result, the `Error` object (when available), and a mutable `retry` flag. When a hook callback sets `retry = true` on an error result, the framework suppresses the error notification (the model never sees it), creates a new fork, and re-dispatches the same tool call as a fresh background task.
+
+```typescript
+// Simplified — production code should track retry counts per task (e.g., keyed by event.taskId)
+let retryCount = 0
+
+agent.addHook(BackgroundTaskResultEvent, (event) => {
+ if (event.result.status === 'error' && retryCount < 3) {
+ retryCount++
+ event.retry = true
+ }
+})
+```
+
+The `retry` flag is only honoured when `result.status === 'error'`. Setting it on a success or cancelled result is a no-op — cancelled tasks were intentionally stopped, so automatic retry would likely hit the same deadline. The re-dispatched task carries the original `ToolUseBlock`, so the retry uses the same tool name and input arguments.
+
+**Safety bound:** `maxBackgroundCycles` (default: 50) limits how many times the agent loop can re-enter after injecting background results. Each retry dispatch settles on a future cycle, counting toward this limit. A hook that always sets `retry = true` will exhaust the cycle budget rather than loop infinitely.
+
+### Layer 3: Model-Driven Retry
+
+If retries at Layers 1 and 2 are exhausted (or not configured), the notification reaches the model as a `[Background Task Result]` with `status: error` or `status: cancelled`. The model can call the same tool again if it chooses — it's still in `backgroundTools`, so the new call dispatches as a fresh background task. The `error` vs `cancelled` distinction helps the model decide: errors may be worth retrying, cancellations typically are not.
+
+**Validated:** The `demos/cases/error-retry` teaching example confirms Layer 2 end-to-end with Sonnet: a flaky tool (503 on first call) is retried transparently by the hook, the model never sees the error, and the coordinator synthesizes all three researcher results as if nothing failed.
+
+---
+
+## Cancellation
+
+Four independent cancellation paths cover every actor that might need to stop background work.
+
+### Developer Cancellation
+
+`BackgroundTask` exposes a public `cancel()` method. This is the primary cancellation path for tasks created via `invokeBackground()`, where the developer holds the handle directly. For `backgroundTools` tasks, cancellation goes through the TaskManager or the model's `cancel_background_task` tool (see below).
+
+```typescript
+const task = agent.invokeBackground('Research topic A')
+// ... later
+task.cancel() // marks cancelled, fires fork abort signal
+```
+
+### TaskManager Cancellation
+
+`TaskManager.cancel(id)` cancels a specific task by internal ID. `TaskManager.cancelByToolUseId(toolUseId)` cancels by the model-facing `toolUseId`. `TaskManager.cancelByName(name)` cancels all in-progress tasks with a given tool name. `TaskManager.cancelAll()` cancels everything. The agent loop's `finally` block calls `cancelAll()` on exit, ensuring no orphaned tasks.
+
+### Model-Driven Cancellation
+
+A `cancel_background_task` tool is auto-registered whenever `backgroundTools` is configured. The model can cancel a specific task by `toolUseId`, or all instances of a tool by `toolName`:
+
+```
+// Cancel a specific task
+Model calls: cancel_background_task({ toolUseId: "tu-1" })
+Result: "Cancelled task for 'searchWeb' (toolUseId: tu-1)"
+
+// Cancel all instances of a tool
+Model calls: cancel_background_task({ toolName: "searchWeb" })
+Result: "Cancelled 2 task(s) for 'searchWeb'"
+```
+
+The model knows both `toolUseId` and tool names from its own `tool_use` blocks in conversation history. `toolUseId` is the primary path for precise cancellation; `toolName` is the bulk option. This closes the asymmetry where the model can start background work but can't stop it.
+
+### Automatic Cancellation (TTL)
+
+TTL auto-cancels tasks that exceed a deadline. Configured via `backgroundTaskTtlMs` on `AgentConfig`, which applies to all background tasks dispatched by that agent.
+
+### What Cancellation Does to the Fork
+
+`cancel()` fires the fork's `AbortController.abort()`. The fork's agent loop checks `isCancelled` at built-in checkpoints (between cycles, during model streaming, between tool executions). If the fork is mid-tool-execution, cancellation depends on the tool: tools that check `cancelSignal` (e.g., `fetch()` with signal forwarding) abort immediately; tools that don't check will run to completion. The cancelled task's status transitions to `'cancelled'` immediately — the agent loop does not wait for an unresponsive fork.
+
+**Validated:** The `demos/cases/cancellation` teaching example confirms developer cancellation end-to-end with Sonnet: 3 tasks dispatched via `invokeBackground()`, the fast one completes in 9.4s, the developer cancels the other two (which were still in progress), status transitions to `cancelled` immediately. Total: 9.4s instead of 90s+.
+
+---
+
+## Context Management
+
+Background tasks create two pathways for context growth: fork creation copies conversation history (input side), and result injection adds content to the parent's conversation (output side). The strategy uses three layers — a structural fix for the input side, reactive compaction for the output side, and an opt-in mechanism for developers who want proactive control.
+
+### Background Tool Forks Start with Empty Messages
+
+When a background tool is dispatched at Decision Point B, the SDK creates a fork to execute the tool on. The fork is created with `messages: []` — it does not copy the parent's conversation history.
+
+This is safe because the fork never reads its own messages. `executeTool()` passes the tool input directly from the `toolUseBlock` — the tool receives its arguments through `ToolContext.toolUse`, not from the conversation. The fork's `messages` array is accessible via `toolContext.agent.messages` but no built-in tool reads it. For Agent-as-tool specifically, `AgentAsTool` resets the sub-agent to its initial state via `loadSnapshot()` before invoking — the fork's messages are irrelevant.
+
+Without this, each background tool fork deep-copies the parent's entire message history. Six concurrent forks of a 50-message conversation = six copies. As the parent's conversation grows with injected results, each subsequent wave of forks copies the now-larger conversation — a compounding cost eliminated by starting forks empty.
+
+For tools that genuinely need conversation context when running in the background, `AgentAsToolOptions` accepts `inheritMessages?: boolean` (default `false`). When set to `true`, the fork copies the parent's messages instead of starting empty. This follows the same pattern as `propagate` on hooks — safe default, per-tool opt-in:
+
+```typescript
+// Default: fork starts with empty messages
+const researcher = new Agent({ ... })
+
+// Opt-in: fork inherits parent messages
+const conversationAnalyzer = new Agent({ ... })
+conversationAnalyzer.asTool({ inheritMessages: true })
+```
+
+When the tool runs as a foreground tool, `inheritMessages` is ignored — no fork is involved.
+
+This does not affect `invokeBackground()` (developer-driven dispatch), which copies the parent's messages by default since the developer is invoking the fork as a full agent. For developer-driven dispatch where conversation context is not needed, `invokeBackground(prompt, { messages: [] })` is the recommended pattern — this is the developer's responsibility to manage.
+
+### Reactive Compaction Handles Result Accumulation
+
+When background task results are injected into the parent's conversation, the conversation grows. If it grows past the context window limit, the conversation manager's overflow recovery fires — the same mechanism used for any long conversation:
+
+1. Background results are injected at the **end** of the conversation (appended as the most recent messages).
+2. The model is called with the updated conversation.
+3. If the conversation exceeds the context window, the model call fails with `ContextWindowOverflowError`.
+4. The conversation manager compacts: `SlidingWindowConversationManager` trims messages from the **front** (oldest first); `SummarizingConversationManager` summarizes old messages.
+5. The model call retries with the compacted conversation.
+
+This ordering works well for background tasks: stale early results from earlier waves get trimmed while recent results survive. Crucially, the model always gets a full turn to process injected results before compaction can trim them — results are injected, the model is called, and only a subsequent overflow (from later injections or the model's own output) triggers compaction. Early results that the model already reasoned about are safe to trim because that reasoning is preserved in the model's own prior responses.
+
+**Settle window batching helps.** When multiple background tasks finish within the settle window, their results are injected as one batch before the model is called. If the combined batch overflows the context, compaction fires once for the entire batch — one compaction pass for N results, not N separate passes.
+
+**Limitation: compaction cannot recover if results alone exceed the context window.** If the combined size of injected results in a single batch exceeds the model's context window — even after all other messages are trimmed — compaction cannot help. This is an extreme case (e.g., six Agent-as-tool results each producing 10K+ tokens on a small context window) but it is real. Use structured output on sub-agents (below) to bound individual result sizes when running many concurrent background tools or targeting models with smaller context windows.
+
+**Caveat for precision-critical workflows.** When compaction trims early background task results, specific data points that the model referenced but didn't reproduce verbatim in its own responses are lost from the context. In background task workflows, the model's inter-wave reasoning tends to be brief (dispatch tools, wait, incorporate results, dispatch more) — so compacted early results may carry data that the model's thin intermediate responses don't preserve. For most workflows this is acceptable. For precision-critical use cases (financial analysis, legal review, quantitative research) where exact figures must remain available for later reference, use structured output on sub-agents (below) to extract specific data points into compact named fields that survive compaction.
+
+### Structured Output on Sub-Agents for Proactive Control
+
+For developers who want to proactively control how much data crosses the fork-to-parent boundary, the recommended pattern is `structuredOutputSchema` on the sub-agent.
+
+The context blowup problem is specific to Agent-as-tool, where an unconstrained model generates free-form text — including reasoning, prose, and verbose explanations — that becomes a tool result. Plain tools (FunctionTool, ZodTool, MCP tools) control their return value directly in the callback. Structured output on sub-agents strips the output down to just the actionable data, making the sub-agent behave more like a regular tool that returns structured results.
+
+The sub-agent's agent loop runs freely — multiple tool calls, intermediate reasoning, as many cycles as needed. On the final cycle, the structured output tool forces the model to distill its work into the schema. `AgentAsTool` detects the structured output and returns a compact `JsonBlock` instead of the full free-form text:
+
+```typescript
+const researcher = new Agent({
+ name: 'researcher',
+ tools: [searchGitHub, searchArxiv, fetchUrl],
+ systemPrompt: 'You are a research specialist. Search thoroughly and report your findings.',
+ structuredOutputSchema: z.object({
+ summary: z.string().describe('2-3 sentence executive summary'),
+ findings: z.array(z.object({
+ source: z.string(),
+ finding: z.string(),
+ })).max(5).describe('Top findings with sources'),
+ sources: z.array(z.string()).max(5).describe('URLs of key sources'),
+ }),
+ printer: false,
+})
+
+const coordinator = new Agent({
+ backgroundTools: [researcher, analyst, writer],
+ systemPrompt: 'Dispatch all researchers, then synthesize findings.',
+})
+```
+
+The researcher searches, reads, and reasons without restriction. Its final output is a compact JSON object — roughly 300-500 tokens crossing the boundary instead of 2000+. The sub-agent's reasoning quality is unaffected; structured output constrains only the final output format.
+
+This is opt-in. Developers who don't set `structuredOutputSchema` get full free-form results. Reactive compaction (above) handles any resulting overflow.
+
+### Summary
+
+| Developer action | Input side | Output side |
+|---|---|---|
+| Nothing | Forks start with empty messages | Reactive compaction on context overflow |
+| `structuredOutputSchema` on sub-agent | Forks start with empty messages | Compact structured JSON crosses the boundary |
+| `inheritMessages: true` on tool | Fork copies parent messages | Reactive compaction on context overflow |
+| `structuredOutputSchema` + `inheritMessages` | Fork copies parent messages | Compact structured JSON crosses the boundary |
+
+A developer who does nothing gets safe defaults — empty forks and reactive compaction. A developer who wants tighter control adds structured output to their sub-agents.
+
+**Fork-internal overflow.** Each fork has its own conversation manager. If a fork's tool execution pushes it over its own context window limit, the fork's overflow handler fires independently — the parent is unaffected.
+
+---
+
+## Design Decisions & Alternatives
+
+### Why backgroundTools as a First-Class Config (And Not Something Else)
+
+Two alternative approaches to giving the model async dispatch capability:
+
+**1. Meta-tool: `run_in_background(tool_name, args)`**
+
+A built-in tool the model calls to wrap any other tool call. Instead of the developer declaring which tools run in the background, the model decides at runtime.
+
+- Pro: no new agent parameter. The model adapts per-conversation — it can background a tool in one context and run it foreground in another based on what the task requires.
+- Con: the model must learn a wrapper pattern. Instead of calling `searchWeb({ query: "..." })` directly, it calls `run_in_background({ tool: "searchWeb", args: { query: "..." } })`. This is strictly more complex for the model, and the framework gets no new information from the indirection — it could have intercepted the direct call.
+- Con: without guardrails, the model decides what's safe to background — but it has no knowledge of tool statefulness, latency, or fork safety. Adding a developer-specified allowlist to restrict what the meta-tool can dispatch solves this, but then the developer is declaring a list of backgroundable tools — which is what `backgroundTools` already is, without the meta-tool indirection.
+- Con: incompatible with the system prompt augmentation strategy. The current augmentation lists specific tool names ("The following tools run asynchronously: searchWeb, analyzeData"). A meta-tool approach can't name the tools upfront because any tool could be backgrounded at runtime — the augmentation would have to be generic ("any tool you call through run_in_background is async"). Prompt ablation testing showed that less specific augmentations regressed model compliance (see [Appendix G](#appendix-g-system-prompt-augmentation-rationale)), and a fully generic instruction would be the least specific variant.
+
+**2. Per-tool option: `{ tool: searchWeb, background: true }` in the tools array**
+
+Same dispatch semantics, but expressed as a property on each tool entry rather than a separate parameter.
+
+- Pro: single source of truth — each tool appears once with its configuration. No possibility of a developer adding the same tool to both `tools` and `backgroundTools` and wondering which wins.
+- Pro: extensible wrapper — the config object could carry other per-tool options in the future (per-tool TTL, priority).
+- Con: no precedent in the SDK. Tool arrays are flat lists of `Tool | McpClient | Agent | Graph | Swarm`. Introducing config wrapper objects is a new pattern that every consumer of the array — tool registry, model formatters, MCP integration — must handle, even if the unwrapping itself is trivial.
+- Con: `backgroundTools` as a separate parameter achieves identical semantics with zero changes to the existing tools pipeline. It's additive to `AgentConfig` rather than a new shape in `ToolList`.
+- Verdict: a valid alternative. We chose `backgroundTools` because it requires no type changes, is immediately discoverable via IDE autocomplete on `AgentConfig`, and keeps the existing tools pipeline untouched. If developer feedback shows that managing two lists is confusing, per-tool options can be added as a backwards-compatible enhancement later.
+
+In both approaches, foreground/background assignment is static — declared at agent construction, not switchable at runtime. This is deliberate. The developer knows which tools are safe to fork (stateless, independent, no shared resources) and which must run inline. Static assignment also has a UX implication: background tool forks run with printing disabled, so the user does not see intermediate reasoning, streaming output, or tool call traces from background work. Developers who want real-time visibility into a tool's execution should keep it foreground. If different contexts require different modes for the same tool, use separate agent configurations.
+
+**Prior art: Mastra's dynamic dispatch.** Mastra solves the "same tool, both modes" problem by allowing the model to include a `_background` field in tool call args to override background/foreground per-call. This is a valid approach that adds flexibility, but it adds a hidden parameter to every tool's input schema, requires the model to learn when to use it, and means the developer can't guarantee a tool will always run in a specific mode. We chose static assignment for v1 because it's simpler, predictable, and lets the developer reason about fork safety at construction time. Dynamic dispatch via an opt-in allowlist (developer pre-approves which tools can be dynamically backgrounded, model decides per-call) is the natural extension path if static assignment proves too restrictive.
+
+**3. Task management tool: `manage_tasks({ action: "create" | "status" | "stop" | "get_result", ... })`**
+
+A further extension of the meta-tool pattern where the model explicitly manages the full task lifecycle — creating tasks, polling for status, stopping tasks, and retrieving results.
+
+- Pro: full visibility — the model can query task status at any time and make decisions based on it (e.g., "if task A is still running, start task B").
+- Con: requires the model to learn a task management API and manually poll for results, adding complexity and wasted cycles. Each status check is a tool call that consumes a full agent loop cycle.
+- Con: polling is fundamentally wasteful — the model repeatedly calls `manage_tasks({ action: "status", taskId: "..." })` instead of receiving results automatically.
+- Our design eliminates the polling overhead: dispatch is transparent (the model calls tools normally), and result delivery is automatic (the agent loop's injection points push results to the model when they're ready). Rather than a monolithic task management API, we provide two focused tools — `list_background_tasks` for status visibility and `cancel_background_task` for stopping work. The model gets on-demand inspection without learning a lifecycle protocol, and results still arrive automatically without polling.
+
+### How Overlap Between tools and backgroundTools Is Handled
+
+If a tool name appears in both `tools` and `backgroundTools`, the agent logs a warning at construction and treats it as a background tool.
+
+**Alternatives considered:**
+
+- **Throw at construction.** Loudest signal, zero ambiguity. But it punishes a common progressive adoption pattern: developer starts with `tools: [a, b, c]`, moves `c` to `backgroundTools`, and forgets to remove it from `tools`. A hard error on what is most likely a migration-in-progress is unnecessarily harsh.
+
+- **`backgroundTools` wins silently.** Does what the developer most likely intended (they added it to `backgroundTools` for a reason) but gives no signal that the overlap exists. A developer who thinks a tool is running foreground because they see it in `tools` will be confused when it actually runs in the background.
+
+- **`tools` wins silently.** Foreground is the "safer" default — if in doubt, block. But this silently ignores the developer's explicit decision to put the tool in `backgroundTools`, which is worse than the reverse.
+
+We chose warn + backgroundTools wins because it respects the developer's most likely intent (they want this tool backgrounded), surfaces the conflict visibly, and doesn't crash the application during an incremental migration.
+
+### Why No `invokeAll()` Convenience Method
+
+We considered adding `agent.invokeAll(['query A', 'query B', 'query C'])` as sugar over `Promise.all` + `invokeBackground`. It saves one line but loses composability — you can't cancel individual tasks, await them selectively, or mix them with other async work. Every variation (per-task timeout, partial result handling, error strategies) would require extending the method signature. `invokeBackground()` + `Promise.all` is explicit, composable, and only one step above the raw primitive.
+
+### Why Fork Isolation (Not Shared State)
+
+We considered having forks share the parent's `messages` array, with concurrent writes synchronized via a lock. This would eliminate the need for result injection — forks would write directly to the conversation. But concurrent writes to a shared conversation create ordering ambiguity, break the model's expectation of a linear conversation, and make debugging very difficult. Isolated forks with explicit result injection preserves linearity and keeps the decision points auditable.
+
+### Why In-Process TaskManager (Not External Queue)
+
+We considered delegating background execution to an external task queue (SQS, Redis, Temporal). This would enable durable execution and cross-process distribution, but introduces infrastructure dependencies that conflict with the SDK's zero-config philosophy. The in-process approach handles the common case — I/O-bound tool calls, sub-agent delegation — without requiring users to set up external services. The TaskManager's API surface is designed so that a persistent/distributed backend could replace the in-memory Maps without changing any consumer code (see [Appendix E: Extension to Containerized Dispatch](#appendix-e-extension-to-containerized-dispatch)).
+
+### Why the Agent Loop Waits for Pending Tasks
+
+The blocking wait (see [The Wait State](#the-wait-state--why-block)) serves two paths: Decision Point C (model ends turn, tasks still in flight) and all-ACK turns (every tool result is a background dispatch acknowledgement). In both cases, if the model still has foreground work (calling tools, reasoning, generating text), it continues normally — background tasks settle in the background and get picked up at Decision Point A on the next loop iteration. The wait never blocks the model from doing other work.
+
+When the model has no foreground work and tasks are still pending, the loop blocks until at least one settles. Without this wait, the loop returns, the `finally` block fires `cancelAll()`, and pending tasks are killed — emails don't send, logs don't write. The wait keeps the loop alive so tasks can complete and results can be injected.
+
+This means background tools cannot be true fire-and-forget today. Side-effect tools still benefit from parallel execution (three 4-second calls settle in ~4s instead of 12s), but the agent's response is held until all tasks finish. A near-term enhancement is a per-tool `fireAndForget` flag that excludes specific tasks from the wait — the agent returns immediately while those tasks complete on detached promises, avoiding the infrastructure cost of a persistent backend for the common case of side-effect tools where silent loss on process exit is acceptable.
+
+The path to fire-and-forget is a persistent TaskManager backend (see [Appendix E](#appendix-e-extension-to-containerized-dispatch)). With external persistence, task state survives the invocation boundary: the loop returns immediately, tasks finish on their own and write results to the store, and the next `invoke()` picks them up at Decision Point A. The `SessionManager` is a natural near-term integration point — it already persists conversation state across invocations.
+
+### Why Model-Level Concurrency (Not Just Framework-Level)
+
+The state machine design ([0005](./0005-state-machine.md)) proposes `ConcurrentToolOrchestrator` as a sub-orchestrator that runs tool steps in parallel. This is complementary, not competing:
+
+- `ConcurrentToolOrchestrator` parallelizes tools within a single model turn at the framework level. The model requests N tools, the SDK runs them concurrently, collects all results, and sends them back together. The model doesn't know concurrency happened.
+- `backgroundTools` parallelizes across turns at the model level. The model dispatches work, continues reasoning, and incorporates results incrementally as they arrive. The model actively participates in the concurrency.
+
+Both are useful. Concurrent tool execution is a performance optimization invisible to the model. Background dispatch is a capability the model uses to coordinate complex workflows. Once the state machine lands, a `ConcurrentToolOrchestrator` for foreground tools would compose naturally alongside background dispatch.
+
+### Why a Sentinel in the ACK
+
+The dispatch acknowledgment contains `\uE001STRANDS_BGT\uE001` — Unicode private-use area characters that are invisible in renderers and not typeable from a keyboard. Without this, a user could craft a message containing "Background task dispatched" and the agent loop's all-ack detection would misidentify it as a background dispatch, causing the agent to enter the blocking wait loop incorrectly. The sentinel is machine-generated only and makes this injection impossible. A tool's actual output cannot contain these characters — they are in Unicode's Private Use Area (U+E001), reserved for application-specific use and not produced by any standard text processing.
+
+### Why System Prompt Augmentation (And Can It Be Eliminated?)
+
+The ~100-token system prompt block is required because no model provider API supports async tool results natively, and two of the four augmentation bullets address behavioral failure modes (hallucination, premature finalization) that no structural change can eliminate. We investigated five alternative approaches — including synthetic tool_use fabrication, XML-style envelopes, and prompt ablation — and none matched the current augmentation's reliability across Sonnet, Haiku, and Nova at benchmark scale. See [Appendix G](#appendix-g-system-prompt-augmentation-rationale) for the full analysis.
+
+### Why Three Decision Points (And Not One)
+
+The current design has three distinct decision points: A (inject completed results), B (dispatch new tasks), C (wait for next task to finish). The all-ack scenario is a separate branch in the loop that converges on the same blocking wait as C (see [The Wait State](#the-wait-state--why-block)).
+
+A natural simplification: collapse A, C, and all-ack into a single top-of-loop operation that (a) pops any settled bg results and (b) asks "does the model have anything new to reason about?". If yes, call the model. If no and tasks are pending, block and wait. If no and nothing is pending, return. B remains its own concern because it is structurally different (fires mid-turn on each tool call).
+
+We considered this and kept the A/B/C structure for several reasons:
+
+- **The A/B/C vocabulary is already load-bearing.** Implementation, tests, benchmarks, demos, and hook documentation reference specific decision points. A unified design would require renaming these touchpoints for marginal conceptual gain — and the unified design produces identical behavior.
+- **The A/B/C labels are teachable.** Bug reports, stack traces, and hook documentation can reference specific points. "Result came from Point C" is crisper than "result came from the top-of-loop settlement check."
+- **Three dedicated code paths are easier to reason about per-case than one unified predicate.** The predicate "is there anything new for the model?" must correctly answer across many states — foreground results, bg settlements, all-ack dispatch, TTL cancellations, interrupt resumption, structured output forcing. Separating these paths makes each easier to test in isolation.
+- **Heartbeat semantics tied to Point C have a clear observability meaning** ("agent finished its turn, waiting on bg work"). Unifying the wait changes when heartbeats fire relative to turn boundaries — not harmful, but different from the current contract.
+
+The unified design is a legitimate alternative that produces identical user-facing behavior (same events, same latency, same wall-clock results). If future constraints make A/B/C burdensome, it is a safe no-op refactor — no consumer code changes required. For the initial ship, A/B/C wins on clarity, debuggability, and alignment with the existing implementation.
+
+---
+
+## Consequences
+
+### Unlocks
+
+| | Before | After |
+|---|---|---|
+| **Tool-level parallelism** | Model calls tools sequentially, idles while each runs | Model dispatches tools to background, continues reasoning, incorporates results as they arrive |
+| **Concurrent agent invocations** | `invoke()` on the same instance throws; manual cloning required | `fork()` + `invokeBackground()` — one line, guaranteed config match |
+| **Non-blocking side effects** | Emails, logs, CRM updates block the agent until each completes | Fire concurrently — agent responds without waiting |
+| **Adaptive pipelines** | Pipeline topology defined at build time (Graph nodes and edges) | Model discovers what to parallelize at runtime based on context and intermediate results |
+
+### Tradeoffs
+
+| | What you pay | How we're addressing it |
+|---|---|---|
+| **Token cost** | Output tokens: 1-18% increase across benchmarks (see Token Δ columns). Input tokens increase proportionally to cycle count — each additional model call re-reads the conversation including previously injected results. Both are the structural cost of incremental reasoning across multiple turns. | Settle window batching reduces turn count. Structured output on sub-agents reduces per-result size. Compaction reclaims context from stale results. |
+| **Context growth** | Each injected result grows the parent's conversation. Multiple waves compound. | Three-layer strategy: empty forks (input), reactive compaction (output), structured output (proactive). See [Context Management](#context-management). |
+| **Session gaps** | Fork internals not persisted. Pending results lost if session saves mid-flight. | v1 limitation. Fork results reach the session via parent injection. Persistent TaskManager (Phase 2) closes this gap fully. |
+| **Fire-and-forget** | Agent loop waits for all tasks to settle before returning. | Side effects still run in parallel (3 emails in ~4s vs ~12s). True fire-and-forget lands with Phase 2 persistent execution. |
+
+---
+
+## Developer Experience
+
+Common patterns using background tasks, from simplest to most advanced.
+
+### Parallel Tool Execution
+
+```typescript
+const agent = new Agent({
+ backgroundTools: [searchWeb, queryDatabase, analyzeCode],
+})
+
+const result = await agent.invoke(
+ 'Search for React best practices, check our DB for usage stats, and review the auth module'
+)
+```
+
+### Mixed Foreground and Background Tools
+
+Most agents will have some tools that should block and some that should run in the background. Foreground tools return results inline; background tools return ACKs and deliver results later:
+
+```typescript
+const agent = new Agent({
+ tools: [quickLookup],
+ backgroundTools: [deepResearch, analyzeDataset],
+ systemPrompt: 'Look up the topic first, then dispatch research and analysis in parallel.',
+})
+
+const result = await agent.invoke('What are the latest trends in edge computing?')
+```
+
+The model can call `quickLookup` and get the result inline within the same turn. When it dispatches `deepResearch` and `analyzeDataset`, the SDK returns ACKs immediately and the model continues reasoning. Results arrive as they settle, and the model incorporates them on subsequent cycles.
+
+### Multi-Agent Coordination
+
+Agents convert to tools automatically via `asTool()`, so they can be passed directly to `backgroundTools`:
+
+```typescript
+const techResearcher = new Agent({ name: 'tech_researcher', tools: [searchWeb, readDocs] })
+const marketAnalyst = new Agent({ name: 'market_analyst', tools: [searchMarket] })
+const riskAssessor = new Agent({ name: 'risk_assessor', tools: [analyzeRisk] })
+
+const coordinator = new Agent({
+ backgroundTools: [techResearcher, marketAnalyst, riskAssessor],
+ systemPrompt: 'You are a due diligence coordinator. Dispatch all three analysts, then synthesize.',
+})
+
+const result = await coordinator.invoke('Evaluate Acme Corp as an acquisition target')
+```
+
+### Pipeline as Background Tool
+
+A Graph pipeline used as a background tool runs its entire DAG concurrently with the coordinator's foreground work:
+
+```typescript
+const researchPipeline = new Graph({
+ id: 'research',
+ nodes: [gatherAgent, analyzeAgent, synthesizeAgent],
+ edges: [['gather', 'analyze'], ['analyze', 'synthesize']],
+})
+
+const coordinator = new Agent({
+ backgroundTools: [researchPipeline.asTool({ name: 'deep_research', description: 'Run research pipeline' })],
+})
+```
+
+### Developer-Driven Dispatch
+
+When the developer — not the model — should control what runs concurrently. Each call to `invokeBackground()` forks the agent internally, so multiple concurrent calls on the same instance are safe. `maxForkDepth` (default 20) limits nested fork chains (a fork that forks another fork), not the number of concurrent tasks:
+
+```typescript
+const researcher = new Agent({ name: 'researcher', tools: [searchWeb] })
+
+const tasks = [
+ researcher.invokeBackground('Research lithium-ion batteries'),
+ researcher.invokeBackground('Research solid-state batteries'),
+ researcher.invokeBackground('Research hydrogen fuel cells'),
+]
+
+const results = await Promise.all(tasks)
+```
+
+`BackgroundTask` implements `PromiseLike`, so `await task` resolves on success and rejects on error. Use `Promise.allSettled` if individual failures should not reject the batch:
+
+```typescript
+const settled = await Promise.allSettled(tasks)
+const successes = settled.filter(r => r.status === 'fulfilled').map(r => r.value)
+const failures = settled.filter(r => r.status === 'rejected').map(r => r.reason)
+```
+
+### Bounded Results with Structured Output
+
+For sub-agents running as background tools, `structuredOutputSchema` controls how much data crosses the fork-to-parent boundary — the sub-agent reasons freely, but its final output is constrained to a compact schema. This prevents context blowup when many sub-agents inject results in parallel. See [Context Management](#structured-output-on-sub-agents-for-proactive-control) for the full rationale and example.
+
+### Observability
+
+```typescript
+agent.addHook(BackgroundTaskDispatchEvent, (event) => {
+ console.log(`Dispatched ${event.toolUse.name} as task ${event.taskId}`)
+})
+
+agent.addHook(BackgroundTaskResultEvent, (event) => {
+ console.log(`Task ${event.taskName} ${event.result.status} in ${event.durationMs}ms`)
+})
+
+agent.addHook(BackgroundTaskPendingEvent, (event) => {
+ const names = event.pendingTasks.map(t => t.name).join(', ')
+ console.log(`Waiting: ${names} — ${event.completedCount} completed, ${(event.elapsedMs / 1000).toFixed(1)}s elapsed`)
+ if (event.elapsedMs > 60_000) {
+ console.warn(`Tasks stalled — cancelling`)
+ event.pendingTasks.forEach(t => t.cancel())
+ }
+})
+```
+
+### Configuration Tuning
+
+```typescript
+const agent = new Agent({
+ backgroundTools: [searchWeb, analyzeData],
+ backgroundToolHeartbeatMs: 3000, // Faster heartbeat for time-sensitive workflows
+ backgroundToolSettleWindowMs: 100, // Enable batching for closely-completing tasks (default: 0)
+ maxBackgroundCycles: 100, // More re-entries for many-task workflows
+ backgroundTaskTtlMs: 30_000, // Auto-cancel after 30 seconds
+ maxConcurrentBackgroundTasks: 5, // Max simultaneous forks (excess queues until a slot opens)
+ maxForkDepth: 5, // Limit recursive forking
+})
+```
+
+---
+
+## Benchmark Results
+
+All benchmarks use 5-10 runs per configuration. Tool delays are deterministic stubs. Models are accessed via Amazon Bedrock. Validations check output content, tool trajectory, and token parity.
+
+### Case 1: probe-dispatch (Single Layer, 6 Independent Tools)
+
+A NASA mission controller dispatches probes to 3 planets and researches each one. 6 tool calls, 1 wave.
+
+#### Cross-Model Summary
+
+| Model | Runs | Avg Standard | Avg Background | Avg Speedup | σ | Token Δ | Validations |
+|-------|------|-------------|----------------|-------------|---|---------|-------------|
+| Sonnet 4.6 | 5 | 86.5s | 31.4s | **2.80x** | ±0.40 | 4.4% | 110/110 |
+| Haiku 4.5 | 5 | 79.6s | 28.2s | **2.89x** | ±0.40 | 18.2% | 110/110 |
+
+Context growth (avg messages): 6-7 standard vs 8-10 background. The additional messages are injected background task results — expected behavior, not overhead. Input tokens increase 11-40% due to the model seeing injected results across multiple turns rather than in a single batch. See [Context Management](#context-management).
+
+### Case 2: due-diligence (Two Dependent Layers, 10 Tools)
+
+An M&A analyst gathers data across 5 categories, then runs 5 analysis models that depend on the gathered data.
+
+#### Cross-Model Summary
+
+| Model | Runs | Avg Standard | Avg Background | Avg Speedup | σ | Token Δ | Validations |
+|-------|------|-------------|----------------|-------------|---|---------|-------------|
+| Sonnet 4.6 | 5 | 115.2s | 68.2s | **1.70x** | ±0.14 | 9.7% | 170/170 |
+| Haiku 4.5 | 5 | 101.3s | 49.6s | **2.05x** | ±0.12 | 1.3% | 170/170 |
+
+### Case 3: incident-response (Four Dependent Layers, 16 Tools)
+
+A security responder triages an alert, investigates 6 data sources, correlates from 4 angles, then executes 4 containment actions.
+
+#### Cross-Model Summary
+
+| Model | Runs | Avg Standard | Avg Background | Avg Speedup | σ | Token Δ | Validations |
+|-------|------|-------------|----------------|-------------|---|---------|-------------|
+| Sonnet 4.6 | 5 | 199.5s | 114.6s | **1.75x** | ±0.17 | 9.1% | 240/240 |
+| Haiku 4.5 | 5 | 158.9s | 73.9s | **2.17x** | ±0.22 | 4.5% | 240/240 |
+
+### Case 4: session-enrichment (Multi-Turn, Developer-Driven Dispatch)
+
+A customer support agent handles a 5-message conversation. After each response, an `AfterInvocationEvent` hook fires two enrichment agents (summarizer + sentiment analyzer). In standard mode, the hook `await`s both agents sequentially — blocking before `invoke()` returns. In background mode, the hook fires both via `invokeBackground()` — returning immediately so the next customer message can be processed without waiting for enrichment.
+
+#### Cross-Model Summary
+
+| Model | Runs | Avg Standard | Avg Background | Avg Speedup | σ | Token Δ | Validations |
+|-------|------|-------------|----------------|-------------|---|---------|-------------|
+| Sonnet 4.6 | 5 | 163.5s | 16.7s | **10.00x** | ±1.14 | 3.3% | 30/30 |
+| Haiku 4.5 | 5 | 77.3s | 8.3s | **9.35x** | ±0.22 | 6.1% | 30/30 |
+
+Session-enrichment shows the highest speedups because the enrichment work (two sub-agent invocations per turn × 5 turns) runs entirely in parallel via `invokeBackground()`. In standard mode, each enrichment pair blocks for ~12-30s. In background mode, `invokeBackground()` returns immediately — enrichment completes while the next customer message is processed. Message counts and token usage are identical between modes.
+
+### Overall Summary
+
+| Case | Layers | Tools | Best Speedup | Model |
+|------|--------|-------|-------------|-------|
+| probe-dispatch | 1 | 6 | 2.89x | Haiku 4.5 |
+| due-diligence | 2 | 10 | 2.05x | Haiku 4.5 |
+| incident-response | 4 | 16 | 2.17x | Haiku 4.5 |
+| session-enrichment | multi-turn | 2 per turn | 10.00x | Sonnet 4.6 |
+
+Patterns:
+- **Model-driven dispatch** (probe-dispatch, due-diligence, incident-response): 1.7-2.9x speedup. The model dispatches tools in the background and reasons about results as they arrive. Speedup scales with the ratio of tool I/O to model reasoning time.
+- **Developer-driven dispatch** (session-enrichment): 9-10x. `invokeBackground()` in hooks runs work in parallel with the main agent — the highest speedups come from patterns where the main agent doesn't need the background results to proceed.
+
+Note on mixed foreground/background: Background tasks add the most value when there is substantial foreground work to overlap with background execution, or when the goal is non-blocking dispatch. When foreground work is fast and the agent is primarily waiting on background results, the async pattern adds model reasoning overhead (extra cycles processing injected results) that can offset the time saved. For these cases, standard sequential execution is simpler and equally fast.
+
+---
+
+## Real-World Demos
+
+### Research Briefing Generator
+
+A coordinator agent dispatches 4 researcher sub-agents — each an independent Agent-as-tool making real model calls and fetching from live public APIs (GitHub, ArXiv, HackerNews, web docs). Each researcher searches, reasons about results, and produces a structured summary. The coordinator synthesizes all findings into a briefing.
+
+**Topic:** "Background task scheduling in AI agent frameworks"
+
+| Event | Standard | Background |
+|-------|----------|------------|
+| github_researcher | +3.5s → +33.5s (30.0s) | dispatched +4.4s, arrived +31.4s |
+| arxiv_researcher | +33.5s → +63.3s (29.8s) | dispatched +4.4s, arrived +33.3s |
+| hackernews_researcher | +63.3s → +85.9s (22.6s) | dispatched +4.4s, arrived +26.9s |
+| docs_researcher | +85.9s → +138.4s (52.5s) | dispatched +4.4s, arrived +66.2s |
+| **Total research phase** | **134.9s (sequential)** | **66.2s (parallel)** |
+| **Total wall clock** | **192.9s** | **121.9s** |
+
+| Metric | Standard | Background | Delta |
+|--------|----------|------------|-------|
+| Wall clock | 192.9s | 121.9s | **1.58x speedup** |
+| Output tokens | 3,042 | 3,177 | +4.4% |
+| Briefing length | 10,302 chars | 10,887 chars | +5.7% |
+
+### Parallel MCP Dispatch (Gmail Email Dispatch)
+
+A project manager agent composes and sends personalized sprint update emails to 4 team members. The `send_email` tool is provided by a Gmail MCP server over stdio transport (stubbed with 3-5s SMTP delay). The agent doesn't need delivery results — all 4 emails execute in parallel rather than sequentially, and the agent waits for all to complete before returning.
+
+**Key point:** The MCP tool is consumed identically to a real Gmail MCP server — same transport, same protocol, same tool discovery. The only code change is `tools: [mcpClient]` → `backgroundTools: [mcpClient]`.
+
+| Event | Standard | Background |
+|-------|----------|------------|
+| send_email to alice@acme.com | +7.9s → +12.1s (4.2s) | dispatched +9.2s, delivered +13.3s |
+| send_email to bob@acme.com | +12.1s → +15.6s (3.6s) | dispatched +9.2s, delivered +13.6s |
+| send_email to carol@acme.com | +15.6s → +20.0s (4.4s) | dispatched +9.2s, delivered +13.0s |
+| send_email to dave@acme.com | +20.0s → +23.7s (3.6s) | dispatched +9.2s, delivered +13.8s |
+| **Total email time** | **15.8s (sequential)** | **4.6s (parallel)** |
+
+| Metric | Standard | Background | Delta |
+|--------|----------|------------|-------|
+| Wall clock | 29.3s | 17.2s | **1.70x speedup** |
+| Output tokens | 730 | 766 | +4.9% |
+
+### Standard vs Background vs Graph (Product Launch Pipeline)
+
+A product launch announcement generated through a 3-layer pipeline of 6 specialist agents. Three approaches execute the same pipeline:
+
+- **Standard:** Single coordinator, all 6 sub-agents as `tools`. Sequential.
+- **Background:** Single coordinator, all 6 sub-agents as `backgroundTools`. Model discovers parallelism.
+- **Graph:** Explicit DAG with 6 nodes and 6 edges. Graph engine handles parallelism.
+
+| Approach | Wall Clock | vs Standard | Output Length |
+|----------|-----------|-------------|---------------|
+| Standard | 98.1s | baseline | 4,006 chars |
+| Background | 66.8s | **1.47x faster** | 3,371 chars |
+| Graph | 34.8s | **2.82x faster** | 3,866 chars |
+
+```
+STANDARD (sequential):
+ market_analyst ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
+ tech_researcher ░░░░░░░░░░██████████░░░░░░░░░░░░░░░░░░░░░░░░░░
+ competitor_scout ░░░░░░░░░░░░░░░░░░░░██████████░░░░░░░░░░░░░░░░
+ marketing_writer ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████░░░░░░
+ technical_writer ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████
+ editor ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█
+ 0s 20s 40s 60s 80s 97s
+
+BACKGROUND (model-discovered parallelism):
+ market_analyst ██████████░░░░░░░░░░░░░░░░░░░░░░░░░
+ tech_researcher ██████████░░░░░░░░░░░░░░░░░░░░░░░░░ Layer 1
+ competitor_scout ██████████░░░░░░░░░░░░░░░░░░░░░░░░░ concurrent
+ ──── coordinator model call ────
+ marketing_writer ░░░░░░░░░░░░░░██████████░░░░░░░░░░░ Layer 2
+ technical_writer ░░░░░░░░░░░░░░██████████░░░░░░░░░░░ concurrent
+ ──── coordinator model call ────
+ editor ░░░░░░░░░░░░░░░░░░░░░░░░░░██████░░░
+ 0s 20s 40s 60s 74s
+
+GRAPH (explicit DAG, zero orchestration overhead):
+ market_analyst ██████████░░░░░░░░░░░░
+ tech_researcher ██████████░░░░░░░░░░░░ Layer 1 concurrent
+ competitor_scout ██████████░░░░░░░░░░░░
+ marketing_writer ░░░░░░░░░░██████████░░ Layer 2 concurrent
+ technical_writer ░░░░░░░░░░██████████░░
+ editor ░░░░░░░░░░░░░░░░░░░░██ Layer 3
+ 0s 10s 20s 31s
+```
+
+For guidance on when to use each approach, see [Appendix C: Background Tasks vs Graph](#appendix-c-background-tasks-vs-graph).
+
+---
+
+
+Appendix A: Landscape Analysis
+
+How other agent frameworks handle concurrent and async execution, as of April 2026.
+
+**The key distinction this appendix tracks:** can the model dispatch a tool and continue reasoning before the result arrives? This is what `backgroundTools` enables. Most frameworks support a weaker form — parallel tool execution within a single turn, where multiple tools run concurrently but the model blocks until all complete.
+
+| Capability | OpenAI Agents SDK | LangGraph | CrewAI | AutoGen | Google ADK | Mastra | **Strands (this design)** |
+|---|---|---|---|---|---|---|---|
+| **Background tools (model continues reasoning)** | No | No | No | No | No | Yes | **Yes** |
+| **Parallel tool execution (same turn, model waits)** | No | Yes (ToolNode) | No | Yes | Yes | Unclear | Yes (ConcurrentToolExecutor, planned) |
+| **Concurrent agent invocations** | Manual | Yes (fan-out/Send) | Yes (async threads) | Yes (actor model) | Yes (ParallelAgent) | Via backgroundTools config | **Yes (fork + invokeBackground)** |
+| **Fork/clone isolation** | Shallow clone | Checkpoint forking | No | AgentId isolation | Deep clone + branch ctx | No agent fork — tools execute directly | **Deep fork with selective sharing** |
+| **Cancellation** | Guardrail tripwire | interrupt\_before/after | No | CancellationToken | Task.cancel (limited) | AbortSignal + timeout | **Per-task cancel + TTL + model-driven** |
+| **Task status tracking** | No | Checkpoint metadata | No | Queue state | Resumable invocations | Persistent storage with status enum | **BackgroundTask status + TaskManager** |
+
+Only Mastra and Strands offer model-driven background dispatch. The remaining frameworks support parallel tool execution within a single turn (model waits for all results) but not async dispatch (model continues reasoning while tools run).
+
+#### Mastra Comparison
+
+Mastra is the only other framework with background tool support. Their feature shipped on April 15, 2026 ([PR #15307](https://github.com/mastra-ai/mastra/pull/15307)), with active follow-up work still in progress ([PR #15686](https://github.com/mastra-ai/mastra/pull/15686)). Comparison based on Mastra's source code and PR description:
+
+| | Mastra | Strands (this design) |
+|---|---|---|
+| **Dispatch config** | Per-tool flag (`ToolBackgroundConfig.enabled`) with 4-level priority: LLM override → agent → tool → manager defaults | Separate `backgroundTools` list, static assignment at construction |
+| **LLM dynamic dispatch** | Model includes `_background` field in tool args to override per-call | No dynamic dispatch in v1 — deferred as future extension |
+| **Execution isolation** | None — background tools run as bare function calls, not inside an agent. No conversation copy, no independent lock, no per-task agent state. | Deep fork per background dispatch — full agent copy with isolated messages, state, hooks |
+| **Result injection** | `ResultInjector` → `messageList.addToolResult()` — injected as a standard tool result | User text message with `[Background Task Result]` format and `toolUseId` correlation |
+| **System prompt** | Generated section listing bg-eligible tools and defaults, plus `_background` schema field in tool definitions | ~100-token block with dispatch/result format and behavioral instructions. Ablation-tested across Sonnet, Haiku, Nova ([Appendix G](#appendix-g-system-prompt-augmentation-rationale)). |
+| **Concurrency** | Global limit (default 10), per-agent limit (default 5), backpressure: queue / reject / fallback-sync | Per-agent `maxConcurrentTasks` (default 10), queue backpressure |
+| **Cancellation** | Per-task `cancel()` via `BackgroundTaskHandle`, timeout per task | Per-task cancel + TTL + model-driven (`cancel_background_task` + `list_background_tasks`) |
+| **Retry** | Configurable backoff, max retries, retryable error filter, PubSub nack redelivery | Three-layer: tool-level (hooks in fork), task-level (event flag), model-driven |
+| **Persistence** | Storage-backed across Convex, Lance, in-memory. Tasks survive process restarts. Stale task recovery on startup. | In-process only (in-memory Maps). Persistent backend is Phase 2. |
+| **Fire-and-forget** | Yes — `waitTimeoutMs` lets the loop end while tasks run. Results picked up on next user message. | No — agent loop waits for all tasks. Fire-and-forget requires Phase 2 persistence. |
+| **Settlement** | `waitTimeoutMs` per-agent/per-tool, waits for next task one at a time | Heartbeat, settle window (batches closely-completing tasks), all-ack wait guard |
+| **Model-facing tools** | None — model cannot inspect or cancel tasks directly | `list_background_tasks` + `cancel_background_task` auto-registered |
+| **Server/API** | HTTP API, SSE streaming, client SDK, playground UI | None — in-process only |
+| **Benchmarks** | No published benchmarks or cross-model validation data | 4 cases × 2 models, 1,100+ validations, 3 real-world demos with live APIs |
+
+**Where Mastra is ahead:** Persistence (storage-backed, crash recovery, true fire-and-forget), dynamic dispatch (LLM override per-call), server integration (HTTP API, SSE, client SDK, playground UI), and more granular concurrency control (global + per-agent + three backpressure modes). Their architecture is distributed — PubSub for dispatch/result, pluggable storage backends.
+
+**Where Strands is ahead:** Settlement mechanics (heartbeat, settle window batching, all-ack guard — prevents hallucination and redundant re-dispatches), model-facing tools (the model can inspect and cancel its own background work), fork isolation (full agent copy with explicit shared-vs-copied boundaries), and cross-model validation (benchmarked across Sonnet, Haiku, and Nova with published results and real-world demos).
+
+**Key architectural difference — result injection:** Mastra injects results via `messageList.addToolResult()` — the result appears as a standard `tool_result` in the message history. This means the model sees a normal tool call/result pair and may not need to learn async semantics. Strands injects results as user text messages with a structured `[Background Task Result]` format, which requires system prompt augmentation to teach the model the async contract. We chose this approach because retroactive `tool_result` insertion mutates conversation history after the model has already reasoned past that point — see [Appendix G](#appendix-g-system-prompt-augmentation-rationale) for the full analysis. Whether Mastra's approach causes issues with model reasoning or history consistency is unknown — no benchmark or validation data has been published.
+
+**Maturity context:** Mastra's background tasks shipped April 15, 2026, with active development continuing (open PRs for `streamUntilIdle` and sub-agent backgrounding). The feature is new for both frameworks — neither has significant production usage data yet.
+
+#### Why Strands Is Uniquely Positioned for This
+
+Strands' `Agent` class draws a clean boundary between per-invocation state (messages, abort controller, metrics, task manager, conversation manager) and shared infrastructure (model client, tool registry, MCP clients). This is what makes `fork()` both cheap and safe: deep-copy the mutable state, reference-share the immutable config. The result is a fully functional agent that can `invoke()` independently — make model calls, execute tools, manage its own conversation — without touching the parent's state.
+
+The hook propagation system reinforces this: the `propagate` flag on `addHook` lets the framework control which callbacks make sense on a fork (logging, guardrails) and which don't (session persistence, overflow recovery), preventing forks from corrupting the parent's session state.
+
+
+
+
+Appendix B: Interface Design Rationale
+
+Key API shape decisions and alternatives rejected.
+
+**Why `BackgroundTask` implements `PromiseLike` instead of exposing `.promise`.**
+
+A developer who gets a task handle wants to `await` it. With `PromiseLike`, `await task` works directly. The alternative — `await task.promise` — is one extra property access that adds nothing and creates a second path to the same value. `PromiseLike` also lets `BackgroundTask` work with `Promise.all`, `Promise.race`, and any utility that accepts thenables.
+
+**Why `BackgroundTask` tracks state with internal flags instead of delegating to the Promise.**
+
+A Promise is either pending, fulfilled, or rejected — there is no "cancelled" or "queued" state. We need these as distinct statuses because TTL expiration, explicit cancel, and concurrency backpressure are not errors from the developer's perspective. A discriminated union tracks the full state (status + associated data: result value, error, or cancellation reason) synchronously, enabling race-safe status checks without awaiting the promise.
+
+**Why `invokeBackground()` takes a prompt string, not a message array.**
+
+It mirrors `invoke(prompt)`. The common case is "run this agent with this instruction." Developers who need full control over the conversation pass `{ messages }` in the options object — same pattern as `fork()`.
+
+**Why `TaskManager` is per-agent-instance, not a singleton.**
+
+Each agent's background tasks are scoped to that agent's lifecycle. When the agent's stream exits, its `TaskManager.cancelAll()` fires. A singleton would require tracking which tasks belong to which agent, and cleanup would depend on agents explicitly de-registering — a leak-prone pattern. Per-instance managers also mean forks get fresh, independent task tracking with no cross-contamination.
+
+**Why `fork()` takes an options object instead of positional parameters.**
+
+Every fork option is optional. Positional parameters would force callers to pass `undefined` for options they don't use: `agent.fork(undefined, undefined, customSystemPrompt)`. An options object lets callers specify only what they need: `agent.fork({ systemPrompt: customSystemPrompt })`.
+
+**Why `backgroundTools` is a flat array, not a map of `{ name: config }`.**
+
+Tools already have names. A map keyed by tool name (`{ searchWeb: { ttl: 30000 } }`) duplicates the name and creates a sync problem — if the tool is renamed, the map key must be updated separately. A flat array with optional per-tool config (deferred to a future iteration) avoids this. The current design intentionally keeps `backgroundTools` as simple as `tools`: just list them.
+
+
+
+
+Appendix C: Background Tasks vs Graph
+
+Both `backgroundTools` and `Graph` execute work concurrently. They solve different problems and compose together.
+
+**Graph wins when the pipeline is known at build time.**
+
+A Graph defines nodes, edges, and data flow before the agent runs. The engine resolves the DAG, runs independent nodes in parallel, and passes results along edges. No model calls are needed for orchestration — the topology IS the orchestration. This makes Graph faster (no coordinator model calls between layers) and deterministic (same graph, same execution order).
+
+Use Graph when:
+- The pipeline structure doesn't change between invocations
+- Maximum throughput matters more than flexibility
+- The steps and their dependencies are known upfront
+
+**Background tasks win when the pipeline emerges from reasoning.**
+
+With `backgroundTools`, the model decides what to dispatch based on context. It can dispatch follow-up work based on early results, skip steps that turn out to be unnecessary, or add steps that weren't anticipated. The pipeline is discovered at runtime, not declared at build time.
+
+Use background tasks when:
+- The number of steps, their ordering, or even which tools to call depends on intermediate results
+- The workflow should adapt to user input without code changes (change the prompt, change the pipeline)
+- Side-effect tools (emails, logs) should execute in parallel rather than sequentially
+
+**They compose.**
+
+A Graph pipeline used as a background tool gets both: the Graph engine handles internal parallelism within the pipeline, while background dispatch lets the coordinator continue reasoning while the entire pipeline runs:
+
+```typescript
+const coordinator = new Agent({
+ backgroundTools: [
+ researchPipeline.asTool({ name: 'deep_research', description: '...' }),
+ quickLookup,
+ ],
+})
+```
+
+The coordinator dispatches `deep_research` (a full Graph DAG) in the background, calls `quickLookup` in the foreground, and reasons about both results as they arrive. The Graph handles its internal node-level concurrency; background tasks handle the agent-level concurrency.
+
+**The speed gap is the cost of flexibility.**
+
+In the Product Launch Pipeline demo, Graph was 2.8x faster than standard while background tasks were 1.5x faster. The difference is coordinator model calls — background tasks require the model to reason between dispatch waves, while Graph has zero orchestration overhead. This is a fundamental tradeoff, not a performance bug: the model calls are where the adaptive decision-making happens.
+
+#### When to Use Each
+
+| | Standard | Background | Graph |
+|---|---|---|---|
+| **Code complexity** | Simplest — list tools | One line change + prompt | Define nodes, edges, sources |
+| **Parallelism** | None | Model-discovered | Developer-defined DAG |
+| **Orchestration overhead** | None | 2-3 coordinator model calls | None |
+| **Flexibility** | N/A | Change prompt → change pipeline | Change code → change pipeline |
+| **Speed** | Slowest (baseline) | ~1.5x faster | ~2.8x faster |
+
+- **Standard:** Simple workflows where parallelism doesn't matter.
+- **Background:** Dynamic workflows where the pipeline may change. Trades speed for flexibility.
+- **Graph:** Fixed pipelines where maximum throughput matters. Requires upfront design.
+
+
+
+
+Appendix D: Naming Alternatives
+
+Names considered for each new primitive. Current choices are bolded.
+
+#### Agent Config
+
+| Name | Pros | Cons |
+|------|------|------|
+| **`backgroundTools`** | Describes the relationship to the agent loop (runs in the background). Self-documenting. | Longer than alternatives |
+| `asyncTools` | Shorter | In JS/TS, "async" means "returns a Promise" — all tools are already async in that sense. Misleading |
+| `concurrentTools` | Technically accurate | Describes the execution model, not the relationship to the agent. Graph tools also run concurrently |
+| `deferredTools` | Captures the result-delivery semantics | Sounds like the tool itself is delayed, not that its execution is non-blocking |
+| `parallelTools` | Intuitive | Same problem as `concurrentTools` — Graph is also parallel |
+
+#### Isolation Method
+
+| Name | Pros | Cons |
+|------|------|------|
+| **`fork()`** | Established concurrency metaphor. Implies independent execution from a shared origin | OS-level fork connotation (process duplication) may mislead |
+| `clone()` | Common in JS | Implies a full deep copy. We share immutable config — it's not a clone |
+| `branch()` | Git-inspired, implies divergence from a common point | No established SDK precedent for this meaning |
+| `spawn()` | Process/thread connotation, implies new independent execution | Implies heavier-weight than what we do (no new process) |
+
+#### Developer Dispatch Method
+
+| Name | Pros | Cons |
+|------|------|------|
+| **`invokeBackground()`** | Mirrors `invoke()`. The "Background" suffix is self-documenting | Verbose |
+| `dispatch()` | Short, clean | Too generic — doesn't indicate non-blocking or background semantics |
+| `invokeAsync()` | Mirrors `invoke()` | Already used in the SDK for a different purpose (async/await invocation that still blocks) |
+| `spawnTask()` | Combines spawn + task | "spawn" isn't established in the Strands vocabulary |
+
+#### Task Handle
+
+| Name | Pros | Cons |
+|------|------|------|
+| **`BackgroundTask`** | Matches `backgroundTools`. Unambiguous | Verbose |
+| `Task` | Short | Collides with too many things — DOM, Node.js, libraries |
+| `AgentTask` | More specific to Strands | Doesn't indicate background/async nature |
+
+#### Lifecycle Manager
+
+| Name | Pros | Cons |
+|------|------|------|
+| **`TaskManager`** | Matches `BackgroundTask`. Clear responsibility | Generic — could mean anything that manages tasks |
+| `TaskStore` | MCP precedent (`TaskStore` in MCP spec) | We do more than store — settlement, heartbeat, TTL. "Store" undersells |
+| `TaskRegistry` | Focuses on registration/lookup | Doesn't capture lifecycle management (cancel, TTL, settlement) |
+| `Scheduler` | Implies timing and ordering | We don't schedule in the traditional sense — tasks run immediately |
+
+
+
+
+Appendix E: Extension to Containerized Dispatch
+
+The TaskManager's API surface — `create`, `get`, `list`, `cancel`, `cancelAll`, `popCompleted`, `waitForNextSettlement` — maps directly to the operations needed for containerized/persistent agent dispatch:
+
+| TaskManager (in-process) | Containerized equivalent |
+|--------------------------|-------------------------|
+| `create(name, promise, { onCancel })` | Spawn container, register task in persistent store |
+| `cancel(id)` | Send stop signal to container |
+| `popCompleted()` | Poll persistent store for settled tasks |
+| `waitForNextSettlement()` | Subscribe to task status notifications |
+| TTL timers | Container idle timeout / max runtime |
+
+A future implementation could replace the in-memory Maps with HTTP calls + persistent storage without changing the agent loop or any consumer code. The interface is ready — only the storage layer would change.
+
+This aligns with the [containerized Strands agents](https://github.com/mkmeral/containerized-strands-agents) approach, which uses MCP task-based communication between a host and Docker containers. The natural extension path:
+
+1. **Phase 1 (this PR):** In-process background task scheduling with the TaskManager abstraction. Forks are ephemeral — they run to completion and return a result.
+2. **Phase 2:** Pluggable TaskManager backend — persistent storage for crash recovery, distributed dispatch for containerized agents.
+3. **Phase 3:** Persistent forks with message channels. Today, forks are ephemeral — the parent dispatches, the fork runs to completion, and the result comes back. The natural evolution is forks that survive invocation boundaries and accept follow-up messages — enabling interactive background agents. This requires `SessionManager` integration (fork state persists across process restarts), a communication primitive between parent and fork (message inbox per fork), and TaskManager awareness of long-lived tasks with pause/resume lifecycle.
+
+#### Other Future Extensions
+
+- **Task priority.** Results arrive in completion order, not dispatch order. The model cannot prioritize — if it dispatches A, B, C and needs A's result before deciding on D, it may receive B or C first. The queue could support priority levels so higher-priority tasks get slots first and their results are injected before lower-priority results that finished earlier.
+- **Dynamic dispatch (LLM override).** Allow the model to include a `_background` flag in tool call args to override foreground/background per-call, gated by a developer-specified allowlist of tools eligible for dynamic dispatch. Inspired by Mastra's approach. See [Design Decisions](#why-backgroundtools-as-a-first-class-config-and-not-something-else).
+- **`resultSchema` on `asTool()`.** Project a sub-agent's full output into a compact schema at the tool boundary, without constraining the sub-agent itself. Enables different output sizes for the same agent in foreground vs. background contexts.
+- **Per-tool TTL.** Currently `backgroundTaskTtlMs` applies uniformly. Per-tool TTL would let developers set appropriate deadlines per tool (fast lookups get 10s, deep research gets 60s).
+- **Context utilization metrics.** Injected result token count on `BackgroundTaskResultEvent`, context utilization percentage on `BackgroundTaskPendingEvent`. Gives developers observability into context growth from background tasks.
+
+
+
+
+
+Appendix G: System Prompt Augmentation Rationale
+
+The current design appends a ~100-token block to the system prompt explaining the async contract to the model. Can we eliminate this with pure API/struct changes?
+
+**The hard constraint.** All major model provider APIs (Anthropic, OpenAI, Bedrock, Gemini) require synchronous `tool_use` → `tool_result` pairing within adjacent turns. There is no "pending" or "deferred" tool_result state. We are forced to return *something* synchronously when a bg tool is dispatched, and the real result — arriving later — cannot be a tool_result because the API rejects late-arriving pairs. It has to be some other content block.
+
+**Two layers of the problem.** Eliminating the prompt would require fixing both:
+
+1. **Wire protocol** — the API schema itself must accept a representation of async tool results. Only providers can change this.
+2. **Model behavior** — given valid inputs, the model must interpret the async pattern correctly. This is partially a training artifact.
+
+These are coupled: we cannot test whether frontier models would natively handle async semantics because we cannot send the API a structure that expresses one. The protocol gates the experiment.
+
+**Why each bullet of the augmentation earns its place.** Each of the four bullets in the current augmentation was validated during benchmark development. Removing any one regresses a specific failure mode:
+
+| Bullet | Prevents | Structural fix possible? |
+|---|---|---|
+| "Calling one returns: Background task dispatched" | Re-dispatch loops | Partially — richer ACK structure helps, not bulletproof |
+| "Do not guess results. Continue working or yield." | Hallucination | No — training artifact |
+| "Results arrive as user messages: [Background Task Result]..." | Ignoring inbound results | Partially — richer inbound format helps |
+| "Incorporate all [Background Task Result] messages before final response." | Premature finalization | No — training artifact |
+
+Two of the four are behavioral failures that no struct change eliminates. The other two are partially addressable at the cost of paying explanation tokens per message rather than once per invocation — worse scaling for multi-dispatch conversations.
+
+**Alternatives considered.**
+
+- **Retroactive history rewriting** — delete the ACK and intervening reasoning when the real result arrives, replace with the real `tool_result`. Rejected: destroys the async benefit by discarding the model's intervening work, corrupts audit trails, and requires sending fabricated history on every turn.
+
+- **Additive `tool_result` injection (Mastra's approach)** — when the background task completes, insert the result as a `tool_result` entry in the message list paired with the original `tool_use`, without deleting the ACK or intervening reasoning. The model sees both the placeholder and the real result in its history. This is what Mastra ships ([PR #15307](https://github.com/mastra-ai/mastra/pull/15307), April 2026). The approach avoids needing the model to learn a new result format (it's a standard `tool_result`), but raises open questions: does the model get confused by two results for the same `tool_use` (ACK + real result)? Does the conversation history remain valid for providers that enforce strict `tool_use`/`tool_result` pairing? No published benchmarks or cross-model validation data exists for this approach. Our user-message injection with system prompt augmentation is more explicit but validated across 3 models with 1,100+ validations.
+
+- **Self-describing ACK/result blocks** — make the ACK and inbound result formats so structurally obvious (e.g., `{"async": true, "status": "dispatched", "note": "..."}`) that the model infers the contract without system-level explanation. Eliminates the system prompt entry but pays explanation tokens *per message*. A conversation with 10 dispatches and 10 results pays 20x the tokens vs. one system prompt entry.
+
+- **Two-tool dispatch/await pattern** — expose `dispatch_bg(name, args)` and `get_bg_result(task_id)` as explicit tools the model calls. Rejected: abandons the "model calls tools normally" UX that is the defining feature. The model must decide *when* to poll (too early = wasted round-trip, too late = unnecessary delay), each poll is a full model call with token and latency cost, and the conversation accumulates result-checking turns that add no reasoning value. Injection-based delivery eliminates all polling — results appear automatically when ready.
+
+- **Synthetic `receive_task` tool_use + tool_result** — before injecting a settled result, forge an assistant `tool_use` for a fictional `receive_task` tool, then deliver the real result as a `tool_result` paired with it. Considered and deferred: the risks outweigh the benefit.
+ - **In-context learning / confabulation.** The conversation history would permanently contain tool calls the model never authored. Models pick up the pattern mid-invocation and start emitting `receive_task` spontaneously.
+ - **Provider validation is unforgiving.** Anthropic requires a `thinking` block to precede `tool_use` in the same assistant turn when extended thinking is enabled. Bedrock and OpenAI each have their own invariants. Per-provider fabrication layer is more surface area than the system-prompt entry.
+ - **Prompt cache fragmentation.** Injecting synthesized assistant content invalidates prefix caches on every bg completion.
+ - **Turn-boundary sharp edges.** If the model is mid-stream when a bg result settles, the synthetic `tool_use` has nowhere clean to land.
+
+- **Tool schema injection** — add framework fields (e.g., `_background: { enabled, timeoutMs }`) to each background tool's input schema, letting the model control execution mode per-call. This is Mastra's approach. Rejected: pollutes every background tool's domain interface with framework-internal parameters, adds cognitive load for the model, and creates a vector for the model to misconfigure execution semantics. Tool schemas should describe the tool's interface, not its execution mode.
+
+- **Per-tool description augmentation** — append "this tool runs asynchronously" to each background tool's description instead of a centralized system prompt block. Rejected: distributes the async contract across N tool descriptions instead of centralizing it. With 5 background tools, that's 5x the augmentation tokens vs one ~100-token block. Behavioral instructions ("don't fabricate results," "incorporate all results before responding") would need to be repeated per-tool or hoisted to the system prompt anyway.
+
+- **Envelope refinement (XML-style delimiters)** — tested during benchmark development. Models hallucinated additional `` tags in their output and confabulated results for tasks that never existed. XML tags carry heavier trained weight as authored content than as framework-injected data. Bracket-prefix avoids these confabulation modes.
+
+- **Minimum-viable prompt** (ablation) — we tested shorter augmentations removing structural bullets while keeping only the behavioral ones. Haiku 4.5 regressed on re-dispatch loops.
+
+**The long-term path.** If any provider adds `tool_result.status: "pending"` (or equivalent) to their API, this augmentation becomes obsolete. Async tool semantics will likely follow the same trajectory as streaming tool calls — framework-level experimentation first, native API support later.
+
+
+
+
+Appendix H: Conversation Traces
+
+Detailed message-level traces showing exact roles, content block types, and ordering for every background task scenario. Each trace shows what the model sees in its conversation history.
+
+#### Mixed Turn (Foreground + Background)
+
+The most common case. The model calls one foreground tool and one background tool in the same turn. The foreground result arrives inline; the background result arrives later.
+
+```
+Turn 1:
+ [user] text: "Search for agentic AI trends and calculate our current metrics"
+
+Turn 2:
+ [assistant] text: "I'll search in the background and calculate metrics now."
+ tool_use: { toolUseId: "tu-1", name: "searchWeb",
+ input: { query: "agentic AI trends 2026" } }
+ tool_use: { toolUseId: "tu-2", name: "calculateMetrics",
+ input: { quarter: "Q2" } }
+
+ [user] tool_result: { toolUseId: "tu-1", status: "success",
+ content: "Background task dispatched" }
+ tool_result: { toolUseId: "tu-2", status: "success",
+ content: "Revenue: $4.2M, Growth: 18%" }
+
+Turn 3:
+ [assistant] text: "Metrics show $4.2M revenue with 18% growth.
+ I'll incorporate the search results when they arrive."
+
+ ← searchWeb completes, SDK injects result →
+
+ [user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: success
+
+ 1. Multi-agent orchestration frameworks gaining traction...
+ 2. Background task scheduling emerging as key differentiator..."
+
+Turn 4:
+ [assistant] text: "Based on both the metrics and search results, here's my analysis..."
+```
+
+The model reasons about the foreground result immediately (Turn 3), then incorporates the background result when it arrives (Turn 4).
+
+#### All-ACK Turn
+
+Every tool in the batch is a background tool. The SDK waits for at least one real result before calling the model — the model never sees a turn of pure ACKs.
+
+```
+Turn 1:
+ [assistant] tool_use: { toolUseId: "tu-1", name: "searchWeb",
+ input: { query: "agentic AI" } }
+ tool_use: { toolUseId: "tu-2", name: "analyzeData",
+ input: { dataset: "usage_logs" } }
+
+ [user] tool_result: { toolUseId: "tu-1", status: "success",
+ content: "Background task dispatched" }
+ tool_result: { toolUseId: "tu-2", status: "success",
+ content: "Background task dispatched" }
+
+ ← SDK waits. Model is NOT called yet. →
+ ← searchWeb finishes first, SDK injects result →
+
+ [user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: success
+
+ "
+
+Turn 2:
+ [assistant] text: "Search results show... I'll wait for the data analysis."
+
+ ← analyzeData finishes, SDK injects result →
+
+ [user] text: "[Background Task Result]
+ tool: analyzeData
+ toolUseId: tu-2
+ status: success
+
+ "
+
+Turn 3:
+ [assistant] text: "Now that I have both results..."
+```
+
+The two `[user]` messages between Turn 1 and Turn 2 (the tool_result ACKs and the first background result) are both `role: user`. Providers that require alternating roles coalesce these automatically (Anthropic natively, Bedrock and Google via SDK adapter).
+
+#### Batched Results
+
+When `settleWindowMs` is configured (default: 0ms — disabled), multiple tasks that finish within the window are injected as separate `TextBlock`s in a single user message.
+
+```
+ ← searchWeb and analyzeData both finish within 100ms of each other →
+
+ [user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: success
+
+ "
+
+ text: "[Background Task Result]
+ tool: analyzeData
+ toolUseId: tu-2
+ status: success
+
+ "
+```
+
+The model sees both results at once and should process all of them before responding. This is a single user message with two `TextBlock`s, not two separate messages.
+
+#### Non-Deterministic Ordering
+
+Results arrive in completion order, not dispatch order. The `toolUseId` is the correlation key.
+
+```
+ Dispatch order: searchWeb (tu-1), analyzeData (tu-2), fetchDocs (tu-3)
+ Completion order: analyzeData (tu-2), fetchDocs (tu-3), searchWeb (tu-1)
+
+ First injection:
+ [user] text: "[Background Task Result]
+ tool: analyzeData
+ toolUseId: tu-2
+ status: success
+ ..."
+
+ Second injection:
+ [user] text: "[Background Task Result]
+ tool: fetchDocs
+ toolUseId: tu-3
+ status: success
+ ..."
+
+ Third injection:
+ [user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: success
+ ..."
+```
+
+The model must not assume results arrive in dispatch order. `toolUseId` correlates each result to its original `tool_use` block.
+
+#### Error Result
+
+A tool fails — the model sees `status: error` with the error text.
+
+```
+ [user] text: "[Background Task Result]
+ tool: searchWeb
+ toolUseId: tu-1
+ status: error
+
+ Connection timeout after 30000ms"
+```
+
+The model can retry by calling the same tool again (it's still in `backgroundTools`), or proceed without the result. See [Retry](#retry) for the three-layer retry story.
+
+#### Cancellation Result
+
+A task was intentionally stopped — by TTL, developer `cancel()`, or the model's `cancel_background_task` tool.
+
+```
+ [user] text: "[Background Task Result]
+ tool: analyzeData
+ toolUseId: tu-2
+ status: cancelled
+
+ Task cancelled (TTL exceeded)"
+```
+
+The `cancelled` status signals the task was intentionally stopped, not that it failed. Retrying is usually not appropriate — the same deadline or cancellation reason likely applies.
+
+#### Model-Driven Cancellation
+
+The model calls `cancel_background_task` to stop a specific task, then receives the cancellation notification when the task settles.
+
+```
+Turn N:
+ [assistant] text: "The earlier search result already answered this. I'll cancel the analysis."
+ tool_use: { toolUseId: "tu-5", name: "cancel_background_task",
+ input: { toolUseId: "tu-2" } }
+
+ [user] tool_result: { toolUseId: "tu-5", status: "success",
+ content: "Cancelled task for 'analyzeData' (toolUseId: tu-2)" }
+
+ ← cancelled task settles, SDK injects notification →
+
+ [user] text: "[Background Task Result]
+ tool: analyzeData
+ toolUseId: tu-2
+ status: cancelled
+
+ Task cancelled"
+```
+
+The `cancel_background_task` tool result confirms the cancellation. The `[Background Task Result]` notification arrives separately when the cancelled fork settles.
+
+#### What the Model Does NOT See
+
+The model's view is deliberately simplified to: dispatch → ACK → result. It never sees:
+
+- **Internal task IDs** — the `BackgroundTask.id` (UUID) used by `TaskManager` internally. The model uses `toolUseId` for correlation.
+- **Task lifecycle transitions** — `inProgress → success/error/cancelled` state changes happen inside the SDK.
+- **Heartbeat events** — `BackgroundTaskPendingEvent` fires on the developer's hook/streaming interface, not in the conversation.
+- **Fork internals** — the forked agent's conversation, intermediate tool calls, and reasoning steps. Only the final result crosses to the parent.
+- **Settle windows and batching mechanics** — whether results were batched or injected individually is invisible to the model.
+
+
+
+
+Appendix I: Development Plan
+
+#### v1 (this design)
+
+- `backgroundTools` config, `fork()`, `invokeBackground()`, `BackgroundTask`, `TaskManager`
+- Three decision points (A/B/C) with settlement mechanics (heartbeat, settle window, all-ack wait)
+- System prompt augmentation with `toolUseId` correlation
+- `cancel_background_task` and `list_background_tasks` auto-registered tools
+- Concurrency limits with queue backpressure (`maxConcurrentTasks`)
+- Three-layer retry: tool-level (hooks), task-level (event flag), model-driven
+- Context management: empty forks, reactive compaction, structured output on sub-agents
+- Hook propagation with `propagate` flag
+- Events: `BackgroundTaskDispatchEvent`, `BackgroundTaskResultEvent`, `BackgroundTaskPendingEvent`
+
+#### Post-v1
+
+- **Phase 2:** Persistent TaskManager backend — storage-backed tasks, crash recovery, true fire-and-forget
+- **Phase 3:** Persistent forks with message channels — interactive background agents
+- **Phase 4:** Model-driven self-forking via `fork_self` tool
+- Dynamic dispatch — model overrides foreground/background per-call via developer allowlist
+- Per-tool TTL, task priority, `resultSchema` on `asTool()`
+- Context utilization metrics on events
+- Python SDK parity implementation
+
+
diff --git a/designs/0009-current-agent-loop.drawio b/designs/0009-current-agent-loop.drawio
new file mode 100644
index 000000000..1cecce6f9
--- /dev/null
+++ b/designs/0009-current-agent-loop.drawio
@@ -0,0 +1,45 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/designs/0009-current-agent-loop.png b/designs/0009-current-agent-loop.png
new file mode 100644
index 000000000..230fce316
Binary files /dev/null and b/designs/0009-current-agent-loop.png differ
diff --git a/designs/0009-demos/background-tasks/index.ts b/designs/0009-demos/background-tasks/index.ts
new file mode 100644
index 000000000..eeaaaaea4
--- /dev/null
+++ b/designs/0009-demos/background-tasks/index.ts
@@ -0,0 +1,90 @@
+/**
+ * Getting Started — backgroundTools
+ *
+ * The simplest demo of background task scheduling. A research assistant searches
+ * 3 sources (web, docs, news). Each search takes 5 seconds.
+ *
+ * Standard: 3 searches run sequentially → ~15s of tool time
+ * Background: 3 searches run concurrently → ~5s of tool time
+ */
+
+import { Agent, BedrockModel, tool } from '@strands-agents/sdk'
+import { z } from 'zod'
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+const searchWeb = tool({
+ name: 'search_web',
+ description: 'Search the web for recent information on a topic.',
+ inputSchema: z.object({ query: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 5000))
+ return `Web results for "${input.query}": Found 3 relevant articles on recent developments in this area, including industry analysis and expert commentary.`
+ },
+})
+
+const searchDocs = tool({
+ name: 'search_docs',
+ description: 'Search technical documentation and reference materials.',
+ inputSchema: z.object({ query: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 5000))
+ return `Documentation results for "${input.query}": Found official guides, API references, and best-practice recommendations from authoritative sources.`
+ },
+})
+
+const searchNews = tool({
+ name: 'search_news',
+ description: 'Search recent news articles and press releases.',
+ inputSchema: z.object({ query: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 5000))
+ return `News results for "${input.query}": Found 5 recent articles covering market trends, product launches, and analyst perspectives.`
+ },
+})
+
+const systemPrompt =
+ 'You are a research assistant. When asked a question, search all 3 sources ' +
+ '(web, docs, news) and write a brief summary of your findings.'
+
+const prompt = 'What are the latest trends in AI agent frameworks?'
+
+// ── Standard ────────────────────────────────────────────────────────────────
+
+console.log('--- STANDARD (sequential) ---\n')
+const standardAgent = new Agent({ model, systemPrompt, tools: [searchWeb, searchDocs, searchNews] })
+const standardStart = Date.now()
+const standardResult = await standardAgent.invoke(prompt)
+const standardMs = Date.now() - standardStart
+
+// ── Background ──────────────────────────────────────────────────────────────
+
+console.log('\n--- BACKGROUND (concurrent) ---\n')
+const backgroundAgent = new Agent({ model, systemPrompt, backgroundTools: [searchWeb, searchDocs, searchNews] })
+const backgroundStart = Date.now()
+const backgroundResult = await backgroundAgent.invoke(prompt)
+const backgroundMs = Date.now() - backgroundStart
+
+// ── Results ─────────────────────────────────────────────────────────────────
+
+const stdUsage = standardResult.metrics?.accumulatedUsage
+const bgUsage = backgroundResult.metrics?.accumulatedUsage
+
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(60))
+console.log(' RESULTS')
+console.log('='.repeat(60))
+console.log()
+console.log(` ${pad('Metric', 24)} ${pad('Standard', 16)} Background`)
+console.log(` ${'-'.repeat(24)} ${'-'.repeat(16)} ${'-'.repeat(16)}`)
+console.log(` ${pad('Wall clock', 24)} ${pad((standardMs / 1000).toFixed(1) + 's', 16)} ${(backgroundMs / 1000).toFixed(1)}s`)
+console.log(` ${pad('Speedup', 24)} ${pad('baseline', 16)} ${(standardMs / backgroundMs).toFixed(2)}x`)
+console.log(` ${pad('Input tokens', 24)} ${pad(String(stdUsage?.inputTokens ?? 'N/A'), 16)} ${bgUsage?.inputTokens ?? 'N/A'}`)
+console.log(` ${pad('Output tokens', 24)} ${pad(String(stdUsage?.outputTokens ?? 'N/A'), 16)} ${bgUsage?.outputTokens ?? 'N/A'}`)
+console.log(` ${pad('Total tokens', 24)} ${pad(String(stdUsage?.totalTokens ?? 'N/A'), 16)} ${bgUsage?.totalTokens ?? 'N/A'}`)
+console.log(` ${pad('Agent cycles', 24)} ${pad(String(standardResult.metrics?.cycleCount ?? 'N/A'), 16)} ${backgroundResult.metrics?.cycleCount ?? 'N/A'}`)
+console.log(` ${pad('Output length (chars)', 24)} ${pad(String(standardResult.toString().length), 16)} ${backgroundResult.toString().length}`)
+console.log()
+console.log('='.repeat(60))
diff --git a/designs/0009-demos/benchmarks/cases/due-diligence.ts b/designs/0009-demos/benchmarks/cases/due-diligence.ts
new file mode 100644
index 000000000..766ddaed8
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/due-diligence.ts
@@ -0,0 +1,265 @@
+/**
+ * Due Diligence Benchmark
+ *
+ * Pattern: 2-layer dependent pipeline (fetch → analyze).
+ *
+ * Layer 1: 5 independent data-gathering tools (5s each base)
+ * Layer 2: 5 dependent analysis tools (8s each base), each consuming Layer 1 data
+ *
+ * All tools are stubs with fixed delays and canned outputs.
+ *
+ * Standard: 10 sequential tool calls (5×5s + 5×8s = 65s)
+ * Background: Layer 1 concurrent (5s) + Layer 2 concurrent (8s) = 13s
+ * Expected speedup: ~2.5-3.5x
+ */
+
+import { Agent, tool } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+import { z } from 'zod'
+import type { BenchmarkCase } from '../framework/types.js'
+
+// ── Tool stubs ───────────────────────────────────────────────────────────────
+
+function makeTools(delayMultiplier: number) {
+ // Simulates API calls to financial data providers (~5s)
+ const fetchDelay = 5000
+ // Simulates running analysis models / algorithms (~8s)
+ const analyzeDelay = 8000
+
+ // Layer 1: Data gathering (independent)
+
+ const fetchFinancials = tool({
+ name: 'fetch_financials',
+ description: 'Retrieve target company financial statements and key metrics.',
+ inputSchema: z.object({ company: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, fetchDelay * delayMultiplier))
+ return `Financial data for ${input.company}: Revenue $2.4B (+18% YoY), gross margin 72%, operating margin 28%, free cash flow $340M, debt-to-equity 0.45, current ratio 2.1. Cash reserves $890M. R&D spend 22% of revenue.`
+ },
+ })
+
+ const fetchMarketPosition = tool({
+ name: 'fetch_market_position',
+ description: 'Retrieve competitive landscape and market positioning data.',
+ inputSchema: z.object({ company: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, fetchDelay * delayMultiplier))
+ return `Market data for ${input.company}: #3 player in $18B market (13.3% share). Market growing 24% CAGR. Top competitor holds 31%. Target gaining share fastest (+3.2pp last year). Strong mid-market, weak enterprise. NPS 62 vs industry avg 45.`
+ },
+ })
+
+ const fetchLegalRecords = tool({
+ name: 'fetch_legal_records',
+ description: 'Retrieve legal filings, litigation history, and regulatory records.',
+ inputSchema: z.object({ company: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, fetchDelay * delayMultiplier))
+ return `Legal records for ${input.company}: 2 active patent disputes (non-material, <$5M combined). 1 resolved SEC inquiry (no action). No pending class actions. Clean GDPR/CCPA record. 3 minor contract disputes in arbitration.`
+ },
+ })
+
+ const fetchPatentPortfolio = tool({
+ name: 'fetch_patent_portfolio',
+ description: 'Retrieve patent and IP portfolio data.',
+ inputSchema: z.object({ company: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, fetchDelay * delayMultiplier))
+ return `Patent portfolio for ${input.company}: 147 granted patents, 38 pending. Core patents in ML inference optimization (12), data pipeline architecture (8), edge computing (15). 23 patents cited >50 times. Portfolio valued at $180-220M.`
+ },
+ })
+
+ const fetchCustomerSentiment = tool({
+ name: 'fetch_customer_sentiment',
+ description: 'Retrieve customer satisfaction, retention, and sentiment data.',
+ inputSchema: z.object({ company: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, fetchDelay * delayMultiplier))
+ return `Customer sentiment for ${input.company}: NPS 62, CSAT 4.3/5. Annual churn 8% (industry avg 15%). Net revenue retention 124%. Top complaints: pricing complexity (34%), onboarding friction (28%). G2 rating 4.6/5.`
+ },
+ })
+
+ // Layer 2: Analysis (each depends on corresponding Layer 1 data)
+
+ const analyzeFinancialRisk = tool({
+ name: 'analyze_financial_risk',
+ description: 'Run financial risk model. Requires financial data from fetch_financials.',
+ inputSchema: z.object({
+ company: z.string(),
+ financial_data: z.string().describe('Raw financial data from fetch_financials'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, analyzeDelay * delayMultiplier))
+ return `FINANCIAL RISK — ${input.company}: Overall risk LOW (2.1/10). Strong cash position covers 2.6 years. Debt manageable with 5.8x interest coverage. Revenue concentration moderate: top 10 customers = 34%. Valuation range: $4.2-5.8B.`
+ },
+ })
+
+ const analyzeCompetitiveThreat = tool({
+ name: 'analyze_competitive_threat',
+ description: 'Assess competitive threats. Requires market data from fetch_market_position.',
+ inputSchema: z.object({
+ company: z.string(),
+ market_data: z.string().describe('Raw market data from fetch_market_position'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, analyzeDelay * delayMultiplier))
+ return `COMPETITIVE THREAT — ${input.company}: Primary threat: market leader launching competing product Q3. Moat MODERATE — strong product but switching costs lower than peers. Technical differentiation durable (18-24 month lead). Enterprise penetration is critical growth vector.`
+ },
+ })
+
+ const analyzeLegalLiability = tool({
+ name: 'analyze_legal_liability',
+ description: 'Forecast legal liability. Requires legal records from fetch_legal_records.',
+ inputSchema: z.object({
+ company: z.string(),
+ legal_data: z.string().describe('Raw legal records from fetch_legal_records'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, analyzeDelay * delayMultiplier))
+ return `LEGAL LIABILITY — ${input.company}: Total estimated exposure $3-8M (immaterial). Patent disputes defensive, low injunction risk. Clean regulatory record reduces compliance risk. Recommend standard IP indemnification. No deal-blocking issues.`
+ },
+ })
+
+ const analyzeIPValue = tool({
+ name: 'analyze_ip_value',
+ description: 'Perform IP valuation. Requires patent data from fetch_patent_portfolio.',
+ inputSchema: z.object({
+ company: z.string(),
+ patent_data: z.string().describe('Raw patent data from fetch_patent_portfolio'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, analyzeDelay * delayMultiplier))
+ return `IP VALUATION — ${input.company}: Portfolio fair value $195M. Strategic premium for ML inference patents +$40-60M (scarcity). 12 core ML patents = 68% of value. Freedom-to-operate risk LOW. Portfolio is significant strategic asset.`
+ },
+ })
+
+ const analyzeCustomerLTV = tool({
+ name: 'analyze_customer_ltv',
+ description: 'Model customer lifetime value. Requires sentiment data from fetch_customer_sentiment.',
+ inputSchema: z.object({
+ company: z.string(),
+ sentiment_data: z.string().describe('Raw customer sentiment from fetch_customer_sentiment'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, analyzeDelay * delayMultiplier))
+ return `CUSTOMER LTV — ${input.company}: Avg LTV $287K (enterprise), $18K (mid-market). Payback 11 months. 124% net revenue retention indicates strong expansion. Churn concentrated in first 90 days — fixable with onboarding investment. 3-year customer base value $1.8B.`
+ },
+ })
+
+ return [
+ fetchFinancials,
+ fetchMarketPosition,
+ fetchLegalRecords,
+ fetchPatentPortfolio,
+ fetchCustomerSentiment,
+ analyzeFinancialRisk,
+ analyzeCompetitiveThreat,
+ analyzeLegalLiability,
+ analyzeIPValue,
+ analyzeCustomerLTV,
+ ]
+}
+
+// ── Structured output schema ─────────────────────────────────────────────────
+
+const dueDiligenceSchema = z.object({
+ target_company: z.string().max(100).describe('Name of the acquisition target'),
+ findings: z
+ .array(
+ z.object({
+ domain: z
+ .enum(['financials', 'market_position', 'legal', 'ip_portfolio', 'customer_sentiment'])
+ .describe('Analysis domain'),
+ risk_level: z.enum(['LOW', 'MEDIUM', 'HIGH', 'CRITICAL']).describe('Risk rating'),
+ headline: z.string().max(200).describe('One-line finding summary'),
+ detail: z.string().max(400).describe('Supporting detail'),
+ }),
+ )
+ .length(5)
+ .describe('Exactly 5 findings, one per analysis domain'),
+ financial_summary: z.object({
+ revenue: z.string().max(100).describe('Revenue figure with growth'),
+ valuation_range: z.string().max(100).describe('Estimated valuation range'),
+ risk_score: z.string().max(80).describe('Overall financial risk score'),
+ }),
+ recommendation: z.enum(['GO', 'NO-GO']).describe('Final acquisition recommendation'),
+ rationale: z.string().max(600).describe('Brief rationale for the recommendation'),
+})
+
+// ── Prompts ──────────────────────────────────────────────────────────────────
+
+const systemPrompt =
+ 'You are a senior M&A analyst conducting acquisition due diligence. You have access to ' +
+ 'data-gathering tools (fetch_*) and analysis tools (analyze_*). ' +
+ 'The analysis tools require data from the corresponding fetch tools as input.\n\n' +
+ 'PROCESS:\n' +
+ '1. Gather ALL data: call fetch_financials, fetch_market_position, fetch_legal_records, ' +
+ 'fetch_patent_portfolio, and fetch_customer_sentiment.\n' +
+ '2. Run ALL analyses: call analyze_financial_risk, analyze_competitive_threat, ' +
+ 'analyze_legal_liability, analyze_ip_value, and analyze_customer_ltv with the fetched data.\n' +
+ '3. Write an acquisition memo with the following structure:\n' +
+ ' - One finding per analysis domain (financials, market, legal, IP, customer) with risk level and detail\n' +
+ ' - Financial summary with revenue, valuation range, and risk score\n' +
+ ' - GO or NO-GO recommendation with rationale\n\n' +
+ 'You MUST call all 10 tools before writing the memo.\n' +
+ 'Include specific numbers and data points from the tool results — do not paraphrase generically.'
+
+const userPrompt =
+ 'Conduct full acquisition due diligence on Nextera Analytics. ' +
+ 'Gather all data (5 fetch tools), run all analysis models (5 analyze tools), ' +
+ 'then write the acquisition memo with findings, financial summary, and GO/NO-GO recommendation.'
+
+// ── Case definition ──────────────────────────────────────────────────────────
+
+export const dueDiligence: BenchmarkCase = {
+ name: 'due-diligence',
+ description: 'M&A due diligence pipeline (10 tool stubs, 2 layers with dependencies)',
+
+ prompt: userPrompt,
+
+ createStandardAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ tools: makeTools(delayMultiplier),
+ structuredOutputSchema: dueDiligenceSchema,
+ printer: false,
+ })
+ },
+
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ backgroundTools: makeTools(delayMultiplier),
+ structuredOutputSchema: dueDiligenceSchema,
+ printer: false,
+ })
+ },
+
+ outputValidation: {
+ requireStructuredOutput: true,
+ requiredContent: [
+ 'Nextera Analytics',
+ 'risk', // from analyze_financial_risk (model synthesizes valuation figures in varying formats)
+ '$195M', // from analyze_ip_value
+ '$287K', // from analyze_customer_ltv
+ ],
+ minLength: 200,
+ },
+
+ trajectoryValidation: {
+ requiredTools: [
+ 'fetch_financials',
+ 'fetch_market_position',
+ 'fetch_legal_records',
+ 'fetch_patent_portfolio',
+ 'fetch_customer_sentiment',
+ 'analyze_financial_risk',
+ 'analyze_competitive_threat',
+ 'analyze_legal_liability',
+ 'analyze_ip_value',
+ 'analyze_customer_ltv',
+ ],
+ minToolCalls: 10,
+ },
+}
diff --git a/designs/0009-demos/benchmarks/cases/incident-response.ts b/designs/0009-demos/benchmarks/cases/incident-response.ts
new file mode 100644
index 000000000..ac3ac43e2
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/incident-response.ts
@@ -0,0 +1,395 @@
+/**
+ * Incident Response Benchmark
+ *
+ * Pattern: 4-layer deep pipeline with high fan-out.
+ *
+ * Layer 1 — Triage: 2 tools (5s each — SIEM queries)
+ * Layer 2 — Investigation: 6 tools (8s each — deep log analysis)
+ * Layer 3 — Correlation: 4 tools (10s each — cross-referencing data sources)
+ * Layer 4 — Response: 4 tools (5s each — infrastructure API calls)
+ *
+ * All tools are stubs with fixed delays and canned outputs.
+ *
+ * Standard: 16 sequential calls (2×5 + 6×8 + 4×10 + 4×5 = 118s)
+ * Background: 4 concurrent layers (5 + 8 + 10 + 5 = 28s)
+ * Expected speedup: ~3-4x
+ */
+
+import { Agent, tool } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+import { z } from 'zod'
+import type { BenchmarkCase } from '../framework/types.js'
+
+// ── Tool stubs ───────────────────────────────────────────────────────────────
+
+function makeTools(delayMultiplier: number) {
+ // Realistic delays per layer:
+ const triageDelay = 5000 // SIEM / alerting system queries (~5s)
+ const investigateDelay = 8000 // Deep log analysis, scanning (~8s)
+ const correlateDelay = 10000 // Cross-referencing multiple data sources (~10s)
+ const respondDelay = 5000 // Infrastructure API calls (~5s)
+
+ // ── Layer 1: Triage ──────────────────────────────────────────────────────
+
+ const detectAnomaly = tool({
+ name: 'detect_anomaly',
+ description: 'Run anomaly detection on recent system activity to identify the initial threat vector.',
+ inputSchema: z.object({ timeframe: z.string().describe('Time range to analyze, e.g. "last 4 hours"') }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, triageDelay * delayMultiplier))
+ return 'ANOMALY DETECTED: Unusual outbound data transfer spike at 02:14 UTC from subnet 10.0.3.0/24. Volume: 4.2GB over 47 minutes to external IP 203.0.113.42. Pattern consistent with staged exfiltration. Affected systems: app-server-07, app-server-12, db-replica-03.'
+ },
+ })
+
+ const collectInitialIndicators = tool({
+ name: 'collect_initial_indicators',
+ description: 'Collect initial indicators of compromise (IOCs) from available telemetry.',
+ inputSchema: z.object({ timeframe: z.string().describe('Time range to collect IOCs from') }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, triageDelay * delayMultiplier))
+ return 'IOC COLLECTION: External IP 203.0.113.42 (known C2 infrastructure, Lazarus Group attribution). Suspicious binary hash: a3f2b8c1... (matches known RAT variant). Compromised credentials: svc-deploy@corp (service account). Initial entry vector: phishing email to eng-team@corp at 2024-01-14 23:48 UTC.'
+ },
+ })
+
+ // ── Layer 2: Investigation ─────────────────────────────────────────────
+
+ const analyzeNetworkLogs = tool({
+ name: 'analyze_network_logs',
+ description: 'Deep analysis of network flow logs for lateral movement and data exfiltration patterns.',
+ inputSchema: z.object({
+ indicators: z.string().describe('IOCs and anomaly data from triage phase'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'NETWORK ANALYSIS: Lateral movement detected via SMB from app-server-07 to 4 additional hosts between 01:30-02:10 UTC. DNS tunneling to 203.0.113.42 confirmed (encoded queries averaging 180 bytes). Exfiltration used HTTPS to blend with normal traffic. Firewall rules bypassed via allowed port 443.'
+ },
+ })
+
+ const analyzeAccessLogs = tool({
+ name: 'analyze_access_logs',
+ description: 'Analyze authentication and access logs for unauthorized access patterns.',
+ inputSchema: z.object({
+ indicators: z.string().describe('IOCs and anomaly data from triage phase'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'ACCESS ANALYSIS: svc-deploy account authenticated from unusual source (10.0.3.15 instead of CI/CD subnet). 23 privilege escalation events in 12 minutes. Accessed production secrets vault at 01:52 UTC. 3 additional service accounts compromised via credential harvesting: svc-backup, svc-monitor, svc-analytics.'
+ },
+ })
+
+ const analyzeEndpointTelemetry = tool({
+ name: 'analyze_endpoint_telemetry',
+ description: 'Analyze endpoint detection and response (EDR) telemetry from affected systems.',
+ inputSchema: z.object({
+ indicators: z.string().describe('IOCs and anomaly data from triage phase'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'ENDPOINT ANALYSIS: Malicious process chain on app-server-07: outlook.exe → powershell.exe → svchost.exe (masqueraded). Memory-resident payload detected — no disk artifacts. Persistence via scheduled task "WindowsHealthCheck" running every 15 minutes. Anti-forensic: event logs cleared on 3 hosts.'
+ },
+ })
+
+ const checkThreatIntel = tool({
+ name: 'check_threat_intel',
+ description: 'Cross-reference IOCs against threat intelligence databases.',
+ inputSchema: z.object({
+ indicators: z.string().describe('IOCs from triage phase'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'THREAT INTEL: IP 203.0.113.42 linked to Lazarus Group infrastructure (confidence: HIGH, 7 reports since 2023). RAT variant matches "DreamJob" campaign TTPs. Similar attack pattern seen in 3 financial sector breaches in last 90 days. MITRE ATT&CK: T1566.001, T1059.001, T1071.001, T1048.003.'
+ },
+ })
+
+ const scanAffectedSystems = tool({
+ name: 'scan_affected_systems',
+ description: 'Run vulnerability and malware scans on systems identified in triage.',
+ inputSchema: z.object({
+ systems: z.string().describe('List of affected systems from triage'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'SYSTEM SCAN: 6 systems confirmed compromised: app-server-07 (patient zero), app-server-12, db-replica-03, jump-host-02, ci-runner-05, monitoring-01. Unpatched CVE-2024-1234 on 4 hosts (initial exploitation vector). Webshell found on app-server-12:/tmp/.cache/health.php. Rootkit detected on jump-host-02.'
+ },
+ })
+
+ const retrieveUserActivity = tool({
+ name: 'retrieve_user_activity',
+ description: 'Retrieve and analyze user activity timelines for compromised accounts.',
+ inputSchema: z.object({
+ accounts: z.string().describe('Compromised accounts from investigation'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, investigateDelay * delayMultiplier))
+ return 'USER ACTIVITY: svc-deploy accessed 142 repos in 8 minutes (vs normal 3/hour). svc-backup triggered 47 snapshot exports. svc-analytics ran 12 bulk data queries against customer PII tables. Human user j.chen@corp phishing victim — email opened at 23:48 UTC, macro executed at 23:49 UTC. No other human accounts compromised.'
+ },
+ })
+
+ // ── Layer 3: Correlation ───────────────────────────────────────────────
+
+ const correlateNetworkAccess = tool({
+ name: 'correlate_network_access',
+ description: 'Correlate network and access log findings to map the attack path.',
+ inputSchema: z.object({
+ network_findings: z.string().describe('Network analysis results'),
+ access_findings: z.string().describe('Access log analysis results'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, correlateDelay * delayMultiplier))
+ return 'CORRELATION — Network+Access: Attack path reconstructed: phishing → j.chen workstation → svc-deploy credential theft via LSASS dump → lateral movement via SMB to app-server-07 → privilege escalation → secrets vault access → spread to 5 additional systems. Total dwell time: ~2.5 hours before exfiltration began.'
+ },
+ })
+
+ const correlateThreatIndicators = tool({
+ name: 'correlate_threat_indicators',
+ description: 'Correlate threat intel with investigation findings for attribution.',
+ inputSchema: z.object({
+ threat_intel: z.string().describe('Threat intelligence findings'),
+ endpoint_findings: z.string().describe('Endpoint telemetry findings'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, correlateDelay * delayMultiplier))
+ return 'CORRELATION — Threat+Endpoint: HIGH confidence attribution to Lazarus Group. TTP overlap: 89% match with DreamJob campaign. Custom RAT variant confirms state-sponsored tooling. Target profile (financial data, customer PII) consistent with previous Lazarus operations. Estimated attacker sophistication: ADVANCED.'
+ },
+ })
+
+ const correlateEndpointBehavior = tool({
+ name: 'correlate_endpoint_behavior',
+ description: 'Correlate endpoint behavior across compromised systems for pattern analysis.',
+ inputSchema: z.object({
+ endpoint_findings: z.string().describe('Endpoint telemetry findings'),
+ system_scan: z.string().describe('System scan results'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, correlateDelay * delayMultiplier))
+ return 'CORRELATION — Endpoint Behavior: Consistent malware deployment pattern across all 6 hosts. Persistence mechanism identical (scheduled task). Anti-forensic cleanup ran on 3 of 6 hosts (partial). Memory-only payload prevents traditional AV detection. Webshell on app-server-12 served as backup C2 channel.'
+ },
+ })
+
+ const buildAttackTimeline = tool({
+ name: 'build_attack_timeline',
+ description: 'Construct a comprehensive attack timeline from all investigation data.',
+ inputSchema: z.object({
+ user_activity: z.string().describe('User activity timeline'),
+ network_findings: z.string().describe('Network analysis results'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, correlateDelay * delayMultiplier))
+ return 'ATTACK TIMELINE: 23:48 — Phishing email opened (j.chen). 23:49 — Macro executed, RAT deployed. 00:15 — Credential harvesting (svc-deploy). 00:30 — Lateral movement begins. 01:30 — 6 systems compromised. 01:52 — Secrets vault accessed. 02:00 — Data staging begins. 02:14 — Exfiltration starts (4.2GB over 47 min). 03:01 — Anomaly detection triggered alert.'
+ },
+ })
+
+ // ── Layer 4: Response ──────────────────────────────────────────────────────
+
+ const isolateCompromisedSystems = tool({
+ name: 'isolate_compromised_systems',
+ description: 'Network-isolate all compromised systems to contain the breach.',
+ inputSchema: z.object({
+ systems: z.string().describe('List of compromised systems to isolate'),
+ correlation_data: z.string().describe('Correlation findings confirming scope'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, respondDelay * delayMultiplier))
+ return 'ISOLATION COMPLETE: 6 systems moved to quarantine VLAN. Network ACLs blocking all external traffic for subnet 10.0.3.0/24. C2 IP 203.0.113.42 added to firewall blocklist. DNS sinkhole configured for known C2 domains. Monitoring confirms zero outbound connections from quarantined systems.'
+ },
+ })
+
+ const revokeCredentials = tool({
+ name: 'revoke_credentials',
+ description: 'Revoke and rotate all compromised credentials.',
+ inputSchema: z.object({
+ accounts: z.string().describe('Compromised accounts to revoke'),
+ correlation_data: z.string().describe('Correlation data confirming compromised scope'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, respondDelay * delayMultiplier))
+ return 'CREDENTIAL REVOCATION: 4 service accounts disabled and rotated (svc-deploy, svc-backup, svc-monitor, svc-analytics). j.chen account locked, password reset forced. All active sessions terminated. Secrets vault keys rotated. API tokens regenerated for affected services. MFA enforcement verified for all accounts.'
+ },
+ })
+
+ const deployPatches = tool({
+ name: 'deploy_patches',
+ description: 'Deploy emergency patches for exploited vulnerabilities.',
+ inputSchema: z.object({
+ vulnerabilities: z.string().describe('CVEs and vulnerabilities to patch'),
+ correlation_data: z.string().describe('Correlation data confirming affected systems'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, respondDelay * delayMultiplier))
+ return 'PATCH DEPLOYMENT: CVE-2024-1234 patched on all 47 production hosts (not just the 6 compromised). Emergency OS patches applied. Scheduled tasks "WindowsHealthCheck" removed from all systems. Webshell removed from app-server-12. EDR signatures updated with new IOCs. Full system reimage scheduled for 6 compromised hosts.'
+ },
+ })
+
+ const generateIncidentReport = tool({
+ name: 'generate_incident_report',
+ description: 'Generate the formal incident report for executive and regulatory communication.',
+ inputSchema: z.object({
+ timeline: z.string().describe('Attack timeline'),
+ response_actions: z.string().describe('Summary of response actions taken'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, respondDelay * delayMultiplier))
+ return 'INCIDENT REPORT GENERATED: INC-2024-0042. Classification: Data Breach — Customer PII potentially exfiltrated. Regulatory notifications required: GDPR (72hr), state breach notification laws. Estimated affected records: 142,000 customer profiles. Insurance carrier notified. External forensics firm engaged for independent validation.'
+ },
+ })
+
+ return [
+ // Layer 1
+ detectAnomaly,
+ collectInitialIndicators,
+ // Layer 2
+ analyzeNetworkLogs,
+ analyzeAccessLogs,
+ analyzeEndpointTelemetry,
+ checkThreatIntel,
+ scanAffectedSystems,
+ retrieveUserActivity,
+ // Layer 3
+ correlateNetworkAccess,
+ correlateThreatIndicators,
+ correlateEndpointBehavior,
+ buildAttackTimeline,
+ // Layer 4
+ isolateCompromisedSystems,
+ revokeCredentials,
+ deployPatches,
+ generateIncidentReport,
+ ]
+}
+
+// ── Structured output schema ─────────────────────────────────────────────────
+
+const incidentReportSchema = z.object({
+ incident_id: z.string().describe('Incident identifier'),
+ severity: z.enum(['LOW', 'MEDIUM', 'HIGH', 'CRITICAL']).describe('Incident severity'),
+ triage_summary: z.object({
+ anomaly_type: z.string().describe('Type of anomaly detected'),
+ initial_indicators: z.array(z.string()).min(2).describe('Key IOCs identified during triage'),
+ }),
+ investigation_findings: z
+ .array(
+ z.object({
+ source: z.string().describe('Investigation data source'),
+ finding: z.string().describe('Key finding from this source'),
+ }),
+ )
+ .min(4)
+ .describe('Findings from investigation tools'),
+ correlation_results: z
+ .array(
+ z.object({
+ analysis: z.string().describe('Correlation analysis name'),
+ conclusion: z.string().describe('Key conclusion'),
+ }),
+ )
+ .min(2)
+ .describe('Results from correlation analyses'),
+ response_actions: z
+ .array(
+ z.object({
+ action: z.string().describe('Response action taken'),
+ status: z.enum(['completed', 'in_progress', 'failed']).describe('Action status'),
+ }),
+ )
+ .min(2)
+ .describe('Response actions executed'),
+ executive_summary: z.string().describe('Executive summary of the incident'),
+})
+
+// ── Prompts ──────────────────────────────────────────────────────────────────
+
+const systemPrompt =
+ 'You are a senior security incident responder. You have access to triage, investigation, ' +
+ 'correlation, and response tools.\n\n' +
+ 'PROTOCOL (follow this exact order):\n' +
+ '1. TRIAGE: Call detect_anomaly and collect_initial_indicators.\n' +
+ '2. INVESTIGATE: Using triage results, call ALL 6 investigation tools:\n' +
+ ' analyze_network_logs, analyze_access_logs, analyze_endpoint_telemetry,\n' +
+ ' check_threat_intel, scan_affected_systems, retrieve_user_activity.\n' +
+ '3. CORRELATE: Using investigation results, call ALL 4 correlation tools:\n' +
+ ' correlate_network_access, correlate_threat_indicators,\n' +
+ ' correlate_endpoint_behavior, build_attack_timeline.\n' +
+ '4. RESPOND: Using correlation results, call ALL 4 response tools:\n' +
+ ' isolate_compromised_systems, revoke_credentials,\n' +
+ ' deploy_patches, generate_incident_report.\n' +
+ '5. Write an incident report with the following structure:\n' +
+ ' - Incident ID and severity\n' +
+ ' - Triage summary: anomaly type, key IOCs\n' +
+ ' - Investigation findings: one finding per source with risk classification\n' +
+ ' - Correlation results: conclusions from each correlation analysis\n' +
+ ' - Response actions: what was done and status\n' +
+ ' - Executive summary\n\n' +
+ 'You MUST call all 16 tools before writing the report.\n' +
+ 'Include specific data points from tool results — do not paraphrase generically.'
+
+const userPrompt =
+ 'ALERT: Anomaly detection flagged unusual outbound data transfer from production subnet. ' +
+ 'Execute the full incident response protocol: triage, investigate, correlate, respond. ' +
+ 'Then write a detailed incident report with specific findings from each phase.'
+
+// ── Case definition ──────────────────────────────────────────────────────────
+
+export const incidentResponse: BenchmarkCase = {
+ name: 'incident-response',
+ description: 'Security incident response pipeline (16 tool stubs, 4 layers with dependencies)',
+
+ prompt: userPrompt,
+
+ createStandardAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ tools: makeTools(delayMultiplier),
+ structuredOutputSchema: incidentReportSchema,
+ printer: false,
+ })
+ },
+
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ backgroundTools: makeTools(delayMultiplier),
+ structuredOutputSchema: incidentReportSchema,
+ printer: false,
+ })
+ },
+
+ outputValidation: {
+ requireStructuredOutput: true,
+ requiredContent: [
+ 'incident',
+ '203.0.113.42', // C2 IP from triage — only in Layer 1 tool outputs
+ 'lateral movement', // from Layer 2 network/endpoint analysis
+ 'Lazarus', // attribution from Layer 3 threat correlation
+ 'quarantine', // from Layer 4 isolation response
+ ],
+ minLength: 200,
+ },
+
+ trajectoryValidation: {
+ requiredTools: [
+ // Layer 1
+ 'detect_anomaly',
+ 'collect_initial_indicators',
+ // Layer 2
+ 'analyze_network_logs',
+ 'analyze_access_logs',
+ 'analyze_endpoint_telemetry',
+ 'check_threat_intel',
+ 'scan_affected_systems',
+ 'retrieve_user_activity',
+ // Layer 3
+ 'correlate_network_access',
+ 'correlate_threat_indicators',
+ 'correlate_endpoint_behavior',
+ 'build_attack_timeline',
+ // Layer 4
+ 'isolate_compromised_systems',
+ 'revoke_credentials',
+ 'deploy_patches',
+ 'generate_incident_report',
+ ],
+ minToolCalls: 16,
+ },
+}
diff --git a/designs/0009-demos/benchmarks/cases/index.ts b/designs/0009-demos/benchmarks/cases/index.ts
new file mode 100644
index 000000000..eb4225421
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/index.ts
@@ -0,0 +1,9 @@
+import { sessionEnrichment } from './session-enrichment.js'
+import { probeDispatch } from './probe-dispatch.js'
+import { mixedDispatch } from './mixed-dispatch.js'
+import { dueDiligence } from './due-diligence.js'
+import { incidentResponse } from './incident-response.js'
+import type { BenchmarkCase } from '../framework/types.js'
+
+/** All benchmark cases, ordered simplest → most complex. */
+export const allCases: BenchmarkCase[] = [sessionEnrichment, probeDispatch, mixedDispatch, dueDiligence, incidentResponse]
diff --git a/designs/0009-demos/benchmarks/cases/mixed-dispatch.ts b/designs/0009-demos/benchmarks/cases/mixed-dispatch.ts
new file mode 100644
index 000000000..866c00efb
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/mixed-dispatch.ts
@@ -0,0 +1,190 @@
+/**
+ * Mixed Dispatch Benchmark
+ *
+ * Pattern: Slow background tasks dispatched up front while the model does
+ * multi-round dependent foreground work underneath.
+ *
+ * A financial analyst agent evaluates a company for investment. Two slow
+ * external verifications — credit check (20s) and regulatory screening (15s) —
+ * fire off to the background in round 1. The model then does sequential
+ * foreground work where each step depends on the previous:
+ *
+ * Round 1: fetch_company_profile (fg, 1s) + submit bg tasks
+ * Round 2: analyze_financials (fg, 3s) — needs the company profile
+ * Round 3: assess_market_position (fg, 2s) — needs the financial analysis
+ * Round 4: bg results arrive, model compiles final investment report
+ *
+ * The model is NEVER idle — it's always reasoning with foreground results
+ * while the background verifications run. In standard mode, the 35s of
+ * external verification blocks everything.
+ *
+ * Standard: all sequential → 20s + 15s + 1s + 3s + 2s = ~41s tool time
+ * Background: bg max(20s,15s) overlaps fg rounds (1s + 3s + 2s = 6s) → ~20s effective
+ * Expected speedup: ~1.5-2x
+ */
+
+import { Agent, tool } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+import { z } from 'zod'
+import type { BenchmarkCase } from '../framework/types.js'
+
+// ── Canned data ─────────────────────────────────────────────────────────────
+
+const companyProfile = [
+ 'Acme Corp (ACME) — Founded 2011, HQ: Austin, TX.',
+ 'Revenue: $2.4B (FY2024), up 18% YoY. Operating margin: 22%.',
+ 'Employees: 8,200. Market cap: $14.2B.',
+ 'Primary segments: Enterprise SaaS (68%), Professional Services (32%).',
+ 'Key customers: 340 enterprise accounts, 97% net revenue retention.',
+].join(' ')
+
+const financialAnalysis = [
+ 'Financial Health: STRONG.',
+ 'Current ratio: 2.8 (excellent liquidity). Debt-to-equity: 0.35 (conservative leverage).',
+ 'Free cash flow: $480M (20% FCF margin). R&D spend: 24% of revenue.',
+ 'Revenue growth accelerating: 14% → 16% → 18% over last 3 years.',
+ 'Gross margin expanding: 71% → 73% → 75%. Path to 30% operating margin visible.',
+].join(' ')
+
+const marketAssessment = [
+ 'Market Position: LEADER in mid-market enterprise SaaS.',
+ 'TAM: $45B growing 12% annually. ACME share: ~5.3%, up from 4.1% two years ago.',
+ 'Competitive moat: proprietary data platform + high switching costs (avg contract: 3.2 years).',
+ 'Key risk: AWS and Azure building competing offerings, but ACME 18-month feature lead.',
+ 'Analyst consensus: 8 Buy, 3 Hold, 0 Sell. Average price target: $185 (current: $162).',
+].join(' ')
+
+// ── Tool stubs ───────────────────────────────────────────────────────────────
+
+function makeForegroundTools(delayMultiplier: number) {
+ const fetchProfile = tool({
+ name: 'fetch_company_profile',
+ description: 'Fetches the company profile including revenue, headcount, and business segments. Call this FIRST — other analysis tools need this data.',
+ inputSchema: z.object({
+ ticker: z.string().describe('Stock ticker symbol'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 1000 * delayMultiplier))
+ return companyProfile
+ },
+ })
+
+ const analyzeFinancials = tool({
+ name: 'analyze_financials',
+ description: 'Analyzes financial health from the company profile data. Requires fetch_company_profile to have been called first.',
+ inputSchema: z.object({
+ ticker: z.string().describe('Stock ticker symbol'),
+ revenue: z.string().describe('Revenue figure from the company profile'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 3000 * delayMultiplier))
+ return financialAnalysis
+ },
+ })
+
+ const assessMarket = tool({
+ name: 'assess_market_position',
+ description: 'Assesses competitive position and market share. Requires analyze_financials to have been called first.',
+ inputSchema: z.object({
+ ticker: z.string().describe('Stock ticker symbol'),
+ financial_health: z.string().describe('Financial health rating from analyze_financials'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 2000 * delayMultiplier))
+ return marketAssessment
+ },
+ })
+
+ return [fetchProfile, analyzeFinancials, assessMarket]
+}
+
+function makeBackgroundTools(delayMultiplier: number) {
+ const creditCheck = tool({
+ name: 'submit_credit_check',
+ description: 'Submits a credit verification request to external rating agencies. SLOW — takes 15-20 seconds as it contacts Moody\'s, S&P, and Fitch APIs sequentially. Call this IMMEDIATELY so it starts processing while you do other analysis.',
+ inputSchema: z.object({
+ ticker: z.string().describe('Stock ticker symbol'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 20000 * delayMultiplier))
+ return 'Credit Rating: A- (S&P), A3 (Moody\'s), A- (Fitch). Outlook: Stable across all agencies. Credit default swap spread: 45bps (low risk). No credit events in past 5 years.'
+ },
+ })
+
+ const regulatoryScreen = tool({
+ name: 'submit_regulatory_screening',
+ description: 'Runs regulatory compliance screening against SEC, DOJ, and international databases. SLOW — takes 10-15 seconds. Call this IMMEDIATELY so it starts processing while you do other analysis.',
+ inputSchema: z.object({
+ ticker: z.string().describe('Stock ticker symbol'),
+ company_name: z.string().describe('Full company name'),
+ }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 15000 * delayMultiplier))
+ return 'Regulatory Screening: CLEAR. No active SEC investigations. No DOJ actions. FCPA compliance: verified. SOX compliance: current. No sanctions matches. Last audit: clean (Deloitte, Q3 2024).'
+ },
+ })
+
+ return [creditCheck, regulatoryScreen]
+}
+
+// ── Prompts ──────────────────────────────────────────────────────────────────
+
+const systemPrompt =
+ 'You are a senior financial analyst evaluating companies for investment.\n\n' +
+ 'CRITICAL WORKFLOW — follow this exact order:\n' +
+ '1. IMMEDIATELY submit submit_credit_check and submit_regulatory_screening — these are SLOW external calls. Start them FIRST.\n' +
+ '2. In the SAME tool call block, also call fetch_company_profile to get the basic company data.\n' +
+ '3. Once you have the profile, call analyze_financials (it needs the profile data).\n' +
+ '4. Once you have the financial analysis, call assess_market_position (it needs the financial health rating).\n' +
+ '5. Once ALL results are in (including the background credit check and regulatory screening), compile a complete investment report.\n\n' +
+ 'The foreground tools have dependencies — each step feeds the next. The background tools are independent external verifications. ' +
+ 'Do NOT wait for credit/regulatory results before starting your analysis chain.'
+
+const userPrompt =
+ 'Evaluate Acme Corp (ticker: ACME) for a potential $50M growth equity investment. ' +
+ 'Run the full analysis: company profile, financial analysis, market assessment, credit check, and regulatory screening. ' +
+ 'Produce a comprehensive investment report with a Buy/Hold/Pass recommendation.'
+
+// ── Case definition ──────────────────────────────────────────────────────────
+
+export const mixedDispatch: BenchmarkCase = {
+ name: 'mixed-dispatch',
+ description: 'Investment analysis: 2 slow background verifications + 3 dependent foreground analysis steps',
+
+ prompt: userPrompt,
+
+ createStandardAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ tools: [...makeBackgroundTools(delayMultiplier), ...makeForegroundTools(delayMultiplier)],
+ printer: false,
+ })
+ },
+
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ tools: makeForegroundTools(delayMultiplier),
+ backgroundTools: makeBackgroundTools(delayMultiplier),
+ printer: false,
+ })
+ },
+
+ outputValidation: {
+ requiredContent: ['credit', 'regulatory', 'revenue'],
+ minLength: 100,
+ },
+
+ trajectoryValidation: {
+ requiredTools: [
+ 'fetch_company_profile',
+ 'analyze_financials',
+ 'assess_market_position',
+ 'submit_credit_check',
+ 'submit_regulatory_screening',
+ ],
+ minToolCalls: 5,
+ },
+}
diff --git a/designs/0009-demos/benchmarks/cases/probe-dispatch.ts b/designs/0009-demos/benchmarks/cases/probe-dispatch.ts
new file mode 100644
index 000000000..41f595cdf
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/probe-dispatch.ts
@@ -0,0 +1,148 @@
+/**
+ * Probe Dispatch Benchmark
+ *
+ * Pattern: Basic concurrent tool dispatch (single layer).
+ *
+ * A NASA mission controller dispatches probes to 3 planets and researches each.
+ * All tools are stubs with fixed delays and canned outputs.
+ *
+ * Standard: 6 tool calls sequential (3×15s dispatch + 3×8s research = ~69s)
+ * Background: 6 tool calls concurrent (max = ~15s)
+ * Expected speedup: ~2-3x (model overhead adds to both)
+ */
+
+import { Agent, tool } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+import { z } from 'zod'
+import type { BenchmarkCase } from '../framework/types.js'
+
+// ── Canned planet data ───────────────────────────────────────────────────────
+
+const planetFacts: Record = {
+ mars: 'Mars is the fourth planet from the Sun. Average temperature: -62C. Thin CO2 atmosphere. Evidence of ancient river valleys and polar ice caps. Olympus Mons is the tallest volcano in the solar system at 21.9 km.',
+ jupiter:
+ 'Jupiter is the largest planet in our solar system with a mass 318x Earth. The Great Red Spot is a storm larger than Earth, raging for 350+ years. Jupiter has 95 known moons including Ganymede, the largest moon in the solar system.',
+ neptune:
+ 'Neptune is the eighth and farthest planet from the Sun. Wind speeds reach 2,100 km/h, the fastest in the solar system. Has 16 known moons. Triton, its largest moon, orbits retrograde and is geologically active with nitrogen geysers.',
+}
+
+// ── Tool stubs ───────────────────────────────────────────────────────────────
+
+function makeTools(delayMultiplier: number) {
+ // Simulates a heavyweight external API call: command submission + telemetry confirmation (~15s)
+ const probeDelay = 15000
+
+ // Simulates a research search / knowledge retrieval (~8s)
+ const researchDelay = 8000
+
+ const dispatchProbe = tool({
+ name: 'dispatch_probe',
+ description: 'Dispatches an interplanetary space probe to the specified planet.',
+ inputSchema: z.object({
+ planet: z.string().describe('Target planet name'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, probeDelay * delayMultiplier))
+ return `Probe successfully dispatched to ${input.planet}. All systems nominal, telemetry confirmed.`
+ },
+ })
+
+ const researchPlanet = tool({
+ name: 'research_planet',
+ description: 'Research and retrieve key facts about a planet.',
+ inputSchema: z.object({
+ planet: z.string().describe('Planet name to research'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, researchDelay * delayMultiplier))
+ const key = input.planet.toLowerCase()
+ return planetFacts[key] ?? `No data available for ${input.planet}.`
+ },
+ })
+
+ return [dispatchProbe, researchPlanet]
+}
+
+// ── Structured output schema ─────────────────────────────────────────────────
+
+const missionReportSchema = z.object({
+ missions: z
+ .array(
+ z.object({
+ planet: z.enum(['Mars', 'Jupiter', 'Neptune']).describe('Target planet'),
+ probe_status: z.enum(['launched', 'failed', 'pending']).describe('Dispatch outcome'),
+ facts: z
+ .array(z.string().max(250).describe('A key fact about the planet'))
+ .min(2)
+ .max(3)
+ .describe('2-3 key facts about the planet'),
+ }),
+ )
+ .length(3)
+ .describe('Exactly 3 mission entries, one per planet'),
+ summary: z.string().max(500).describe('Overall mission status summary'),
+})
+
+// ── Prompts ──────────────────────────────────────────────────────────────────
+
+const systemPrompt =
+ 'You are a NASA deep space mission controller. You dispatch probes using dispatch_probe ' +
+ 'and gather intelligence using research_planet.\n\n' +
+ 'PROTOCOL:\n' +
+ '1. Dispatch probes to ALL requested planets (use dispatch_probe for each).\n' +
+ '2. Research each planet (use research_planet for each).\n' +
+ '3. Write a mission report with the following structure:\n' +
+ ' - For each planet: dispatch status, key facts from research (include specific data points)\n' +
+ ' - Overall mission summary\n\n' +
+ 'You MUST call all 6 tools (3 dispatches + 3 research) before writing the report.\n' +
+ 'Include specific facts and numbers from the research results — do not paraphrase generically.'
+
+const userPrompt =
+ 'Commander directive: dispatch probes to Mars, Jupiter, and Neptune immediately. ' +
+ 'Research key facts about each planet. ' +
+ 'Write a detailed mission report covering dispatch status and specific planet facts from your research.'
+
+// ── Case definition ──────────────────────────────────────────────────────────
+
+export const probeDispatch: BenchmarkCase = {
+ name: 'probe-dispatch',
+ description: 'NASA probe dispatch to 3 planets (6 tool stubs, 1 layer)',
+
+ prompt: userPrompt,
+
+ createStandardAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ tools: makeTools(delayMultiplier),
+ structuredOutputSchema: missionReportSchema,
+ printer: false,
+ })
+ },
+
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent {
+ return new Agent({
+ model,
+ systemPrompt,
+ backgroundTools: makeTools(delayMultiplier),
+ structuredOutputSchema: missionReportSchema,
+ printer: false,
+ })
+ },
+
+ outputValidation: {
+ requireStructuredOutput: true,
+ requiredContent: [
+ 'Mars', 'Jupiter', 'Neptune',
+ 'Olympus Mons', // from Mars research tool output
+ 'Great Red Spot', // from Jupiter research tool output
+ 'nitrogen geysers', // from Neptune research tool output
+ ],
+ minLength: 200,
+ },
+
+ trajectoryValidation: {
+ requiredTools: ['dispatch_probe', 'research_planet'],
+ minToolCalls: 6,
+ },
+}
diff --git a/designs/0009-demos/benchmarks/cases/session-enrichment.ts b/designs/0009-demos/benchmarks/cases/session-enrichment.ts
new file mode 100644
index 000000000..1e9adbe8e
--- /dev/null
+++ b/designs/0009-demos/benchmarks/cases/session-enrichment.ts
@@ -0,0 +1,132 @@
+/**
+ * Session Enrichment Benchmark
+ *
+ * Pattern: Developer-driven background dispatch via invokeBackground().
+ *
+ * A support agent has an AfterInvocationEvent hook that runs two enrichment
+ * agents (summarizer + sentiment analyzer) after every response.
+ *
+ * Standard: hook awaits both enrichment agents sequentially — blocks before
+ * invoke() returns. Each turn pays model + enrichment overhead.
+ * Background: hook fires both enrichment agents via invokeBackground() — returns
+ * immediately. invoke() returns after the model's response only.
+ * Enrichment runs concurrently in the background.
+ *
+ * Same agent, same prompt, same enrichment. Only difference: await vs invokeBackground().
+ */
+
+import { Agent, tool, AfterInvocationEvent } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+import { z } from 'zod'
+import type { BenchmarkCase } from '../framework/types.js'
+
+// ── Enrichment tool stubs (deterministic delays, canned output) ──────────────
+
+function makeEnrichmentAgents(model: Model, delayMultiplier: number) {
+ const summarizeTool = tool({
+ name: 'summarize',
+ description: 'Summarize the conversation.',
+ inputSchema: z.object({ text: z.string() }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 3000 * delayMultiplier))
+ return 'Customer inquiring about order #78432 shipping delay; exploring refund options.'
+ },
+ })
+
+ const sentimentTool = tool({
+ name: 'analyze_sentiment',
+ description: 'Classify sentiment.',
+ inputSchema: z.object({ text: z.string() }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, 2000 * delayMultiplier))
+ return 'NEGATIVE — customer frustrated with shipping delay.'
+ },
+ })
+
+ const summarizer = new Agent({
+ model,
+ name: 'summarizer',
+ tools: [summarizeTool],
+ systemPrompt: 'Call the summarize tool with the provided text. Return only the tool result.',
+ printer: false,
+ })
+
+ const sentimentAnalyzer = new Agent({
+ model,
+ name: 'sentiment_analyzer',
+ tools: [sentimentTool],
+ systemPrompt: 'Call the analyze_sentiment tool with the provided text. Return only the tool result.',
+ printer: false,
+ })
+
+ return { summarizer, sentimentAnalyzer }
+}
+
+// ── Prompts ──────────────────────────────────────────────────────────────────
+
+const customerMessages = [
+ "Hi, I placed an order 3 days ago (order #78432) and it still hasn't shipped. When will it arrive?",
+ "That's really frustrating — I needed it by Friday for a birthday gift. Can you expedite the shipping?",
+ "Friday is tomorrow. There's no way to make it in time?",
+ 'Fine. Can you at least cancel the order and issue a full refund?',
+ 'How long does the refund take, and will I get a return label if the package shows up anyway?',
+]
+
+const supportPrompt =
+ 'You are a helpful and empathetic customer support agent for an e-commerce company. ' +
+ 'Respond in exactly 2 sentences: acknowledge the concern, then state the next action. Keep under 80 words.'
+
+// ── Case definition ──────────────────────────────────────────────────────────
+
+export const sessionEnrichment: BenchmarkCase = {
+ name: 'session-enrichment',
+ description: 'Multi-turn support with enrichment: await vs invokeBackground() in AfterInvocationEvent hook',
+
+ prompt: customerMessages,
+
+ createStandardAgent(delayMultiplier: number, model: Model): Agent {
+ const { summarizer, sentimentAnalyzer } = makeEnrichmentAgents(model, delayMultiplier)
+
+ const agent = new Agent({
+ model,
+ systemPrompt: supportPrompt,
+ printer: false,
+ })
+
+ agent.addHook(AfterInvocationEvent, async (event) => {
+ const text = JSON.stringify(event.agent.messages.map((m) => m.toJSON()))
+ await summarizer.invoke(text)
+ await sentimentAnalyzer.invoke(text)
+ })
+
+ return agent
+ },
+
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent {
+ const { summarizer, sentimentAnalyzer } = makeEnrichmentAgents(model, delayMultiplier)
+
+ const agent = new Agent({
+ model,
+ systemPrompt: supportPrompt,
+ printer: false,
+ })
+
+ agent.addHook(AfterInvocationEvent, (event) => {
+ const text = JSON.stringify(event.agent.messages.map((m) => m.toJSON()))
+ summarizer.invokeBackground(text)
+ sentimentAnalyzer.invokeBackground(text)
+ })
+
+ return agent
+ },
+
+ outputValidation: {
+ requiredContent: ['refund'],
+ minLength: 50,
+ },
+
+ trajectoryValidation: {
+ requiredTools: [],
+ minToolCalls: 0,
+ },
+}
diff --git a/designs/0009-demos/benchmarks/framework/models.ts b/designs/0009-demos/benchmarks/framework/models.ts
new file mode 100644
index 000000000..9fe70b656
--- /dev/null
+++ b/designs/0009-demos/benchmarks/framework/models.ts
@@ -0,0 +1,22 @@
+/**
+ * Model factory for benchmark runs.
+ *
+ * Maps CLI model names to Bedrock Model instances.
+ */
+
+import { BedrockModel } from '@strands-agents/sdk'
+import type { Model } from '@strands-agents/sdk'
+
+export function createModel(name: string): Model {
+ switch (name) {
+ case 'sonnet-4.6':
+ return new BedrockModel({ modelId: 'us.anthropic.claude-sonnet-4-6', region: 'us-east-1' })
+ case 'haiku-4.5':
+ return new BedrockModel({ modelId: 'us.anthropic.claude-haiku-4-5-20251001-v1:0', region: 'us-east-1' })
+ case 'nova-pro':
+ return new BedrockModel({ modelId: 'us.amazon.nova-pro-v1:0', region: 'us-east-1' })
+ default:
+ console.error(`Unknown model: "${name}". Available: sonnet-4.6, haiku-4.5, nova-pro`)
+ process.exit(1)
+ }
+}
diff --git a/designs/0009-demos/benchmarks/framework/report.ts b/designs/0009-demos/benchmarks/framework/report.ts
new file mode 100644
index 000000000..8ab379797
--- /dev/null
+++ b/designs/0009-demos/benchmarks/framework/report.ts
@@ -0,0 +1,214 @@
+import { writeFileSync, mkdirSync } from 'node:fs'
+import type { BenchmarkReport, CaseReport, Stats } from './types.js'
+
+// ── Statistics ───────────────────────────────────────────────────────────────
+
+export function computeStats(values: number[]): Stats {
+ if (values.length === 0) return { mean: 0, stddev: 0, min: 0, max: 0, p50: 0, p95: 0 }
+
+ const sorted = [...values].sort((a, b) => a - b)
+ const n = sorted.length
+ const mean = sorted.reduce((s, v) => s + v, 0) / n
+ const variance = sorted.reduce((s, v) => s + (v - mean) ** 2, 0) / n
+ const stddev = Math.sqrt(variance)
+
+ return {
+ mean,
+ stddev,
+ min: sorted[0]!,
+ max: sorted[n - 1]!,
+ p50: sorted[Math.floor(n * 0.5)]!,
+ p95: sorted[Math.min(Math.floor(n * 0.95), n - 1)]!,
+ }
+}
+
+// ── JSON Report ──────────────────────────────────────────────────────────────
+
+export function writeReport(report: BenchmarkReport, outputDir: string): string {
+ mkdirSync(outputDir, { recursive: true })
+
+ // Replace regex patterns with their source strings for JSON serialization
+ const serializable = JSON.parse(JSON.stringify(report, (_key, value) => {
+ if (value instanceof RegExp) return value.source
+ return value
+ }))
+
+ const filename = `report-${report.timestamp.replace(/[:.]/g, '-')}.json`
+ const filepath = `${outputDir}/${filename}`
+ writeFileSync(filepath, JSON.stringify(serializable, null, 2))
+ return filepath
+}
+
+// ── Console Summary ──────────────────────────────────────────────────────────
+
+function fmt(ms: number): string {
+ return (ms / 1000).toFixed(1) + 's'
+}
+
+function fmtStats(s: Stats, unit: 'ms' | 'x' | 'tokens'): string {
+ if (unit === 'ms') return `${fmt(s.mean)} ± ${fmt(s.stddev)}`
+ if (unit === 'x') return `${s.mean.toFixed(2)}x ± ${s.stddev.toFixed(2)}`
+ return `${Math.round(s.mean)} ± ${Math.round(s.stddev)}`
+}
+
+function pad(str: string, len: number): string {
+ return str.padEnd(len)
+}
+
+function printCaseReport(c: CaseReport): void {
+ const s = c.summary
+ const line = '\u2500'.repeat(70)
+
+ console.log(`\n\u2500\u2500 ${c.caseName} ${'\u2500'.repeat(Math.max(0, 66 - c.caseName.length))}`)
+ console.log()
+
+ // Per-run table
+ console.log(
+ ` ${pad('Run', 6)} ${pad('Standard', 12)} ${pad('Background', 12)} ${pad('Speedup', 10)} ${pad('Std Tok', 10)} ${pad('Bg Tok', 10)} ${pad('Std Msg', 8)} Bg Msg`,
+ )
+ console.log(` ${'\u2500'.repeat(6)} ${'\u2500'.repeat(12)} ${'\u2500'.repeat(12)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(8)} ${'\u2500'.repeat(8)}`)
+
+ for (const r of c.runs) {
+ console.log(
+ ` ${pad(String(r.runIndex + 1), 6)} ${pad(fmt(r.standard.wallClockMs), 12)} ${pad(fmt(r.background.wallClockMs), 12)} ${pad(r.speedup.toFixed(2) + 'x', 10)} ${pad(String(r.standard.outputTokens), 10)} ${pad(String(r.background.outputTokens), 10)} ${pad(String(r.standard.messageCount), 8)} ${r.background.messageCount}`,
+ )
+ }
+
+ console.log(` ${'\u2500'.repeat(6)} ${'\u2500'.repeat(12)} ${'\u2500'.repeat(12)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(8)} ${'\u2500'.repeat(8)}`)
+ console.log(
+ ` ${pad('Avg', 6)} ${pad(fmt(s.standardWallClockMs.mean), 12)} ${pad(fmt(s.backgroundWallClockMs.mean), 12)} ${pad(s.speedup.mean.toFixed(2) + 'x', 10)} ${pad(String(Math.round(s.standardOutputTokens.mean)), 10)} ${pad(String(Math.round(s.backgroundOutputTokens.mean)), 10)} ${pad(String(Math.round(s.standardMessageCount.mean)), 8)} ${Math.round(s.backgroundMessageCount.mean)}`,
+ )
+
+ // Time decomposition
+ console.log()
+ console.log(' Time Decomposition (avg):')
+ const toolSpeedupStr =
+ s.backgroundToolTimeMs.mean > 0
+ ? `${s.toolTimeSpeedup.mean.toFixed(2)}x`
+ : 'N/A (bg tools run in forks)'
+ console.log(
+ ` Tool execution: ${fmt(s.standardToolTimeMs.mean)} std \u2192 ${fmt(s.backgroundToolTimeMs.mean)} bg` +
+ ` (${toolSpeedupStr})`,
+ )
+
+ // Token usage
+ const outDeltaPct = s.outputTokenDeltaPct
+ const outDeltaOk = Math.abs(outDeltaPct) < 15
+ const inDeltaPct = s.inputTokenDeltaPct
+ console.log()
+ console.log(' Token Usage (avg):')
+ console.log(
+ ` Output: ${Math.round(s.standardOutputTokens.mean)} std vs ${Math.round(s.backgroundOutputTokens.mean)} bg` +
+ ` \u2014 \u0394 ${Math.abs(outDeltaPct).toFixed(1)}% ${outDeltaOk ? '\u2713' : '\u2717 WARNING: >15%'}`,
+ )
+ console.log(
+ ` Input: ${Math.round(s.standardInputTokens.mean)} std vs ${Math.round(s.backgroundInputTokens.mean)} bg` +
+ ` \u2014 \u0394 ${Math.abs(inDeltaPct).toFixed(1)}%`,
+ )
+
+ // Context size
+ console.log()
+ console.log(' Context (avg):')
+ console.log(
+ ` Messages: ${Math.round(s.standardMessageCount.mean)} std vs ${Math.round(s.backgroundMessageCount.mean)} bg`,
+ )
+ console.log(
+ ` Cycles: ${Math.round(s.standardCycleCount.mean)} std vs ${Math.round(s.backgroundCycleCount.mean)} bg`,
+ )
+
+ // Speedup stats
+ console.log()
+ console.log(
+ ` Speedup: ${fmtStats(s.speedup, 'x')} (p50: ${s.speedup.p50.toFixed(2)}x, p95: ${s.speedup.p95.toFixed(2)}x)`,
+ )
+
+ // Trajectory (from first run, both modes)
+ const stdTools = c.runs[0]?.standard.metrics.toolMetrics
+ const bgTools = c.runs[0]?.background.metrics.toolMetrics
+ if (stdTools && Object.keys(stdTools).length > 0) {
+ console.log()
+ console.log(' Trajectory:')
+ const allToolNames = new Set([...Object.keys(stdTools), ...Object.keys(bgTools ?? {})])
+ allToolNames.delete('strands_structured_output')
+ for (const name of allToolNames) {
+ const sc = stdTools[name]?.callCount ?? 0
+ const bc = bgTools?.[name]?.callCount ?? 0
+ console.log(` ${pad(name + ':', 40)} ${sc} (std) | ${bc} (bg)`)
+ }
+ }
+
+ // Validations
+ const totalValidations = c.runs.reduce(
+ (sum, r) => sum + r.standard.validationResults.length + r.background.validationResults.length,
+ 0,
+ )
+ const passedValidations = c.runs.reduce(
+ (sum, r) =>
+ sum +
+ r.standard.validationResults.filter((v) => v.passed).length +
+ r.background.validationResults.filter((v) => v.passed).length,
+ 0,
+ )
+ console.log()
+ console.log(` Validations: ${passedValidations}/${totalValidations} passed ${s.allValidationsPassed ? '\u2713' : '\u2717'}`)
+
+ // Print failed validations
+ if (!s.allValidationsPassed) {
+ for (const r of c.runs) {
+ for (const v of [...r.standard.validationResults, ...r.background.validationResults]) {
+ if (!v.passed) {
+ console.log(` \u2717 Run ${r.runIndex + 1}: ${v.name} — ${v.reason}`)
+ }
+ }
+ }
+ }
+}
+
+export function printSummary(report: BenchmarkReport): void {
+ const border = '\u2550'.repeat(72)
+
+ console.log(`\n${border}`)
+ console.log(` BENCHMARK REPORT \u2014 ${report.timestamp}`)
+ console.log(` Model: ${report.config.model} | Runs per case: ${report.config.runs} | Delay multiplier: ${report.config.delayMultiplier}x`)
+ console.log(border)
+
+ for (const c of report.cases) {
+ printCaseReport(c)
+ }
+
+ // Overall summary
+ console.log(`\n${border}`)
+ console.log(' OVERALL SUMMARY')
+ console.log(border)
+ console.log()
+ console.log(
+ ` ${pad('Case', 26)} ${pad('Avg Speedup', 14)} ${pad('\u03c3', 10)} ${pad('Validations', 14)} Token \u0394`,
+ )
+ console.log(` ${'\u2500'.repeat(26)} ${'\u2500'.repeat(14)} ${'\u2500'.repeat(10)} ${'\u2500'.repeat(14)} ${'\u2500'.repeat(10)}`)
+
+ for (const c of report.cases) {
+ const s = c.summary
+ const totalV = c.runs.reduce(
+ (sum, r) => sum + r.standard.validationResults.length + r.background.validationResults.length,
+ 0,
+ )
+ const passedV = c.runs.reduce(
+ (sum, r) =>
+ sum +
+ r.standard.validationResults.filter((v) => v.passed).length +
+ r.background.validationResults.filter((v) => v.passed).length,
+ 0,
+ )
+
+ console.log(
+ ` ${pad(c.caseName, 26)} ${pad(s.speedup.mean.toFixed(2) + 'x', 14)} ${pad('\u00b1' + s.speedup.stddev.toFixed(2), 10)} ${pad(`${s.allValidationsPassed ? '\u2713' : '\u2717'} ${passedV}/${totalV}`, 14)} ${Math.abs(s.outputTokenDeltaPct).toFixed(1)}%`,
+ )
+ }
+
+ console.log()
+ console.log(
+ ` Overall: ${report.overallSummary.avgSpeedupAcrossCases.toFixed(2)}x avg | ` +
+ `${report.overallSummary.allValidationsPassed ? '\u2713 All validations passed' : '\u2717 Some validations failed'}`,
+ )
+ console.log()
+}
diff --git a/designs/0009-demos/benchmarks/framework/runner.ts b/designs/0009-demos/benchmarks/framework/runner.ts
new file mode 100644
index 000000000..f695592cb
--- /dev/null
+++ b/designs/0009-demos/benchmarks/framework/runner.ts
@@ -0,0 +1,288 @@
+import {
+ type AgentResult,
+ type Agent,
+ type Model,
+ BeforeToolCallEvent,
+ AfterToolCallEvent,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+} from '@strands-agents/sdk'
+import type {
+ BenchmarkCase,
+ BenchmarkConfig,
+ BenchmarkReport,
+ CaseReport,
+ ModeMetrics,
+ ModeResult,
+ RunResult,
+} from './types.js'
+import { validateOutput, validateTrajectory } from './validators.js'
+import { computeStats } from './report.js'
+
+// ── Metric extraction ────────────────────────────────────────────────────────
+
+function extractMetrics(result: AgentResult): ModeMetrics {
+ const m = result.metrics
+ return {
+ cycleCount: m?.cycleCount ?? 0,
+ totalDuration: m?.totalDuration ?? 0,
+ averageCycleTime: m?.averageCycleTime ?? 0,
+ accumulatedUsage: {
+ inputTokens: m?.accumulatedUsage?.inputTokens ?? 0,
+ outputTokens: m?.accumulatedUsage?.outputTokens ?? 0,
+ totalTokens: m?.accumulatedUsage?.totalTokens ?? 0,
+ cacheReadInputTokens: m?.accumulatedUsage?.cacheReadInputTokens,
+ cacheWriteInputTokens: m?.accumulatedUsage?.cacheWriteInputTokens,
+ },
+ modelLatencyMs: m?.accumulatedMetrics?.latencyMs ?? 0,
+ toolMetrics: m?.toolMetrics ?? {},
+ }
+}
+
+function totalToolTime(metrics: ModeMetrics): number {
+ return Object.values(metrics.toolMetrics).reduce((sum, t) => sum + t.totalTime, 0)
+}
+
+// ── Tool lifecycle logging ────────────────────────────────────────────────────
+
+function elapsed(start: number): string {
+ return `+${((Date.now() - start) / 1000).toFixed(1)}s`
+}
+
+function attachToolLogging(
+ agent: Agent,
+ mode: string,
+ runStart: () => number,
+ bgDispatches: Record,
+): void {
+ // Standard tool calls: before/after
+ agent.addHook(BeforeToolCallEvent, (event) => {
+ const toolName = event.toolUse.name
+ if (toolName === 'strands_structured_output') return
+ console.log(` [${elapsed(runStart())}] ${mode} | ${toolName} started`)
+ })
+
+ agent.addHook(AfterToolCallEvent, (event) => {
+ const toolName = event.toolUse.name
+ if (toolName === 'strands_structured_output') return
+ console.log(` [${elapsed(runStart())}] ${mode} | ${toolName} finished`)
+ })
+
+ // Background dispatches and results — also track dispatch counts
+ agent.addHook(BackgroundTaskDispatchEvent, (event) => {
+ const toolName = event.toolUse.name
+ bgDispatches[toolName] = (bgDispatches[toolName] ?? 0) + 1
+ console.log(` [${elapsed(runStart())}] ${mode} | ${toolName} dispatched (bg)`)
+ })
+
+ agent.addHook(BackgroundTaskResultEvent, (event) => {
+ console.log(` [${elapsed(runStart())}] ${mode} | ${event.taskName} result arrived (bg)`)
+ })
+}
+
+// ── Single-mode execution ────────────────────────────────────────────────────
+
+async function runMode(
+ agent: Agent,
+ prompt: string | string[],
+ mode: 'standard' | 'background',
+ benchmarkCase: BenchmarkCase,
+): Promise {
+ const start = Date.now()
+
+ // Track background dispatches for trajectory validation
+ // (background tools execute in forks — their metrics don't appear in primary agent's toolMetrics)
+ const bgDispatches: Record = {}
+ attachToolLogging(agent, mode, () => start, bgDispatches)
+
+ let result: AgentResult
+ const allOutputTexts: string[] = []
+
+ if (Array.isArray(prompt)) {
+ // Multi-turn: invoke sequentially, collect all outputs
+ let lastResult: AgentResult | undefined
+ for (let t = 0; t < prompt.length; t++) {
+ const turnStart = Date.now()
+ lastResult = await agent.invoke(prompt[t]!)
+ const turnMs = Date.now() - turnStart
+ allOutputTexts.push(lastResult.toString())
+ process.stdout.write(` turn ${t + 1}/${prompt.length} (${(turnMs / 1000).toFixed(1)}s)\n`)
+ }
+ result = lastResult!
+ } else {
+ result = await agent.invoke(prompt)
+ allOutputTexts.push(result.toString())
+ }
+
+ const wallClockMs = Date.now() - start
+ const metrics = extractMetrics(result)
+
+ // For output validation: combine toString() text with structured output JSON
+ // (structured output has the actual data, toString() may be sparse)
+ const textParts = [...allOutputTexts]
+ if (result.structuredOutput) {
+ textParts.push(JSON.stringify(result.structuredOutput))
+ }
+ const outputText = textParts.join('\n')
+ const toolTimeMs = totalToolTime(metrics)
+
+ // Run validations
+ const outputValidations = validateOutput(outputText, benchmarkCase.outputValidation, result.structuredOutput)
+
+ // Merge toolMetrics with background dispatch counts.
+ // Background tools execute in forks so their metrics don't appear in the primary agent's
+ // toolMetrics. We track dispatches via hooks and merge them in.
+ const mergedToolMetrics = { ...metrics.toolMetrics }
+ for (const [name, count] of Object.entries(bgDispatches)) {
+ if (!mergedToolMetrics[name]) {
+ mergedToolMetrics[name] = { callCount: count, successCount: count, errorCount: 0, totalTime: 0 }
+ }
+ }
+ const trajectoryValidations = validateTrajectory(mergedToolMetrics, benchmarkCase.trajectoryValidation)
+ const validationResults = [...outputValidations, ...trajectoryValidations]
+
+ // Use merged tool metrics so background dispatch counts appear in the report
+ const metricsWithBgDispatches = { ...metrics, toolMetrics: mergedToolMetrics }
+
+ return {
+ mode,
+ wallClockMs,
+ metrics: metricsWithBgDispatches,
+ totalToolTimeMs: toolTimeMs,
+ totalModelLatencyMs: metrics.modelLatencyMs,
+ outputTokens: metrics.accumulatedUsage.outputTokens,
+ inputTokens: metrics.accumulatedUsage.inputTokens,
+ messageCount: agent.messages.length,
+ cycleCount: metrics.cycleCount,
+ outputText,
+ structuredOutput: result.structuredOutput,
+ stopReason: result.stopReason,
+ validationResults,
+ allValidationsPassed: validationResults.every((v) => v.passed),
+ }
+}
+
+// ── BenchmarkRunner ──────────────────────────────────────────────────────────
+
+export class BenchmarkRunner {
+ constructor(private readonly _config: BenchmarkConfig, private readonly _model: Model) {}
+
+ async runCase(benchmarkCase: BenchmarkCase): Promise {
+ const runs: RunResult[] = []
+
+ for (let i = 0; i < this._config.runs; i++) {
+ console.log(`\n Run ${i + 1}/${this._config.runs}`)
+
+ // Fresh agents per run — no state bleed
+ console.log(' Standard...')
+ const standardAgent = benchmarkCase.createStandardAgent(this._config.delayMultiplier, this._model)
+ const standard = await runMode(standardAgent, benchmarkCase.prompt, 'standard', benchmarkCase)
+ console.log(` Standard: ${(standard.wallClockMs / 1000).toFixed(1)}s`)
+
+ console.log(' Background...')
+ const backgroundAgent = benchmarkCase.createBackgroundAgent(this._config.delayMultiplier, this._model)
+ const background = await runMode(backgroundAgent, benchmarkCase.prompt, 'background', benchmarkCase)
+ console.log(` Background: ${(background.wallClockMs / 1000).toFixed(1)}s`)
+
+ const speedup = standard.wallClockMs / background.wallClockMs
+
+ console.log(` Speedup: ${speedup.toFixed(2)}x`)
+
+ runs.push({ runIndex: i, standard, background, speedup })
+ }
+
+ // Compute summary statistics
+ const speedups = runs.map((r) => r.speedup)
+ const stdWallClocks = runs.map((r) => r.standard.wallClockMs)
+ const bgWallClocks = runs.map((r) => r.background.wallClockMs)
+ const stdOutputTokens = runs.map((r) => r.standard.outputTokens)
+ const bgOutputTokens = runs.map((r) => r.background.outputTokens)
+ const stdInputTokens = runs.map((r) => r.standard.inputTokens)
+ const bgInputTokens = runs.map((r) => r.background.inputTokens)
+ const stdMessageCounts = runs.map((r) => r.standard.messageCount)
+ const bgMessageCounts = runs.map((r) => r.background.messageCount)
+ const stdCycleCounts = runs.map((r) => r.standard.cycleCount)
+ const bgCycleCounts = runs.map((r) => r.background.cycleCount)
+ const stdToolTimes = runs.map((r) => r.standard.totalToolTimeMs)
+ const bgToolTimes = runs.map((r) => r.background.totalToolTimeMs)
+ const toolTimeSpeedups = runs.map((r) =>
+ r.background.totalToolTimeMs > 0 ? r.standard.totalToolTimeMs / r.background.totalToolTimeMs : 1,
+ )
+
+ const avgStdOutputTokens = stdOutputTokens.reduce((s, v) => s + v, 0) / runs.length
+ const avgBgOutputTokens = bgOutputTokens.reduce((s, v) => s + v, 0) / runs.length
+ const outputTokenDeltaPct = avgStdOutputTokens > 0 ? ((avgBgOutputTokens - avgStdOutputTokens) / avgStdOutputTokens) * 100 : 0
+
+ const avgStdInputTokens = stdInputTokens.reduce((s, v) => s + v, 0) / runs.length
+ const avgBgInputTokens = bgInputTokens.reduce((s, v) => s + v, 0) / runs.length
+ const inputTokenDeltaPct = avgStdInputTokens > 0 ? ((avgBgInputTokens - avgStdInputTokens) / avgStdInputTokens) * 100 : 0
+
+ const allValidationsPassed = runs.every((r) => r.standard.allValidationsPassed && r.background.allValidationsPassed)
+
+ return {
+ caseName: benchmarkCase.name,
+ description: benchmarkCase.description,
+ runs,
+ summary: {
+ speedup: computeStats(speedups),
+ standardWallClockMs: computeStats(stdWallClocks),
+ backgroundWallClockMs: computeStats(bgWallClocks),
+ standardOutputTokens: computeStats(stdOutputTokens),
+ backgroundOutputTokens: computeStats(bgOutputTokens),
+ outputTokenDeltaPct,
+ standardInputTokens: computeStats(stdInputTokens),
+ backgroundInputTokens: computeStats(bgInputTokens),
+ inputTokenDeltaPct,
+ standardMessageCount: computeStats(stdMessageCounts),
+ backgroundMessageCount: computeStats(bgMessageCounts),
+ standardCycleCount: computeStats(stdCycleCounts),
+ backgroundCycleCount: computeStats(bgCycleCounts),
+ standardToolTimeMs: computeStats(stdToolTimes),
+ backgroundToolTimeMs: computeStats(bgToolTimes),
+ toolTimeSpeedup: computeStats(toolTimeSpeedups),
+ allValidationsPassed,
+ },
+ }
+ }
+
+ async runAll(cases: BenchmarkCase[]): Promise {
+ const filtered = this._config.cases
+ ? cases.filter((c) => this._config.cases!.includes(c.name))
+ : cases
+
+ if (filtered.length === 0) {
+ console.error('No matching cases found. Available:', cases.map((c) => c.name).join(', '))
+ process.exit(1)
+ }
+
+ const caseReports: CaseReport[] = []
+
+ for (const benchmarkCase of filtered) {
+ console.log(`\n${'='.repeat(60)}`)
+ console.log(` CASE: ${benchmarkCase.name}`)
+ console.log(` ${benchmarkCase.description}`)
+ console.log('='.repeat(60))
+
+ const report = await this.runCase(benchmarkCase)
+ caseReports.push(report)
+ }
+
+ const allValidationsPassed = caseReports.every((c) => c.summary.allValidationsPassed)
+ const avgSpeedup =
+ caseReports.length > 0
+ ? caseReports.reduce((sum, c) => sum + c.summary.speedup.mean, 0) / caseReports.length
+ : 0
+
+ return {
+ timestamp: new Date().toISOString(),
+ config: this._config,
+ cases: caseReports,
+ overallSummary: {
+ totalCases: caseReports.length,
+ totalRuns: caseReports.length * this._config.runs,
+ allValidationsPassed,
+ avgSpeedupAcrossCases: avgSpeedup,
+ },
+ }
+ }
+}
diff --git a/designs/0009-demos/benchmarks/framework/types.ts b/designs/0009-demos/benchmarks/framework/types.ts
new file mode 100644
index 000000000..e9448e9a9
--- /dev/null
+++ b/designs/0009-demos/benchmarks/framework/types.ts
@@ -0,0 +1,176 @@
+import type { Agent, AgentResult, Model } from '@strands-agents/sdk'
+
+// ── Configuration ────────────────────────────────────────────────────────────
+
+export interface BenchmarkConfig {
+ /** Case names to run. Undefined = all. */
+ cases?: string[]
+ /** Repetitions per case. */
+ runs: number
+ /** Model identifier (e.g. "sonnet-4.6", "haiku-4.5", "gpt-4o", "llama-4-maverick"). */
+ model: string
+ /** Multiplier applied to tool stub delays (0.1 = 10x faster). */
+ delayMultiplier: number
+ /** Directory for JSON report output. */
+ outputDir: string
+}
+
+/** Model presets for --help display. */
+export const MODEL_PRESETS: Record = {
+ 'sonnet-4.6': 'Claude Sonnet 4.6 via Bedrock (default)',
+ 'haiku-4.5': 'Claude Haiku 4.5 via Bedrock (fast/cheap)',
+ 'nova-pro': 'Amazon Nova Pro via Bedrock',
+}
+
+export const DEFAULT_CONFIG: BenchmarkConfig = {
+ model: 'sonnet-4.6',
+ runs: 3,
+ delayMultiplier: 1,
+ outputDir: './results',
+}
+
+// ── Case Definition ──────────────────────────────────────────────────────────
+
+export interface BenchmarkCase {
+ /** Unique identifier, e.g. "probe-dispatch". */
+ name: string
+ /** One-line description of what the case tests. */
+ description: string
+
+ /**
+ * Single-turn: a prompt string.
+ * Multi-turn: an array of sequential prompts (e.g. customer messages).
+ */
+ prompt: string | string[]
+
+ /** Creates a fresh standard-mode agent. Called once per run. */
+ createStandardAgent(delayMultiplier: number, model: Model): Agent
+
+ /** Creates a fresh background-mode agent. Called once per run. */
+ createBackgroundAgent(delayMultiplier: number, model: Model): Agent
+
+ /** Output validation spec. */
+ outputValidation: OutputValidation
+
+ /** Tool trajectory validation spec. */
+ trajectoryValidation: TrajectoryValidation
+}
+
+// ── Validation Specs ─────────────────────────────────────────────────────────
+
+export interface OutputValidation {
+ /** Required substrings (case-insensitive) in the output text. */
+ requiredContent?: string[]
+ /** Regex patterns that must match somewhere in the output. */
+ requiredPatterns?: RegExp[]
+ /** Minimum output length (characters). */
+ minLength?: number
+ /** Maximum output length (characters). */
+ maxLength?: number
+ /** When true, validates that the agent produced non-null structuredOutput. */
+ requireStructuredOutput?: boolean
+}
+
+export interface TrajectoryValidation {
+ /** Tool names that must appear in toolMetrics with callCount >= 1. */
+ requiredTools?: string[]
+ /** Minimum total tool calls across all tools. */
+ minToolCalls?: number
+}
+
+// ── Results ──────────────────────────────────────────────────────────────────
+
+export interface ValidationResult {
+ name: string
+ passed: boolean
+ reason: string
+}
+
+export interface ModeMetrics {
+ cycleCount: number
+ totalDuration: number
+ averageCycleTime: number
+ accumulatedUsage: {
+ inputTokens: number
+ outputTokens: number
+ totalTokens: number
+ cacheReadInputTokens?: number
+ cacheWriteInputTokens?: number
+ }
+ modelLatencyMs: number
+ toolMetrics: Record
+}
+
+export interface ModeResult {
+ mode: 'standard' | 'background'
+ wallClockMs: number
+ metrics: ModeMetrics
+ totalToolTimeMs: number
+ totalModelLatencyMs: number
+ outputTokens: number
+ inputTokens: number
+ messageCount: number
+ cycleCount: number
+ outputText: string
+ structuredOutput?: unknown
+ stopReason: string
+ validationResults: ValidationResult[]
+ allValidationsPassed: boolean
+}
+
+export interface RunResult {
+ runIndex: number
+ standard: ModeResult
+ background: ModeResult
+ speedup: number
+}
+
+// ── Statistics ───────────────────────────────────────────────────────────────
+
+export interface Stats {
+ mean: number
+ stddev: number
+ min: number
+ max: number
+ p50: number
+ p95: number
+}
+
+// ── Reports ──────────────────────────────────────────────────────────────────
+
+export interface CaseReport {
+ caseName: string
+ description: string
+ runs: RunResult[]
+ summary: {
+ speedup: Stats
+ standardWallClockMs: Stats
+ backgroundWallClockMs: Stats
+ standardOutputTokens: Stats
+ backgroundOutputTokens: Stats
+ outputTokenDeltaPct: number
+ standardInputTokens: Stats
+ backgroundInputTokens: Stats
+ inputTokenDeltaPct: number
+ standardMessageCount: Stats
+ backgroundMessageCount: Stats
+ standardCycleCount: Stats
+ backgroundCycleCount: Stats
+ standardToolTimeMs: Stats
+ backgroundToolTimeMs: Stats
+ toolTimeSpeedup: Stats
+ allValidationsPassed: boolean
+ }
+}
+
+export interface BenchmarkReport {
+ timestamp: string
+ config: BenchmarkConfig
+ cases: CaseReport[]
+ overallSummary: {
+ totalCases: number
+ totalRuns: number
+ allValidationsPassed: boolean
+ avgSpeedupAcrossCases: number
+ }
+}
diff --git a/designs/0009-demos/benchmarks/framework/validators.ts b/designs/0009-demos/benchmarks/framework/validators.ts
new file mode 100644
index 000000000..df3054a2e
--- /dev/null
+++ b/designs/0009-demos/benchmarks/framework/validators.ts
@@ -0,0 +1,108 @@
+import type { OutputValidation, TrajectoryValidation, ValidationResult } from './types.js'
+
+/**
+ * Validate agent output text against the case's output spec.
+ */
+export function validateOutput(
+ output: string,
+ spec: OutputValidation,
+ structuredOutput?: unknown,
+): ValidationResult[] {
+ const results: ValidationResult[] = []
+ const lower = output.toLowerCase()
+
+ if (spec.requireStructuredOutput) {
+ const present = structuredOutput != null
+ results.push({
+ name: 'structured_output_present',
+ passed: present,
+ reason: present ? 'Structured output produced' : 'Structured output missing (null or undefined)',
+ })
+ }
+
+ if (spec.requiredContent) {
+ for (const content of spec.requiredContent) {
+ const found = lower.includes(content.toLowerCase())
+ results.push({
+ name: `required_content:${content}`,
+ passed: found,
+ reason: found ? `Found "${content}" in output` : `Missing "${content}" in output`,
+ })
+ }
+ }
+
+ if (spec.requiredPatterns) {
+ for (const pattern of spec.requiredPatterns) {
+ const matched = pattern.test(output)
+ results.push({
+ name: `required_pattern:${pattern.source}`,
+ passed: matched,
+ reason: matched ? `Pattern /${pattern.source}/ matched` : `Pattern /${pattern.source}/ not found`,
+ })
+ }
+ }
+
+ if (spec.minLength !== undefined) {
+ const passed = output.length >= spec.minLength
+ results.push({
+ name: `min_length:${spec.minLength}`,
+ passed,
+ reason: passed
+ ? `Output length ${output.length} >= ${spec.minLength}`
+ : `Output length ${output.length} < ${spec.minLength}`,
+ })
+ }
+
+ if (spec.maxLength !== undefined) {
+ const passed = output.length <= spec.maxLength
+ results.push({
+ name: `max_length:${spec.maxLength}`,
+ passed,
+ reason: passed
+ ? `Output length ${output.length} <= ${spec.maxLength}`
+ : `Output length ${output.length} > ${spec.maxLength}`,
+ })
+ }
+
+ return results
+}
+
+/**
+ * Validate tool call trajectory against the case's trajectory spec.
+ */
+export function validateTrajectory(
+ toolMetrics: Record,
+ spec: TrajectoryValidation,
+): ValidationResult[] {
+ const results: ValidationResult[] = []
+
+ if (spec.requiredTools) {
+ for (const toolName of spec.requiredTools) {
+ const entry = toolMetrics[toolName]
+ const called = entry !== undefined && entry.callCount >= 1
+ results.push({
+ name: `required_tool:${toolName}`,
+ passed: called,
+ reason: called
+ ? `${toolName} called ${entry!.callCount} time(s)`
+ : `${toolName} was not called`,
+ })
+ }
+ }
+
+ if (spec.minToolCalls !== undefined) {
+ const totalCalls = Object.entries(toolMetrics)
+ .filter(([name]) => name !== 'strands_structured_output')
+ .reduce((sum, [, t]) => sum + t.callCount, 0)
+ const passed = totalCalls >= spec.minToolCalls
+ results.push({
+ name: `min_tool_calls:${spec.minToolCalls}`,
+ passed,
+ reason: passed
+ ? `Total tool calls ${totalCalls} >= ${spec.minToolCalls}`
+ : `Total tool calls ${totalCalls} < ${spec.minToolCalls}`,
+ })
+ }
+
+ return results
+}
diff --git a/designs/0009-demos/benchmarks/index.ts b/designs/0009-demos/benchmarks/index.ts
new file mode 100644
index 000000000..eda626193
--- /dev/null
+++ b/designs/0009-demos/benchmarks/index.ts
@@ -0,0 +1,99 @@
+import { allCases } from './cases/index.js'
+import { BenchmarkRunner } from './framework/runner.js'
+import { writeReport, printSummary } from './framework/report.js'
+import { createModel } from './framework/models.js'
+import type { BenchmarkConfig } from './framework/types.js'
+import { DEFAULT_CONFIG, MODEL_PRESETS } from './framework/types.js'
+
+// ── CLI arg parsing ──────────────────────────────────────────────────────────
+
+function parseArgs(): BenchmarkConfig {
+ const args = process.argv.slice(2)
+ const config = { ...DEFAULT_CONFIG }
+
+ for (let i = 0; i < args.length; i++) {
+ const arg = args[i]
+ const next = args[i + 1]
+
+ switch (arg) {
+ case '--cases':
+ if (next) {
+ config.cases = next.split(',').map((s) => s.trim())
+ i++
+ }
+ break
+ case '--runs':
+ if (next) {
+ config.runs = parseInt(next, 10)
+ i++
+ }
+ break
+ case '--delay-multiplier':
+ if (next) {
+ config.delayMultiplier = parseFloat(next)
+ i++
+ }
+ break
+ case '--output-dir':
+ if (next) {
+ config.outputDir = next
+ i++
+ }
+ break
+ case '--model':
+ if (next) {
+ config.model = next
+ i++
+ }
+ break
+ case '--help':
+ console.log(`
+Background Tasks Benchmark Suite
+
+Usage: npm start -- [options]
+
+Options:
+ --cases Comma-separated case names to run (default: all)
+ --model Model preset to use (default: ${DEFAULT_CONFIG.model})
+ --runs Repetitions per case (default: ${DEFAULT_CONFIG.runs})
+ --delay-multiplier Tool delay scale factor (default: ${DEFAULT_CONFIG.delayMultiplier})
+ --output-dir Report output directory (default: ${DEFAULT_CONFIG.outputDir})
+ --help Show this help
+
+Available models:
+${Object.entries(MODEL_PRESETS).map(([name, desc]) => ` ${name.padEnd(24)} ${desc}`).join('\n')}
+
+Available cases:
+${allCases.map((c) => ` ${c.name.padEnd(24)} ${c.description}`).join('\n')}
+`)
+ process.exit(0)
+ }
+ }
+
+ return config
+}
+
+// ── Main ─────────────────────────────────────────────────────────────────────
+
+const config = parseArgs()
+const model = createModel(config.model)
+const runner = new BenchmarkRunner(config, model)
+
+console.log('\n' + '='.repeat(60))
+console.log(' BACKGROUND TASKS BENCHMARK SUITE')
+console.log('='.repeat(60))
+console.log(` Model: ${config.model}`)
+console.log(` Runs: ${config.runs}`)
+console.log(` Delay multiplier: ${config.delayMultiplier}x`)
+console.log(` Cases: ${config.cases?.join(', ') ?? 'all'}`)
+console.log(` Output: ${config.outputDir}`)
+console.log('='.repeat(60))
+
+const report = await runner.runAll(allCases)
+const filepath = writeReport(report, config.outputDir)
+printSummary(report)
+
+console.log(` Report written to: ${filepath}`)
+console.log()
+
+process.exit(report.overallSummary.allValidationsPassed ? 0 : 1)
diff --git a/designs/0009-demos/cancellation/index.ts b/designs/0009-demos/cancellation/index.ts
new file mode 100644
index 000000000..a29aefbde
--- /dev/null
+++ b/designs/0009-demos/cancellation/index.ts
@@ -0,0 +1,121 @@
+/**
+ * Cancellation Demo
+ *
+ * Proves that background task cancellation works: dispatch multiple tasks,
+ * get the first result, cancel the rest immediately.
+ *
+ * Scenario: Find when the Strands Agents project was founded.
+ * Three search agents with different speeds are dispatched in parallel via
+ * invokeBackground(). The fast one returns the answer in ~10s (model + tool).
+ * The slow ones would take 60-90s. After the fast result, the developer
+ * cancels the slow ones — they stop immediately.
+ */
+
+import { Agent, BedrockModel, tool } from '@strands-agents/sdk'
+import { z } from 'zod'
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+// ── Search agents with very different tool delays ───────────────────────────
+
+function createSearchAgent(name: string, delayMs: number, result: string) {
+ const searchTool = tool({
+ name: 'search',
+ description: 'Execute the search.',
+ inputSchema: z.object({ query: z.string() }),
+ callback: async () => {
+ await new Promise((r) => setTimeout(r, delayMs))
+ return result
+ },
+ })
+
+ return new Agent({
+ model,
+ name,
+ tools: [searchTool],
+ systemPrompt: 'Call the search tool with the user query. Return ONLY the tool result, nothing else.',
+ printer: false,
+ })
+}
+
+// Fast: 5s tool delay (~10-15s total with model overhead)
+const quickAgent = createSearchAgent('quick_search', 5000,
+ 'FOUND: The Strands Agents project was founded in October 2024 by the AWS AI team. Open-sourced March 2025.')
+
+// Slow: 60s tool delay (~65-70s total)
+const docsAgent = createSearchAgent('docs_search', 60000,
+ 'DOCS: Strands Agents documentation confirms the project started in late 2024.')
+
+// Slowest: 90s tool delay (~95-100s total)
+const archiveAgent = createSearchAgent('archive_search', 90000,
+ 'ARCHIVE: Historical records show early commits dating to October 2024.')
+
+// ── Run ─────────────────────────────────────────────────────────────────────
+
+console.log('\n' + '='.repeat(70))
+console.log(' CANCELLATION DEMO')
+console.log('='.repeat(70))
+console.log(' Question: "When was the Strands Agents project founded?"')
+console.log(' 3 search agents dispatched via invokeBackground():')
+console.log(' quick_search — fast tool (~5s + model overhead)')
+console.log(' docs_search — slow tool (~60s + model overhead)')
+console.log(' archive_search — slowest tool (~90s + model overhead)')
+console.log(' After the fast result arrives, cancel the other two.')
+console.log('='.repeat(70))
+
+const start = Date.now()
+const elapsed = () => `+${((Date.now() - start) / 1000).toFixed(1)}s`
+
+const query = 'When was the Strands Agents project founded?'
+
+// Dispatch all three search agents in parallel
+const taskQuick = quickAgent.invokeBackground(query, { name: 'quick_search' })
+const taskDocs = docsAgent.invokeBackground(query, { name: 'docs_search' })
+const taskArchive = archiveAgent.invokeBackground(query, { name: 'archive_search' })
+
+console.log(`\n [${elapsed()}] Dispatched 3 search agents`)
+console.log(` quick_search: ${taskQuick.status}`)
+console.log(` docs_search: ${taskDocs.status}`)
+console.log(` archive_search: ${taskArchive.status}`)
+
+// Wait for just the fast one
+const quickResult = await taskQuick
+console.log(`\n [${elapsed()}] quick_search completed!`)
+console.log(` Result: "${String(quickResult).slice(0, 80)}..."`)
+
+// The other two should still be running
+console.log(`\n [${elapsed()}] Status check before cancel:`)
+console.log(` quick_search: ${taskQuick.status}`)
+console.log(` docs_search: ${taskDocs.status}`)
+console.log(` archive_search: ${taskArchive.status}`)
+
+// Cancel the slow ones
+taskDocs.cancel()
+taskArchive.cancel()
+
+console.log(`\n [${elapsed()}] Cancelled docs_search and archive_search`)
+console.log(` quick_search: ${taskQuick.status}`)
+console.log(` docs_search: ${taskDocs.status}`)
+console.log(` archive_search: ${taskArchive.status}`)
+
+const wallClock = Date.now() - start
+
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 28)} Value`)
+console.log(` ${'-'.repeat(28)} ${'-'.repeat(30)}`)
+console.log(` ${pad('Wall clock', 28)} ${(wallClock / 1000).toFixed(1)}s`)
+console.log(` ${pad('Without cancel', 28)} ~90s+ (waiting for archive_search)`)
+console.log(` ${pad('Time saved', 28)} ~${Math.max(0, 90 - wallClock / 1000).toFixed(0)}s`)
+console.log(` ${pad('quick_search status', 28)} ${taskQuick.status}`)
+console.log(` ${pad('docs_search status', 28)} ${taskDocs.status}`)
+console.log(` ${pad('archive_search status', 28)} ${taskArchive.status}`)
+console.log()
+console.log('='.repeat(70))
+
+process.exit(0)
diff --git a/designs/0009-demos/error-retry/index.ts b/designs/0009-demos/error-retry/index.ts
new file mode 100644
index 000000000..9cb7f36af
--- /dev/null
+++ b/designs/0009-demos/error-retry/index.ts
@@ -0,0 +1,149 @@
+/**
+ * Error Handling & Retry Demo
+ *
+ * Demonstrates the three-layer retry mechanism for background tasks:
+ *
+ * Layer 1: Tool-level retry (inside the fork) — via AfterToolCallEvent hook
+ * Layer 2: Task-level retry (developer-controlled) — via BackgroundTaskResultEvent hook
+ * Layer 3: Model-driven retry — model sees the error and re-calls the tool
+ *
+ * Scenario: A research coordinator dispatches 3 background researchers.
+ * One researcher (market_researcher) has a flaky API that fails on the first call.
+ * A Layer 2 retry hook catches the failure and re-dispatches automatically.
+ * The model never sees the error — it gets the successful result transparently.
+ */
+
+import {
+ Agent,
+ BedrockModel,
+ tool,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+} from '@strands-agents/sdk'
+import { z } from 'zod'
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+// ── Tool stubs ──────────────────────────────────────────────────────────────
+
+const techResearcher = tool({
+ name: 'research_tech',
+ description: 'Research technical landscape for a given topic. Always succeeds.',
+ inputSchema: z.object({ topic: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 5000))
+ return `Tech research for "${input.topic}": Found 12 relevant open-source projects, 3 major frameworks (Strands, LangGraph, CrewAI), and growing adoption of async tool patterns. Key trend: model-driven concurrency is emerging as a differentiator.`
+ },
+})
+
+const policyResearcher = tool({
+ name: 'research_policy',
+ description: 'Research governance and policy implications. Always succeeds.',
+ inputSchema: z.object({ topic: z.string() }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 4000))
+ return `Policy research for "${input.topic}": Enterprise adoption requires audit trails for concurrent operations. Key concerns: task isolation (preventing data leakage between forks), cancellation guarantees (no orphaned processes), and observability (tracking background task lifecycle for compliance).`
+ },
+})
+
+// Flaky tool: fails on first call, succeeds on second
+let marketCallCount = 0
+
+const marketResearcher = tool({
+ name: 'research_market',
+ description: 'Research market landscape. Note: this API is flaky and may fail intermittently.',
+ inputSchema: z.object({ topic: z.string() }),
+ callback: async (input) => {
+ marketCallCount++
+ await new Promise((r) => setTimeout(r, 3000))
+ if (marketCallCount === 1) {
+ throw new Error('Market API timeout: upstream provider returned 503 after 3000ms')
+ }
+ return `Market research for "${input.topic}": $2.4B TAM for AI agent tooling (2026). Background task scheduling identified as top-3 requested feature by enterprise users. Competitors: LangGraph (checkpoint-based), CrewAI (thread-based), Mastra (background flag). None offer model-driven dispatch.`
+ },
+})
+
+// ── Run ─────────────────────────────────────────────────────────────────────
+
+console.log('\n' + '='.repeat(70))
+console.log(' ERROR HANDLING & RETRY DEMO')
+console.log('='.repeat(70))
+console.log(' Scenario: 3 background researchers, one with a flaky API')
+console.log(' research_market will fail on first call, succeed on retry')
+console.log(' Layer 2 retry hook re-dispatches automatically')
+console.log('='.repeat(70))
+
+const coordinator = new Agent({
+ model,
+ systemPrompt:
+ 'You are a research coordinator. Dispatch all three researchers in parallel, ' +
+ 'then synthesize their findings into a brief summary. ' +
+ 'Include specific data points from each researcher.',
+ backgroundTools: [techResearcher, policyResearcher, marketResearcher],
+ printer: false,
+})
+
+const start = Date.now()
+const elapsed = () => `+${((Date.now() - start) / 1000).toFixed(1)}s`
+
+// Track dispatches
+coordinator.addHook(BackgroundTaskDispatchEvent, (e) => {
+ console.log(` [${elapsed()}] ${e.toolUse.name} dispatched`)
+})
+
+// Layer 2 retry: catch errors, retry once
+let retryCount = 0
+let errorReachedModel = false
+const maxRetries = 1
+
+coordinator.addHook(BackgroundTaskResultEvent, (e) => {
+ if (e.result.status === 'error') {
+ if (retryCount < maxRetries) {
+ retryCount++
+ e.retry = true
+ console.log(` [${elapsed()}] ${e.taskName} FAILED: ${e.error?.message ?? 'unknown error'}`)
+ console.log(` [${elapsed()}] ${e.taskName} RETRYING (attempt ${retryCount + 1})...`)
+ } else {
+ errorReachedModel = true
+ console.log(` [${elapsed()}] ${e.taskName} FAILED: retries exhausted, error will reach model`)
+ }
+ } else {
+ console.log(` [${elapsed()}] ${e.taskName} result arrived`)
+ }
+})
+
+console.log('\n--- Running with background tools + retry hook ---\n')
+
+const result = await coordinator.invoke(
+ 'Research the landscape of background task scheduling in AI agent frameworks. ' +
+ 'Cover technical, market, and policy dimensions.'
+)
+
+const wallClock = Date.now() - start
+
+console.log('\n--- COORDINATOR OUTPUT ---\n')
+console.log(result.toString())
+
+const usage = result.metrics?.accumulatedUsage
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 28)} Value`)
+console.log(` ${'-'.repeat(28)} ${'-'.repeat(30)}`)
+console.log(` ${pad('Wall clock', 28)} ${(wallClock / 1000).toFixed(1)}s`)
+console.log(` ${pad('Input tokens', 28)} ${usage?.inputTokens ?? 'N/A'}`)
+console.log(` ${pad('Output tokens', 28)} ${usage?.outputTokens ?? 'N/A'}`)
+console.log(` ${pad('Total tokens', 28)} ${usage?.totalTokens ?? 'N/A'}`)
+console.log(` ${pad('Agent cycles', 28)} ${result.metrics?.cycleCount ?? 'N/A'}`)
+console.log(` ${pad('Output length (chars)', 28)} ${result.toString().length}`)
+console.log()
+console.log(' Retry Mechanics:')
+console.log(` ${pad('research_market calls', 28)} ${marketCallCount}`)
+console.log(` ${pad('Layer 2 retries fired', 28)} ${retryCount}`)
+console.log(` ${pad('Error reached model', 28)} ${errorReachedModel ? 'YES (retries exhausted)' : 'NO (handled transparently)'}`)
+console.log()
+console.log('='.repeat(70))
diff --git a/designs/0009-demos/fire-and-forget-mcp/gmail-mcp-server.ts b/designs/0009-demos/fire-and-forget-mcp/gmail-mcp-server.ts
new file mode 100644
index 000000000..ebb6cce76
--- /dev/null
+++ b/designs/0009-demos/fire-and-forget-mcp/gmail-mcp-server.ts
@@ -0,0 +1,54 @@
+/**
+ * Stub Gmail MCP Server
+ *
+ * Exposes the same tool interface a real Gmail MCP server would:
+ * - send_email(to, subject, body) — simulates SMTP delivery with a 3-5s delay
+ *
+ * Runs over stdio transport. To the agent, this is indistinguishable from
+ * a real Gmail MCP server — same tool name, same input schema, same response shape.
+ * The only difference is it logs to stderr instead of actually sending.
+ */
+
+import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'
+import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
+import * as z from 'zod'
+
+const server = new McpServer(
+ { name: 'gmail-mcp-server', version: '1.0.0' },
+ { capabilities: { tools: {} } },
+)
+
+server.registerTool(
+ 'send_email',
+ {
+ title: 'Send Email',
+ description: 'Sends an email via Gmail. Takes a recipient address, subject line, and body text.',
+ inputSchema: {
+ to: z.string().describe('Recipient email address'),
+ subject: z.string().describe('Email subject line'),
+ body: z.string().describe('Email body text'),
+ },
+ },
+ async ({ to, subject, body }) => {
+ // Simulate SMTP round-trip (3-5s)
+ const delay = 3000 + Math.random() * 2000
+ await new Promise((r) => setTimeout(r, delay))
+
+ const messageId = `msg-${Date.now()}-${Math.random().toString(36).slice(2, 8)}`
+
+ // Log to stderr so it doesn't interfere with stdio MCP transport
+ process.stderr.write(
+ `[Gmail MCP] Sent email to=${to} subject="${subject}" body=${body.length} chars messageId=${messageId} (${(delay / 1000).toFixed(1)}s)\n`
+ )
+
+ return {
+ content: [{
+ type: 'text' as const,
+ text: `Email sent successfully. Message ID: ${messageId}`,
+ }],
+ }
+ },
+)
+
+const transport = new StdioServerTransport()
+await server.connect(transport)
diff --git a/designs/0009-demos/fire-and-forget-mcp/index.ts b/designs/0009-demos/fire-and-forget-mcp/index.ts
new file mode 100644
index 000000000..52d5cf38c
--- /dev/null
+++ b/designs/0009-demos/fire-and-forget-mcp/index.ts
@@ -0,0 +1,160 @@
+/**
+ * Fire-and-Forget MCP Demo — Standard vs Background
+ *
+ * Demonstrates background tasks with MCP tools in a fire-and-forget pattern.
+ * A project manager agent reviews a sprint and composes status update emails
+ * for 4 team members. The send_email tool (provided by a stub Gmail MCP server
+ * over stdio) takes 3-5s per email to simulate SMTP delivery.
+ *
+ * Standard: Agent composes each email, waits for send_email to complete before
+ * moving to the next team member. Emails block the main flow.
+ * Background: send_email is a backgroundTool — the agent fires off each email
+ * as it's composed and immediately moves to the next. All 4 emails
+ * send concurrently while the agent continues working.
+ */
+
+import {
+ Agent,
+ BedrockModel,
+ McpClient,
+ BeforeToolCallEvent,
+ AfterToolCallEvent,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+} from '@strands-agents/sdk'
+import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'
+import { fileURLToPath } from 'node:url'
+import { dirname, join } from 'node:path'
+
+const __dirname = dirname(fileURLToPath(import.meta.url))
+
+// ── MCP client factory ──────────────────────────────────────────────────────
+
+function createGmailMcpClient(): McpClient {
+ return new McpClient({
+ transport: new StdioClientTransport({
+ command: 'node',
+ args: [join(__dirname, 'gmail-mcp-server.js')],
+ }),
+ })
+}
+
+// ── Common config ───────────────────────────────────────────────────────────
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+const systemPrompt =
+ 'You are a project manager sending sprint status update emails to your team.\n\n' +
+ 'You have a send_email tool that sends emails via Gmail.\n\n' +
+ 'For EACH team member below, compose a personalized 2-3 sentence status update email ' +
+ 'and send it using send_email. The email should reference their specific work.\n\n' +
+ 'Team members:\n' +
+ '1. alice@acme.com — Worked on the authentication service refactor\n' +
+ '2. bob@acme.com — Shipped the new dashboard analytics feature\n' +
+ '3. carol@acme.com — Fixed 3 critical production bugs in the payment pipeline\n' +
+ '4. dave@acme.com — Completed the API rate limiting implementation\n\n' +
+ 'After sending all emails, produce a brief confirmation summary listing who was emailed.\n' +
+ 'Keep each email body under 100 words. Use subject line "Sprint 42 Status Update".'
+
+const userPrompt = 'Send sprint status update emails to all 4 team members now.'
+
+// ── Standard run ────────────────────────────────────────────────────────────
+
+console.log('\n' + '='.repeat(70))
+console.log(' FIRE-AND-FORGET MCP DEMO — Standard vs Background')
+console.log('='.repeat(70))
+console.log(' Tool: send_email via Gmail MCP Server (stdio, 3-5s per email)')
+console.log(' Scenario: Send personalized sprint update emails to 4 team members')
+console.log('='.repeat(70))
+
+console.log('\n--- STANDARD MODE (sequential — each email blocks) ---\n')
+
+const stdMcpClient = createGmailMcpClient()
+const standardAgent = new Agent({
+ model,
+ systemPrompt,
+ tools: [stdMcpClient],
+ printer: false,
+})
+
+const stdStart = Date.now()
+
+standardAgent.addHook(BeforeToolCallEvent, (e) => {
+ if (e.toolUse.name === 'send_email') {
+ const input = e.toolUse.input as { to?: string }
+ console.log(` [+${((Date.now() - stdStart) / 1000).toFixed(1)}s] send_email to ${input.to ?? '?'} — started`)
+ }
+})
+standardAgent.addHook(AfterToolCallEvent, (e) => {
+ if (e.toolUse.name === 'send_email') {
+ const input = e.toolUse.input as { to?: string }
+ console.log(` [+${((Date.now() - stdStart) / 1000).toFixed(1)}s] send_email to ${input.to ?? '?'} — delivered`)
+ }
+})
+
+const standardResult = await standardAgent.invoke(userPrompt)
+const standardMs = Date.now() - stdStart
+
+console.log(`\n Standard: ${(standardMs / 1000).toFixed(1)}s`)
+
+await stdMcpClient.disconnect()
+
+// ── Background run ──────────────────────────────────────────────────────────
+
+console.log('\n--- BACKGROUND MODE (fire-and-forget — emails send concurrently) ---\n')
+
+const bgMcpClient = createGmailMcpClient()
+const backgroundAgent = new Agent({
+ model,
+ systemPrompt,
+ backgroundTools: [bgMcpClient],
+ printer: false,
+})
+
+const bgStart = Date.now()
+
+backgroundAgent.addHook(BackgroundTaskDispatchEvent, (e) => {
+ const input = e.toolUse.input as { to?: string }
+ console.log(` [+${((Date.now() - bgStart) / 1000).toFixed(1)}s] send_email to ${input.to ?? '?'} — dispatched (bg)`)
+})
+backgroundAgent.addHook(BackgroundTaskResultEvent, (e) => {
+ console.log(` [+${((Date.now() - bgStart) / 1000).toFixed(1)}s] ${e.taskName} — delivered (bg)`)
+})
+
+const backgroundResult = await backgroundAgent.invoke(userPrompt)
+const backgroundMs = Date.now() - bgStart
+
+console.log(`\n Background: ${(backgroundMs / 1000).toFixed(1)}s`)
+
+await bgMcpClient.disconnect()
+
+// ── Results ─────────────────────────────────────────────────────────────────
+
+console.log('\n--- STANDARD OUTPUT ---\n')
+console.log(standardResult.toString())
+console.log('\n--- BACKGROUND OUTPUT ---\n')
+console.log(backgroundResult.toString())
+
+const speedup = standardMs / backgroundMs
+
+const stdUsage = standardResult.metrics?.accumulatedUsage
+const bgUsage = backgroundResult.metrics?.accumulatedUsage
+
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 24)} ${pad('Standard', 16)} Background`)
+console.log(` ${'-'.repeat(24)} ${'-'.repeat(16)} ${'-'.repeat(16)}`)
+console.log(` ${pad('Wall clock', 24)} ${pad((standardMs / 1000).toFixed(1) + 's', 16)} ${(backgroundMs / 1000).toFixed(1)}s`)
+console.log(` ${pad('Speedup', 24)} ${pad('baseline', 16)} ${speedup.toFixed(2)}x`)
+console.log(` ${pad('Input tokens', 24)} ${pad(String(stdUsage?.inputTokens ?? 'N/A'), 16)} ${bgUsage?.inputTokens ?? 'N/A'}`)
+console.log(` ${pad('Output tokens', 24)} ${pad(String(stdUsage?.outputTokens ?? 'N/A'), 16)} ${bgUsage?.outputTokens ?? 'N/A'}`)
+console.log(` ${pad('Total tokens', 24)} ${pad(String(stdUsage?.totalTokens ?? 'N/A'), 16)} ${bgUsage?.totalTokens ?? 'N/A'}`)
+console.log(` ${pad('Agent cycles', 24)} ${pad(String(standardResult.metrics?.cycleCount ?? 'N/A'), 16)} ${backgroundResult.metrics?.cycleCount ?? 'N/A'}`)
+console.log(` ${pad('Output length (chars)', 24)} ${pad(String(standardResult.toString().length), 16)} ${backgroundResult.toString().length}`)
+console.log()
+console.log('='.repeat(70))
diff --git a/designs/0009-demos/graph-vs-background/index.ts b/designs/0009-demos/graph-vs-background/index.ts
new file mode 100644
index 000000000..2e29894e1
--- /dev/null
+++ b/designs/0009-demos/graph-vs-background/index.ts
@@ -0,0 +1,331 @@
+/**
+ * Graph vs Background Tools — Side-by-side comparison
+ *
+ * Same 3-layer content pipeline executed two different ways:
+ *
+ * Layer 1 (parallel): market_analyst, tech_researcher, competitor_scout
+ * Layer 2 (parallel): marketing_writer (needs market + competitors),
+ * technical_writer (needs tech + competitors)
+ * Layer 3: editor (combines both drafts into final announcement)
+ *
+ * Approach A — Graph: Explicit DAG with 6 nodes. Graph orchestrates parallelism
+ * and dependencies. No model decides what runs when.
+ *
+ * Approach B — Single Agent + backgroundTools: One coordinator agent with 5
+ * sub-agents as background tools. The model discovers the pipeline
+ * and dispatches accordingly.
+ *
+ * Both produce the same artifact: a product launch announcement.
+ */
+
+import {
+ Agent,
+ BedrockModel,
+ Graph,
+ BeforeToolCallEvent,
+ AfterToolCallEvent,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+} from '@strands-agents/sdk'
+import { BeforeNodeCallEvent, AfterNodeCallEvent } from '@strands-agents/sdk/multiagent'
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+const product = 'QuantumDB — a distributed database that uses quantum-inspired algorithms for sub-millisecond queries at petabyte scale'
+
+// ── Shared agent factories (used by both approaches) ────────────────────────
+
+function createMarketAnalyst() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'market_analyst',
+ name: 'market_analyst',
+ description: 'Analyzes market landscape, target audience, and market sizing for a product launch.',
+ systemPrompt:
+ 'You are a market analyst. Given a product description, produce a concise market analysis (under 200 words) covering: ' +
+ 'target market, market size estimate, key buyer personas, and go-to-market positioning. Be specific and data-oriented.',
+ })
+}
+
+function createTechResearcher() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'tech_researcher',
+ name: 'tech_researcher',
+ description: 'Researches and summarizes the technical architecture and key differentiators of a product.',
+ systemPrompt:
+ 'You are a technical researcher. Given a product description, produce a concise technical summary (under 200 words) covering: ' +
+ 'architecture highlights, key technical differentiators, performance characteristics, and how it compares to existing approaches.',
+ })
+}
+
+function createCompetitorScout() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'competitor_scout',
+ name: 'competitor_scout',
+ description: 'Identifies and analyzes key competitors and their positioning relative to the product.',
+ systemPrompt:
+ 'You are a competitive intelligence analyst. Given a product description, produce a concise competitor analysis (under 200 words) covering: ' +
+ '3-4 key competitors, their strengths/weaknesses, and how this product differentiates.',
+ })
+}
+
+function createMarketingWriter() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'marketing_writer',
+ name: 'marketing_writer',
+ description: 'Writes marketing copy for a product launch using market analysis and competitor intelligence.',
+ systemPrompt:
+ 'You are a marketing copywriter. Given market analysis and competitor intelligence, write a compelling ' +
+ 'marketing section (under 250 words) for a product launch announcement. Include: value proposition, ' +
+ 'key benefits, and competitive positioning. Write for a technical decision-maker audience.',
+ })
+}
+
+function createTechnicalWriter() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'technical_writer',
+ name: 'technical_writer',
+ description: 'Writes the technical section of a product launch announcement using research and competitor data.',
+ systemPrompt:
+ 'You are a technical writer. Given technical research and competitor analysis, write a clear ' +
+ 'technical section (under 250 words) for a product launch announcement. Cover: architecture, ' +
+ 'performance benchmarks, and technical advantages over competitors.',
+ })
+}
+
+function createEditor() {
+ return new Agent({
+ model,
+ printer: false,
+ id: 'editor',
+ name: 'editor',
+ description: 'Combines marketing and technical drafts into a polished final product launch announcement.',
+ systemPrompt:
+ 'You are a senior editor. Given a marketing draft and a technical draft, combine them into a single ' +
+ 'polished product launch announcement (under 600 words). Structure: headline, executive summary, ' +
+ 'market opportunity, technical innovation, competitive advantage, and call to action. ' +
+ 'Make it professional, compelling, and cohesive.',
+ })
+}
+
+// ── Approach A: Graph ───────────────────────────────────────────────────────
+
+interface RunResult {
+ ms: number
+ output: string
+ inputTokens: number
+ outputTokens: number
+ totalTokens: number
+}
+
+async function runGraph(): Promise {
+ const graph = new Graph({
+ id: 'launch-pipeline',
+ nodes: [
+ createMarketAnalyst(),
+ createTechResearcher(),
+ createCompetitorScout(),
+ createMarketingWriter(),
+ createTechnicalWriter(),
+ createEditor(),
+ ],
+ edges: [
+ // Layer 1 → Layer 2
+ ['market_analyst', 'marketing_writer'],
+ ['competitor_scout', 'marketing_writer'],
+ ['tech_researcher', 'technical_writer'],
+ ['competitor_scout', 'technical_writer'],
+ // Layer 2 → Layer 3
+ ['marketing_writer', 'editor'],
+ ['technical_writer', 'editor'],
+ ],
+ })
+
+ const start = Date.now()
+
+ graph.addHook(BeforeNodeCallEvent, (e) => {
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.nodeId} started`)
+ })
+ graph.addHook(AfterNodeCallEvent, (e) => {
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.nodeId} finished${e.error ? ` (error: ${e.error.message})` : ''}`)
+ })
+
+ const result = await graph.invoke(`Product: ${product}`)
+ const ms = Date.now() - start
+
+ const output = result.content
+ .filter((b) => b.type === 'textBlock')
+ .map((b) => 'text' in b ? (b as { text: string }).text : '')
+ .join('\n')
+
+ return {
+ ms,
+ output,
+ inputTokens: result.usage?.inputTokens ?? 0,
+ outputTokens: result.usage?.outputTokens ?? 0,
+ totalTokens: result.usage?.totalTokens ?? 0,
+ }
+}
+
+// ── Shared coordinator prompt ────────────────────────────────────────────────
+
+const coordinatorPrompt =
+ 'You are a silent orchestrator. You NEVER produce text — only tool calls.\n\n' +
+ 'PIPELINE:\n' +
+ '1. Call market_analyst, tech_researcher, competitor_scout simultaneously.\n' +
+ '2. When results arrive, call marketing_writer (pass it market + competitor data) and technical_writer (pass it tech + competitor data) simultaneously.\n' +
+ '3. When drafts arrive, call editor (pass it both drafts).\n' +
+ '4. Return the editor\'s output verbatim.\n\n' +
+ 'RULES:\n' +
+ '- NEVER write prose, commentary, or narration between tool calls.\n' +
+ '- ONLY emit tool_use blocks. Nothing else.\n' +
+ '- Pass the FULL text from earlier results into later tool inputs — do not summarize.'
+
+// ── Approach B: Single Agent + standard tools (sequential) ──────────────────
+
+async function runStandard(): Promise {
+ const coordinator = new Agent({
+ model,
+ systemPrompt: coordinatorPrompt,
+ tools: [
+ createMarketAnalyst(),
+ createTechResearcher(),
+ createCompetitorScout(),
+ createMarketingWriter(),
+ createTechnicalWriter(),
+ createEditor(),
+ ],
+ printer: false,
+ })
+
+ const start = Date.now()
+
+ coordinator.addHook(BeforeToolCallEvent, (e) => {
+ if (e.toolUse.name !== 'strands_structured_output')
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.toolUse.name} started`)
+ })
+ coordinator.addHook(AfterToolCallEvent, (e) => {
+ if (e.toolUse.name !== 'strands_structured_output')
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.toolUse.name} finished`)
+ })
+
+ const result = await coordinator.invoke(`Product: ${product}`)
+ const ms = Date.now() - start
+ const output = result.toString()
+ const usage = result.metrics?.accumulatedUsage
+
+ return {
+ ms,
+ output,
+ inputTokens: usage?.inputTokens ?? 0,
+ outputTokens: usage?.outputTokens ?? 0,
+ totalTokens: usage?.totalTokens ?? 0,
+ }
+}
+
+// ── Approach C: Single Agent + backgroundTools (parallel) ───────────────────
+
+async function runBackground(): Promise {
+ const coordinator = new Agent({
+ model,
+ systemPrompt: coordinatorPrompt,
+ backgroundTools: [
+ createMarketAnalyst(),
+ createTechResearcher(),
+ createCompetitorScout(),
+ createMarketingWriter(),
+ createTechnicalWriter(),
+ createEditor(),
+ ],
+ printer: false,
+ })
+
+ const start = Date.now()
+
+ coordinator.addHook(BackgroundTaskDispatchEvent, (e) => {
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.toolUse.name} dispatched (bg)`)
+ })
+ coordinator.addHook(BackgroundTaskResultEvent, (e) => {
+ console.log(` [+${((Date.now() - start) / 1000).toFixed(1)}s] ${e.taskName} result arrived (bg)`)
+ })
+
+ const result = await coordinator.invoke(`Product: ${product}`)
+ const ms = Date.now() - start
+ const output = result.toString()
+ const usage = result.metrics?.accumulatedUsage
+
+ return {
+ ms,
+ output,
+ inputTokens: usage?.inputTokens ?? 0,
+ outputTokens: usage?.outputTokens ?? 0,
+ totalTokens: usage?.totalTokens ?? 0,
+ }
+}
+
+// ── Main ────────────────────────────────────────────────────────────────────
+
+console.log('\n' + '='.repeat(70))
+console.log(' STANDARD vs BACKGROUND vs GRAPH — Product Launch Pipeline')
+console.log('='.repeat(70))
+console.log(` Product: ${product}`)
+console.log(` Pipeline: 3 researchers → 2 writers → 1 editor (6 agents)`)
+console.log('='.repeat(70))
+
+console.log('\n--- APPROACH A: Standard tools (sequential) ---\n')
+const stdResult = await runStandard()
+console.log(`\n Standard: ${(stdResult.ms / 1000).toFixed(1)}s`)
+
+console.log('\n--- APPROACH B: Background tools (parallel dispatch) ---\n')
+const bgResult = await runBackground()
+console.log(`\n Background: ${(bgResult.ms / 1000).toFixed(1)}s`)
+
+console.log('\n--- APPROACH C: Graph (explicit DAG) ---\n')
+const graphResult = await runGraph()
+console.log(`\n Graph: ${(graphResult.ms / 1000).toFixed(1)}s`)
+
+// ── Outputs ─────────────────────────────────────────────────────────────────
+
+console.log('\n--- STANDARD OUTPUT ---\n')
+console.log(stdResult.output)
+console.log('\n--- BACKGROUND OUTPUT ---\n')
+console.log(bgResult.output)
+console.log('\n--- GRAPH OUTPUT ---\n')
+console.log(graphResult.output)
+
+// ── Results (printed last for easy reading) ─────────────────────────────────
+
+const bgVsStd = stdResult.ms / bgResult.ms
+const graphVsStd = stdResult.ms / graphResult.ms
+
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 24)} ${pad('Standard', 14)} ${pad('Background', 14)} Graph`)
+console.log(` ${'-'.repeat(24)} ${'-'.repeat(14)} ${'-'.repeat(14)} ${'-'.repeat(14)}`)
+console.log(` ${pad('Wall clock', 24)} ${pad((stdResult.ms / 1000).toFixed(1) + 's', 14)} ${pad((bgResult.ms / 1000).toFixed(1) + 's', 14)} ${(graphResult.ms / 1000).toFixed(1)}s`)
+console.log(` ${pad('Speedup', 24)} ${pad('baseline', 14)} ${pad(bgVsStd.toFixed(2) + 'x', 14)} ${graphVsStd.toFixed(2)}x`)
+console.log(` ${pad('Input tokens', 24)} ${pad(String(stdResult.inputTokens), 14)} ${pad(String(bgResult.inputTokens), 14)} ${graphResult.inputTokens}`)
+console.log(` ${pad('Output tokens', 24)} ${pad(String(stdResult.outputTokens), 14)} ${pad(String(bgResult.outputTokens), 14)} ${graphResult.outputTokens}`)
+console.log(` ${pad('Total tokens', 24)} ${pad(String(stdResult.totalTokens), 14)} ${pad(String(bgResult.totalTokens), 14)} ${graphResult.totalTokens}`)
+console.log(` ${pad('Output length (chars)', 24)} ${pad(String(stdResult.output.length), 14)} ${pad(String(bgResult.output.length), 14)} ${graphResult.output.length}`)
+console.log()
+console.log(' Tradeoffs:')
+console.log(' Standard: Simplest code. No parallelism. Slowest.')
+console.log(' Background: One agent + prompt. Model discovers parallelism. Middle ground.')
+console.log(' Graph: Explicit DAG. Maximum parallelism. Fastest. Requires upfront design.')
+console.log()
+console.log('='.repeat(70))
diff --git a/designs/0009-demos/mixed-foreground-background/index.ts b/designs/0009-demos/mixed-foreground-background/index.ts
new file mode 100644
index 000000000..06faceb32
--- /dev/null
+++ b/designs/0009-demos/mixed-foreground-background/index.ts
@@ -0,0 +1,186 @@
+/**
+ * Mixed Foreground + Background Demo
+ *
+ * Demonstrates the most common production pattern: an agent with both foreground
+ * tools (results needed immediately) and background tools (side-effects that
+ * don't block the response).
+ *
+ * Scenario: A customer support agent handles a refund request.
+ * - Foreground: look up order details (model needs this to respond)
+ * - Foreground: process the refund (model needs confirmation)
+ * - Background: send confirmation email (fire-and-forget)
+ * - Background: log the interaction for analytics (fire-and-forget)
+ * - Background: update CRM record (fire-and-forget)
+ *
+ * The agent responds to the user immediately after the refund is processed,
+ * while the email, logging, and CRM update happen in the background.
+ */
+
+import {
+ Agent,
+ BedrockModel,
+ tool,
+ BeforeToolCallEvent,
+ AfterToolCallEvent,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+ BackgroundTaskPendingEvent,
+} from '@strands-agents/sdk'
+import { z } from 'zod'
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+// ── Foreground tools (model needs results immediately) ──────────────────────
+
+const lookupOrder = tool({
+ name: 'lookup_order',
+ description: 'Look up order details by order ID. Returns order status, items, and total.',
+ inputSchema: z.object({ orderId: z.string().describe('The order ID to look up') }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 1000))
+ return `Order ${input.orderId}: 2x Widget Pro ($49.99 each), 1x Widget Case ($12.99). Total: $112.97. Status: delivered 2026-04-15. Payment: Visa ending 4242.`
+ },
+})
+
+const processRefund = tool({
+ name: 'process_refund',
+ description: 'Process a refund for an order. Returns refund confirmation with reference number.',
+ inputSchema: z.object({
+ orderId: z.string().describe('The order ID to refund'),
+ amount: z.string().describe('The refund amount'),
+ reason: z.string().describe('Reason for the refund'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 2000))
+ return `Refund processed: $${input.amount} for order ${input.orderId}. Reference: REF-2026-04-21-7829. Reason: ${input.reason}. Funds will appear in 3-5 business days on Visa ending 4242.`
+ },
+})
+
+// ── Background tools (fire-and-forget side effects) ─────────────────────────
+
+const sendEmail = tool({
+ name: 'send_confirmation_email',
+ description: 'Send a refund confirmation email to the customer. Fire-and-forget — no need to wait for delivery.',
+ inputSchema: z.object({
+ to: z.string().describe('Customer email address'),
+ orderId: z.string().describe('Order ID for the refund'),
+ refundRef: z.string().describe('Refund reference number'),
+ amount: z.string().describe('Refund amount'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 8000))
+ return `Email sent to ${input.to}: "Your refund of $${input.amount} for order ${input.orderId} has been processed. Reference: ${input.refundRef}. Expect funds in 3-5 business days."`
+ },
+})
+
+const logInteraction = tool({
+ name: 'log_interaction',
+ description: 'Log this support interaction for analytics and quality review. Fire-and-forget.',
+ inputSchema: z.object({
+ orderId: z.string().describe('Order ID'),
+ action: z.string().describe('Action taken'),
+ resolution: z.string().describe('Resolution summary'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 5000))
+ return `Logged: order=${input.orderId}, action=${input.action}, resolution=${input.resolution}. Ticket #SUP-48291 created.`
+ },
+})
+
+const updateCrm = tool({
+ name: 'update_crm',
+ description: 'Update the customer CRM record with this interaction. Fire-and-forget.',
+ inputSchema: z.object({
+ orderId: z.string().describe('Order ID'),
+ note: z.string().describe('Note to add to CRM record'),
+ }),
+ callback: async (input) => {
+ await new Promise((r) => setTimeout(r, 6000))
+ return `CRM updated for order ${input.orderId}: "${input.note}". Customer satisfaction score recalculated.`
+ },
+})
+
+// ── Run ─────────────────────────────────────────────────────────────────────
+
+console.log('\n' + '='.repeat(70))
+console.log(' MIXED FOREGROUND + BACKGROUND DEMO')
+console.log('='.repeat(70))
+console.log(' Foreground (blocking): lookup_order, process_refund')
+console.log(' Background (fire-and-forget): send_email, log_interaction, update_crm')
+console.log(' The agent responds after the refund, while side-effects run in parallel')
+console.log('='.repeat(70))
+
+const agent = new Agent({
+ model,
+ systemPrompt:
+ 'You are a customer support agent. When a customer requests a refund:\n' +
+ '1. Look up the order details using lookup_order.\n' +
+ '2. Once you have the order details, call ALL of the following tools IN THE SAME TURN:\n' +
+ ' - process_refund (foreground — you need the refund confirmation)\n' +
+ ' - send_confirmation_email (background — fire and forget)\n' +
+ ' - log_interaction (background — fire and forget)\n' +
+ ' - update_crm (background — fire and forget)\n' +
+ '3. Respond to the customer with the refund confirmation. Do NOT wait for the background tools.\n\n' +
+ 'IMPORTANT: In step 2, call process_refund together with the background tools in one batch.\n' +
+ 'The customer email is customer@example.com.',
+ tools: [lookupOrder, processRefund],
+ backgroundTools: [sendEmail, logInteraction, updateCrm],
+ backgroundToolHeartbeatMs: 1000,
+ printer: false,
+})
+
+const start = Date.now()
+const elapsed = () => `+${((Date.now() - start) / 1000).toFixed(1)}s`
+
+agent.addHook(BeforeToolCallEvent, (e) => {
+ console.log(` [${elapsed()}] ${e.toolUse.name} started (foreground)`)
+}, { propagate: false })
+agent.addHook(AfterToolCallEvent, (e) => {
+ console.log(` [${elapsed()}] ${e.toolUse.name} finished (foreground)`)
+}, { propagate: false })
+agent.addHook(BackgroundTaskDispatchEvent, (e) => {
+ console.log(` [${elapsed()}] ${e.toolUse.name} dispatched (background)`)
+})
+agent.addHook(BackgroundTaskResultEvent, (e) => {
+ console.log(` [${elapsed()}] ${e.taskName} completed (background)`)
+})
+agent.addHook(BackgroundTaskPendingEvent, (e) => {
+ const names = e.pendingTasks.map((t) => t.name).join(', ')
+ console.log(` [${elapsed()}] ⏳ Waiting for: ${names} — ${e.completedCount} completed, ${(e.elapsedMs / 1000).toFixed(1)}s elapsed`)
+})
+
+console.log('\n--- Customer request ---\n')
+console.log(' Customer: "I need a refund for order ORD-99421. The widgets arrived damaged."\n')
+
+const result = await agent.invoke(
+ 'I need a refund for order ORD-99421. The widgets arrived damaged.'
+)
+
+const wallClock = Date.now() - start
+
+console.log('\n--- AGENT RESPONSE ---\n')
+console.log(result.toString())
+
+const usage = result.metrics?.accumulatedUsage
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 28)} Value`)
+console.log(` ${'-'.repeat(28)} ${'-'.repeat(30)}`)
+console.log(` ${pad('Wall clock', 28)} ${(wallClock / 1000).toFixed(1)}s`)
+console.log(` ${pad('Input tokens', 28)} ${usage?.inputTokens ?? 'N/A'}`)
+console.log(` ${pad('Output tokens', 28)} ${usage?.outputTokens ?? 'N/A'}`)
+console.log(` ${pad('Total tokens', 28)} ${usage?.totalTokens ?? 'N/A'}`)
+console.log(` ${pad('Agent cycles', 28)} ${result.metrics?.cycleCount ?? 'N/A'}`)
+console.log(` ${pad('Output length (chars)', 28)} ${result.toString().length}`)
+console.log()
+console.log(' Tool Breakdown:')
+console.log(` ${pad('Foreground (blocking)', 28)} lookup_order (1s) + process_refund (2s)`)
+console.log(` ${pad('Background (parallel)', 28)} send_email (8s) + log (5s) + CRM (6s)`)
+console.log(` ${pad('Without backgroundTools', 28)} would add ~19s of sequential blocking`)
+console.log()
+console.log('='.repeat(70))
diff --git a/designs/0009-demos/research-briefing/index.ts b/designs/0009-demos/research-briefing/index.ts
new file mode 100644
index 000000000..c8ba392fc
--- /dev/null
+++ b/designs/0009-demos/research-briefing/index.ts
@@ -0,0 +1,300 @@
+/**
+ * Research Briefing Generator — Standard vs Background comparison
+ *
+ * Demonstrates background task scheduling with real data sources.
+ * Runs the same research pipeline twice — once with standard tools (sequential)
+ * and once with backgroundTools (parallel dispatch) — then compares wall-clock time.
+ *
+ * Data sources (all free, no auth):
+ * - GitHub API (repos + discussions)
+ * - ArXiv API (academic papers)
+ * - HackerNews Algolia API (community discussions)
+ * - Web fetch (documentation sites)
+ */
+
+import {
+ Agent,
+ BedrockModel,
+ tool,
+ BeforeToolCallEvent,
+ AfterToolCallEvent,
+ BackgroundTaskDispatchEvent,
+ BackgroundTaskResultEvent,
+} from '@strands-agents/sdk'
+import { z } from 'zod'
+
+// ── Fetch tools ─────────────────────────────────────────────────────────────
+
+const searchGitHub = tool({
+ name: 'search_github',
+ description: 'Search GitHub for repositories and discussions matching a query. Returns repo names, descriptions, stars, and recent activity.',
+ inputSchema: z.object({
+ query: z.string().describe('Search query'),
+ }),
+ callback: async (input) => {
+ const url = `https://api.github.com/search/repositories?q=${encodeURIComponent(input.query)}&sort=updated&per_page=10`
+ const res = await fetch(url, { headers: { 'Accept': 'application/vnd.github.v3+json', 'User-Agent': 'strands-research-agent' } })
+ if (!res.ok) return `GitHub API error: ${res.status}`
+ const data = await res.json() as { items: Array<{ full_name: string; description: string; stargazers_count: number; updated_at: string; html_url: string; topics: string[] }> }
+ return data.items.map(r =>
+ `${r.full_name} (${r.stargazers_count} stars, updated ${r.updated_at.slice(0, 10)})\n ${r.description ?? 'No description'}\n Topics: ${r.topics?.join(', ') || 'none'}\n ${r.html_url}`
+ ).join('\n\n')
+ },
+})
+
+const searchArxiv = tool({
+ name: 'search_arxiv',
+ description: 'Search ArXiv for recent academic papers matching a query. Returns titles, authors, abstracts, and links.',
+ inputSchema: z.object({
+ query: z.string().describe('Search query'),
+ }),
+ callback: async (input) => {
+ const url = `https://export.arxiv.org/api/query?search_query=all:${encodeURIComponent(input.query)}&max_results=8&sortBy=submittedDate&sortOrder=descending`
+ const res = await fetch(url)
+ if (!res.ok) return `ArXiv API error: ${res.status}`
+ const xml = await res.text()
+ const entries = xml.split('').slice(1)
+ return entries.map(entry => {
+ const title = entry.match(/([\s\S]*?)<\/title>/)?.[1]?.trim().replace(/\n\s+/g, ' ') ?? 'Unknown'
+ const summary = entry.match(/([\s\S]*?)<\/summary>/)?.[1]?.trim().replace(/\n\s+/g, ' ').slice(0, 300) ?? ''
+ const link = entry.match(/([\s\S]*?)<\/id>/)?.[1]?.trim() ?? ''
+ const published = entry.match(/([\s\S]*?)<\/published>/)?.[1]?.trim().slice(0, 10) ?? ''
+ return `${title}\n Published: ${published}\n ${summary}...\n ${link}`
+ }).join('\n\n')
+ },
+})
+
+const searchHackerNews = tool({
+ name: 'search_hackernews',
+ description: 'Search HackerNews for recent discussions and stories matching a query. Returns titles, points, comment counts, and links.',
+ inputSchema: z.object({
+ query: z.string().describe('Search query'),
+ }),
+ callback: async (input) => {
+ const url = `https://hn.algolia.com/api/v1/search?query=${encodeURIComponent(input.query)}&tags=story&hitsPerPage=10`
+ const res = await fetch(url)
+ if (!res.ok) return `HN API error: ${res.status}`
+ const data = await res.json() as { hits: Array<{ title: string; points: number; num_comments: number; url: string; objectID: string; created_at: string }> }
+ return data.hits.map(h =>
+ `${h.title} (${h.points} points, ${h.num_comments} comments, ${h.created_at.slice(0, 10)})\n ${h.url || `https://news.ycombinator.com/item?id=${h.objectID}`}`
+ ).join('\n\n')
+ },
+})
+
+const fetchUrl = tool({
+ name: 'fetch_url',
+ description: 'Fetch content from a URL and return the text. Use for documentation pages, blog posts, etc.',
+ inputSchema: z.object({
+ url: z.string().describe('URL to fetch'),
+ }),
+ callback: async (input) => {
+ try {
+ const res = await fetch(input.url, { headers: { 'User-Agent': 'strands-research-agent' } })
+ if (!res.ok) return `Fetch error: ${res.status}`
+ const text = await res.text()
+ return text.replace(/<[^>]*>/g, ' ').replace(/\s+/g, ' ').slice(0, 8000)
+ } catch (e) {
+ return `Fetch failed: ${e instanceof Error ? e.message : String(e)}`
+ }
+ },
+})
+
+// ── Researcher sub-agents ───────────────────────────────────────────────────
+
+const modelId = process.argv.find((_, i) => process.argv[i - 1] === '--model')
+const model = new BedrockModel({ ...(modelId && { modelId }), region: 'us-east-1' })
+
+function createResearchers() {
+ const githubResearcher = new Agent({
+ model,
+ name: 'github_researcher',
+ description: 'Searches GitHub for relevant repositories, frameworks, and open-source projects. Returns a structured summary of findings with links.',
+ tools: [searchGitHub],
+ systemPrompt:
+ 'You are a GitHub research specialist. Given a topic, search GitHub for the most relevant ' +
+ 'repositories, frameworks, and projects. Make 2-3 searches with different query variations. ' +
+ 'Produce a structured summary with:\n' +
+ '- Key repositories (name, stars, what it does)\n' +
+ '- Notable patterns across repos\n' +
+ '- Links to the most important repos\n' +
+ 'Keep your summary under 600 words.',
+ printer: false,
+ })
+
+ const arxivResearcher = new Agent({
+ model,
+ name: 'arxiv_researcher',
+ description: 'Searches ArXiv for recent academic papers and research. Returns a structured summary of key papers and their contributions.',
+ tools: [searchArxiv],
+ systemPrompt:
+ 'You are an academic research specialist. Given a topic, search ArXiv for recent papers. ' +
+ 'Try 2-3 query variations. Produce a structured summary with:\n' +
+ '- Key papers (title, date, main contribution)\n' +
+ '- Common themes across the literature\n' +
+ '- Links to the most important papers\n' +
+ 'Keep your summary under 600 words.',
+ printer: false,
+ })
+
+ const hnResearcher = new Agent({
+ model,
+ name: 'hackernews_researcher',
+ description: 'Searches HackerNews for community discussions, opinions, and real-world experiences. Returns a structured summary of community sentiment.',
+ tools: [searchHackerNews],
+ systemPrompt:
+ 'You are a community research specialist focused on HackerNews. Given a topic, search HN ' +
+ 'for relevant discussions. Produce a structured summary with:\n' +
+ '- Top discussions (title, engagement, key takeaway)\n' +
+ '- Community sentiment\n' +
+ '- Real-world experiences shared by practitioners\n' +
+ 'Keep your summary under 600 words.',
+ printer: false,
+ })
+
+ const docsResearcher = new Agent({
+ model,
+ name: 'docs_researcher',
+ description: 'Fetches and analyzes documentation from framework and project websites. Returns a structured summary of how different frameworks approach the topic.',
+ tools: [fetchUrl],
+ systemPrompt:
+ 'You are a documentation research specialist. Given a topic, fetch documentation from ' +
+ '2-3 relevant framework websites to understand how different projects approach it. ' +
+ 'Produce a structured summary with:\n' +
+ '- How each framework handles the topic\n' +
+ '- API patterns and design choices\n' +
+ '- Commonalities and differences\n' +
+ 'Keep your summary under 600 words.',
+ printer: false,
+ })
+
+ return [githubResearcher, arxivResearcher, hnResearcher, docsResearcher]
+}
+
+// ── Common coordinator prompt ───────────────────────────────────────────────
+
+const coordinatorPrompt =
+ 'You are a senior technology analyst producing research briefings.\n\n' +
+ 'You have 4 researcher agents:\n' +
+ '- github_researcher: finds relevant open-source projects\n' +
+ '- arxiv_researcher: finds recent academic papers\n' +
+ '- hackernews_researcher: finds community discussions\n' +
+ '- docs_researcher: analyzes framework documentation\n\n' +
+ 'WORKFLOW:\n' +
+ '1. Dispatch ALL 4 researchers with appropriate queries for the topic.\n' +
+ '2. Synthesize results into a structured briefing.\n\n' +
+ 'BRIEFING FORMAT:\n' +
+ '## Executive Summary\n' +
+ '2-3 sentences.\n\n' +
+ '## Key Developments\n' +
+ 'Most important findings with source links.\n\n' +
+ '## Open Source Landscape\n' +
+ 'Major projects from GitHub.\n\n' +
+ '## Academic Research\n' +
+ 'Notable papers from ArXiv.\n\n' +
+ '## Community Sentiment\n' +
+ 'What practitioners say, from HN.\n\n' +
+ '## Emerging Patterns\n' +
+ 'Cross-cutting themes.\n\n' +
+ '## Open Problems & Gaps\n' +
+ 'What is underexplored.\n\n' +
+ '## Recommended Reading\n' +
+ 'Top 5-8 links.\n\n' +
+ 'Cite specific projects, papers, and discussions by name with links. Keep the briefing under 1500 words.'
+
+// ── Run ─────────────────────────────────────────────────────────────────────
+
+const args = process.argv.slice(2).filter((a, i, arr) => a !== '--model' && arr[i - 1] !== '--model')
+const topic = args[0] || 'background task scheduling in AI agent frameworks'
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESEARCH BRIEFING GENERATOR — Standard vs Background')
+console.log('='.repeat(70))
+console.log(` Topic: ${topic}`)
+console.log(` Researchers: 4 (GitHub, ArXiv, HackerNews, Docs)`)
+console.log('='.repeat(70))
+
+// ── Standard run ────────────────────────────────────────────────────────────
+
+console.log('\n--- STANDARD MODE (sequential) ---\n')
+
+const standardCoordinator = new Agent({
+ model,
+ systemPrompt: coordinatorPrompt,
+ tools: createResearchers(),
+ printer: false,
+})
+
+const stdStart = Date.now()
+
+standardCoordinator.addHook(BeforeToolCallEvent, (e) => {
+ if (e.toolUse.name !== 'strands_structured_output')
+ console.log(` [+${((Date.now() - stdStart) / 1000).toFixed(1)}s] ${e.toolUse.name} started`)
+})
+standardCoordinator.addHook(AfterToolCallEvent, (e) => {
+ if (e.toolUse.name !== 'strands_structured_output')
+ console.log(` [+${((Date.now() - stdStart) / 1000).toFixed(1)}s] ${e.toolUse.name} finished`)
+})
+
+const standardResult = await standardCoordinator.invoke(
+ `Produce a comprehensive research briefing on: ${topic}`
+)
+const standardMs = Date.now() - stdStart
+
+console.log(`\n Standard: ${(standardMs / 1000).toFixed(1)}s | Tokens: ${standardResult.metrics?.accumulatedUsage?.outputTokens ?? 'N/A'}`)
+
+// ── Background run ──────────────────────────────────────────────────────────
+
+console.log('\n--- BACKGROUND MODE (parallel dispatch) ---\n')
+
+const backgroundCoordinator = new Agent({
+ model,
+ systemPrompt: coordinatorPrompt,
+ backgroundTools: createResearchers(),
+ printer: false,
+})
+
+const bgStart = Date.now()
+
+backgroundCoordinator.addHook(BackgroundTaskDispatchEvent, (e) => {
+ console.log(` [+${((Date.now() - bgStart) / 1000).toFixed(1)}s] ${e.toolUse.name} dispatched (bg)`)
+})
+backgroundCoordinator.addHook(BackgroundTaskResultEvent, (e) => {
+ console.log(` [+${((Date.now() - bgStart) / 1000).toFixed(1)}s] ${e.taskName} result arrived (bg)`)
+})
+
+const backgroundResult = await backgroundCoordinator.invoke(
+ `Produce a comprehensive research briefing on: ${topic}`
+)
+const backgroundMs = Date.now() - bgStart
+
+console.log(`\n Background: ${(backgroundMs / 1000).toFixed(1)}s | Tokens: ${backgroundResult.metrics?.accumulatedUsage?.outputTokens ?? 'N/A'}`)
+
+// ── Results ─────────────────────────────────────────────────────────────────
+
+const speedup = standardMs / backgroundMs
+
+const stdUsage = standardResult.metrics?.accumulatedUsage
+const bgUsage = backgroundResult.metrics?.accumulatedUsage
+
+console.log('\n--- STANDARD BRIEFING OUTPUT ---\n')
+console.log(standardResult.toString())
+console.log('\n--- BACKGROUND BRIEFING OUTPUT ---\n')
+console.log(backgroundResult.toString())
+
+const pad = (s: string, n: number) => s.padEnd(n)
+
+console.log('\n' + '='.repeat(70))
+console.log(' RESULTS')
+console.log('='.repeat(70))
+console.log()
+console.log(` ${pad('Metric', 24)} ${pad('Standard', 16)} Background`)
+console.log(` ${'-'.repeat(24)} ${'-'.repeat(16)} ${'-'.repeat(16)}`)
+console.log(` ${pad('Wall clock', 24)} ${pad((standardMs / 1000).toFixed(1) + 's', 16)} ${(backgroundMs / 1000).toFixed(1)}s`)
+console.log(` ${pad('Speedup', 24)} ${pad('baseline', 16)} ${speedup.toFixed(2)}x`)
+console.log(` ${pad('Input tokens', 24)} ${pad(String(stdUsage?.inputTokens ?? 'N/A'), 16)} ${bgUsage?.inputTokens ?? 'N/A'}`)
+console.log(` ${pad('Output tokens', 24)} ${pad(String(stdUsage?.outputTokens ?? 'N/A'), 16)} ${bgUsage?.outputTokens ?? 'N/A'}`)
+console.log(` ${pad('Total tokens', 24)} ${pad(String(stdUsage?.totalTokens ?? 'N/A'), 16)} ${bgUsage?.totalTokens ?? 'N/A'}`)
+console.log(` ${pad('Agent cycles', 24)} ${pad(String(standardResult.metrics?.cycleCount ?? 'N/A'), 16)} ${backgroundResult.metrics?.cycleCount ?? 'N/A'}`)
+console.log(` ${pad('Output length (chars)', 24)} ${pad(String(standardResult.toString().length), 16)} ${backgroundResult.toString().length}`)
+console.log()
+console.log('='.repeat(70))
diff --git a/designs/0009-modified-agent-loop.drawio b/designs/0009-modified-agent-loop.drawio
new file mode 100644
index 000000000..672e579e6
--- /dev/null
+++ b/designs/0009-modified-agent-loop.drawio
@@ -0,0 +1,186 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/designs/0009-modified-agent-loop.png b/designs/0009-modified-agent-loop.png
new file mode 100644
index 000000000..9e701495a
Binary files /dev/null and b/designs/0009-modified-agent-loop.png differ