Skip to content

Configurable custom jina url#10

Open
uhercik wants to merge 1 commit intodanny-avila:mainfrom
gratex:jina-custom-url
Open

Configurable custom jina url#10
uhercik wants to merge 1 commit intodanny-avila:mainfrom
gratex:jina-custom-url

Conversation

@uhercik
Copy link
Copy Markdown

@uhercik uhercik commented Aug 5, 2025

Made jina url configurable using JINA_URL env variable

@davidjrh
Copy link
Copy Markdown

This is quite useful to allow hosting the JINA reranker in Azure Foundry, since it is available through the Azure Marketplace. https://jina.ai/news/jina-embeddings-and-reranker-on-azure-scalable-business-ready-ai-solutions/

BTW, would be awesome to have the same for the COHERE_URL (I'm going to do another similar PR)

danny-avila added a commit that referenced this pull request Apr 12, 2026
…t, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added
danny-avila added a commit that referenced this pull request Apr 12, 2026
* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).
danny-avila added a commit that referenced this pull request Apr 12, 2026
Resolves all findings from comprehensive review:

MAJOR fixes:
- #1: convertInjectedMessages now passes MessageContentComplex[] content
  through directly instead of JSON.stringify (preserves structured content)
- #2: Injected messages placed AFTER ToolMessages to respect provider
  message ordering contracts (AIMessage tool_calls -> ToolMessage adjacency).
  Added JSDoc documenting host formatter contract.
- #3: Removed duplicate Zod/JSON Schema. SkillToolSchema is now the single
  JSON Schema source of truth (used by both SkillToolDefinition and
  createSkillTool). Internal Zod schema derives descriptions from it.

MINOR fixes:
- #4: Budget calculation now accounts for (n-1) newline chars between
  entries. Added final length guard after proportional truncation.
- #5: Moved InjectedMessage type from skill.ts to tools.ts alongside
  ToolExecuteResult that references it.
- #6: Removed dead SkillExecutionMeta type (YAGNI).
- #7: Added 4 new tests: MessageContentComplex[] passthrough, multi-tool
  aggregation ordering, mixed mode (direct + event-driven), and
  MessageContentComplex type-check.

NIT fixes:
- #8: Removed formatNamesOnly duplicate; formatEntries handles empty
  descriptions natively.
- #9: Replaced em dash with period in SkillToolDescription.
- #10: Tests import events module at top level instead of fragile
  jest.spyOn(await import(...)) pattern.
danny-avila added a commit that referenced this pull request Apr 12, 2026
…ests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).
danny-avila added a commit that referenced this pull request Apr 13, 2026
* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)
danny-avila added a commit that referenced this pull request Apr 13, 2026
* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).
danny-avila added a commit that referenced this pull request Apr 13, 2026
* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)
danny-avila added a commit that referenced this pull request Apr 13, 2026
* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).
danny-avila added a commit that referenced this pull request Apr 13, 2026
* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)
danny-avila added a commit that referenced this pull request Apr 16, 2026
Address 12 valid findings from the audit — 3 MAJOR, 6 MINOR, 3 NIT:

MAJOR fixes:

1. Depth tracking now works across graph boundaries. Previously the
   `depth`/`maxDepth` check only worked within a single executor; when
   `allowNested: true`, the child graph spawned its own executor at
   depth=0, making `maxSubagentDepth` non-functional. Fix: decrement
   `maxSubagentDepth` in `buildChildInputs`, so the child's own
   executor inherits a smaller budget and naturally blocks further
   nesting when it reaches 0. Removed the now-vestigial `depth` field.

2. Detect duplicate `SubagentConfig.type` values. `Map` construction
   silently dropped duplicate keys. Now `resolveSubagentConfigs`
   throws with a clear error.

3. Add test coverage for `allowNested: true`: verifies depth is
   decremented across boundaries, child inherits subagentConfigs,
   and budget is clamped to >= 0.

MINOR fixes:

4. Runtime guard for empty `description` in the tool callback — the
   LLM can send undefined/empty values; now we default to a safe
   fallback string.

5. Updated `src/hooks/index.ts` barrel comment to list SubagentExecutor
   as the fourth hook consumer (SubagentStart/SubagentStop).

6. Extracted property description strings as constants in
   SubagentTool.ts; `SubagentToolSchema` and `buildSubagentToolParams`
   now share a single source of truth.

7. Changed SubagentStop from fire-and-forget to awaited, matching the
   PostCompact pattern. Removes fragile `setTimeout` synchronization
   in tests. Errors still swallowed (observational).

8. Truncate error messages from child graph to 200 chars to prevent
   leaking internal details (stack traces, API errors) to the parent
   LLM.

9. Document that `ask` decision is treated identically to `deny` in
   the non-interactive subagent context.

NIT fixes:

11. Import order in SubagentExecutor.ts — `@/types` (multi-line,
    longest) now comes first in the local types sub-group.

12. Removed unused `contentParts` destructuring in verification script.

13. Fixed unsafe `msg.content as string` cast in verification script;
    now handles array content correctly.

Skipped #10 (NIT): `_sourceInputs` underscore prefix is an accepted
TS convention for cross-file-accessible-but-semantically-internal
fields.

Test count: 1074 passed (up from 1065, +9 new tests), 0 regressions.
danny-avila added a commit that referenced this pull request Apr 17, 2026
#104)

* feat: add subagent-as-tool primitive for hierarchical agent delegation

Adds a SubagentTool that lets a parent agent delegate tasks to child
agents with isolated context windows. Verbose tool outputs stay in
the child's context; only a filtered text summary returns to the parent.

Key components:
- SubagentConfig type on AgentInputs for declaring child agent types
- SubagentExecutor: creates child StandardGraph, invokes with isolated
  state, filters result (strips tool_use/thinking blocks), fires
  SubagentStart/SubagentStop hooks
- SubagentTool: LCTool definition with dynamic enum from configs
- Graph integration: tool created in createAgentNode() via graphTools,
  automatically becomes a direct tool (bypasses event-driven dispatch)
- Self-spawn support: reuse parent's AgentInputs for context isolation
  without separate agent config

Wires the existing SubagentStart (with deny support) and SubagentStop
hook types that were defined but had zero fire points.

* fix: resolve review findings for subagent primitive

Address all 6 issues from PR review:

1. Add subagentHooks.test.ts — end-to-end hook integration test
   through real Run pipeline (SubagentStart payload, SubagentStop
   messages, deny blocks execution). 3 new tests.

2. Remove `as unknown as` cast — follow MultiAgentGraph pattern
   (init graphTools as empty array, push directly).

3. Deduplicate schema — extract buildSubagentToolParams() in
   SubagentTool.ts, use it in Graph.ts. createSubagentToolDefinition()
   now delegates to it. Single source of truth for schema/description.

4. Propagate threadId — tool callback extracts thread_id from
   LangGraph config.configurable, passes to executor.execute().
   Hooks now receive correct threadId.

5. Pass tokenCounter to child graph — added to SubagentExecutorOptions,
   threaded from agentContext.tokenCounter through to child
   StandardGraph constructor.

6. Pass signal in invoke() config — workflow.invoke() now receives
   signal for more responsive abort between graph steps.

* fix: resolve second-pass audit findings for subagent primitive

Address 12 valid findings from the audit — 3 MAJOR, 6 MINOR, 3 NIT:

MAJOR fixes:

1. Depth tracking now works across graph boundaries. Previously the
   `depth`/`maxDepth` check only worked within a single executor; when
   `allowNested: true`, the child graph spawned its own executor at
   depth=0, making `maxSubagentDepth` non-functional. Fix: decrement
   `maxSubagentDepth` in `buildChildInputs`, so the child's own
   executor inherits a smaller budget and naturally blocks further
   nesting when it reaches 0. Removed the now-vestigial `depth` field.

2. Detect duplicate `SubagentConfig.type` values. `Map` construction
   silently dropped duplicate keys. Now `resolveSubagentConfigs`
   throws with a clear error.

3. Add test coverage for `allowNested: true`: verifies depth is
   decremented across boundaries, child inherits subagentConfigs,
   and budget is clamped to >= 0.

MINOR fixes:

4. Runtime guard for empty `description` in the tool callback — the
   LLM can send undefined/empty values; now we default to a safe
   fallback string.

5. Updated `src/hooks/index.ts` barrel comment to list SubagentExecutor
   as the fourth hook consumer (SubagentStart/SubagentStop).

6. Extracted property description strings as constants in
   SubagentTool.ts; `SubagentToolSchema` and `buildSubagentToolParams`
   now share a single source of truth.

7. Changed SubagentStop from fire-and-forget to awaited, matching the
   PostCompact pattern. Removes fragile `setTimeout` synchronization
   in tests. Errors still swallowed (observational).

8. Truncate error messages from child graph to 200 chars to prevent
   leaking internal details (stack traces, API errors) to the parent
   LLM.

9. Document that `ask` decision is treated identically to `deny` in
   the non-interactive subagent context.

NIT fixes:

11. Import order in SubagentExecutor.ts — `@/types` (multi-line,
    longest) now comes first in the local types sub-group.

12. Removed unused `contentParts` destructuring in verification script.

13. Fixed unsafe `msg.content as string` cast in verification script;
    now handles array content correctly.

Skipped #10 (NIT): `_sourceInputs` underscore prefix is an accepted
TS convention for cross-file-accessible-but-semantically-internal
fields.

Test count: 1074 passed (up from 1065, +9 new tests), 0 regressions.

* fix: address third-pass audit findings for subagent primitive

Resolve 3 NITs from the second review pass plus 2 new codex comments:

NIT #1 — Tighten truncation test: previously asserted only
`result.content.length < 500`. Now asserts exact envelope of 219
chars ("Subagent error: " prefix + 200 body + "..." suffix) so that
regressions in ERROR_MESSAGE_MAX_CHARS are caught. Added companion
test verifying short messages are not truncated.

NIT #2 — Skip tool creation when maxSubagentDepth: 0. When a host
set `maxSubagentDepth: 0` alongside `subagentConfigs`, the tool was
still bound to the LLM, and every invocation returned the depth-
exceeded error. The LLM wasted a turn. Now the `createAgentNode`
guard includes `effectiveSubagentDepth > 0`, so the tool is not
created at the leaf level. Added integration test.

NIT #3 — Document buildChildInputs audience. Added @remarks block
explaining this is an advanced utility exported primarily for
testing and internal use; hosts configuring subagents should not
call it directly.

Codex P1 (false positive, clarified): maxSubagentDepth propagation
across graph boundaries was already correct via the countdown in
buildChildInputs, but it wasn't obvious from the call site. Added
a multi-line JSDoc comment above the executor construction in
createAgentNode explaining how depth consumes across boundaries
through the decremented AgentInputs.maxSubagentDepth.

Codex P2 (real bug, fixed): toolSchemaTokens did not include the
subagent tool's schema. `calculateInstructionTokens()` was called
in `fromConfig()` before graphTools was populated, so the later-
injected subagent tool's schema was missing from the budget. Fixes:
  1. Include `graphTools` in the iteration in
     `calculateInstructionTokens` (also helps handoff tools, a
     pre-existing issue with the same root cause).
  2. Re-trigger the calculation at the end of `createAgentNode`
     after the subagent tool is pushed, so the token count reflects
     the final graphTools state before `createCallModel` awaits
     `tokenCalculationPromise`.

Added test asserting `toolSchemaTokens` is larger for agents with
subagents vs without, confirming the recalculation runs.

Test count: 1077 passed (up from 1074, +3 new tests), 0 regressions.

* fix: break circular dep between SubagentExecutor and StandardGraph

Rollup warned on build:
  Graph.ts → subagent/index.ts → SubagentExecutor.ts → Graph.ts

SubagentExecutor needed `StandardGraph` at runtime to construct the
child graph; Graph.ts needs SubagentExecutor to attach the tool. With
preserveModules, these land in separate chunks producing circular
chunk imports with undefined execution order.

Fix: dependency injection. SubagentExecutor now takes a required
`createChildGraph: ChildGraphFactory` option and only imports
`StandardGraph` as `import type` (erased at build). Graph.ts passes
`(input) => new StandardGraph(input)` when constructing the executor.
Runtime module graph is now acyclic; `import type` does not appear in
emitted JS.

Test updates: replaced `jest.spyOn(GraphModule, 'StandardGraph')`
(which relied on the runtime import we just removed) with direct mock
factories passed through the new option. Cleaner — tests no longer
need to reach into module internals. Same coverage.

Build: `npx rollup -c` now emits zero warnings.
Test count: 1077 passed, 0 regressions.

* chore: add multi-agent-subagent script to package.json

Introduced a new script entry for multi-agent-subagent, enabling execution of the corresponding TypeScript file with necessary configurations. This addition enhances the multi-agent functionality by allowing for more granular control over subagent operations.

* fix: isolate child subagent from parent's callback chain

The manual verification script (`npm run multi-agent-subagent`)
crashed with:

  Error: No agent context found for agent ID researcher
    at StandardGraph.getAgentContext
    at ChatModelStreamHandler.handle

Root cause: `Run.processStream` drives the parent via
`streamEvents`, which captures events from every nested runnable in
the call tree — including the child graph's LLM calls. When the
child's "researcher" agent streamed `on_chat_model_stream` events,
the parent's `ChatModelStreamHandler` received them and tried to
resolve the child's agent ID in the parent's `agentContexts` map,
which only contains the supervisor. Error thrown.

The Phase 1 plan was for child events to stay fully isolated. Unit
tests missed this because the child graph was always mocked (no
real `streamEvents` plumbing); only running the full pipeline
end-to-end surfaced it.

Fix: pass `callbacks: []` in the child `workflow.invoke()` config.
This overrides the inherited callbacks for that invocation, breaking
the event propagation chain. Also set `runName: subagent:<type>` and
a child-scoped `configurable.thread_id` so the child has a distinct
trace root rather than polluting the parent's trace.

Manual verification now works: supervisor delegates to researcher,
receives filtered text, synthesizes the final answer.

Test suite: 1079 passed, 0 subagent-suite regressions. The two newly
failing suites (ProgrammaticToolCalling / ToolSearch Live API tests)
are pre-existing — they require `LIBRECHAT_CODE_BASEURL` to be
configured and fail identically on the prior commit. They were
skipped before because OPENAI_API_KEY was absent.

* feat: align filterSubagentResult fallback with Claude Code

When a subagent's last AIMessage is pure tool_use (e.g. the model
hit maxTurns mid-tool-call), the previous implementation returned
the "Task completed" sentinel, discarding any textual progress from
earlier turns.

Claude Code's agentToolUtils.finalizeAgentTool walks backwards from
the last message, and on messages without text content continues
scanning earlier AIMessages, surfacing the most recent textual
thought instead of the sentinel.

This change matches that behavior. The three `return` sites inside
the loop that previously exited with the sentinel now `continue`
when the AIMessage yields no text; the loop exits with "Task
completed" only when no AIMessage in the history contains any text.

Concrete example:
  [0] Human: "What's the capital of France?"
  [1] AI:    "Let me search." + tool_use(search, ...)
  [2] Tool:  "Paris."
  [3] AI:    tool_use(search, ...)   <-- maxTurns, no text

  Before: "Task completed"
  After:  "Let me search."

Added two tests — the turn-limit scenario above, and the analogous
empty-string-content edge case. All nine existing filterSubagentResult
tests still pass unchanged.

* fix: update model name for non-Anthropic API to gpt-5.4

Changed the model name used in the multi-agent subagent script from 'gpt-4.1' to 'gpt-5.4' for improved performance and capabilities. This update ensures compatibility with the latest API offerings.
danny-avila added a commit that referenced this pull request Apr 17, 2026
* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).
danny-avila added a commit that referenced this pull request Apr 17, 2026
* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)
danny-avila added a commit that referenced this pull request Apr 17, 2026
#104)

* feat: add subagent-as-tool primitive for hierarchical agent delegation

Adds a SubagentTool that lets a parent agent delegate tasks to child
agents with isolated context windows. Verbose tool outputs stay in
the child's context; only a filtered text summary returns to the parent.

Key components:
- SubagentConfig type on AgentInputs for declaring child agent types
- SubagentExecutor: creates child StandardGraph, invokes with isolated
  state, filters result (strips tool_use/thinking blocks), fires
  SubagentStart/SubagentStop hooks
- SubagentTool: LCTool definition with dynamic enum from configs
- Graph integration: tool created in createAgentNode() via graphTools,
  automatically becomes a direct tool (bypasses event-driven dispatch)
- Self-spawn support: reuse parent's AgentInputs for context isolation
  without separate agent config

Wires the existing SubagentStart (with deny support) and SubagentStop
hook types that were defined but had zero fire points.

* fix: resolve review findings for subagent primitive

Address all 6 issues from PR review:

1. Add subagentHooks.test.ts — end-to-end hook integration test
   through real Run pipeline (SubagentStart payload, SubagentStop
   messages, deny blocks execution). 3 new tests.

2. Remove `as unknown as` cast — follow MultiAgentGraph pattern
   (init graphTools as empty array, push directly).

3. Deduplicate schema — extract buildSubagentToolParams() in
   SubagentTool.ts, use it in Graph.ts. createSubagentToolDefinition()
   now delegates to it. Single source of truth for schema/description.

4. Propagate threadId — tool callback extracts thread_id from
   LangGraph config.configurable, passes to executor.execute().
   Hooks now receive correct threadId.

5. Pass tokenCounter to child graph — added to SubagentExecutorOptions,
   threaded from agentContext.tokenCounter through to child
   StandardGraph constructor.

6. Pass signal in invoke() config — workflow.invoke() now receives
   signal for more responsive abort between graph steps.

* fix: resolve second-pass audit findings for subagent primitive

Address 12 valid findings from the audit — 3 MAJOR, 6 MINOR, 3 NIT:

MAJOR fixes:

1. Depth tracking now works across graph boundaries. Previously the
   `depth`/`maxDepth` check only worked within a single executor; when
   `allowNested: true`, the child graph spawned its own executor at
   depth=0, making `maxSubagentDepth` non-functional. Fix: decrement
   `maxSubagentDepth` in `buildChildInputs`, so the child's own
   executor inherits a smaller budget and naturally blocks further
   nesting when it reaches 0. Removed the now-vestigial `depth` field.

2. Detect duplicate `SubagentConfig.type` values. `Map` construction
   silently dropped duplicate keys. Now `resolveSubagentConfigs`
   throws with a clear error.

3. Add test coverage for `allowNested: true`: verifies depth is
   decremented across boundaries, child inherits subagentConfigs,
   and budget is clamped to >= 0.

MINOR fixes:

4. Runtime guard for empty `description` in the tool callback — the
   LLM can send undefined/empty values; now we default to a safe
   fallback string.

5. Updated `src/hooks/index.ts` barrel comment to list SubagentExecutor
   as the fourth hook consumer (SubagentStart/SubagentStop).

6. Extracted property description strings as constants in
   SubagentTool.ts; `SubagentToolSchema` and `buildSubagentToolParams`
   now share a single source of truth.

7. Changed SubagentStop from fire-and-forget to awaited, matching the
   PostCompact pattern. Removes fragile `setTimeout` synchronization
   in tests. Errors still swallowed (observational).

8. Truncate error messages from child graph to 200 chars to prevent
   leaking internal details (stack traces, API errors) to the parent
   LLM.

9. Document that `ask` decision is treated identically to `deny` in
   the non-interactive subagent context.

NIT fixes:

11. Import order in SubagentExecutor.ts — `@/types` (multi-line,
    longest) now comes first in the local types sub-group.

12. Removed unused `contentParts` destructuring in verification script.

13. Fixed unsafe `msg.content as string` cast in verification script;
    now handles array content correctly.

Skipped #10 (NIT): `_sourceInputs` underscore prefix is an accepted
TS convention for cross-file-accessible-but-semantically-internal
fields.

Test count: 1074 passed (up from 1065, +9 new tests), 0 regressions.

* fix: address third-pass audit findings for subagent primitive

Resolve 3 NITs from the second review pass plus 2 new codex comments:

NIT #1 — Tighten truncation test: previously asserted only
`result.content.length < 500`. Now asserts exact envelope of 219
chars ("Subagent error: " prefix + 200 body + "..." suffix) so that
regressions in ERROR_MESSAGE_MAX_CHARS are caught. Added companion
test verifying short messages are not truncated.

NIT #2 — Skip tool creation when maxSubagentDepth: 0. When a host
set `maxSubagentDepth: 0` alongside `subagentConfigs`, the tool was
still bound to the LLM, and every invocation returned the depth-
exceeded error. The LLM wasted a turn. Now the `createAgentNode`
guard includes `effectiveSubagentDepth > 0`, so the tool is not
created at the leaf level. Added integration test.

NIT #3 — Document buildChildInputs audience. Added @remarks block
explaining this is an advanced utility exported primarily for
testing and internal use; hosts configuring subagents should not
call it directly.

Codex P1 (false positive, clarified): maxSubagentDepth propagation
across graph boundaries was already correct via the countdown in
buildChildInputs, but it wasn't obvious from the call site. Added
a multi-line JSDoc comment above the executor construction in
createAgentNode explaining how depth consumes across boundaries
through the decremented AgentInputs.maxSubagentDepth.

Codex P2 (real bug, fixed): toolSchemaTokens did not include the
subagent tool's schema. `calculateInstructionTokens()` was called
in `fromConfig()` before graphTools was populated, so the later-
injected subagent tool's schema was missing from the budget. Fixes:
  1. Include `graphTools` in the iteration in
     `calculateInstructionTokens` (also helps handoff tools, a
     pre-existing issue with the same root cause).
  2. Re-trigger the calculation at the end of `createAgentNode`
     after the subagent tool is pushed, so the token count reflects
     the final graphTools state before `createCallModel` awaits
     `tokenCalculationPromise`.

Added test asserting `toolSchemaTokens` is larger for agents with
subagents vs without, confirming the recalculation runs.

Test count: 1077 passed (up from 1074, +3 new tests), 0 regressions.

* fix: break circular dep between SubagentExecutor and StandardGraph

Rollup warned on build:
  Graph.ts → subagent/index.ts → SubagentExecutor.ts → Graph.ts

SubagentExecutor needed `StandardGraph` at runtime to construct the
child graph; Graph.ts needs SubagentExecutor to attach the tool. With
preserveModules, these land in separate chunks producing circular
chunk imports with undefined execution order.

Fix: dependency injection. SubagentExecutor now takes a required
`createChildGraph: ChildGraphFactory` option and only imports
`StandardGraph` as `import type` (erased at build). Graph.ts passes
`(input) => new StandardGraph(input)` when constructing the executor.
Runtime module graph is now acyclic; `import type` does not appear in
emitted JS.

Test updates: replaced `jest.spyOn(GraphModule, 'StandardGraph')`
(which relied on the runtime import we just removed) with direct mock
factories passed through the new option. Cleaner — tests no longer
need to reach into module internals. Same coverage.

Build: `npx rollup -c` now emits zero warnings.
Test count: 1077 passed, 0 regressions.

* chore: add multi-agent-subagent script to package.json

Introduced a new script entry for multi-agent-subagent, enabling execution of the corresponding TypeScript file with necessary configurations. This addition enhances the multi-agent functionality by allowing for more granular control over subagent operations.

* fix: isolate child subagent from parent's callback chain

The manual verification script (`npm run multi-agent-subagent`)
crashed with:

  Error: No agent context found for agent ID researcher
    at StandardGraph.getAgentContext
    at ChatModelStreamHandler.handle

Root cause: `Run.processStream` drives the parent via
`streamEvents`, which captures events from every nested runnable in
the call tree — including the child graph's LLM calls. When the
child's "researcher" agent streamed `on_chat_model_stream` events,
the parent's `ChatModelStreamHandler` received them and tried to
resolve the child's agent ID in the parent's `agentContexts` map,
which only contains the supervisor. Error thrown.

The Phase 1 plan was for child events to stay fully isolated. Unit
tests missed this because the child graph was always mocked (no
real `streamEvents` plumbing); only running the full pipeline
end-to-end surfaced it.

Fix: pass `callbacks: []` in the child `workflow.invoke()` config.
This overrides the inherited callbacks for that invocation, breaking
the event propagation chain. Also set `runName: subagent:<type>` and
a child-scoped `configurable.thread_id` so the child has a distinct
trace root rather than polluting the parent's trace.

Manual verification now works: supervisor delegates to researcher,
receives filtered text, synthesizes the final answer.

Test suite: 1079 passed, 0 subagent-suite regressions. The two newly
failing suites (ProgrammaticToolCalling / ToolSearch Live API tests)
are pre-existing — they require `LIBRECHAT_CODE_BASEURL` to be
configured and fail identically on the prior commit. They were
skipped before because OPENAI_API_KEY was absent.

* feat: align filterSubagentResult fallback with Claude Code

When a subagent's last AIMessage is pure tool_use (e.g. the model
hit maxTurns mid-tool-call), the previous implementation returned
the "Task completed" sentinel, discarding any textual progress from
earlier turns.

Claude Code's agentToolUtils.finalizeAgentTool walks backwards from
the last message, and on messages without text content continues
scanning earlier AIMessages, surfacing the most recent textual
thought instead of the sentinel.

This change matches that behavior. The three `return` sites inside
the loop that previously exited with the sentinel now `continue`
when the AIMessage yields no text; the loop exits with "Task
completed" only when no AIMessage in the history contains any text.

Concrete example:
  [0] Human: "What's the capital of France?"
  [1] AI:    "Let me search." + tool_use(search, ...)
  [2] Tool:  "Paris."
  [3] AI:    tool_use(search, ...)   <-- maxTurns, no text

  Before: "Task completed"
  After:  "Let me search."

Added two tests — the turn-limit scenario above, and the analogous
empty-string-content edge case. All nine existing filterSubagentResult
tests still pass unchanged.

* fix: update model name for non-Anthropic API to gpt-5.4

Changed the model name used in the multi-agent subagent script from 'gpt-4.1' to 'gpt-5.4' for improved performance and capabilities. This update ensures compatibility with the latest API offerings.
danny-avila added a commit that referenced this pull request Apr 17, 2026
…tests

Addresses the comprehensive audit of #107:

### MAJOR #1 — ON_TOOL_EXECUTE forwarding coverage
Two new tests in the `event forwarding` suite drive the forwarder
callback directly with a synthesized `ToolExecuteBatchRequest`:
- `routes child ON_TOOL_EXECUTE dispatches through the parent registry`
  asserts the parent's `ON_TOOL_EXECUTE` handler is invoked, the batch's
  `resolve` fires with canned results, and the child receives them.
- `does NOT forward ON_TOOL_EXECUTE when the parent registry has no
  handler (safe fallback)` documents the companion case: without a
  handler the `resolve` never fires, which is why the executor gates
  `keepToolDefinitions` on handler presence in the first place.

### MINOR #2 — unused `childGraph` parameter
Removed from `createForwarderCallback`'s arg shape, destructuring, and
call site. It was accepted as `_childGraph` and never read.

### MINOR #3 — `data as never`
`EventHandler.handle` didn't include `ToolExecuteBatchRequest` in its
accepted data union, which is why the forwarder used `as never` to
shut the compiler up. Extended the union and switched the cast to the
concrete `ToolExecuteBatchRequest` type — the forwarder is now fully
type-checked on the hot path.

### MINOR #4 — error-phase envelope test
`emits an 'error' phase envelope when the child graph throws` uses
the throwing factory helper already in the file, registers a real
`ON_SUBAGENT_UPDATE` handler, and asserts start/error ordering plus
the `data.message` payload and `parentToolCallId` passthrough.

### MINOR #6 — dynamic imports in tests
Replaced `await import('@/events')` / `await import('@/common')` with
static top-level imports. Consistent with the rest of the file.

### MINOR #7 — `summarizeEvent` unit tests
Exported the pure function from `@/tools/subagent` and added 10 focused
tests covering every branch: tool_calls with one/many names, empty
tool_calls, message_creation, ON_TOOL_EXECUTE with/without names,
completed steps with/without a tool name, message_delta, and the
unknown-event fallback. Every user-visible label is now regression
protected.

### MINOR #8 — `config.toolCall` comment
Expanded the JSDoc to note the specific LangChain source
(`ToolRunnableConfig` in `@langchain/core/tools`, stable since 0.3.x)
and that the defensive read degrades to `undefined` on upstream
changes.

### NIT #9 — unused import
Removed `createContentAggregator` from `subagent-tools-debug.ts`.

### NIT #10 — `awaitHandlers = true` doc
Added a JSDoc block explaining the trade-off: required for
`ON_TOOL_EXECUTE` to actually block on the host's resolve, but it
also serializes observational events through any slow host handlers.
Refinement path noted for later if host-side latency becomes a problem.

Version bumped to `3.1.67-dev.3`.

Tests: 51 pass (40 SubagentExecutor + 10 new summarizeEvent + 1 new
error-phase + existing coverage retained).
danny-avila added a commit that referenced this pull request Apr 17, 2026
* feat: forward subagent child events and enable event-driven tools

Adds an opt-in event forwarding path for `SubagentExecutor` so host apps
that rely on event-driven tool dispatch (`toolDefinitions` +
`ON_TOOL_EXECUTE` handler) can use subagents — including self-spawn.

Before: a subagent in event-driven mode got `toolDefinitions` stripped in
`buildChildInputs` and ran with `callbacks: []`, so the LLM had no tool
schemas and ON_TOOL_EXECUTE never reached the host. The subagent replied
in prose without invoking any tools.

Now:
- `SubagentExecutor` accepts `parentHandlerRegistry` (a direct registry
  or a lazy getter — `Run` wires the registry onto the graph AFTER
  `createWorkflow`, so `createAgentNode` must capture lazily).
- When provided, `buildChildInputs` keeps `toolDefinitions`, and the
  child's invoke attaches a `BaseCallbackHandler` forwarder that:
    - routes ON_TOOL_EXECUTE to the parent's handler (event-driven tools
      work), and
    - wraps other custom events (run_step, run_step_delta,
      run_step_completed, message_delta, reasoning_delta) in a new
      `ON_SUBAGENT_UPDATE` envelope so hosts can render child progress
      in a separate UI surface.
- New `SubagentUpdateEvent` type and `ON_SUBAGENT_UPDATE` GraphEvents
  entry; start/stop/error envelopes fire around the child invoke.
- Legacy isolation is preserved when no registry is passed.

Test plan:
- `npx jest SubagentExecutor SubagentTool` — 36 pass (4 new covering
  forwarding, toolDefinition retention, legacy strip, lazy getter).
- Integration test against OpenAI (`subagent-event-driven-debug.ts`)
  confirms the child uses `calculator` via the parent's ON_TOOL_EXECUTE
  path and the host sees start/run_step/run_step_completed/stop
  ON_SUBAGENT_UPDATE envelopes.

* fix: gate child toolDefinitions on ON_TOOL_EXECUTE handler + propagate tool_call_id

Two follow-ups from review on #107:

1. Codex P1: `SubagentExecutor` was treating any non-null parent registry
   as "forwarding enabled" and always kept `toolDefinitions` on the
   child. Because `Run.create` always constructs a `HandlerRegistry`,
   this meant configurations with tools declared but no `ON_TOOL_EXECUTE`
   handler would hang forever — the child's `ToolNode` would dispatch a
   batch request and the forwarder would find no handler to resolve or
   reject it, leaving the promise pending.

   Now `keepToolDefinitions` is gated on
   `parentRegistry.getHandler(ON_TOOL_EXECUTE) != null`. A registry
   without the handler falls back to the legacy "strip toolDefinitions"
   path, which is a recoverable no-tools state instead of a hang.

2. `SubagentUpdateEvent` now carries `parentToolCallId`. `Graph.ts`
   pulls it from `config.toolCall.id` (LangChain threads the originating
   `ToolCall` onto `RunnableConfig`) when the subagent tool is invoked.
   This lets hosts correlate subagent updates to the parent tool call
   deterministically instead of inferring by event ordering — a concern
   raised on the LibreChat PR using this event stream.

Also bumps to `3.1.67-dev.2`.

Tests: `npx jest SubagentExecutor` — 38 pass (2 new covering the handler
gate and `parentToolCallId` propagation). Integration script verified
against real OpenAI: every ON_SUBAGENT_UPDATE envelope now carries the
parent's `tool_call_id`.

* chore: address audit findings — drop unused param, tighten cast, add tests

Addresses the comprehensive audit of #107:

### MAJOR #1 — ON_TOOL_EXECUTE forwarding coverage
Two new tests in the `event forwarding` suite drive the forwarder
callback directly with a synthesized `ToolExecuteBatchRequest`:
- `routes child ON_TOOL_EXECUTE dispatches through the parent registry`
  asserts the parent's `ON_TOOL_EXECUTE` handler is invoked, the batch's
  `resolve` fires with canned results, and the child receives them.
- `does NOT forward ON_TOOL_EXECUTE when the parent registry has no
  handler (safe fallback)` documents the companion case: without a
  handler the `resolve` never fires, which is why the executor gates
  `keepToolDefinitions` on handler presence in the first place.

### MINOR #2 — unused `childGraph` parameter
Removed from `createForwarderCallback`'s arg shape, destructuring, and
call site. It was accepted as `_childGraph` and never read.

### MINOR #3 — `data as never`
`EventHandler.handle` didn't include `ToolExecuteBatchRequest` in its
accepted data union, which is why the forwarder used `as never` to
shut the compiler up. Extended the union and switched the cast to the
concrete `ToolExecuteBatchRequest` type — the forwarder is now fully
type-checked on the hot path.

### MINOR #4 — error-phase envelope test
`emits an 'error' phase envelope when the child graph throws` uses
the throwing factory helper already in the file, registers a real
`ON_SUBAGENT_UPDATE` handler, and asserts start/error ordering plus
the `data.message` payload and `parentToolCallId` passthrough.

### MINOR #6 — dynamic imports in tests
Replaced `await import('@/events')` / `await import('@/common')` with
static top-level imports. Consistent with the rest of the file.

### MINOR #7 — `summarizeEvent` unit tests
Exported the pure function from `@/tools/subagent` and added 10 focused
tests covering every branch: tool_calls with one/many names, empty
tool_calls, message_creation, ON_TOOL_EXECUTE with/without names,
completed steps with/without a tool name, message_delta, and the
unknown-event fallback. Every user-visible label is now regression
protected.

### MINOR #8 — `config.toolCall` comment
Expanded the JSDoc to note the specific LangChain source
(`ToolRunnableConfig` in `@langchain/core/tools`, stable since 0.3.x)
and that the defensive read degrades to `undefined` on upstream
changes.

### NIT #9 — unused import
Removed `createContentAggregator` from `subagent-tools-debug.ts`.

### NIT #10 — `awaitHandlers = true` doc
Added a JSDoc block explaining the trade-off: required for
`ON_TOOL_EXECUTE` to actually block on the host's resolve, but it
also serializes observational events through any slow host handlers.
Refinement path noted for later if host-side latency becomes a problem.

Version bumped to `3.1.67-dev.3`.

Tests: 51 pass (40 SubagentExecutor + 10 new summarizeEvent + 1 new
error-phase + existing coverage retained).

* fix: clear parent run summary and discoveredTools from child inputs

Paired with Codex P1 on the LibreChat PR (danny-avila/LibreChat#12725)
that uses this event stream. `buildChildInputs` shallow-spreads the
parent's `AgentInputs`, which carries two run-scoped fields that leak
into the isolated child by default:

- `initialSummary` — cross-run conversation summary. On a conversation
  that has already been summarized, every child would inherit it and
  reason against unrelated prior chat.
- `discoveredTools` — tool names the parent's LLM searched for in
  earlier turns via `tool_search`. The child gets primed to use
  whatever the parent happened to find.

Both are cleared on the returned child inputs. Configuration fields
(summarization policy, provider options, context pruning config) are
preserved — those are policy, not context.

Affects both paths:
- Self-spawn: `resolveSubagentConfigs` clones parent `_sourceInputs`,
  then `buildChildInputs` strips.
- Explicit children: host-built `AgentInputs` goes through the same
  `buildChildInputs` before the child graph runs.

Version bumped to `3.1.67-dev.4`.

Regression test: `strips parent-run-scoped initialSummary and
discoveredTools from child inputs` — seeds parent inputs with
`initialSummary: { text, tokenCount }` and `discoveredTools: [...]`,
asserts both are `undefined` on the returned child. 52 pass.

* test: cover summarizeEvent step.type fallback branches

Addresses reviewer F2.1 (NIT) on #107. `summarizeEvent` resolves
run-step detail type via `step.stepDetails?.type ?? step.type ?? 'step'`.
The existing tests all passed `{ stepDetails: { type } }`, leaving the
second and third clauses of the chain uncovered.

Two new tests:
- top-level `step.type` (no `stepDetails` wrapper) for both
  `tool_calls` and `message_creation` — exercises the second clause
- empty `{}` payload — exercises the `'step'` default and the generic
  `Step: <detailType>` branch

Carried MINOR #5 (`SubagentUpdateEvent.data: unknown`) left as a
documented deferral; the JSDoc already notes "shape depends on phase"
and a discriminated union is a larger refactor.

54/54 pass.

* chore: add npm run scripts for subagent debug scripts

- `npm run subagent` — supervisor + researcher/coder subagent demo
  (no tool use; from #104)
- `npm run subagent:events` — event-driven flow matching LibreChat's
  architecture (toolDefinitions + ON_TOOL_EXECUTE handler + self-spawn
  with calculator). This is the script that demonstrates the #107 fix.
- `npm run subagent:tools` — traditional LangChain tool instances with
  self-spawn; pre-fix parity case.

All three accept `OPENAI_API_KEY` (and `--provider anthropic` for the
first).
danny-avila added a commit that referenced this pull request Apr 19, 2026
* feat: add hook registry and executor scaffolding (phase 1/1) (#87)

* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).

* feat: wire run-level hooks into processStream (#88)

* feat: wire run-level hooks into processStream (phase 1/2)

Fires four hook lifecycle events from `Run.processStream`, making the
inert hooks module from PR #87 live. Hosts pass a pre-constructed
`HookRegistry` via `RunConfig.hooks`; when absent, all hook dispatch
is skipped (zero overhead for non-adopters).

## Integration points

- `RunConfig.hooks?: HookRegistry` — new optional field, mirrors the
  existing `customHandlers` pattern
- `Graph.hookRegistry` — field alongside `handlerRegistry`, cleared in
  `clearHeavyState()`
- `Run` constructor wires the registry onto both `this.hookRegistry`
  and `this.Graph.hookRegistry`

## Hook fire points in `processStream`

1. **RunStart** — after config setup, before stream creation. Receives
   `inputs.messages`, `runId`, `threadId`, `agentId`.
2. **UserPromptSubmit** — immediately after RunStart. Extracts prompt
   text from the last human message. If the hook returns
   `{decision: 'deny'}` or `{decision: 'ask'}`, the run aborts early
   and returns `undefined`. The `ask` case is an abort in v1 — the SDK
   returns the decision and the host decides how to implement
   interactive approval (see Phase 2 plan in the skills design report).
3. **Stop** — at the end of the try block, after the for-await loop.
   Receives the accumulated messages from `Graph.getRunMessages()`.
   Only fires on successful completion (not on error).
4. **StopFailure** — in a new catch block between try and finally.
   Receives the error message and last assistant message. Hook errors
   are swallowed (`.catch(() => {})`) so the original error propagates.

## Session teardown

`hookRegistry.clearSession(runId)` is called in the finally block,
guaranteeing cleanup even on error. Fulfills §3.9 rule #5 from the
design report.

## Public API

`src/hooks/` is now re-exported from `src/index.ts`. Hosts can import
`HookRegistry`, `executeHooks`, and all hook types directly from
`@librechat/agents`.

## Helper functions

Three module-level functions added to `run.ts`:
- `findLastHumanMessage` — backward scan for `getType() === 'human'`
- `findLastAssistantMessage` — backward scan for `getType() === 'ai'`
- `extractPromptText` — handles string and complex content arrays

## Tests

10 new integration tests using `Run.create` + `overrideTestModel` +
`processStream` (same pattern as `custom-event-await.test.ts`):
- RunStart fires with correct fields
- UserPromptSubmit extracts prompt, deny aborts, ask aborts
- Stop fires on success, does not fire on error
- StopFailure fires on error, preserves original exception
- Session cleared in finally (both success and error paths)
- No-hooks baseline works identically

Total: 81 existing + 10 integration = 91 passing.

* fix(hooks): address PR 2 review — DRY helpers, test coverage, guards

Resolves all findings from the comprehensive review on #88.

- #1 MAJOR: Add 3 tests for `extractPromptText` — multi-part content
  (text + image blocks yields 'hello\nworld'), image-only content
  (yields ''), and empty content array (yields '').
- #2 MINOR: Merge `findLastHumanMessage`/`findLastAssistantMessage`
  into parameterized `findLastMessageOfType(messages, type)`.
- #3 MINOR: Rewrite `extractPromptText` array path from
  `.filter().map().join()` triple-pass to single `for...of` loop
  accumulating into `parts[]` then `.join('\n')`.
- #4 MINOR: Add `config.callbacks = undefined` on deny/ask early
  return path so Langfuse handler reference is cleaned up even
  though no stream ran.
- #5 MINOR: Add comment on `UserPromptSubmit` fire point noting
  `attachments` is not yet wired (Phase 2).
- #6 MINOR: Strengthen `StopFailure` test assertion from
  `toBeTruthy()` to `typeof === 'string'` + `length > 0`.
- #7 MINOR: Add test for empty-content human message — verifies
  `UserPromptSubmit` fires with `prompt: ''`.
- #8 NIT: Add inline comment on `stopHookActive: false` explaining
  it will be `true` in Phase 2 when stop is triggered by a hook.
- #9 NIT: Use `hasHookFor('Stop', sessionId)` and
  `hasHookFor('StopFailure', sessionId)` guards before
  `getRunMessages()` to avoid the array copy when no hooks are
  registered for the event.
- CL-2: Wrap Stop hook dispatch in its own `.catch()` so an
  infrastructure error in `executeHooks` (theoretical — it catches
  internally) cannot masquerade as a stream failure in the catch
  block.

Tests: 91 -> 94 (+3 for extractPromptText coverage).

* feat: wire tool hooks into event-driven ToolNode (phase 1/3) (#90)

* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)

* feat: SkillTool primitive, injectedMessages, and catalog formatter (#91)

Phases 1+2 of the Skills system, rebased on hooks PRs (#88, #90):

Types:
- InjectedMessage type in tools.ts (generic, any tool can use it)
- SkillCatalogEntry type in skill.ts
- ToolExecuteResult.injectedMessages field
- Constants.SKILL_TOOL = 'skill'

SkillTool (src/tools/SkillTool.ts):
- JSON Schema (single source of truth), LCTool definition, createSkillTool()
- Runtime-agnostic description (files "may" become accessible)
- Requires event-driven execution mode

ToolNode integration (src/tools/ToolNode.ts):
- convertInjectedMessages() converts to HumanMessage (both roles, avoids
  Anthropic/Google SystemMessage rejection). Original role in additional_kwargs.
- dispatchToolEvents returns { toolMessages, injected } tuple
- Injected messages placed AFTER ToolMessages (provider ordering)
- try/catch isolation around injection conversion
- Works with hooks-based PreToolUse/PostToolUse lifecycle

Catalog formatter (src/tools/skillCatalog.ts):
- formatSkillCatalog() with truncation ladder and budget enforcement
- fitNamesOnly drops trailing entries when names-only exceeds budget

Tests: 30 total (18 SkillTool + 12 skillCatalog)

* refactor: remove unused Zod schema and createSkillTool function from SkillTool.ts

* feat: add bash execution and bash programmatic tool calling tools (#93)

* feat: add local bash execution and bash programmatic tool calling tools

Implement BashExecutor for local bash command execution via child_process
and BashProgrammaticToolCalling for multi-tool orchestration using bash
scripts with a local HTTP server for tool IPC via curl.

* fix: use remote Code API for bash tools instead of local child_process

BashExecutor now sends `lang: 'bash'` to the same /exec endpoint as
CodeExecutor. BashProgrammaticToolCalling sends bash code to
/exec/programmatic with the same round-trip loop as PTC. Both tools
share session/file handling with their counterparts. ToolNode session
injection and storage updated for all four code execution tool names.

* refactor: rename EXECUTE_BASH constant to BASH_TOOL for consistency

Updated the constant name from EXECUTE_BASH to BASH_TOOL in the Constants enum and adjusted references in BashExecutor and ToolNode to reflect this change. This improves clarity and aligns with naming conventions.

* feat: add SKILL_TOOL to session storage and context in ToolNode (#94)

When the skill handler uploads files to the code env and returns
artifact: { session_id, files }, ToolNode now stores that session
context (via storeCodeSessionFromResults). This makes skill-primed
files available to subsequent bash/code tool calls in the same run.

Also adds SKILL_TOOL to the codeSessionContext request building in
dispatchToolEvents, so the handler receives the current session
context and can check if files are already uploaded.

* feat: initialSessions on RunConfig + CODE_EXECUTION_TOOLS DRY helper (#95)

* feat: add initialSessions to RunConfig for seeding Graph session state

Allows hosts to seed the Graph's ToolSessionMap at run creation time.
Used by skill file re-priming: after uploading skill files to the code
env at run start, the host passes the session info so ToolNode can
inject session_id + files into subsequent bash/code tool calls.

* refactor: add CODE_EXECUTION_TOOLS set, DRY ToolNode checks

Add CODE_EXECUTION_TOOLS ReadonlySet to common/enum.ts containing
all four code execution tool constants (EXECUTE_CODE, BASH_TOOL,
PROGRAMMATIC_TOOL_CALLING, BASH_PROGRAMMATIC_TOOL_CALLING).

Replace 5 repeated 4-constant inline checks in ToolNode with
CODE_EXECUTION_TOOLS.has(). SKILL_TOOL remains a separate check
where needed since it's not a code execution tool itself.

* chore: update version to 3.1.66-dev.0 and refresh package-lock.json

* feat: reconstruct skill body HumanMessages in formatAgentMessages (#96)

When a skill is invoked, the body is injected as a HumanMessage into
LangGraph state but NOT persisted to conversation history. On follow-up
runs the skill body is lost.

formatAgentMessages now accepts an optional invokedSkillBodies Map
(skillName → body) and reconstructs HumanMessages at the correct
position in the message sequence — after the skill's ToolMessage,
matching where ToolNode originally injected them.

Detection happens inside the existing tool_call processing loop (both
with and without tools filtering), so there are zero extra message
iteration passes.

* fix: skill body reconstruction token distribution + review fixes (#97)

Critical fix:
- Capture endMessageIndex BEFORE skill body HumanMessage injection so
  injected messages are excluded from the assistant message's token
  distribution range. Prevents indexTokenCountMap corruption.

Review fixes:
- Remove dead else-if(!discoveredTools) branch inside if(discoveredTools)
  block and revert redundant optional chaining
- Extract extractSkillName() helper to eliminate DRY violation (identical
  args parsing IIFE was duplicated in two locations)
- Use Set<string> instead of string[] for pendingSkillNames (dedup)
- Lazy allocation via ??= (only allocated when skills are present)

* chore: eslint

* chore: remove `createSkillTool`

* fix: resolve review findings for hooks & skills PR (#100)

* fix: resolve review findings for hooks & skills PR

- Fix stale JSDoc referencing non-existent future PR (run.ts)
- Use requestMap (Map.get) instead of Array.find in storeCodeSessionFromResults
- Extract shared updateCodeSession helper to eliminate duplicated session storage logic
- Add sync critical section markers around once-matcher removal in executeHooks
- Optimize fitNamesOnly from O(n²) to O(n) with running character sum
- Widen ReDoS timing assertion from 50ms to 200ms for CI stability
- Add test verifying ON_RUN_STEP_COMPLETED dispatch for denied tool calls

* test: assert event shape in deny step completion test

* feat: add ReadFile tool definition for general-purpose file reading (#99)

New READ_FILE constant and ReadFileToolDefinition for a general-purpose
file reader. Schema: { file_path: string }. Skill files use the path
format {skillName}/{path} (e.g. "pdf-analyzer/src/utils.py").

The tool is event-driven only (host handles via ON_TOOL_EXECUTE).
Works without code execution for skill files (DB/storage reads).
With code env, also supports code execution output files.

* chore: update ToolNode session tests for requestMap signature change (#101)

* chore: add --tag dev for prerelease npm publish (#102)

* feat: wire PreCompact/PostCompact hooks around summarization node (#103)

* feat: wire PreCompact/PostCompact hooks around summarization node

Fires PreCompact and PostCompact hooks inside `createSummarizeNode`,
the last two SDK-side hook events. These are auditing hooks — hosts
can archive the full transcript before compaction, observe the summary
produced, or log compaction events for telemetry.

## Fire points

- **PreCompact** fires after ON_SUMMARIZE_START, before the LLM
  summarization call. Receives `messagesBeforeCount` and the trigger
  type from the agent's summarization config.
- **PostCompact** fires after ON_SUMMARIZE_COMPLETE, after the summary
  is generated and enriched. Receives the summary text and
  `messagesAfterCount: 0` (the node removes all messages; the summary
  is injected into the system prompt, not as a message).

Both hooks are observational and non-blocking — errors are swallowed
via `.catch()`. Results are not consumed.

## Type fix

`PreCompactHookInput.trigger` was defined as `'threshold' | 'manual'
| 'error'` (from the initial PR #87) but those don't match the real
summarization trigger types. Updated to:
`'token_ratio' | 'remaining_tokens' | 'messages_to_refine' | 'default'`
with `(string & {})` for forward compatibility. `'default'` covers the
case where no trigger is configured and compaction fires on any pruning.

## Plumbing

`hookRegistry` added to `CreateSummarizeNodeParams.graph` interface.
`Graph.ts` passes `this.hookRegistry` at the `createSummarizeNode`
call site, same pattern as ToolNode.

## Tests

4 new integration tests using `FakeListChatModel` for the summarization
provider (no API keys required):
- PreCompact fires with correct messagesBeforeCount and trigger
- PostCompact fires with summary text
- Hook errors don't crash compaction
- No-hooks baseline works identically

Total: 108 existing + 4 new = 112 passing.

* fix(hooks): add threadId to compaction hooks, strengthen tests, fix imports

Resolves review findings on #103.

- #1 MAJOR: Add threadId to both PreCompact and PostCompact hook
  inputs. PreCompact extracts from the node's config parameter,
  PostCompact from runnableConfig. Restores consistency with all
  other hook sites in run.ts.
- #2 MINOR: Assert trigger field in PreCompact test — verifies the
  'default' fallback when no trigger is configured.
- #3 MINOR: Strengthen PostCompact summary assertion from
  length > 0 to toContain('Summary') against the known
  FakeListChatModel response.
- #4 MINOR: Add PostCompact error resilience test — throwing hook
  in dispatchCompletionEvents does not crash compaction.
- #5 MINOR: Fix import order in node.ts per AGENTS.md (type imports
  longest→shortest, value imports longest→shortest).
- #6 NIT: Update hooks/index.ts barrel comment to list
  createSummarizeNode as third consumer.
- #8 NIT: Assert agentId in both PreCompact and PostCompact tests.

Tests: 112 -> 113 (+1 PostCompact error resilience).

* test: assert threadId propagation and exact summary match in compaction hooks

* fix: use runnableConfig for PreCompact threadId extraction (matches PostCompact)

* feat: add subagent-as-tool primitive for hierarchical agent delegation (#104)

* feat: add subagent-as-tool primitive for hierarchical agent delegation

Adds a SubagentTool that lets a parent agent delegate tasks to child
agents with isolated context windows. Verbose tool outputs stay in
the child's context; only a filtered text summary returns to the parent.

Key components:
- SubagentConfig type on AgentInputs for declaring child agent types
- SubagentExecutor: creates child StandardGraph, invokes with isolated
  state, filters result (strips tool_use/thinking blocks), fires
  SubagentStart/SubagentStop hooks
- SubagentTool: LCTool definition with dynamic enum from configs
- Graph integration: tool created in createAgentNode() via graphTools,
  automatically becomes a direct tool (bypasses event-driven dispatch)
- Self-spawn support: reuse parent's AgentInputs for context isolation
  without separate agent config

Wires the existing SubagentStart (with deny support) and SubagentStop
hook types that were defined but had zero fire points.

* fix: resolve review findings for subagent primitive

Address all 6 issues from PR review:

1. Add subagentHooks.test.ts — end-to-end hook integration test
   through real Run pipeline (SubagentStart payload, SubagentStop
   messages, deny blocks execution). 3 new tests.

2. Remove `as unknown as` cast — follow MultiAgentGraph pattern
   (init graphTools as empty array, push directly).

3. Deduplicate schema — extract buildSubagentToolParams() in
   SubagentTool.ts, use it in Graph.ts. createSubagentToolDefinition()
   now delegates to it. Single source of truth for schema/description.

4. Propagate threadId — tool callback extracts thread_id from
   LangGraph config.configurable, passes to executor.execute().
   Hooks now receive correct threadId.

5. Pass tokenCounter to child graph — added to SubagentExecutorOptions,
   threaded from agentContext.tokenCounter through to child
   StandardGraph constructor.

6. Pass signal in invoke() config — workflow.invoke() now receives
   signal for more responsive abort between graph steps.

* fix: resolve second-pass audit findings for subagent primitive

Address 12 valid findings from the audit — 3 MAJOR, 6 MINOR, 3 NIT:

MAJOR fixes:

1. Depth tracking now works across graph boundaries. Previously the
   `depth`/`maxDepth` check only worked within a single executor; when
   `allowNested: true`, the child graph spawned its own executor at
   depth=0, making `maxSubagentDepth` non-functional. Fix: decrement
   `maxSubagentDepth` in `buildChildInputs`, so the child's own
   executor inherits a smaller budget and naturally blocks further
   nesting when it reaches 0. Removed the now-vestigial `depth` field.

2. Detect duplicate `SubagentConfig.type` values. `Map` construction
   silently dropped duplicate keys. Now `resolveSubagentConfigs`
   throws with a clear error.

3. Add test coverage for `allowNested: true`: verifies depth is
   decremented across boundaries, child inherits subagentConfigs,
   and budget is clamped to >= 0.

MINOR fixes:

4. Runtime guard for empty `description` in the tool callback — the
   LLM can send undefined/empty values; now we default to a safe
   fallback string.

5. Updated `src/hooks/index.ts` barrel comment to list SubagentExecutor
   as the fourth hook consumer (SubagentStart/SubagentStop).

6. Extracted property description strings as constants in
   SubagentTool.ts; `SubagentToolSchema` and `buildSubagentToolParams`
   now share a single source of truth.

7. Changed SubagentStop from fire-and-forget to awaited, matching the
   PostCompact pattern. Removes fragile `setTimeout` synchronization
   in tests. Errors still swallowed (observational).

8. Truncate error messages from child graph to 200 chars to prevent
   leaking internal details (stack traces, API errors) to the parent
   LLM.

9. Document that `ask` decision is treated identically to `deny` in
   the non-interactive subagent context.

NIT fixes:

11. Import order in SubagentExecutor.ts — `@/types` (multi-line,
    longest) now comes first in the local types sub-group.

12. Removed unused `contentParts` destructuring in verification script.

13. Fixed unsafe `msg.content as string` cast in verification script;
    now handles array content correctly.

Skipped #10 (NIT): `_sourceInputs` underscore prefix is an accepted
TS convention for cross-file-accessible-but-semantically-internal
fields.

Test count: 1074 passed (up from 1065, +9 new tests), 0 regressions.

* fix: address third-pass audit findings for subagent primitive

Resolve 3 NITs from the second review pass plus 2 new codex comments:

NIT #1 — Tighten truncation test: previously asserted only
`result.content.length < 500`. Now asserts exact envelope of 219
chars ("Subagent error: " prefix + 200 body + "..." suffix) so that
regressions in ERROR_MESSAGE_MAX_CHARS are caught. Added companion
test verifying short messages are not truncated.

NIT #2 — Skip tool creation when maxSubagentDepth: 0. When a host
set `maxSubagentDepth: 0` alongside `subagentConfigs`, the tool was
still bound to the LLM, and every invocation returned the depth-
exceeded error. The LLM wasted a turn. Now the `createAgentNode`
guard includes `effectiveSubagentDepth > 0`, so the tool is not
created at the leaf level. Added integration test.

NIT #3 — Document buildChildInputs audience. Added @remarks block
explaining this is an advanced utility exported primarily for
testing and internal use; hosts configuring subagents should not
call it directly.

Codex P1 (false positive, clarified): maxSubagentDepth propagation
across graph boundaries was already correct via the countdown in
buildChildInputs, but it wasn't obvious from the call site. Added
a multi-line JSDoc comment above the executor construction in
createAgentNode explaining how depth consumes across boundaries
through the decremented AgentInputs.maxSubagentDepth.

Codex P2 (real bug, fixed): toolSchemaTokens did not include the
subagent tool's schema. `calculateInstructionTokens()` was called
in `fromConfig()` before graphTools was populated, so the later-
injected subagent tool's schema was missing from the budget. Fixes:
  1. Include `graphTools` in the iteration in
     `calculateInstructionTokens` (also helps handoff tools, a
     pre-existing issue with the same root cause).
  2. Re-trigger the calculation at the end of `createAgentNode`
     after the subagent tool is pushed, so the token count reflects
     the final graphTools state before `createCallModel` awaits
     `tokenCalculationPromise`.

Added test asserting `toolSchemaTokens` is larger for agents with
subagents vs without, confirming the recalculation runs.

Test count: 1077 passed (up from 1074, +3 new tests), 0 regressions.

* fix: break circular dep between SubagentExecutor and StandardGraph

Rollup warned on build:
  Graph.ts → subagent/index.ts → SubagentExecutor.ts → Graph.ts

SubagentExecutor needed `StandardGraph` at runtime to construct the
child graph; Graph.ts needs SubagentExecutor to attach the tool. With
preserveModules, these land in separate chunks producing circular
chunk imports with undefined execution order.

Fix: dependency injection. SubagentExecutor now takes a required
`createChildGraph: ChildGraphFactory` option and only imports
`StandardGraph` as `import type` (erased at build). Graph.ts passes
`(input) => new StandardGraph(input)` when constructing the executor.
Runtime module graph is now acyclic; `import type` does not appear in
emitted JS.

Test updates: replaced `jest.spyOn(GraphModule, 'StandardGraph')`
(which relied on the runtime import we just removed) with direct mock
factories passed through the new option. Cleaner — tests no longer
need to reach into module internals. Same coverage.

Build: `npx rollup -c` now emits zero warnings.
Test count: 1077 passed, 0 regressions.

* chore: add multi-agent-subagent script to package.json

Introduced a new script entry for multi-agent-subagent, enabling execution of the corresponding TypeScript file with necessary configurations. This addition enhances the multi-agent functionality by allowing for more granular control over subagent operations.

* fix: isolate child subagent from parent's callback chain

The manual verification script (`npm run multi-agent-subagent`)
crashed with:

  Error: No agent context found for agent ID researcher
    at StandardGraph.getAgentContext
    at ChatModelStreamHandler.handle

Root cause: `Run.processStream` drives the parent via
`streamEvents`, which captures events from every nested runnable in
the call tree — including the child graph's LLM calls. When the
child's "researcher" agent streamed `on_chat_model_stream` events,
the parent's `ChatModelStreamHandler` received them and tried to
resolve the child's agent ID in the parent's `agentContexts` map,
which only contains the supervisor. Error thrown.

The Phase 1 plan was for child events to stay fully isolated. Unit
tests missed this because the child graph was always mocked (no
real `streamEvents` plumbing); only running the full pipeline
end-to-end surfaced it.

Fix: pass `callbacks: []` in the child `workflow.invoke()` config.
This overrides the inherited callbacks for that invocation, breaking
the event propagation chain. Also set `runName: subagent:<type>` and
a child-scoped `configurable.thread_id` so the child has a distinct
trace root rather than polluting the parent's trace.

Manual verification now works: supervisor delegates to researcher,
receives filtered text, synthesizes the final answer.

Test suite: 1079 passed, 0 subagent-suite regressions. The two newly
failing suites (ProgrammaticToolCalling / ToolSearch Live API tests)
are pre-existing — they require `LIBRECHAT_CODE_BASEURL` to be
configured and fail identically on the prior commit. They were
skipped before because OPENAI_API_KEY was absent.

* feat: align filterSubagentResult fallback with Claude Code

When a subagent's last AIMessage is pure tool_use (e.g. the model
hit maxTurns mid-tool-call), the previous implementation returned
the "Task completed" sentinel, discarding any textual progress from
earlier turns.

Claude Code's agentToolUtils.finalizeAgentTool walks backwards from
the last message, and on messages without text content continues
scanning earlier AIMessages, surfacing the most recent textual
thought instead of the sentinel.

This change matches that behavior. The three `return` sites inside
the loop that previously exited with the sentinel now `continue`
when the AIMessage yields no text; the loop exits with "Task
completed" only when no AIMessage in the history contains any text.

Concrete example:
  [0] Human: "What's the capital of France?"
  [1] AI:    "Let me search." + tool_use(search, ...)
  [2] Tool:  "Paris."
  [3] AI:    tool_use(search, ...)   <-- maxTurns, no text

  Before: "Task completed"
  After:  "Let me search."

Added two tests — the turn-limit scenario above, and the analogous
empty-string-content edge case. All nine existing filterSubagentResult
tests still pass unchanged.

* fix: update model name for non-Anthropic API to gpt-5.4

Changed the model name used in the multi-agent subagent script from 'gpt-4.1' to 'gpt-5.4' for improved performance and capabilities. This update ensures compatibility with the latest API offerings.

* v3.1.67-dev.0

* feat: forward subagent child events and enable event-driven tools (#107)

* feat: forward subagent child events and enable event-driven tools

Adds an opt-in event forwarding path for `SubagentExecutor` so host apps
that rely on event-driven tool dispatch (`toolDefinitions` +
`ON_TOOL_EXECUTE` handler) can use subagents — including self-spawn.

Before: a subagent in event-driven mode got `toolDefinitions` stripped in
`buildChildInputs` and ran with `callbacks: []`, so the LLM had no tool
schemas and ON_TOOL_EXECUTE never reached the host. The subagent replied
in prose without invoking any tools.

Now:
- `SubagentExecutor` accepts `parentHandlerRegistry` (a direct registry
  or a lazy getter — `Run` wires the registry onto the graph AFTER
  `createWorkflow`, so `createAgentNode` must capture lazily).
- When provided, `buildChildInputs` keeps `toolDefinitions`, and the
  child's invoke attaches a `BaseCallbackHandler` forwarder that:
    - routes ON_TOOL_EXECUTE to the parent's handler (event-driven tools
      work), and
    - wraps other custom events (run_step, run_step_delta,
      run_step_completed, message_delta, reasoning_delta) in a new
      `ON_SUBAGENT_UPDATE` envelope so hosts can render child progress
      in a separate UI surface.
- New `SubagentUpdateEvent` type and `ON_SUBAGENT_UPDATE` GraphEvents
  entry; start/stop/error envelopes fire around the child invoke.
- Legacy isolation is preserved when no registry is passed.

Test plan:
- `npx jest SubagentExecutor SubagentTool` — 36 pass (4 new covering
  forwarding, toolDefinition retention, legacy strip, lazy getter).
- Integration test against OpenAI (`subagent-event-driven-debug.ts`)
  confirms the child uses `calculator` via the parent's ON_TOOL_EXECUTE
  path and the host sees start/run_step/run_step_completed/stop
  ON_SUBAGENT_UPDATE envelopes.

* fix: gate child toolDefinitions on ON_TOOL_EXECUTE handler + propagate tool_call_id

Two follow-ups from review on #107:

1. Codex P1: `SubagentExecutor` was treating any non-null parent registry
   as "forwarding enabled" and always kept `toolDefinitions` on the
   child. Because `Run.create` always constructs a `HandlerRegistry`,
   this meant configurations with tools declared but no `ON_TOOL_EXECUTE`
   handler would hang forever — the child's `ToolNode` would dispatch a
   batch request and the forwarder would find no handler to resolve or
   reject it, leaving the promise pending.

   Now `keepToolDefinitions` is gated on
   `parentRegistry.getHandler(ON_TOOL_EXECUTE) != null`. A registry
   without the handler falls back to the legacy "strip toolDefinitions"
   path, which is a recoverable no-tools state instead of a hang.

2. `SubagentUpdateEvent` now carries `parentToolCallId`. `Graph.ts`
   pulls it from `config.toolCall.id` (LangChain threads the originating
   `ToolCall` onto `RunnableConfig`) when the subagent tool is invoked.
   This lets hosts correlate subagent updates to the parent tool call
   deterministically instead of inferring by event ordering — a concern
   raised on the LibreChat PR using this event stream.

Also bumps to `3.1.67-dev.2`.

Tests: `npx jest SubagentExecutor` — 38 pass (2 new covering the handler
gate and `parentToolCallId` propagation). Integration script verified
against real OpenAI: every ON_SUBAGENT_UPDATE envelope now carries the
parent's `tool_call_id`.

* chore: address audit findings — drop unused param, tighten cast, add tests

Addresses the comprehensive audit of #107:

### MAJOR #1 — ON_TOOL_EXECUTE forwarding coverage
Two new tests in the `event forwarding` suite drive the forwarder
callback directly with a synthesized `ToolExecuteBatchRequest`:
- `routes child ON_TOOL_EXECUTE dispatches through the parent registry`
  asserts the parent's `ON_TOOL_EXECUTE` handler is invoked, the batch's
  `resolve` fires with canned results, and the child receives them.
- `does NOT forward ON_TOOL_EXECUTE when the parent registry has no
  handler (safe fallback)` documents the companion case: without a
  handler the `resolve` never fires, which is why the executor gates
  `keepToolDefinitions` on handler presence in the first place.

### MINOR #2 — unused `childGraph` parameter
Removed from `createForwarderCallback`'s arg shape, destructuring, and
call site. It was accepted as `_childGraph` and never read.

### MINOR #3 — `data as never`
`EventHandler.handle` didn't include `ToolExecuteBatchRequest` in its
accepted data union, which is why the forwarder used `as never` to
shut the compiler up. Extended the union and switched the cast to the
concrete `ToolExecuteBatchRequest` type — the forwarder is now fully
type-checked on the hot path.

### MINOR #4 — error-phase envelope test
`emits an 'error' phase envelope when the child graph throws` uses
the throwing factory helper already in the file, registers a real
`ON_SUBAGENT_UPDATE` handler, and asserts start/error ordering plus
the `data.message` payload and `parentToolCallId` passthrough.

### MINOR #6 — dynamic imports in tests
Replaced `await import('@/events')` / `await import('@/common')` with
static top-level imports. Consistent with the rest of the file.

### MINOR #7 — `summarizeEvent` unit tests
Exported the pure function from `@/tools/subagent` and added 10 focused
tests covering every branch: tool_calls with one/many names, empty
tool_calls, message_creation, ON_TOOL_EXECUTE with/without names,
completed steps with/without a tool name, message_delta, and the
unknown-event fallback. Every user-visible label is now regression
protected.

### MINOR #8 — `config.toolCall` comment
Expanded the JSDoc to note the specific LangChain source
(`ToolRunnableConfig` in `@langchain/core/tools`, stable since 0.3.x)
and that the defensive read degrades to `undefined` on upstream
changes.

### NIT #9 — unused import
Removed `createContentAggregator` from `subagent-tools-debug.ts`.

### NIT #10 — `awaitHandlers = true` doc
Added a JSDoc block explaining the trade-off: required for
`ON_TOOL_EXECUTE` to actually block on the host's resolve, but it
also serializes observational events through any slow host handlers.
Refinement path noted for later if host-side latency becomes a problem.

Version bumped to `3.1.67-dev.3`.

Tests: 51 pass (40 SubagentExecutor + 10 new summarizeEvent + 1 new
error-phase + existing coverage retained).

* fix: clear parent run summary and discoveredTools from child inputs

Paired with Codex P1 on the LibreChat PR (danny-avila/LibreChat#12725)
that uses this event stream. `buildChildInputs` shallow-spreads the
parent's `AgentInputs`, which carries two run-scoped fields that leak
into the isolated child by default:

- `initialSummary` — cross-run conversation summary. On a conversation
  that has already been summarized, every child would inherit it and
  reason against unrelated prior chat.
- `discoveredTools` — tool names the parent's LLM searched for in
  earlier turns via `tool_search`. The child gets primed to use
  whatever the parent happened to find.

Both are cleared on the returned child inputs. Configuration fields
(summarization policy, provider options, context pruning config) are
preserved — those are policy, not context.

Affects both paths:
- Self-spawn: `resolveSubagentConfigs` clones parent `_sourceInputs`,
  then `buildChildInputs` strips.
- Explicit children: host-built `AgentInputs` goes through the same
  `buildChildInputs` before the child graph runs.

Version bumped to `3.1.67-dev.4`.

Regression test: `strips parent-run-scoped initialSummary and
discoveredTools from child inputs` — seeds parent inputs with
`initialSummary: { text, tokenCount }` and `discoveredTools: [...]`,
asserts both are `undefined` on the returned child. 52 pass.

* test: cover summarizeEvent step.type fallback branches

Addresses reviewer F2.1 (NIT) on #107. `summarizeEvent` resolves
run-step detail type via `step.stepDetails?.type ?? step.type ?? 'step'`.
The existing tests all passed `{ stepDetails: { type } }`, leaving the
second and third clauses of the chain uncovered.

Two new tests:
- top-level `step.type` (no `stepDetails` wrapper) for both
  `tool_calls` and `message_creation` — exercises the second clause
- empty `{}` payload — exercises the `'step'` default and the generic
  `Step: <detailType>` branch

Carried MINOR #5 (`SubagentUpdateEvent.data: unknown`) left as a
documented deferral; the JSDoc already notes "shape depends on phase"
and a discriminated union is a larger refactor.

54/54 pass.

* chore: add npm run scripts for subagent debug scripts

- `npm run subagent` — supervisor + researcher/coder subagent demo
  (no tool use; from #104)
- `npm run subagent:events` — event-driven flow matching LibreChat's
  architecture (toolDefinitions + ON_TOOL_EXECUTE handler + self-spawn
  with calculator). This is the script that demonstrates the #107 fix.
- `npm run subagent:tools` — traditional LangChain tool instances with
  self-spawn; pre-fix parity case.

All three accept `OPENAI_API_KEY` (and `--provider anthropic` for the
first).

* fix(summarization): catch model init errors and log them

Move `initializeModel` inside the try/catch in `executeSummarizationWithFallback`
so that errors like `Unsupported LLM provider: <name>` are surfaced through the
`log('error', ...)` path and fall through to the metadata-stub fallback, instead
of bubbling up silently.

This is the defense-in-depth half of the fix for the LibreChat Discussion #12614
report: when a caller forwards an unrecognized summarization provider name
(e.g. a custom-endpoint label), the error is now observable in operator logs
rather than only the client UI.
danny-avila added a commit that referenced this pull request Apr 21, 2026
* feat: add inert hook registry and executor scaffolding (phase 1/1)

Introduces `src/hooks/` with types, registry, and executor for a 12-event
hook lifecycle. Purely additive — not exported from `src/index.ts` and not
yet wired into `Run`, `Graph`, or `ToolNode`. Integration lands in the
next PR so this change is shippable independently with zero behavioral
impact on hosts that don't opt in.

- Discriminated-union input/output per event; generic `HookCallback<E>`
  and `HookMatcher<E>` for type-safe registration.
- `HookRegistry` uses `Map<sessionId, bucket>` (not `Record`) to avoid
  O(n^2) churn under parallel registration in multi-tenant hosts.
- `executeHooks` fans out in parallel, races hooks against a combined
  parent + timeout `AbortSignal` so non-listening hooks still time out,
  folds decisions with `deny > ask > allow` precedence, accumulates
  `additionalContext`, and self-removes `once: true` matchers after any
  successful hook. Errors are non-fatal: swallowed into the aggregated
  result and routed through an optional winston logger (falling back to
  `console.warn`), with internal matcher errors suppressed.

47 tests under `src/hooks/__tests__/` cover registry scoping, regex
matching, precedence folding, once-self-removal (success/failure/mixed),
timeout enforcement (including non-listening hooks), error non-fatality,
and parent signal combination.

* fix(hooks): address review findings — rename field, wire updatedOutput, cache patterns

Resolves the comprehensive review on #87:

- Rename `HookMatcher.matcher` -> `HookMatcher.pattern` to fix the
  self-referential naming (`m.matcher` reads as "the matcher's matcher").
- Wire `PostToolUseHookOutput.updatedOutput` through `AggregatedHookResult`
  and `fold()`. The type was previously a promise the executor didn't
  fulfill — a hook returning `{ updatedOutput: ... }` was silently dropped.
- Correct the JSDoc on `updatedInput`/`updatedOutput`. The previous copy
  claimed non-determinism "in Promise.all resolution order" — but
  `Promise.all` preserves input order, so the fold is deterministic in
  registration order (outer loop over matchers, inner loop over each
  matcher's hooks). Updated the `executeHooks` function docstring to match.
- Add regex compilation cache and length bound to `matchers.ts`. Patterns
  compile at most once per unique string and are reused across calls;
  patterns over MAX_PATTERN_LENGTH (512) are rejected outright as a cheap
  ReDoS mitigation. Documented the trust model (patterns are trusted input;
  hosts must validate user-supplied patterns upstream).
- Document the wildcard-only semantics on query-less events: a non-empty
  pattern on an event that doesn't supply a matchQuery (RunStart, Stop,
  etc.) never matches.
- Document the `once: true` concurrent-dispatch limitation: two parallel
  `executeHooks` calls each snapshot the matcher list, so `once` means
  "once per call", not "once globally". Matches Claude Code's semantics.
  Added a test that pins this behavior.
- Merge `matchers.filter(...)` + task-build loop into a single pass per
  AGENTS.md "consolidate sequential O(n) operations."
- Scope `eslint-disable no-console` to the single `console.warn` call in
  `reportErrors` instead of disabling it file-wide.
- Fix the timeout-ignoring-hook test: clear the dangling `setTimeout` and
  assert the error surfaces an abort-shaped label.
- Add tests: multi-hook `preventContinuation` (first-writer-wins on
  stopReason, flag without reason), `updatedOutput` flow + registration
  order, registration-order determinism for `updatedInput`, pattern
  length bound, compilation cache hit/miss/clear, concurrent `once`
  dispatch. Total: 47 -> 59 tests.

Findings audited and resolved:
- #1 ReDoS: cache + length bound + trust model doc (full fix blocked on
  host-side validation)
- #2 Wrong updatedInput ordering JSDoc: fixed
- #3 Dead `updatedOutput` type: wired through
- #4 Concurrent `once` race: documented + test
- #5 `HookMatcher.matcher` naming: renamed to `pattern`
- #6 `WideCallback` dead code: rejected as inaccurate (used in `runHook`)
- #7 Eslint-disable scope: line-scoped
- #8 File-path comments: rejected; matches existing project convention
- #9 Two-pass filter: single pass
- #10 Regex recompilation: cached
- CL-4 Undocumented wildcard semantics: documented
- RT-3 Timeout test dangling timer: fixed + error content verified
- RT-4 No multi-hook preventContinuation test: added

* perf(hooks): atomic once, bounded LRU, ReDoS heuristic, shared signal

Rework the phase-1 hooks scaffolding for the multi-tenant scale we
actually operate at. These are behavior changes (semantics of `once`
tightened, new pattern-rejection path) but none of them are wired into
the runtime yet — this PR is still inert — so the blast radius is the
hooks directory only.

## Atomic `once: true`

Removal now happens synchronously inside `executeHooks`, between
`registry.getMatchers` and the first `await`. Node's event loop
serialises sync work, so two concurrent `executeHooks` calls can never
both observe the same `once` matcher — whichever call runs its sync
prefix first consumes it, and the loser sees an empty bucket.

Semantic change: `once` is now at-most-one-dispatch, not
at-most-one-successful-execution. If every hook in a `once` matcher
throws, the matcher is still gone. The old retry-on-failure behavior
was Claude Code's model — fine for a single-user CLI, but in a
multi-tenant server it opens a window where concurrent bursts can
re-dispatch a "once" hook non-idempotently. "At most once, ever" is
the correct semantic for a shared registry. Hosts that need retry
should register a normal matcher and self-unregister via the callback
returned from `registry.register`.

Dropped `collectOnceMatchersForRemoval` + its Map/array allocations —
the post-hoc collection step is no longer needed.

Tests: replaced "keeps the matcher when every hook throws" with
"removes the matcher even when every hook in it throws". Added
"fires exactly once across 3 concurrent calls" and "fires exactly once
across 8-way concurrent dispatch when hooks are slow" (the slow test
deliberately uses a 10ms hook to force overlap).

## Bounded LRU regex cache

`matchers.ts` was unbounded. Under multi-tenant use where different
tenants register different patterns, the cache would leak. Capped at
`MAX_CACHE_SIZE = 256` entries with LRU eviction: hits refresh position
(delete-then-set), misses at capacity evict the oldest key. Failed
compiles are cached the same way so a tenant spamming bad patterns
doesn't burn CPU on every call.

Exposed `getMatcherCacheSize` and `MAX_CACHE_SIZE` for test
observability. Tests verify eviction order (cold patterns recompiled
after overflow) and that hot patterns refreshed mid-stream survive
eviction pressure.

## ReDoS heuristic: nested-quantifier detector

`hasNestedQuantifier` walks a pattern linearly, tracks paren depth,
and rejects any pattern where a quantified group contains another
quantifier — the `(a+)+`, `(.*)*`, `(\w+)+` shape that is the #1
catastrophic-backtracking source. Character classes (`[*+?]`) and
escaped quantifiers (`\+`) are correctly ignored. Rejection happens
before `new RegExp`, so adversarial input never reaches the compiler.

This is a floor, not a ceiling — it doesn't catch ambiguous-alternation
ReDoS like `(a|a)+`. Hosts that accept user-supplied patterns must
still validate upstream. Documented in the `HookMatcher.pattern` and
`matchesQuery` JSDoc.

Test verifies that `(a+)+$` on a 35-char adversarial input resolves in
under 50ms (would be seconds without the heuristic).

## Shared `AbortSignal` per matcher

Was: one `AbortSignal.timeout(ms)` + one `AbortSignal.any([parent,
timeout])` per hook. For a matcher with N hooks and the same timeout,
that's N timers + N composite signals.

Now: one signal per matcher, shared across all its hooks. Since hooks
in a matcher already share `matcher.timeout`, they may as well share
the timer. Behavior is identical (all hooks fire and race the same
deadline) but allocation drops from N to 1 per matcher.

## Tests

72 passing (was 59, +13):
- Atomic once: 2 concurrent tests + 1 failure-removal test updated
- LRU: cache size cap, eviction order, hit refresh
- Nested quantifier: classic shapes, deep nesting, escaped chars,
  character classes, false positives on legitimate patterns
- ReDoS mitigation: rejection path, <50ms response on adversarial input

* fix(hooks): fix ReDoS heuristic false-positives on (?:...) groups

Resolves the follow-up review on #87. Four findings, all valid.

## #1 MAJOR — hasNestedQuantifier false-positives on group syntax

The previous scanner treated `?` as a quantifier at any depth > 0, but
`?` immediately after `(` is group syntax, not a quantifier:
`(?:...)`, `(?=...)`, `(?!...)`, `(?<name>...)`, `(?<=...)`, `(?<!...)`.
Patterns like `(?:pre_)?tool_name` were silently rejected and the
registering hook never fired — with no error surfaced to the caller.

Fix: explicit group-syntax prefix skip inside the `(` handler. When the
scanner enters a group it peeks for `?` and advances past the modifier
characters (`:`, `=`, `!`, `<=`, `<!`, or a named-group label `<name>`)
before processing the group body. The `?` is therefore never considered
a quantifier at the start of a group.

While in there I also fixed a second correctness bug the reviewer
didn't catch: the old depth-indexed array didn't propagate a child
group's "contains quantifier" signal up to its parent, so `(?:(a+))+`
(outer quantifier over a wrapped quantified group, equivalent to
`(a+)+`) escaped detection. The scanner now uses an explicit stack of
`QuantifierFrame` records, and when a child group closes it propagates
`hasBacktrackRisk` to the parent frame — whether the child itself was
quantified or not. This catches `(?:(a+))+` correctly while still
allowing non-risky wrappers like `(?:(ab))+`.

Added 9 tests covering non-capturing groups, lookaheads, lookbehinds,
named captures, `(?:(a+))+` risk propagation, and `((ab)+)+` deep
wrapping. Verified existing ReDoS-detection tests still pass.

## #2 MINOR — executeHooks.test.ts did not clear the pattern cache

Matcher cache is module-level. `matchers.test.ts` clears it in
`beforeEach`; `executeHooks.test.ts` did not, so patterns compiled
during one test persisted across subsequent tests in the same file.
Added `clearMatcherCache()` to `executeHooks`'s `beforeEach` — no
test failures today, but restores test independence.

## #3 MINOR — Test utilities leaked into production barrel

`clearMatcherCache` and `getMatcherCacheSize` are test-only helpers
(their JSDoc says so) but were exported from `src/hooks/index.ts`.
When the integration PR eventually re-exports `src/hooks` from
`src/index.ts` they would become public API.

Removed both from the barrel. Tests already import them directly from
`../matchers`, so no test changes needed. `hasNestedQuantifier`,
`MAX_PATTERN_LENGTH`, and `MAX_CACHE_SIZE` remain exported — hosts can
legitimately use them for upstream validation.

## #4 NIT — LRU refresh overhead at low cache pressure

Every `compile()` cache hit was doing `delete + set` to refresh LRU
position, even when the cache was nowhere near capacity. With
`MAX_CACHE_SIZE = 256` and typical working sets of tens of patterns,
eviction pressure is near-zero and the refresh is pure overhead.

Added a `LRU_REFRESH_THRESHOLD = 75%` gate: refresh only runs when the
cache is above 192 entries. Below that the code fast-paths straight
back from `compile()`. Above it the LRU semantics kick in so hot
patterns still survive evictions — the existing "refreshes LRU
position on hit" test still passes because by the time it runs the
check, the cache is at capacity.

Tests: 72 -> 81 (+9 for group-syntax and risk-propagation coverage).
danny-avila added a commit that referenced this pull request Apr 21, 2026
* feat: wire tool hooks into event-driven ToolNode (phase 1/3)

Fires PreToolUse, PostToolUse, PostToolUseFailure, and PermissionDenied
hooks inside `ToolNode.dispatchToolEvents` — the single chokepoint for
all event-driven tool execution in LibreChat.

## Hook lifecycle in dispatchToolEvents

1. **PreToolUse** fires per call in parallel before ON_TOOL_EXECUTE
   dispatch. Denied calls (deny or ask) produce error ToolMessages and
   fire PermissionDenied; surviving calls proceed with optional
   updatedInput applied to tool args.

2. Only approved calls are batched into ON_TOOL_EXECUTE and sent to the
   host. If all calls are denied, the batch dispatch is skipped entirely.

3. **PostToolUse** fires per successful result. If a hook returns
   updatedOutput, the ToolMessage content is replaced before the message
   is appended to the graph.

4. **PostToolUseFailure** fires per error result. Observational only —
   cannot modify the error message.

Post hooks are awaited (not fire-and-forget) so updatedOutput takes
effect before the ToolMessage is consumed. Hook errors are swallowed
via .catch() to avoid disrupting the tool result flow.
PermissionDenied is fire-and-forget since it's purely observational.

## Plumbing

- `ToolNodeOptions.hookRegistry?: HookRegistry` — new optional field
- `ToolNode` constructor stores `this.hookRegistry`
- `Graph.createToolNode()` passes `this.hookRegistry` at both
  construction sites (event-driven and traditional)
- `hookRegistry` is set on Graph BEFORE `createWorkflow()` is called
  (moved from the post-create assignment in Run constructor into
  `createLegacyGraph` and `createMultiAgentGraph`) so ToolNode receives
  a non-undefined registry at construction time

## hasHookFor guards

All four hook fire points use `hasHookFor(event, runId)` before calling
`executeHooks`, so when no hooks are registered for the event the
overhead is a single Map lookup + array length check — no Promise
allocations, no function calls.

## Tests

9 new integration tests in `src/hooks/__tests__/toolHooks.test.ts`
using event-driven mode (toolDefinitions + ON_TOOL_EXECUTE handler):

- PreToolUse fires with correct fields
- PreToolUse deny blocks execution (tool handler never called)
- PreToolUse ask blocks execution (v1)
- PreToolUse updatedInput rewrites args before dispatch to host
- PreToolUse hook errors are non-fatal (tool still executes)
- PermissionDenied fires after deny with reason
- PostToolUse fires with tool output
- PostToolUseFailure fires on error result
- No-hooks baseline works identically

Total: 94 existing + 9 new = 103 passing.

* fix(hooks): address PR 3 review — step completion, ordering, catch, tests

Resolves all 10 findings from the review on #90.

## Structural rewrite of dispatchToolEvents

The post-processing loop was rewritten to fix #1, #2, #6, #8, #10:

- #1 MAJOR: Denied calls now dispatch ON_RUN_STEP_COMPLETED via a
  shared `dispatchStepCompleted` helper (extracted from the duplicate
  logic in the executed-results loop). Without this, the frontend
  would show stuck spinners for denied tool calls.

- #2 MAJOR: Output messages are now collected in a Map keyed by
  tool_call_id, then returned in the original `toolCalls` input order
  via `toolCalls.map(call => messageByCallId.get(call.id))`. Before,
  denied messages were prepended to executed messages, breaking the
  implicit ordering contract.

- #6 MINOR: `toolUsageCount` is now incremented only for approved
  calls (moved from the initial `preToolCalls.map` into the
  `approvedEntries.map` that builds ToolCallRequests). Denied calls no
  longer inflate the turn counter.

- #8 MINOR: Eliminated the double `result.status === 'error'` check.
  ToolMessage is now constructed inside each status branch directly
  after hook dispatch.

- #10 NIT: `requests.find()` replaced with a pre-built
  `Map<id, ToolCallRequest>` for O(1) lookup per result, per AGENTS.md
  guidance on Map/Set over Array.find.

## Error protection

- #3 MAJOR: PreToolUse `Promise.all` now wraps each `executeHooks`
  call in `.catch(() => emptyResult)` so an infrastructure error in
  the hooks subsystem cannot crash the entire tool batch. Matches the
  stated "hook errors are non-fatal" contract and is consistent with
  PostToolUse/PostToolUseFailure which already had `.catch()`.

## New tests

- #4 MAJOR: Added `updatedOutput replaces the ToolMessage content` —
  registers a PostToolUse hook returning `{updatedOutput: 'REDACTED'}`,
  verifies the ToolMessage content in the graph's run messages is
  `'REDACTED'` (not the original output).

- #7 MINOR: PermissionDenied test replaced 50ms `setTimeout` with a
  `Promise` resolved by the hook callback. No more flaky sleep.

## Cleanup

- #9 NIT: Updated stale "once the tool-hook PR lands" comment in
  `src/hooks/index.ts`.

Tests: 103 -> 104 (+1 for updatedOutput).

* fix(hooks): off-by-one in dispatchStepCompleted index, test guard, turn fallback

- #1 MAJOR: dispatchStepCompleted read post-increment toolUsageCount
  for the `index` field. Added `turn?: number` parameter; approved-call
  site passes `request?.turn` (the pre-increment value), denied-call
  site omits it (fallback is correct since count was never bumped).
- #2 NIT: updatedOutput test now asserts `expect(toolMsg).toBeDefined()`
  before content extraction so a missing tool message produces a clear
  failure rather than "Expected undefined to be 'REDACTED'".
- #3 NIT: PreToolUse input `turn` field restored `?? 0` fallback so
  first-use tools see `turn: 0` instead of `turn: undefined`.

* fix(hooks): batch tests, turn JSDoc, freeze sentinel, remove unused hookRegistry

Addresses follow-up review findings on #90.

- #1 MAJOR: Add multi-call batch tests — partial deny (echo denied,
  math approved, order preserved) and all-denied (ON_TOOL_EXECUTE
  handler never called). Uses counter-based IDs (#7 fix) to avoid
  Date.now() collisions across calls in the same test.
- #2 MINOR: Document PreToolUseHookInput.turn semantics — within a
  single batch, all calls to the same tool share the same turn value.
  Per-call discrimination within a batch is not supported in v1.
- #3 MINOR: Remove hookRegistry from the traditional (non-event-driven)
  ToolNode construction site in Graph.ts. Hooks only fire in
  dispatchToolEvents; passing the registry to a ToolNode that never
  uses it created a false impression of hook support.
- #4 MINOR: Add PostToolUse error non-fatality test — throwing hook
  does not crash the batch, original content is preserved.
- #5+#8 MINOR: Freeze the hook-error fallback sentinel via
  Object.freeze and rename from emptyResult to HOOK_FALLBACK. Prevents
  future mutation bugs if someone pushes to the shared arrays.
- #6 MINOR: Document in ToolNodeOptions.hookRegistry JSDoc that hooks
  only fire for event-driven calls — directToolNames bypass dispatch.

Tests: 104 -> 107 (+3 for batch partial-deny, all-denied, and
PostToolUse error resilience).

* nit: simplify test assertions, keep shallow freeze (deep conflicts with mutable type)
danny-avila added a commit that referenced this pull request Apr 21, 2026
#104)

* feat: add subagent-as-tool primitive for hierarchical agent delegation

Adds a SubagentTool that lets a parent agent delegate tasks to child
agents with isolated context windows. Verbose tool outputs stay in
the child's context; only a filtered text summary returns to the parent.

Key components:
- SubagentConfig type on AgentInputs for declaring child agent types
- SubagentExecutor: creates child StandardGraph, invokes with isolated
  state, filters result (strips tool_use/thinking blocks), fires
  SubagentStart/SubagentStop hooks
- SubagentTool: LCTool definition with dynamic enum from configs
- Graph integration: tool created in createAgentNode() via graphTools,
  automatically becomes a direct tool (bypasses event-driven dispatch)
- Self-spawn support: reuse parent's AgentInputs for context isolation
  without separate agent config

Wires the existing SubagentStart (with deny support) and SubagentStop
hook types that were defined but had zero fire points.

* fix: resolve review findings for subagent primitive

Address all 6 issues from PR review:

1. Add subagentHooks.test.ts — end-to-end hook integration test
   through real Run pipeline (SubagentStart payload, SubagentStop
   messages, deny blocks execution). 3 new tests.

2. Remove `as unknown as` cast — follow MultiAgentGraph pattern
   (init graphTools as empty array, push directly).

3. Deduplicate schema — extract buildSubagentToolParams() in
   SubagentTool.ts, use it in Graph.ts. createSubagentToolDefinition()
   now delegates to it. Single source of truth for schema/description.

4. Propagate threadId — tool callback extracts thread_id from
   LangGraph config.configurable, passes to executor.execute().
   Hooks now receive correct threadId.

5. Pass tokenCounter to child graph — added to SubagentExecutorOptions,
   threaded from agentContext.tokenCounter through to child
   StandardGraph constructor.

6. Pass signal in invoke() config — workflow.invoke() now receives
   signal for more responsive abort between graph steps.

* fix: resolve second-pass audit findings for subagent primitive

Address 12 valid findings from the audit — 3 MAJOR, 6 MINOR, 3 NIT:

MAJOR fixes:

1. Depth tracking now works across graph boundaries. Previously the
   `depth`/`maxDepth` check only worked within a single executor; when
   `allowNested: true`, the child graph spawned its own executor at
   depth=0, making `maxSubagentDepth` non-functional. Fix: decrement
   `maxSubagentDepth` in `buildChildInputs`, so the child's own
   executor inherits a smaller budget and naturally blocks further
   nesting when it reaches 0. Removed the now-vestigial `depth` field.

2. Detect duplicate `SubagentConfig.type` values. `Map` construction
   silently dropped duplicate keys. Now `resolveSubagentConfigs`
   throws with a clear error.

3. Add test coverage for `allowNested: true`: verifies depth is
   decremented across boundaries, child inherits subagentConfigs,
   and budget is clamped to >= 0.

MINOR fixes:

4. Runtime guard for empty `description` in the tool callback — the
   LLM can send undefined/empty values; now we default to a safe
   fallback string.

5. Updated `src/hooks/index.ts` barrel comment to list SubagentExecutor
   as the fourth hook consumer (SubagentStart/SubagentStop).

6. Extracted property description strings as constants in
   SubagentTool.ts; `SubagentToolSchema` and `buildSubagentToolParams`
   now share a single source of truth.

7. Changed SubagentStop from fire-and-forget to awaited, matching the
   PostCompact pattern. Removes fragile `setTimeout` synchronization
   in tests. Errors still swallowed (observational).

8. Truncate error messages from child graph to 200 chars to prevent
   leaking internal details (stack traces, API errors) to the parent
   LLM.

9. Document that `ask` decision is treated identically to `deny` in
   the non-interactive subagent context.

NIT fixes:

11. Import order in SubagentExecutor.ts — `@/types` (multi-line,
    longest) now comes first in the local types sub-group.

12. Removed unused `contentParts` destructuring in verification script.

13. Fixed unsafe `msg.content as string` cast in verification script;
    now handles array content correctly.

Skipped #10 (NIT): `_sourceInputs` underscore prefix is an accepted
TS convention for cross-file-accessible-but-semantically-internal
fields.

Test count: 1074 passed (up from 1065, +9 new tests), 0 regressions.

* fix: address third-pass audit findings for subagent primitive

Resolve 3 NITs from the second review pass plus 2 new codex comments:

NIT #1 — Tighten truncation test: previously asserted only
`result.content.length < 500`. Now asserts exact envelope of 219
chars ("Subagent error: " prefix + 200 body + "..." suffix) so that
regressions in ERROR_MESSAGE_MAX_CHARS are caught. Added companion
test verifying short messages are not truncated.

NIT #2 — Skip tool creation when maxSubagentDepth: 0. When a host
set `maxSubagentDepth: 0` alongside `subagentConfigs`, the tool was
still bound to the LLM, and every invocation returned the depth-
exceeded error. The LLM wasted a turn. Now the `createAgentNode`
guard includes `effectiveSubagentDepth > 0`, so the tool is not
created at the leaf level. Added integration test.

NIT #3 — Document buildChildInputs audience. Added @remarks block
explaining this is an advanced utility exported primarily for
testing and internal use; hosts configuring subagents should not
call it directly.

Codex P1 (false positive, clarified): maxSubagentDepth propagation
across graph boundaries was already correct via the countdown in
buildChildInputs, but it wasn't obvious from the call site. Added
a multi-line JSDoc comment above the executor construction in
createAgentNode explaining how depth consumes across boundaries
through the decremented AgentInputs.maxSubagentDepth.

Codex P2 (real bug, fixed): toolSchemaTokens did not include the
subagent tool's schema. `calculateInstructionTokens()` was called
in `fromConfig()` before graphTools was populated, so the later-
injected subagent tool's schema was missing from the budget. Fixes:
  1. Include `graphTools` in the iteration in
     `calculateInstructionTokens` (also helps handoff tools, a
     pre-existing issue with the same root cause).
  2. Re-trigger the calculation at the end of `createAgentNode`
     after the subagent tool is pushed, so the token count reflects
     the final graphTools state before `createCallModel` awaits
     `tokenCalculationPromise`.

Added test asserting `toolSchemaTokens` is larger for agents with
subagents vs without, confirming the recalculation runs.

Test count: 1077 passed (up from 1074, +3 new tests), 0 regressions.

* fix: break circular dep between SubagentExecutor and StandardGraph

Rollup warned on build:
  Graph.ts → subagent/index.ts → SubagentExecutor.ts → Graph.ts

SubagentExecutor needed `StandardGraph` at runtime to construct the
child graph; Graph.ts needs SubagentExecutor to attach the tool. With
preserveModules, these land in separate chunks producing circular
chunk imports with undefined execution order.

Fix: dependency injection. SubagentExecutor now takes a required
`createChildGraph: ChildGraphFactory` option and only imports
`StandardGraph` as `import type` (erased at build). Graph.ts passes
`(input) => new StandardGraph(input)` when constructing the executor.
Runtime module graph is now acyclic; `import type` does not appear in
emitted JS.

Test updates: replaced `jest.spyOn(GraphModule, 'StandardGraph')`
(which relied on the runtime import we just removed) with direct mock
factories passed through the new option. Cleaner — tests no longer
need to reach into module internals. Same coverage.

Build: `npx rollup -c` now emits zero warnings.
Test count: 1077 passed, 0 regressions.

* chore: add multi-agent-subagent script to package.json

Introduced a new script entry for multi-agent-subagent, enabling execution of the corresponding TypeScript file with necessary configurations. This addition enhances the multi-agent functionality by allowing for more granular control over subagent operations.

* fix: isolate child subagent from parent's callback chain

The manual verification script (`npm run multi-agent-subagent`)
crashed with:

  Error: No agent context found for agent ID researcher
    at StandardGraph.getAgentContext
    at ChatModelStreamHandler.handle

Root cause: `Run.processStream` drives the parent via
`streamEvents`, which captures events from every nested runnable in
the call tree — including the child graph's LLM calls. When the
child's "researcher" agent streamed `on_chat_model_stream` events,
the parent's `ChatModelStreamHandler` received them and tried to
resolve the child's agent ID in the parent's `agentContexts` map,
which only contains the supervisor. Error thrown.

The Phase 1 plan was for child events to stay fully isolated. Unit
tests missed this because the child graph was always mocked (no
real `streamEvents` plumbing); only running the full pipeline
end-to-end surfaced it.

Fix: pass `callbacks: []` in the child `workflow.invoke()` config.
This overrides the inherited callbacks for that invocation, breaking
the event propagation chain. Also set `runName: subagent:<type>` and
a child-scoped `configurable.thread_id` so the child has a distinct
trace root rather than polluting the parent's trace.

Manual verification now works: supervisor delegates to researcher,
receives filtered text, synthesizes the final answer.

Test suite: 1079 passed, 0 subagent-suite regressions. The two newly
failing suites (ProgrammaticToolCalling / ToolSearch Live API tests)
are pre-existing — they require `LIBRECHAT_CODE_BASEURL` to be
configured and fail identically on the prior commit. They were
skipped before because OPENAI_API_KEY was absent.

* feat: align filterSubagentResult fallback with Claude Code

When a subagent's last AIMessage is pure tool_use (e.g. the model
hit maxTurns mid-tool-call), the previous implementation returned
the "Task completed" sentinel, discarding any textual progress from
earlier turns.

Claude Code's agentToolUtils.finalizeAgentTool walks backwards from
the last message, and on messages without text content continues
scanning earlier AIMessages, surfacing the most recent textual
thought instead of the sentinel.

This change matches that behavior. The three `return` sites inside
the loop that previously exited with the sentinel now `continue`
when the AIMessage yields no text; the loop exits with "Task
completed" only when no AIMessage in the history contains any text.

Concrete example:
  [0] Human: "What's the capital of France?"
  [1] AI:    "Let me search." + tool_use(search, ...)
  [2] Tool:  "Paris."
  [3] AI:    tool_use(search, ...)   <-- maxTurns, no text

  Before: "Task completed"
  After:  "Let me search."

Added two tests — the turn-limit scenario above, and the analogous
empty-string-content edge case. All nine existing filterSubagentResult
tests still pass unchanged.

* fix: update model name for non-Anthropic API to gpt-5.4

Changed the model name used in the multi-agent subagent script from 'gpt-4.1' to 'gpt-5.4' for improved performance and capabilities. This update ensures compatibility with the latest API offerings.
danny-avila added a commit that referenced this pull request Apr 21, 2026
* feat: forward subagent child events and enable event-driven tools

Adds an opt-in event forwarding path for `SubagentExecutor` so host apps
that rely on event-driven tool dispatch (`toolDefinitions` +
`ON_TOOL_EXECUTE` handler) can use subagents — including self-spawn.

Before: a subagent in event-driven mode got `toolDefinitions` stripped in
`buildChildInputs` and ran with `callbacks: []`, so the LLM had no tool
schemas and ON_TOOL_EXECUTE never reached the host. The subagent replied
in prose without invoking any tools.

Now:
- `SubagentExecutor` accepts `parentHandlerRegistry` (a direct registry
  or a lazy getter — `Run` wires the registry onto the graph AFTER
  `createWorkflow`, so `createAgentNode` must capture lazily).
- When provided, `buildChildInputs` keeps `toolDefinitions`, and the
  child's invoke attaches a `BaseCallbackHandler` forwarder that:
    - routes ON_TOOL_EXECUTE to the parent's handler (event-driven tools
      work), and
    - wraps other custom events (run_step, run_step_delta,
      run_step_completed, message_delta, reasoning_delta) in a new
      `ON_SUBAGENT_UPDATE` envelope so hosts can render child progress
      in a separate UI surface.
- New `SubagentUpdateEvent` type and `ON_SUBAGENT_UPDATE` GraphEvents
  entry; start/stop/error envelopes fire around the child invoke.
- Legacy isolation is preserved when no registry is passed.

Test plan:
- `npx jest SubagentExecutor SubagentTool` — 36 pass (4 new covering
  forwarding, toolDefinition retention, legacy strip, lazy getter).
- Integration test against OpenAI (`subagent-event-driven-debug.ts`)
  confirms the child uses `calculator` via the parent's ON_TOOL_EXECUTE
  path and the host sees start/run_step/run_step_completed/stop
  ON_SUBAGENT_UPDATE envelopes.

* fix: gate child toolDefinitions on ON_TOOL_EXECUTE handler + propagate tool_call_id

Two follow-ups from review on #107:

1. Codex P1: `SubagentExecutor` was treating any non-null parent registry
   as "forwarding enabled" and always kept `toolDefinitions` on the
   child. Because `Run.create` always constructs a `HandlerRegistry`,
   this meant configurations with tools declared but no `ON_TOOL_EXECUTE`
   handler would hang forever — the child's `ToolNode` would dispatch a
   batch request and the forwarder would find no handler to resolve or
   reject it, leaving the promise pending.

   Now `keepToolDefinitions` is gated on
   `parentRegistry.getHandler(ON_TOOL_EXECUTE) != null`. A registry
   without the handler falls back to the legacy "strip toolDefinitions"
   path, which is a recoverable no-tools state instead of a hang.

2. `SubagentUpdateEvent` now carries `parentToolCallId`. `Graph.ts`
   pulls it from `config.toolCall.id` (LangChain threads the originating
   `ToolCall` onto `RunnableConfig`) when the subagent tool is invoked.
   This lets hosts correlate subagent updates to the parent tool call
   deterministically instead of inferring by event ordering — a concern
   raised on the LibreChat PR using this event stream.

Also bumps to `3.1.67-dev.2`.

Tests: `npx jest SubagentExecutor` — 38 pass (2 new covering the handler
gate and `parentToolCallId` propagation). Integration script verified
against real OpenAI: every ON_SUBAGENT_UPDATE envelope now carries the
parent's `tool_call_id`.

* chore: address audit findings — drop unused param, tighten cast, add tests

Addresses the comprehensive audit of #107:

### MAJOR #1 — ON_TOOL_EXECUTE forwarding coverage
Two new tests in the `event forwarding` suite drive the forwarder
callback directly with a synthesized `ToolExecuteBatchRequest`:
- `routes child ON_TOOL_EXECUTE dispatches through the parent registry`
  asserts the parent's `ON_TOOL_EXECUTE` handler is invoked, the batch's
  `resolve` fires with canned results, and the child receives them.
- `does NOT forward ON_TOOL_EXECUTE when the parent registry has no
  handler (safe fallback)` documents the companion case: without a
  handler the `resolve` never fires, which is why the executor gates
  `keepToolDefinitions` on handler presence in the first place.

### MINOR #2 — unused `childGraph` parameter
Removed from `createForwarderCallback`'s arg shape, destructuring, and
call site. It was accepted as `_childGraph` and never read.

### MINOR #3 — `data as never`
`EventHandler.handle` didn't include `ToolExecuteBatchRequest` in its
accepted data union, which is why the forwarder used `as never` to
shut the compiler up. Extended the union and switched the cast to the
concrete `ToolExecuteBatchRequest` type — the forwarder is now fully
type-checked on the hot path.

### MINOR #4 — error-phase envelope test
`emits an 'error' phase envelope when the child graph throws` uses
the throwing factory helper already in the file, registers a real
`ON_SUBAGENT_UPDATE` handler, and asserts start/error ordering plus
the `data.message` payload and `parentToolCallId` passthrough.

### MINOR #6 — dynamic imports in tests
Replaced `await import('@/events')` / `await import('@/common')` with
static top-level imports. Consistent with the rest of the file.

### MINOR #7 — `summarizeEvent` unit tests
Exported the pure function from `@/tools/subagent` and added 10 focused
tests covering every branch: tool_calls with one/many names, empty
tool_calls, message_creation, ON_TOOL_EXECUTE with/without names,
completed steps with/without a tool name, message_delta, and the
unknown-event fallback. Every user-visible label is now regression
protected.

### MINOR #8 — `config.toolCall` comment
Expanded the JSDoc to note the specific LangChain source
(`ToolRunnableConfig` in `@langchain/core/tools`, stable since 0.3.x)
and that the defensive read degrades to `undefined` on upstream
changes.

### NIT #9 — unused import
Removed `createContentAggregator` from `subagent-tools-debug.ts`.

### NIT #10 — `awaitHandlers = true` doc
Added a JSDoc block explaining the trade-off: required for
`ON_TOOL_EXECUTE` to actually block on the host's resolve, but it
also serializes observational events through any slow host handlers.
Refinement path noted for later if host-side latency becomes a problem.

Version bumped to `3.1.67-dev.3`.

Tests: 51 pass (40 SubagentExecutor + 10 new summarizeEvent + 1 new
error-phase + existing coverage retained).

* fix: clear parent run summary and discoveredTools from child inputs

Paired with Codex P1 on the LibreChat PR (danny-avila/LibreChat#12725)
that uses this event stream. `buildChildInputs` shallow-spreads the
parent's `AgentInputs`, which carries two run-scoped fields that leak
into the isolated child by default:

- `initialSummary` — cross-run conversation summary. On a conversation
  that has already been summarized, every child would inherit it and
  reason against unrelated prior chat.
- `discoveredTools` — tool names the parent's LLM searched for in
  earlier turns via `tool_search`. The child gets primed to use
  whatever the parent happened to find.

Both are cleared on the returned child inputs. Configuration fields
(summarization policy, provider options, context pruning config) are
preserved — those are policy, not context.

Affects both paths:
- Self-spawn: `resolveSubagentConfigs` clones parent `_sourceInputs`,
  then `buildChildInputs` strips.
- Explicit children: host-built `AgentInputs` goes through the same
  `buildChildInputs` before the child graph runs.

Version bumped to `3.1.67-dev.4`.

Regression test: `strips parent-run-scoped initialSummary and
discoveredTools from child inputs` — seeds parent inputs with
`initialSummary: { text, tokenCount }` and `discoveredTools: [...]`,
asserts both are `undefined` on the returned child. 52 pass.

* test: cover summarizeEvent step.type fallback branches

Addresses reviewer F2.1 (NIT) on #107. `summarizeEvent` resolves
run-step detail type via `step.stepDetails?.type ?? step.type ?? 'step'`.
The existing tests all passed `{ stepDetails: { type } }`, leaving the
second and third clauses of the chain uncovered.

Two new tests:
- top-level `step.type` (no `stepDetails` wrapper) for both
  `tool_calls` and `message_creation` — exercises the second clause
- empty `{}` payload — exercises the `'step'` default and the generic
  `Step: <detailType>` branch

Carried MINOR #5 (`SubagentUpdateEvent.data: unknown`) left as a
documented deferral; the JSDoc already notes "shape depends on phase"
and a discriminated union is a larger refactor.

54/54 pass.

* chore: add npm run scripts for subagent debug scripts

- `npm run subagent` — supervisor + researcher/coder subagent demo
  (no tool use; from #104)
- `npm run subagent:events` — event-driven flow matching LibreChat's
  architecture (toolDefinitions + ON_TOOL_EXECUTE handler + self-spawn
  with calculator). This is the script that demonstrates the #107 fix.
- `npm run subagent:tools` — traditional LangChain tool instances with
  self-spawn; pre-fix parity case.

All three accept `OPENAI_API_KEY` (and `--provider anthropic` for the
first).
danny-avila added a commit that referenced this pull request May 3, 2026
Codex P1 (preserve session hooks across HITL resumes):
- `Run.processStream` finally block was unconditionally calling
  `hookRegistry.clearSession(this.id)` — including after an interrupt
  fired. The very next call is `Run.resume()`, which needs the same
  policy hooks (e.g. the `PreToolUse` matcher that produced the
  interrupt) to fire on the re-executed node and uphold the approval
  flow. Clearing leaked the gate on resume. Now gated on
  `this._interrupt == null` so the session survives until either
  natural completion, error, or hook-driven halt.
- Regression test covers the full lifecycle: interrupt → session
  matcher still present, resume → hook fires again (preCallCount === 2,
  proving policy actually applied), then natural completion → matcher
  cleared.

Comprehensive review (audited 10 findings, addressed 7 valid):

#2 — Mixed deny/ask/allow batch test: previously missing. Added a
test with three tools where the policy denies one, asks one, and
allows one. Verifies that on first pass only the ask appears in the
interrupt and NO tools dispatched (LangGraph rolls back the node's
effects on `interrupt()` throw — partial execution while a human is
being asked would leak side effects ahead of approval). On resume,
all three tools land in state with the correct outcomes.

#4 — Documented in `ToolApprovalDecision` JSDoc that `respond` does
NOT fire per-tool `PostToolUse` (no real execution happened) but DOES
appear in the `PostToolBatch` entry array, so batch-level audit /
convention hooks see the full set of outcomes.

#5 — Added clarifying JSDoc at both hook-context injection sites
(`Run.runPreStreamHooks`, `ToolNode.dispatchToolEvents`) explaining
why we use `HumanMessage` with `additional_kwargs.role: 'system'`:
mid-conversation `SystemMessage`s are rejected by Anthropic and
Google providers; the `role` field is metadata only for hosts
inspecting state. Mirrors the existing `convertInjectedMessages`
convention.

#6 — Decomposed `processStream`: extracted the pre-stream hook
block (RunStart + UserPromptSubmit + context injection + halt
short-circuit) into a private `runPreStreamHooks()` helper. Returns
a boolean: `true` to halt, `false` to proceed. Cuts the method's
body and matches AGENTS.md "Break complex operations into
well-named helpers."

#8 — Moved `HookHaltSignal` interface above the `HookRegistry`
class JSDoc block so it doesn't read like a continuation of the
Map-mutation rationale.

#9 — `blockEntry` now records `turn` on its `PostToolBatch` entry
push, matching the executed-path push. Same applies to the `respond`
branch. Batch hooks now see uniform entry shapes regardless of
outcome.

Not addressed (intentionally, per project direction):
- #1 default-on HITL: deliberate — most apps expect this. JSDoc on
  `RunConfig.humanInTheLoop` already calls out the contract.
- #3 mixed-resume test: already covered by the existing "bundles
  multiple ask decisions into a single interrupt and resolves per
  call" test (resumes with `[approve, reject]`).
- #7 unconditional MemorySaver allocation: HITL-on-by-default
  contract; opt out with `{ enabled: false }` for the latency-
  sensitive path. Per-Run MemorySaver overhead is small.
- #10 `HumanInterruptType` exported but unused: kept as public API
  surface; LibreChat's wire types reference it.

Validation: 32/32 HITL tests pass (2 new under "Codex review fixes"
describe), full hooks/tools/summarization suite (621 tests) green,
typecheck clean, lint clean on touched files (one pre-existing
warning at `src/tools/ToolNode.ts:751` from `df6b739`, untouched),
build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
danny-avila added a commit that referenced this pull request May 4, 2026
* 🛡️ feat: First-class HITL & permissions surface

Adds the SDK side of human-in-the-loop and tool-permission support so
hosts can build approval flows, audit policies, and ask-user prompts
against a stable surface that mirrors Claude Code's vocabulary.

What's in:

- HITL on by default. `RunConfig.humanInTheLoop` is opt-out via
  `{ enabled: false }`. When enabled and the host did not provide a
  checkpointer, `Run.create` installs a process-local `MemorySaver`
  fallback so `interrupt()` / `Command({ resume })` work out of the box.

- ToolNode bundles every `ask`-decision tool call from one batch into
  a single `interrupt()` carrying a `tool_approval` payload, anchored
  to the node config via `AsyncLocalStorageProviderSingleton.runWithConfig`
  (works around `ToolNode.trace = false`). Resume decisions support
  approve / reject / edit / respond, accepted as either an array or a
  record keyed by `tool_call_id`.

- AskUserQuestion as a separate `HumanInterruptType`. New
  `askUserQuestion()` helper for use inside any custom node;
  `Run.resume<T>` is generic so the same method handles both interrupt
  categories. Type guards `isToolApprovalInterrupt` /
  `isAskUserQuestionInterrupt` for narrowing the payload union.

- `createToolPolicyHook` declarative permission hook with allow / deny
  / ask lists + `mode: 'default' | 'dontAsk' | 'bypass'`. Glob `*`
  patterns. Evaluation order matches Claude Code's:
  deny → bypass → allow → ask → dontAsk → fallthrough(ask).

- New `PostToolBatch` hook event with full `PostToolBatchEntry[]` per
  dispatch. Per-hook `allowedDecisions` override flows into the
  interrupt's `review_configs`. `additionalContext` from any hook now
  injects a single `HumanMessage` into the conversation (previously
  collected but discarded). `preventContinuation` honored both
  pre-stream (early return) and mid-flight (`HookRegistry.haltRun()`
  signal polled by `Run.processStream`, skips the `Stop` hook). Async
  fire-and-forget hooks via `{ async: true }` — output is ignored,
  background work continues.

- `Run.getInterrupt()` and `Run.getHaltReason()` for hosts to inspect
  why the run paused or stopped. Re-exports `Command`, `MemorySaver`,
  `BaseCheckpointSaver`, `INTERRUPT`, `interrupt`, `isInterrupted`,
  and the `Interrupt` type from `@langchain/langgraph` so hosts that
  build durable checkpointers extend the same instance the SDK was
  compiled against (no version skew).

49 new tests across `src/hooks/__tests__/createToolPolicyHook.test.ts`
and `src/tools/__tests__/hitl.test.ts`; full hooks/tools/summarization
suite (616 tests) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🔖 chore: bump version to 3.1.75-dev.2

Carries the HITL & permissions surface to the next dev pre-release so
LibreChat can pin it during Slice B integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛠️ fix: address Codex review on HITL ToolNode

Three issues from the PR #134 review:

1. P1: Unknown HITL resume decision types now fail closed instead of
   silently flowing into `approvedEntries`. Hosts deserialize resume
   payloads from untyped JSON, so a typo (`'aproved'`) or schema drift
   would have bypassed the approval gate entirely. Restructured to
   explicit approve/edit/reject/respond branches with a typed-widened
   final check that routes anything else through `blockEntry` with a
   diagnostic reason.

2. P2: `PostToolBatch` entries now carry the post-`PostToolUse`-hook
   output. Previously the batch entry recorded `result.content` even
   when a hook had replaced it via `updatedOutput`, so audit /
   redaction / convention-injecting batch hooks observed stale or
   unredacted data. Added a loop-scoped `finalToolOutput` that the
   PostToolUse-hook branch updates and the batch entry reads.

3. P2: `PostToolUseFailure` `additionalContext` is now injected into
   the next model turn. The hook result was awaited but discarded, so
   recovery guidance returned on tool errors (e.g. "if this errors,
   suggest the user retry with X") was silently dropped despite the
   API surface advertising the field. Now collected into the same
   batch additional-contexts accumulator.

Three new tests in `src/tools/__tests__/hitl.test.ts` under the new
"Codex review fixes" describe block:
  - host sends `{ type: 'aproved' }` → blocked, tool never dispatched
  - PostToolUse rewrite → batch entry sees the redacted output
  - tool errors with PostToolUseFailure additionalContext → injected

31/31 HITL tests pass; full hooks/tools/summarization suite (619
tests) still green; typecheck clean; build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 📝 docs: correct HITL default-on contract on RunConfig.humanInTheLoop

Codex review on PR #134 caught that the JSDoc still described the
opt-in semantics from before the default-on flip. Integrators reading
the type docs would omit the field expecting no interrupts, then hit
unexpected pauses + a checkpointer fallback in production.

Aligns the contract with runtime behavior: HITL is on by default
(omit or `{ enabled: true }` engages everything); opt out with
`{ enabled: false }` to restore the pre-HITL fail-closed path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛠️ fix: address comprehensive review on HITL surface

Codex P1 (preserve session hooks across HITL resumes):
- `Run.processStream` finally block was unconditionally calling
  `hookRegistry.clearSession(this.id)` — including after an interrupt
  fired. The very next call is `Run.resume()`, which needs the same
  policy hooks (e.g. the `PreToolUse` matcher that produced the
  interrupt) to fire on the re-executed node and uphold the approval
  flow. Clearing leaked the gate on resume. Now gated on
  `this._interrupt == null` so the session survives until either
  natural completion, error, or hook-driven halt.
- Regression test covers the full lifecycle: interrupt → session
  matcher still present, resume → hook fires again (preCallCount === 2,
  proving policy actually applied), then natural completion → matcher
  cleared.

Comprehensive review (audited 10 findings, addressed 7 valid):

#2 — Mixed deny/ask/allow batch test: previously missing. Added a
test with three tools where the policy denies one, asks one, and
allows one. Verifies that on first pass only the ask appears in the
interrupt and NO tools dispatched (LangGraph rolls back the node's
effects on `interrupt()` throw — partial execution while a human is
being asked would leak side effects ahead of approval). On resume,
all three tools land in state with the correct outcomes.

#4 — Documented in `ToolApprovalDecision` JSDoc that `respond` does
NOT fire per-tool `PostToolUse` (no real execution happened) but DOES
appear in the `PostToolBatch` entry array, so batch-level audit /
convention hooks see the full set of outcomes.

#5 — Added clarifying JSDoc at both hook-context injection sites
(`Run.runPreStreamHooks`, `ToolNode.dispatchToolEvents`) explaining
why we use `HumanMessage` with `additional_kwargs.role: 'system'`:
mid-conversation `SystemMessage`s are rejected by Anthropic and
Google providers; the `role` field is metadata only for hosts
inspecting state. Mirrors the existing `convertInjectedMessages`
convention.

#6 — Decomposed `processStream`: extracted the pre-stream hook
block (RunStart + UserPromptSubmit + context injection + halt
short-circuit) into a private `runPreStreamHooks()` helper. Returns
a boolean: `true` to halt, `false` to proceed. Cuts the method's
body and matches AGENTS.md "Break complex operations into
well-named helpers."

#8 — Moved `HookHaltSignal` interface above the `HookRegistry`
class JSDoc block so it doesn't read like a continuation of the
Map-mutation rationale.

#9 — `blockEntry` now records `turn` on its `PostToolBatch` entry
push, matching the executed-path push. Same applies to the `respond`
branch. Batch hooks now see uniform entry shapes regardless of
outcome.

Not addressed (intentionally, per project direction):
- #1 default-on HITL: deliberate — most apps expect this. JSDoc on
  `RunConfig.humanInTheLoop` already calls out the contract.
- #3 mixed-resume test: already covered by the existing "bundles
  multiple ask decisions into a single interrupt and resolves per
  call" test (resumes with `[approve, reject]`).
- #7 unconditional MemorySaver allocation: HITL-on-by-default
  contract; opt out with `{ enabled: false }` for the latency-
  sensitive path. Per-Run MemorySaver overhead is small.
- #10 `HumanInterruptType` exported but unused: kept as public API
  surface; LibreChat's wire types reference it.

Validation: 32/32 HITL tests pass (2 new under "Codex review fixes"
describe), full hooks/tools/summarization suite (621 tests) green,
typecheck clean, lint clean on touched files (one pre-existing
warning at `src/tools/ToolNode.ts:751` from `df6b739`, untouched),
build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛠️ fix: address follow-up review on HITL extraction

Four findings from the follow-up review on commit 389cc88:

#1 (MAJOR) — runPreStreamHooks side-effect documentation. The
extracted helper had three side effects (mutating
`stateInputs.messages`, mutating `config.callbacks`, calling
`registry.clearSession`) but the JSDoc only mentioned the first.
Added a "Side effects" section that calls out each one and explains
why `config.callbacks` is dropped inside the helper rather than by
the caller (mirrors processStream's natural-completion cleanup).
Also added a one-line comment on the defensive `this.Graph == null`
guard explaining why it's there even though `processStream` already
validates Graph (TS narrowing doesn't propagate across method
boundaries; the guard keeps the body free of `!` assertions).

#2 (MINOR) — stream errors after interrupt detection no longer
preserve stale session hooks. Previously the `finally` guard
checked only `_interrupt == null`, so a downstream handler throwing
after the interrupt was stashed would re-throw, hit the finally
with `_interrupt != null`, and leak the session matchers into the
next run. Added a `streamThrew` flag set in the `catch` block and
widened the guard to `_interrupt == null || streamThrew` so a
captured-but-stale interrupt also triggers cleanup. New regression
test covers this exactly: `customHandler` throws once it detects
`run.getInterrupt() != null`, asserts the run still exposed the
interrupt, AND that session hooks were cleared anyway.

#3 (MINOR) — mixed deny/ask/allow batch test now subscribes a
`PostToolBatch` hook and verifies the batch entry shape after
resume reflects the FINAL outcomes (tool_a → error with policy
reason, tool_b → success with ran:tool_b output, tool_c → success
with ran:tool_c output). Closes the validation loop on the
PostToolBatch entry data that batch-level audit / convention hooks
would observe.

#4 (NIT) — blockEntry's `turn` JSDoc no longer overpromises
"mirrors the executed-path turn." Now correctly describes it as the
pre-invocation count of prior successful invocations of the same
tool name, with the same shape the executed path captures before
incrementing.

Not changed:
- #5 (NIT) defensive Graph-null guard kept; reviewer agreed "no
  change needed." Added a one-line comment for clarity.

Validation: 34/34 HITL tests (2 new under "Codex review fixes"
describe), 622 across hooks/tools/summarization, typecheck clean,
lint clean on touched files (only the pre-existing
src/tools/ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛠️ fix: address two more Codex P2s on ToolNode HITL flow

Two findings on the HITL deny path; both were silent correctness
bugs in mixed batches:

1. Duplicate ON_RUN_STEP_COMPLETED on resume (deny + ask batch)
   — `blockEntry` was dispatching `ON_RUN_STEP_COMPLETED` and the
   `PermissionDenied` hook synchronously the moment a deny decision
   landed. When the same batch also held an `ask` decision, the
   subsequent `interrupt()` threw and LangGraph rolled back the node;
   on resume the node re-executed from scratch, `blockEntry` fired
   again, and the host saw two completion events for one logical
   denial. Fix: queue both side effects in
   `deferredBlockedSideEffects` instead of firing immediately, then
   flush exactly once at the end of the hook-handling block. The
   first-pass throw skips the flush; resume / no-interrupt passes
   reach it once. New test counts the dispatched
   `ON_RUN_STEP_COMPLETED` events for the denied tool's call_id and
   asserts exactly one across the interrupt + resume lifecycle.

2. PostToolBatch entry order non-deterministic in mixed batches
   — entries used to be `push`'d into a flat array at three different
   call sites (deny synchronous, approved post-execution, respond on
   resume), so a batch with `[allowed, denied]` could surface
   `[denied, allowed]` in the entry list whenever the deny path
   resolved first. The API contract documents these entries as
   reflecting the original `toolCalls` order, and hooks correlating
   outcomes by position would misattribute. Fix: track entries in a
   `Map<callId, PostToolBatchEntry>` and materialize the array in
   `toolCalls` order at dispatch time. Two new tests pin the order
   under both `[deny, allow]` and `[allow, deny]` orderings; the
   existing mixed-batch test now also asserts the full
   `[call_a, call_b, call_c]` sequence on the snapshot.

Validation: 36/36 HITL tests (3 new), 624 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🧹 chore: tighten mixed-batch test assertion + correct stale comment

Round 3 review #2 (NIT). The mixed deny/ask/allow batch test
comment claimed PostToolBatch fires on the interrupted pass with
only tool_a, then again on resume with all three entries. That was
incorrect: PostToolBatch is dispatched at the bottom of
`dispatchToolEvents`, after tool execution, so `interrupt()` throws
before the dispatch ever runs. Only the resume pass yields a
snapshot.

Tightened the assertion from `toBeGreaterThanOrEqual(1)` to
`toHaveLength(1)` so the test pins the actual behavior, and reworded
the comment to match. Future contributors reading this test now
build the right mental model of when PostToolBatch fires relative
to interrupts. The snapshot's per-entry assertions and order
assertion (added in commit 6a9b6fd) are unchanged.

Round 3 review #1 (the duplicate denial side-effect events) was
already addressed in commit 6a9b6fd via the
`deferredBlockedSideEffects` queue + post-interrupt flush.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🧹 chore: handle two NITs from Round 4 review

#1: Add explanatory comment on the `respond` decision branch's
immediate `dispatchStepCompleted` call. Without context, the
asymmetry with `blockEntry`'s deferred dispatch reads as inconsistent
— a future contributor might either "fix" respond to also defer
(unnecessary, wastes effort) or wonder if it's a latent bug. The new
comment explains: respond only executes inside the decision-
processing loop, which is reachable only AFTER `interrupt()` has
returned, so there's no rollback risk and no risk of duplicate
ON_RUN_STEP_COMPLETED dispatch.

#2: Add a test for mixed respond + reject in the same resume batch.
Both decisions land on the same resume pass: respond dispatches
ON_RUN_STEP_COMPLETED immediately via the resume branch; reject
queues into deferredBlockedSideEffects and dispatches via the flush.
The test asserts each tool dispatches exactly once, the PostToolBatch
snapshot fires once with entries in the original toolCalls order, and
the ToolMessages match the resume decisions. Pins the timing
interaction between the immediate-dispatch and deferred-flush paths.

Validation: 37/37 HITL tests pass, 625 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: enforce allowedDecisions on resume + widen interrupt payload typing

Two new Codex findings, both real correctness gaps in the HITL
surface:

P1 — `allowedDecisions` not enforced on resume. A `PreToolUse` hook
returning `{ decision: 'ask', allowedDecisions: ['approve', 'reject'] }`
advertises the restricted set in the interrupt's `review_configs`,
but the SDK's resume handler never validated incoming
`decision.type` against it. A buggy or hostile host UI could submit
`{ type: 'edit', updatedInput: {...} }` and bypass the policy —
mutating arguments or substituting a synthetic tool result that
the hook explicitly forbade.

Fix: per-entry check that reads `decision.type` through a widened
view (handles untyped JSON / typo / missing field) and routes
anything not in the per-call `allowedDecisions` allowlist through
`blockEntry` with a diagnostic reason. The check fires before the
existing approve/edit/reject/respond branches, so policy
enforcement happens regardless of which decision type was offered.

Two new tests pin both directions: an `edit` submitted against an
allowlist of `['approve', 'reject']` is blocked and the host event
never dispatches; an `approve` submitted against the same allowlist
passes through and the tool runs with original args.

P2 — `RunInterruptResult.payload` was typed as `HumanInterruptPayload`
(the SDK's `tool_approval` / `ask_user_question` discriminated
union), but `Run.resume` documents support for custom interrupt
payloads from custom graph nodes. Hosts raising arbitrary shapes
got back a mistyped value and downstream code could falsely assume
the discriminator-based fields existed.

Fix:
- `RunInterruptResult<TPayload = HumanInterruptPayload>` is now
  generic. Default preserves the existing ergonomics for HITL-only
  consumers; custom hosts pass their own type.
- `Run.getInterrupt<T = HumanInterruptPayload>(): RunInterruptResult<T>`
  is generic. Callers assert what shape they expect.
- Internal storage typed as `RunInterruptResult<unknown>`,
  reflecting that the SDK does not validate the runtime payload.
- `isToolApprovalInterrupt` / `isAskUserQuestionInterrupt` widened
  to accept `unknown` and narrow defensively (`typeof === 'object'`,
  null check, then discriminator). Lets host code call them on raw
  values from `getInterrupt()` without first asserting a type.

Two new tests pin both: a custom-payload interrupt is captured and
returned through `getInterrupt<MyCustom>()` with the right shape; the
type guards safely return `false` for `null`, `undefined`, primitives,
empty objects, and wrong-discriminator objects.

Validation: 41/41 HITL tests pass (4 new), 629 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🧹 chore: hoist single declaredType read in HITL decision loop

Round 5 review NIT. The `allowedDecisions` enforcement guard
(commit 21692f2) introduced `declaredDecisionType`, which read the
exact same widened value as the existing `declaredType` further
down. Two reads from the same object under different names with
no functional difference — pure cosmetic redundancy.

Hoisted one `declaredType` constant before both checks, removed
the duplicate. The two callers (allowlist enforcement, unknown-type
fallthrough) now share the single source so any future change to
how the wire shape is widened lands in one place.

Behavior unchanged. 41/41 HITL tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: apply updatedInput before ask + capture nullish interrupt payloads

Two new Codex findings, both real correctness bugs:

P1 — Hook returning `decision: 'ask'` + `updatedInput` silently
dropped the rewrite. The `ask` branch in `dispatchToolEvents`'s
PreToolUse loop pushed to `askEntries` and immediately `continue`d,
skipping the `applyInputOverride` block below. Result:
  - The interrupt payload's `action_requests[i].arguments` surfaced
    the ORIGINAL args to the reviewer (e.g. unredacted secrets a
    sister matcher had rewritten).
  - On approve, the host event dispatched with the ORIGINAL args
    too — the policy rewrite was lost entirely.

Fix: call `applyInputOverride` before queuing into `askEntries`
when both signals are present. Real-world pattern: one matcher
redacts/sanitizes, another matcher requires approval; both must
take effect. New test seeds a hook returning
`{decision: 'ask', updatedInput: {command: 'redacted-command'}}`
against an original arg of `'original-secret'`, then asserts (a)
the interrupt payload shows the redacted value, and (b) the
host event dispatches with the redacted value after approve.

P2 — Nullish interrupt payloads (`interrupt(null)` /
`interrupt(undefined)`) silently downgraded paused runs to
"completed." The detection block in `processStream` only stashed
the interrupt when `first.value != null`; for nullish payloads,
`_interrupt` stayed undefined, `getInterrupt()` returned undefined,
the run looked like a natural completion, and the `Stop` hook
fired. A custom node that pauses without metadata (the pause
itself is the signal) was effectively invisible to the host's
resume handling.

Fix: capture unconditionally on any `__interrupt__` chunk; preserve
the payload as-is (may be `null` / `undefined`). New test pins:
a custom node calling `interrupt(null)` produces a defined
`getInterrupt()` result with `payload === null`, and the Stop
hook does NOT fire.

Validation: 43/43 HITL tests pass (2 new), 631 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: scope halt signals per session + add tool_call_id to review_configs

Two findings from a Round 6 review (the colleague review aligned with
the latest Codex P1):

P1 — Cross-run halt-signal bleed. `HookRegistry._haltSignal` was a
single `HookHaltSignal | undefined` field. Hosts that share one
registry across concurrent runs (a normal pattern — a global policy
hook lives on a shared registry) would see `preventContinuation`
from run A trip run B's stream-loop poll on the next iteration,
silently terminating an unrelated run. Real correctness bug for
multi-tenant / parallel-run hosts.

Fix:
- Replace `_haltSignal` with `haltSignals: Map<sessionId, HookHaltSignal>`.
- `haltRun(sessionId, reason, source)` — first-write-wins per session.
- `getHaltSignal(sessionId)` / `clearHaltSignal(sessionId)` — scoped reads
  and clears.
- `executeHooks` passes its `sessionId` (= the run id every in-tree
  callsite already uses) into `haltRun`. Hooks fired without a
  sessionId can't raise a halt — there's no run for the loop to poll
  under, which is the correct semantic.
- `Run.processStream` polls / clears with `this.id` everywhere
  (resetValues, both pre-stream early-return paths, the mid-stream
  loop poll, and the `finally` cleanup).

New regression test runs two Run instances on a shared registry: a
RunStart hook halts run A; run B's signal stays undefined throughout,
B's processStream completes naturally, and A's session entry is
cleaned up before B even starts.

Optional nit also addressed: add `tool_call_id` to
`ToolApprovalReviewConfig`. The previous shape only carried
`action_name`, so a UI mapping `review_configs[i]` →
`action_requests[j]` had to assume positional pairing — fragile when
a batch contains the same tool called twice (different `tool_call_id`s,
same `name`). Carrying `tool_call_id` lets the UI key by id directly,
matching the same field already on the action_requests side. New test
pins both sides for a duplicate-tool batch.

Validation: 45/45 HITL tests pass (2 new), 633 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:751 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: address Round 7 comprehensive review (8 findings, 5 new tests)

Three MAJOR fixes (real correctness gaps), three MINOR (refactor +
coverage + observability), two NIT (docs + style). One MINOR
(MemorySaver warning) deferred — see below.

F1 (MAJOR) — `respond` decision bypassed truncation. Every other
tool-output path runs through `truncateToolResultContent` to enforce
`maxToolResultChars`, but the `respond` branch wrote
`decision.responseText` raw into the ToolMessage. A user pasting a
large document as a manual response could blow past the model
context window. Fix: truncate up front and use the truncated value
in the ToolMessage, the PostToolBatch entry, and the
`dispatchStepCompleted` event so all three see the same final
output. Test pins: 200-char response with `maxToolResultChars: 50`
→ ToolMessage and batch entry both carry the truncated value.

F3 (MAJOR) — Session hooks leaked when interrupt + halt coexisted.
A hook returning BOTH `decision: 'ask'` AND `preventContinuation:
true` set both signals (interrupt captured, halt raised). The
`finally` block's session-clear guard checked only
`_interrupt == null || streamThrew`, so it preserved sessions even
though the run was halted (no resume expected). Fix: widen the
guard to `_interrupt == null || _haltedReason != null ||
streamThrew`. Test pins: hook returns ask + preventContinuation,
both signals land on the Run, and session matchers are cleared.

F2 (MAJOR) — `async: true` hook docs over-promised. The current
implementation awaits the hook callback fully (subject to timeout)
and only ignores its OUTPUT in the fold. The previous JSDoc said
"the agent already moved on" implying detachment, which is false.
Fix: rewrote the JSDoc to be honest about the semantic — the
return value is ignored for influence, but the callback promise
is still awaited. The fire-and-forget pattern requires the hook
body to detach via `void promise.catch()` and return immediately
(the SDK can't speculatively detach based on output shape it
hasn't seen yet). Added a WRONG-pattern example showing the
common mistake (`await` inside the body undoes the intent).

F5 (MINOR) — Extracted two helpers from `dispatchToolEvents`:
- `buildToolApprovalInterruptPayload(askEntries)` — pure module-
  scope function, takes the askEntries list, returns the
  `tool_approval` payload. Exposed `AskEntry` as a module-scope
  type so it can move out of the function body.
- `dispatchPostToolBatchAndInjectContext({...})` — private async
  method on ToolNode, materializes batch entries in `toolCalls`
  order, fires `PostToolBatch`, and pushes the consolidated
  context HumanMessage. Cuts ~70 lines from the main function.
Both extractions preserve behavior; existing tests cover them.

F6 (MINOR) — Added direct tests for `Run.resume()` and
`Run.getHaltReason()`. Previously the resume codepath was only
exercised through `graph.invoke(new Command(...))` and getHaltReason
had only the pre-stream `preventContinuation` test. New tests:
  - `Run.resume([{type:'approve'}], config)` end-to-end through
    the wrapper API after an interrupt.
  - `Run.getHaltReason()` returns `'PII detected'` when
    `UserPromptSubmit` denies the prompt with a reason.
  - `Run.getHaltReason()` returns `'prompt_requires_approval'`
    when `UserPromptSubmit` returns `decision: 'ask'` without
    a custom reason.

F7 (MINOR) — `UserPromptSubmit` deny/ask now sets `_haltedReason`.
Previously the only signal that landed in `_haltedReason` was
`preventContinuation`; deny/ask returned `undefined` from
processStream with no way for hosts to distinguish "blocked" from
"completed naturally with no output." Fix: set canonical reasons
(`prompt_denied`, `prompt_requires_approval`) when the hook
doesn't supply one, surface the hook's `reason` when it does.

F8 (NIT) — Added JSDoc on `HumanInterruptType` documenting the
downstream consumer (LibreChat's wire types in
`librechat-data-provider`) so the export isn't read as dead code.

F9 (NIT) — Removed `// src/path/to/file.ts` header comments from
the four new files (askUserQuestion, hitl/index, createToolPolicyHook,
types/hitl). Per AGENTS.md "Avoid standalone `//` comments unless
absolutely necessary." The descriptive block comments below them
stay.

Deferred: F4 (MemorySaver warning). The `humanInTheLoop` JSDoc
already documents "production hosts should always provide a durable
checkpointer." A console.warn per Run.create would push hosts to
opt out, defeating the deliberate default-on contract. If observability
becomes a real issue, a debug-gated log via `emitAgentLog` is the
right shape — but that needs a logger plumb-through and is out of
scope for this PR.

Validation: 50/50 HITL tests pass (5 new), 638 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:808 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: guard malformed respond decisions + cover prompt_denied canonical default

Codex P1: malformed respond payloads (no responseText, or non-string
responseText) crashed `truncateToolResultContent` on `content.length`
— turning a fail-closed approval path into a hard run failure for one
bad wire payload. Hosts deserialize resume payloads from untyped JSON,
so the SDK can't trust the shape.

Fix: validate `responseText` is a string BEFORE passing to truncation.
If missing or non-string, route through `blockEntry` with a diagnostic
that includes the actual offered shape (`<missing>` or the typeof) so
the host can debug. Two regression tests pin both cases:
- `{ type: 'respond' }` (no field) → blocked, error mentions
  `'<missing>'`, host event never dispatched
- `{ type: 'respond', responseText: 42 }` → blocked, error mentions
  `'number'`

Also addresses Round 7 follow-up review N2 (NIT, marked "not worth
filing" by the reviewer): added an explicit test for the
`prompt_denied` canonical default — the existing test only covered
the custom-reason path. Now both the custom and canonical paths are
explicitly pinned for both `decision: 'deny'` and `decision: 'ask'`.

Skipped follow-up review N1 (pre-existing `// src/run.ts` header) —
F9 was scoped to new files only; the reviewer agreed the scope is
intentional.

Validation: 53/53 HITL tests pass (3 new), 641 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: validate edit payloads + correct HITL default doc + scope direct-tool gap

Three findings from Round 8 review (Codex's 5; 1 stale, 2 architectural
deferred per the reviewer's own option, 2 fixed here + 1 doc):

P2 (Codex new) — Malformed `edit` decisions weren't validated. Same
trust boundary as the `respond` fix in 6a93e55: hosts deserialize
resume payloads from untyped JSON, and `{ type: 'edit' }` (no
`updatedInput`), `{ type: 'edit', updatedInput: 'string' }`, or
`{ type: 'edit', updatedInput: [...] }` would feed garbage into
`applyInputOverride` and silently approve a tool with the wrong-shape
args. Fix: typeof + null + Array.isArray check before approving;
fail-closed via `blockEntry` with a diagnostic that includes the
offered shape (`<missing>` / `null` / `array` / typeof). Three
regression tests pin all three malformed cases.

Refactored both `respond` and `edit` diagnostic-formatting code into a
single `describeOfferedShape(value)` module-scope helper to drop the
nested-ternary lint warnings the new check introduced and keep the
diagnostic strings consistent across both decision branches.

P3 (Codex new) — `ToolNodeOptions.humanInTheLoop` JSDoc still
described the pre-default-on contract ("undefined keeps fail-closed").
Updated to mirror `RunConfig.humanInTheLoop`'s default-on language and
explicitly note the event-driven-only scope (the same caveat already
documented on the parallel `hookRegistry` field).

Architectural P2s deferred — Codex's reviewer explicitly offered this
path: "If this PR is only meant to unblock LibreChat's event-driven
tool approval path, they can be documented as out of scope."
  - Direct tools (those routed through `directToolNames` like
    handoffs/subagents) bypass HITL entirely.
  - Mixed direct+event batches re-execute direct tools on resume
    because LangGraph rolls back the entire ToolNode on `interrupt()`.

Both are real but require restructuring `ToolNode.run` to defer
direct execution until after HITL approval — non-trivial and better
done as a focused follow-up. Documented explicitly in
`HumanInTheLoopConfig`'s JSDoc with practical implications for hosts:
LibreChat-style event-driven hosts get the full surface; direct-tool
hosts need to switch to event-driven mode or accept that direct
tools aren't approval-gated; mixed batches with side-effecting
direct tools should not be combined with approval-gated event tools.

Skipped: P1 (malformed respond) — already fixed in 6a93e55. Codex
just hadn't re-evaluated yet.

Validation: 56/56 HITL tests pass (3 new), 644 across hooks/tools/
summarization, typecheck clean, lint clean (only pre-existing
ToolNode.ts:826 warning untouched), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🛡️ fix: preserve Graph sidecars across HITL interrupt + resume

Codex P1: after a clean HITL interrupt, `processStream` was wiping
the sidecars `Run.resume()` needs:
- `Graph.resetValues()` on the next invocation cleared
  `toolCallStepIds`, `stepKeyIds`, `messageStepHasToolCalls`,
  accumulated `messages`, etc.
- `Graph.clearHeavyState()` in the prior `finally` cleared
  `toolCallStepIds`, `_toolOutputRegistry`, `sessions`,
  `hookRegistry`, `humanInTheLoop`.

Result: when LangGraph re-entered ToolNode on resume, the lookup
in `dispatchStepCompleted`'s `toolCallStepIds.get(toolCallId) ?? ''`
returned `''`. The host then dispatched `ON_RUN_STEP_COMPLETED`
with an empty step id, and `stream.ts` dropped the completed tool
result. Specifically broken for LibreChat's stream-resume story
where the resumed approval path needs to complete the
already-rendered tool step.

The existing HITL tests didn't catch this because they replace
`run.graphRunnable` with a custom graph whose ToolNode owns its
own prefilled `toolCallStepIds` Map — disjoint from the SDK
Graph's, so the SDK Graph's clear didn't observably affect the
test ToolNode's lookups.

Two gates added:

1. `processStream` finally: skip `Graph.clearHeavyState()` when
   `_interrupt != null && _haltedReason == null && !streamThrew`
   — i.e., a clean interrupt awaiting resume. Halt-driven and
   error-driven paths still clean up because no resume is expected.

2. `processStream` entry: skip `Graph.resetValues()` when entering
   via `Command` (resume). We're continuing an in-flight run, not
   starting fresh — the sidecars must survive the boundary.

Cross-process resume (host rebuilds Run from scratch) is a
separate concern handled host-side; documented in
`HumanInTheLoopConfig` JSDoc earlier. This fix covers the
same-Run-instance pause-and-resume path that's the SDK's primary
contract.

Two regression tests, paired:
- "preserves Graph sidecars across HITL interrupt + resume" wires
  the test ToolNode to share `run.Graph.toolCallStepIds` by
  reference (mirroring how `StandardGraph` builds its inner
  ToolNode at Graph.ts:587). Pre-populates the map inside the
  agent node (mirroring `attemptInvoke`'s timing). Asserts the
  entry survives both the interrupt finally AND the resume entry,
  AND that the resumed dispatch fires `ON_RUN_STEP_COMPLETED`
  with the real step id (not `''`).
- "clears Graph sidecars on natural completion" pins the negative
  case: a non-interrupt run still triggers `clearHeavyState`, so
  this gate doesn't accidentally leak memory across runs.

Verified the regression tests catch the bug pre-fix by stashing
`run.ts` and re-running: the preservation test fails on
`expect(toolCallStepIds.has('call_1')).toBe(true)` exactly as
the reviewer described.

Validation: 58/58 HITL tests pass (2 new), 646 across hooks/tools/
summarization, typecheck/lint/build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* 🚦 chore: flip HITL default to OFF until host UI ships

Hosts (notably LibreChat) don't yet render `tool_approval` interrupts,
so a default-on HITL surface would let `ask` decisions pause the run
with no resolver — surfaces to end users as a hung tool-call card.
Flipping the default to OFF keeps existing hosts on the pre-HITL
fail-closed path until they're ready to wire the resume UI.

Hosts opt in explicitly with `humanInTheLoop: { enabled: true }`. The
plan of record (documented on `HumanInTheLoopConfig`) is to flip the
default back to ON in a future minor once the consumer ecosystem is
ready end-to-end.

- `Run.applyHITLCheckpointerFallback`: only attach the in-memory
  `MemorySaver` when `enabled === true`. Omitted/false leaves
  `compileOptions.checkpointer` untouched.
- `ToolNode` ask-decision branch: collapse `ask` into a synchronous
  block (the pre-HITL behavior) unless `enabled === true`.
- JSDoc on `HumanInTheLoopConfig`, `RunConfig.humanInTheLoop`, and
  `ToolNodeOptions.humanInTheLoop` rewritten to lead with default-off
  semantics + the migration plan.
- Tests: rename "default-on" assertion to "default-off blocks", and
  flip the host-checkpointer-preservation test to thread `enabled: true`
  explicitly (since default-off no longer engages the fallback path).

Bump to 3.1.75-dev.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
danny-avila added a commit that referenced this pull request May 4, 2026
#9 P1 — dangerous-command gate missed quoted targets
----------------------------------------------------
`validateBashCommand` checks `dangerousCommandPatterns` against
`normalized` — the post-`stripQuotedContent` form. That pass blanks
the contents of every quoted span, so destructive targets that the
LLM (accidentally or not) wraps in quotes — `rm -rf "/"`,
`rm -rf "$HOME"`, `chmod -R 777 "/"` — slip past the regex even
though they execute identically to the bare form. Real safety
regression on the default-on guard.

Fix: split the destructive check into two passes.

  1. The existing `dangerousCommandPatterns` continue to run against
     `normalized`. Same false-positive defence (`echo "rm -rf /"`
     stays valid because the destructive text is wrapped by the
     OUTER quote pair, not by quotes around the path argument).
  2. New `quotedDestructivePatterns` run against the ORIGINAL
     command string. Each pattern explicitly REQUIRES a matching
     quote pair around the destructive path (`"/"`, `"$HOME"`,
     `'/'`, etc.). `echo "rm -rf /"` doesn't match — the `/` isn't
     surrounded by quotes there, only the whole `rm -rf /` string is.

Tests: `codex review fixes (round 4) › quoted destructive targets`
covers `"/"`, `"$HOME"`, `'/'`, `chmod -R 777 "/"`, no-regression
on bare forms, and the `echo "rm -rf /"` no-false-positive case.

#10 P1 — workspace policy hook compared lexical paths only
----------------------------------------------------------
`createWorkspacePolicyHook` did `isAbsolute(p) ? resolve(p) :
resolve(root, p)` and called it done. A symlink inside the workspace
pointing outside (e.g. `workspace/escape → /etc/secret`) lexically
appears under `workspace/` and got auto-allowed — even though the
agent reading through it touches a path the policy meant to gate.

Critical when the hook is the primary gate: the documented "ask the
user" setup turns `allowReadOutside: true` so the file tools' own
clamp is off, leaving the hook as the sole defence.

Fix: the hook now realpaths both the candidate and the workspace
roots before comparing. Roots are pre-realpath'd once at construction
(memoized via a Promise so the per-call cost is one realpath) and
candidate paths get the same `realpathOfPathOrAncestor` walk
`resolveWorkspacePathSafe` already uses — so paths that don't yet
exist (e.g. `write_file` to a brand-new path) still get checked
against their nearest existing ancestor's realpath.

Two test cases pin both halves:
  - `workspace/escape → extra/secret.txt` — hook DENIES read of
    `escape` even though it's lexically inside.
  - `extra/alt-mount → workspace/` — hook ALLOWS read of
    `extra/alt-mount/file.ts` because the realpath lands inside the
    workspace root (covers the alternate-mount case so we don't
    over-correct).

Tests: 1272 passing (was 1264; +8 across both fixes).
danny-avila added a commit that referenced this pull request May 4, 2026
, #5, #7, #9, #10, #11)

Comprehensive-review audit findings, fixed in one pass:

#1 (attachments full-file MIME sniff): switch classifyAttachment to
   open() + 12-byte header read for sniffing; only readFile() the
   full buffer when we're about to embed. A 9 MB image now allocates
   12 bytes for sniffing instead of 9 MB.

#3 (fallback walker SKIP_DIRS too small): expanded the Node-fallback
   walker's skip set to cover dist/build/.next/.cache/__pycache__/
   .venv/venv/target/vendor/coverage/.tox/.mypy_cache and friends.
   ripgrep itself respects .gitignore so this only affects the
   fallback.

#5 (dynamic fs/promises import): replaced two `await import('fs/
   promises').then(...)` calls in CompileCheckTool with a static
   `import { readFile, stat } from 'fs/promises'` per AGENTS.md.

#7 (test-only resets in public API): added @internal JSDoc to all
   three `_reset*ForTests` functions so IDE autocomplete signals
   they aren't part of the SDK surface. (The leading underscore was
   convention-only; @internal is the standard tag.)

#9 (legacy fields missing @deprecated): added @deprecated tags to
   `LocalExecutionConfig.cwd` and `allowOutsideWorkspace` pointing
   to their workspace.* replacements.

#10 (dead lexicallyInside compute in workspace policy hook): dropped
   the unused variable. Realpath is the source of truth for both
   the symlink-escape and alternate-mount cases; the lexical check
   provides no information so we don't pay for it.

#11 (voided unused imports in pi comparison script): removed the
   `void readdir; void cp;` workaround and the dead imports they
   were silencing.

Bonus: also addressed the comprehensive review's #2 (bitwise OR) by
verifying it was a misread — the code uses `||`, not `|`. Manual
review's reading was wrong; no fix needed.

Larger findings addressed:

#4 (no editStrategies tests): added
   `src/tools/local/__tests__/editStrategies.test.ts` with 12 cases
   covering exact / line-trimmed / whitespace-normalized /
   indentation-flexible match boundaries, start/end-of-file matches,
   unicode content, multi-line needles spanning blank lines,
   ambiguous-match rejection, and applyEdit semantics.

#12 (no FileCheckpointer tests): added
   `src/tools/local/__tests__/FileCheckpointer.test.ts` with 6
   cases covering capture+rewind happy path, idempotent capture,
   absent-file rewind (deletion), multi-file rewind, oversize-file
   tracking-but-not-snapshotting, and rewind across a deleted
   parent directory.

Performance:

#8 (lineWindow split-whole-file): replaced `content.split('\\n')`
   in `lineWindow` with a newline-walk that's O(start + limit)
   instead of O(file). For a 10 MB file with `offset: 1, limit: 10`
   this drops from "split the whole file into millions of strings"
   to ~10 indexOf calls. Falls back to the simple split when no
   limit is supplied.

Deferred:

#6 (spill files never cleaned) needs a Run-lifecycle hook to run
   cleanup; tracked separately rather than fixed inline.
danny-avila added a commit that referenced this pull request May 4, 2026
Follow-up audit on commit ebf8b6a flagged 4 nits/minors. All valid:

F1 (NIT): the lineWindow perf fix introduced exactly the same
  `void <var>` pattern that finding #10 had me remove from
  createWorkspacePolicyHook. Embarrassing — dropped the dead
  `line` variable + `line++` + `void line`.

F2 (MINOR): the editStrategies test commit message claimed coverage
  of "ambiguous-match rejection" but no test actually exercised the
  multi-match → null path. Added the missing case for the exact
  strategy. Noted in a comment that the looser strategies in the
  chain may still resolve duplicates unambiguously by falling
  through; whether that's correct is a separate design call.

F3 (NIT): the new FileCheckpointer.test.ts had a `import('fs/
  promises')` inside a catch block when `mkdir` was already
  available via the static import at the top. Static-imported
  `mkdir` and restructured the test to mkdir-then-writeFile
  unconditionally (the catch-and-retry pattern was unnecessary
  defensive plumbing).

F4 (NIT): the unused-imports cleanup missed `sum`/`void sum` in
  the comparison harness. Removed the function and its void.
danny-avila added a commit that referenced this pull request May 5, 2026
* feat: add local execution engine

* feat(local-engine): hardening + ergonomics follow-ups

Adds the follow-up items flagged in the design comparison so the local
engine ships with a defensible default posture.

Security & correctness
- Bridge bearer token: per-Run 256-bit token; programmatic Python and
  bash callers send `x-librechat-bridge-token`; server uses
  `timingSafeEqual`. Stops cross-process listeners on the same loopback
  port from invoking SDK tools.
- Symlink-aware workspace path: new `resolveWorkspacePathSafe()` walks
  up to the nearest existing ancestor, realpaths it, and rejects
  containment escapes — covers `write_file` to a brand-new path
  without false positives on `/tmp -> /private/tmp` style ancestors.
- Read tool guards: stat-cap (default 10 MiB, configurable via
  `local.maxReadBytes`), NUL-byte binary detection on the first 8 KiB
  before reading the rest, returns descriptive stubs.
- Local engine warns once per process when it runs without
  `@anthropic-ai/sandbox-runtime` wrapping so the no-sandbox default
  stays loud in CI/server logs.

Validation
- New `bashAst` config: `'off' | 'auto' | 'strict'` heuristic
  categorical-hazard pass over the quote-stripped command
  (command substitution, zsh privileged builtins,
  `/proc/<pid>/environ` reads, `IFS=` injection, hex-escape
  obfuscation, `source $VAR`, plus `eval`/`exec` denied in strict).
  Forward-compatible name for a future tree-sitter-bash AST drop-in.

Ergonomics
- File checkpointing: `LocalFileCheckpointerImpl` snapshots pre-write
  contents for `write_file`/`edit_file` and rewinds them on demand.
  `createLocalCodingToolBundle` returns the checkpointer alongside the
  tools when `local.fileCheckpointing: true`.
- Pluggable spawn seam: `LocalExecutionConfig.spawn?: LocalSpawn` lets
  callers route every command through SSH/containers/remote runners
  without forking the engine.
- Ripgrep fallback: `grep_search` and `glob_search` now probe for `rg`
  once per process and fall back to a Node-side walker + JS regex when
  ripgrep isn't on PATH; the old hard requirement is gone.

Tests
- 20 new unit tests covering AST findings, sandbox-off warning latch,
  checkpointer round-trip and file deletion, binary-file refusal,
  oversize stub, symlink escape, ripgrep fallback, and bridge auth
  rejection. Existing 109 tests still pass.

* feat(toolnode): fire PreToolUse/PostToolUse/PostToolUseFailure on direct path

Before this change, in-process tools added to `directToolNames` (every
graphTool with a real implementation, including the new local-engine
tools) skipped the hook lifecycle entirely. Only schema-only tools
that round-tripped through the host event-dispatch path actually fired
PreToolUse/PostToolUse/PostToolUseFailure/PermissionDenied. That meant
`createToolPolicyHook`, the new HITL `'ask'` flow, and PostToolUse
output rewriting all silently no-op'd on direct tools — including
every `bash`/`code`/`edit_file`/`write_file` from the local engine.

This adds `runDirectToolWithLifecycleHooks(call, config, batchContext)`
which wraps `runTool` with the same hook semantics
`dispatchToolEvents` uses for the single-call case, and routes the
three direct-call sites in `run()` through it. Behavior:

- Fast path: when the registry has none of PreToolUse / PostToolUse /
  PostToolUseFailure registered for this run, falls through to
  `runTool` with zero overhead. No regression for callers without
  hooks.
- PreToolUse: pre-resolves args (matching the event path), fires the
  hook with resolved args.
  - `decision: 'deny'` → synthesizes a Blocked error ToolMessage and
    fires PermissionDenied (observational).
  - `decision: 'ask'` → fail-closed deny regardless of
    `humanInTheLoop.enabled`. The event path handles 'ask' via
    LangGraph's interrupt(), which requires restructuring the direct
    invoke path; until that lands, returning 'ask' on a direct tool
    blocks the call (strict improvement vs. ignoring the hook
    entirely). A one-time warning fires when HITL is enabled, so
    hosts notice the gap.
  - `updatedInput` → applied to the call before runTool; runTool's
    placeholder resolution is idempotent on already-resolved args.
- PostToolUse: when runTool returns a non-error ToolMessage, fires
  PostToolUse and applies `updatedOutput` (preserving id/name/status).
- PostToolUseFailure: when runTool returns a status='error'
  ToolMessage, fires the failure hook (observational); the original
  error continues to propagate.

PostToolBatch participation across direct + dispatched outcomes is
intentionally out of scope: the batch hook accumulates inside
dispatchToolEvents and is fired at the end of its scope, and merging
direct-call outcomes into that aggregation crosses the two paths'
result-shaping logic. Left as a follow-up.

8 new unit tests in directToolHooks.test.ts cover deny+PermissionDenied,
ask fail-closed, allow, updatedInput, updatedOutput, PostToolUseFailure,
no-hooks fast path, and a mixed batch where one direct call denies and
the sibling runs.

Total non-API test count: 1163 -> 1171 passing.

* chore(scripts): add live-API smoke scripts for the local engine

Four scripts that exercise the local execution engine end-to-end with
a real LLM provider (matching the convention in src/scripts/*). Each
spins up a temporary workspace, runs the graph, then dumps observable
state so the behavior under test is visible in the console.

- local_engine.ts        — baseline e2e: bash + write_file + read_file
                           + edit_file + list_directory in a single
                           multi-step task. Sets `bashAst: 'auto'` so
                           the categorical-hazard pass is exercised.
- local_engine_ptc.ts    — `run_tools_with_code` (Python) calling
                           write_file / read_file / edit_file through
                           the localhost bridge in a single call. The
                           bridge token auth (timingSafeEqual) is on
                           by default for every Run, so this also
                           exercises that path.
- local_engine_hooks.ts  — direct-path hook integration: registers a
                           `createToolPolicyHook` that denies
                           write_file / edit_file plus an explicit
                           PostToolUse that prefixes every successful
                           tool result with "[reviewed]". Asks the
                           model to write then read; the write must
                           be blocked, the read must come back tagged.
                           Also asserts blocked.txt does not land on
                           disk.
- local_engine_checkpointer.ts — builds the local coding tools via
                           `createLocalCodingToolBundle({ fileCheckpointing: true })`
                           so the host keeps a handle on the
                           checkpointer, lets the model edit two
                           seeded files, then calls `rewind()` and
                           verifies originals come back. Exits non-zero
                           if rewind didn't restore both files.

npm scripts added for each: `local`, `local:ptc`, `local:hooks`,
`local:checkpointer`. All four typecheck clean (`tsc --noEmit`).

* fix(graph): pass hookRegistry on the non-event-driven path

Live integration test (`npm run local:hooks`) surfaced that hooks
weren't actually firing on local-engine tools when used with a plain
`graphConfig: { type: 'standard' }` (no `agentContext.toolDefinitions`).
Reason: `Graph.ts` only forwarded `hookRegistry` and `humanInTheLoop`
to ToolNode in the event-driven branch — the legacy non-event-driven
branch dropped both, so the policy hook silently no-op'd and a
`write_file` that the script's `createToolPolicyHook` was supposed to
deny landed on disk.

Two changes:

1. `StandardGraph.createToolNode` now passes `hookRegistry` and
   `humanInTheLoop` in BOTH branches. Hooks are conceptually
   orthogonal to whether tools dispatch as events or invoke in
   process; only the event path was historically wired.
2. `ToolNode.run`'s non-event-driven `else` branch now invokes
   `runDirectToolWithLifecycleHooks` instead of `runTool` directly,
   so PreToolUse / PostToolUse / PostToolUseFailure / PermissionDenied
   fire the same way they do on the event-driven mixed-batch path.
   The helper still falls through to `runTool` on the no-hooks fast
   path, so callers without hooks pay zero overhead.

Verified by re-running `npm run local:hooks`:
  - PermissionDenied fired 1 time(s) for write_file with the policy
    reason
  - PostToolUse saw read_file (the allowed read got the [reviewed]
    tag)
  - blocked.txt landed on disk? false (expected: false)

Also re-ran the full non-API jest suite — 1171 passing, no regressions.

* feat(local-engine): coding-tool parity push (fuzzy edit, diff, BOM/EOL, overflow)

Live cross-comparison against claude-code, pi-mono, and opencode (sst)
surfaced four meaningful gaps in the foundation tools. This commit
closes them; LSP integration and apply_patch (multi-file) are larger
and stay as documented follow-ups.

Edit fuzzy matching
-------------------
Our edit_file required exact `oldText` match, so a single mistyped
space would force the LLM to re-read and retry. New
`editStrategies.ts` walks a four-stage chain — exact, line-trimmed,
whitespace-normalized, indentation-flexible — stopping at the first
strategy that locates EXACTLY ONE match. The matched on-disk slice is
then literally replaced with `newText`; we never modify newText. The
strategy that fired is surfaced in the tool result's `strategies`
metadata so callers can see when fuzzy fallback kicked in. Inspired
by opencode's nine-strategy chain; kept to the four highest-yield
strategies for a first cut, room for more (block-anchor + Levenshtein
etc.) as needed.

Unified-diff output on edit/write
---------------------------------
`edit_file` and `write_file` now embed a unified diff in the
ToolMessage so the model sees what actually changed instead of the
prior `Applied 1 edit(s)` blob. Uses npm `diff`'s
`createTwoFilesPatch` with 3 lines of context, capped at 4 KB so
large rewrites don't blow context.

BOM + line-ending preservation
------------------------------
New `textEncoding.ts`:
- `decodeFile` strips and remembers UTF-8 BOM and CRLF/LF newline.
- `encodeFile` reapplies them on write.

`edit_file` / `write_file` now read the file, decode to LF + remember,
operate, and re-encode on write. Verified by tests: editing a CRLF
file keeps CRLF; overwriting a BOM file keeps the BOM.

Bash output overflow → temp file
--------------------------------
`spawnLocalProcess` previously truncated overflowing output in place.
Now: when total bytes exceed 2× `maxOutputChars`, the FULL
stdout/stderr is written to a temp file (`/tmp/lc-local-output-*.txt`)
and the path is appended to the formatted output as
`full_output_path: …`. The model can then `bash tail -n N <path>` (or
`grep`) to inspect specifics it actually needs. Mirrors pi-mono's and
opencode's overflow handling.

New deps
--------
- `diff@9` and `@types/diff@7` — the canonical TS diff lib opencode
  uses too, MIT-licensed.

Tests
-----
Added unit tests for each new behavior — 5 new cases for fuzzy
matching, diff embedding, CRLF preservation, BOM preservation. Ran
the full non-API jest suite: 1176 passing (was 1171). Ran all four
live integration scripts (`local`, `local:hooks`, `local:checkpointer`,
`local:ptc`) end-to-end against the real Anthropic API; all four
green and the diff payload now shows up in the model-visible tool
results.

* feat(local-engine): inline image/PDF attachments in read_file

Closes the last "Close now" parity gap: read_file can now return image
and PDF files as inline `MessageContentComplex[]` content blocks so
vision-capable models actually see the bytes, instead of getting our
old "binary file" stub.

New module
----------
src/tools/local/attachments.ts

- Magic-byte sniff for the five formats we care about (PNG, JPEG,
  GIF, WebP, PDF). Inlined the signatures rather than pulling in
  `file-type` (ESM-only, awkward under ts-jest); LibreChat's
  api/server/utils/files.js takes the same approach but uses the
  package since it's a CJS-OK Node server.
- Returns a typed Attachment union that the read tool branches on:
  image -> image_url block, pdf -> image_url block (for Anthropic's
  PDF-as-data-url path), oversize -> stub, binary -> stub,
  text-or-unknown -> falls through to the existing line-window read.
- Configurable via two new fields on `LocalExecutionConfig`:
  - `attachReadAttachments`: 'off' (default) | 'images-only' |
    'images-and-pdf'. Defaults `'off'` to preserve current behavior;
    hosts opt in when their model is vision-capable.
  - `maxAttachmentBytes`: pre-encoding size cap (default 5 MiB) that
    bounds the post-base64 token cost.

Plumbing
--------
- read_file branches on attachment.kind. For an image, it returns
  `[ [textBlock, imageUrlBlock], { artifact } ]`. LangChain's
  Anthropic adapter (verified at
  node_modules/@langchain/anthropic/dist/utils/message_inputs.js)
  preserves image_url blocks inside `tool_result` content arrays;
  OpenAI Chat Completions accepts them on vision models too. The
  Responses API flattens to text on the wire (function_call_output
  is string-only) which is the safe degrade.
- The textBlock that goes alongside the image_url carries the
  filename/mime/byte count so non-vision models still get useful
  signal.

Tests
-----
4 new unit cases:
- default behavior: still returns the binary stub
- images-only: returns array content with `image_url` data URL +
  artifact { mime: 'image/png', attachment: 'image' }
- oversize: caps embedding past `maxAttachmentBytes` -> "Refusing to
  embed" stub
- text files unaffected when embedding is enabled

29 LocalExecutionTools tests pass; 1180 non-API tests overall.

Live verification
-----------------
New `npm run local:image` script:
1. Copies a real PNG (the macOS Certificate Assistant droppedImage)
   into a temp workspace.
2. Asks the agent to `read_file` it and describe what's in it.
3. Inspects the run's ToolMessage to confirm the content is
   `[textBlock, imageBlock]` with a `data:image/png;base64,...` URL.

Live run against real Anthropic Sonnet:
  ToolMessage(name=read_file, content=[text,image_url] url=data:image/png;base64,iVBORw0KGgo…)
  Image attachment landed in tool result: true ✔
  ASSISTANT: "The image shows a blank certificate template with a
  light blue border. The word 'Certificate' appears at the top in an
  elegant, cursive script font. There's a decorative flourish or
  underline beneath the text, and a gold/yellow seal or badge
  decoration in the lower right portion..."

That's the actual content of the source PNG — vision is wired
through end-to-end.

* feat(local-engine): post-edit syntax check + compile_check tool

The two cheap "tell-the-agent-when-it-broke-the-file" alternatives I
called out as the pragmatic substitute for full LSP integration. Both
are opt-in.

1. Post-edit syntax check
-------------------------
New `local.postEditSyntaxCheck: 'off' | 'auto' | 'strict'`
(default `'off'`).

After every successful `edit_file` / `write_file`, runs a fast
per-file syntax checker keyed on extension and appends any error to
the tool result so the model self-corrects on the next turn without
a separate read round-trip:

  - `.js` / `.mjs` / `.cjs` / `.jsx`  -> `node --check <file>`
  - `.py`  / `.pyw`                   -> `python3 -m py_compile`
  - `.json`                           -> `JSON.parse` (in process)
  - `.sh`  / `.bash`                  -> `bash -n <file>`

`'auto'` appends `[syntax-check warning ...]` to the tool result and
returns success. `'strict'` throws the error so the next turn must
react. Each checker probes once for tool availability and silently
skips if missing.

TypeScript per-file syntax check is intentionally not in this list:
per-file `tsc` is slow and only catches a small fraction of TS
issues without type information. Use `compile_check` (below) for that.

2. compile_check tool
---------------------
New tool auto-bound when `engine: 'local'` and `includeCodingTools`
is on. Lets the agent ask "did my change break anything?" without us
shipping a real LSP client.

Auto-detection (first hit wins):
  - tsconfig.json or package.json[typescript] -> `npx --no-install tsc --noEmit`
  - Cargo.toml                                -> `cargo check --message-format=short`
  - go.mod                                    -> `go vet ./...`
  - pyproject.toml/setup.py with mypy         -> `python3 -m mypy .`
  - pyproject.toml/setup.py without mypy      -> `compileall .`
  - none of the above                         -> tells the agent "no toolchain"

Honours `local.compileCheck.command` and `compile_check.command` arg
overrides; honours `local.compileCheck.timeoutMs` (default 120s).
Output is truncated through the standard local-engine output cap and
spills overflow to a temp file like the bash tool does. Returns
exit-code-aware `PASSED`/`FAILED` headline + body.

Tests
-----
8 new unit cases:
  - JS/JSON broken/valid round-trip
  - returns null for unknown extensions
  - write_file appends `[syntax-check warning ...]` in `auto`
  - write_file in `strict` throws
  - compile_check reports "no recognised project marker" in an empty workspace
  - compile_check honours an explicit command override and reports exit codes

37 LocalExecutionTools tests pass; 1188 non-API tests overall.

Live integration
----------------
New `npm run local:compile` script that:
  1. Seeds a tiny TS workspace (tsconfig.json + index.ts).
  2. Asks Anthropic Sonnet to write a broken JS file (post-edit
     syntax-check warns) -> fix it -> write a broken TS file with a
     real type error -> call compile_check (must FAIL with
     `error TS2322`) -> fix it -> call compile_check again (must
     PASS).

Live run, end-to-end:
  - `[syntax-check warning via node --check]` appended to write_file
  - First compile_check: FAILED via `npx --no-install tsc --noEmit`
    (exit=2): `broken.ts(1,14): error TS2322: Type 'string' is not
    assignable to type 'number'.`
  - Second compile_check (after edit_file fix): PASSED
  - Both assertions ✔.

Other live scripts (local, local:hooks, local:checkpointer,
local:ptc, local:image) still green.

* chore(scripts): side-by-side comparison harness vs pi-mono

`npm run compare:pi` runs four identical tasks against pi-mono's CLI
and our local engine in parallel temp workspaces using the same model
(claude-sonnet-4-5 by default). Captures: tool-call shape, wall time,
input/output/cache tokens, and verifies the workspace ended in the
expected state.

Tasks
-----
- T1 simple-edit       : single literal substitution in greet.py
- T2 fuzzy-edit        : edit a file with trailing whitespace + tabs;
                         exercises both agents' fuzzy-match recovery
- T3 syntax-error-fix  : pre-seeded broken JS; ours has post-edit
                         syntax check, pi has to discover via bash
- T4 type-error-fix    : pre-seeded TS file with a real type error in
                         a tiny tsconfig project; ours has
                         compile_check, pi has bash + tsc

Pi runner spawns `node $PI_BIN --print --mode json --no-session
--provider anthropic --model <model>` and parses its JSON event
stream. Ours uses Run.create with `engine: 'local'` and walks the
final conversation. PI_BIN defaults to
~/Projects/pi-mono/packages/coding-agent/dist/cli.js.

Live results (real Anthropic, Sonnet 4.5)
-----------------------------------------
  task                    metric      pi     ours
  T1 simple-edit          verify      ✔      ✔
                          wall        8.4s   11.0s
                          tool calls  2      2
                          output tok  313    281
  T2 fuzzy-edit           verify      ✔      ✔
                          wall        16.1s  7.2s
                          tool calls  2      2
                          output tok  384    246
  T4 type-error-fix-loop  verify      ✔      ✔
                          wall        23.5s  13.6s
                          tool calls  6      4
                          output tok  805    397
  T3 syntax-error-fix     verify      ✔      ✔
                          wall        16.2s  11.7s
                          tool calls  3      4    (note: ours used 1
                                                  bash, pi used 2)
                          output tok  371    503

Both agents 4/4 verify ✔. Ours wins wall-time on 3/4 tasks, output
tokens on every task, and tool-call count on T4 (compile_check
short-circuits the bash+read+bash loop pi did).

Open follow-up surfaced by the harness: pi reports thousands of
cacheRead tokens per turn (Anthropic prompt cache rebate), ours
reports zero. We're paying full rate for the system+tools prefix on
every turn. Not a regression introduced here, but worth its own PR
to wire prompt caching on the Anthropic adapter.

* chore(scripts): compare:pi — caching enabled, +2 tasks, +variance

Updates to the side-by-side harness based on first-run findings.

1. Anthropic prompt caching wired on our side
-----------------------------------------------
The first comparison reported ours' cache_read=0 — pre-existing PR
clarification: caching IS supported (`addCacheControl` + agent
context's `hasAnthropicPromptCache()`), but it's gated on
`clientOptions.promptCache === true`. The harness's first cut set
that on `graphConfig.clientOptions`, but `Run.createLegacyGraph`
(src/run.ts:156) destructures `llmConfig` and rebuilds
`agentConfig.clientOptions` from it — `graphConfig.clientOptions`
is silently dropped on the standard path. So in this harness we
set `promptCache: true` on the llmConfig itself.

After the fix, ours reports real cache_read / cache_creation per
turn. Caching works.

2. New tasks
------------
- T5 multi-file-rename : rename `calc_total` → `calculateTotal` across
  src/lib.ts, src/index.ts, src/index.test.ts. Tests how each agent
  finds + applies the rename.
- T6 image-read-and-describe (ours-only): copies a real PNG into the
  workspace (macOS Certificate Assistant icon) and asks the model to
  describe it. Exercises our `attachReadAttachments: 'images-only'`
  end-to-end. pi has no equivalent and is skipped via the new
  `skip: 'pi'` field.

3. Variance support
-------------------
COMPARE_ITERS=N runs each task N times per side and reports the
mean. Skipped tasks count cleanly (no false N/A entries). The
verify check now force-fails when a runner errored, so a soft
verify (e.g. T6's "file-still-on-disk" check) can't mask a real
provider rejection.

4. Cost computation
-------------------
We now compute our cost from the same per-turn breakdown using
Sonnet 4.5 pricing ($3/$15 per Mtok input/output, $3.75 cache write,
$0.30 cache read), so the cost columns are directly comparable.

Live results (N=2)
------------------
Functional: pi 10/10 ✔, ours 12/12 ✔ (T6 ours-only adds 2).
Wall time: ours wins all 5 shared tasks (~30% faster on average).
Tool calls: ours fewer or equal on every shared task.
Output tokens: ours fewer on every shared task.
Cost: ours is ~6× more expensive per task even with caching on.

Why the cost gap survives caching
---------------------------------
Per-turn "input new" tokens:
  pi (T1): 35      ours (T1): 28298
  pi (T4): 76      ours (T4): 48630

Pi only sends the new turn payload (~35 tokens); ~the entire system
prefix and tool defs are cache-hits. Ours sends ~28k tokens of new
input per turn even with caching on, which strongly suggests our
cache prefix isn't covering tool definitions, OR the breakpoint
position isn't matching across turns. Worth a dedicated PR to tune
the cache breakpoints in `addCacheControl` / `AgentContext.systemRunnable`
so tool defs sit inside the cached prefix; this will close the cost
gap considerably.

The harness made this gap impossible to miss — the entire reason for
the cross-comparison.

* feat(cache): tool-prefix cache breakpoint + harness accounting fix

Two related changes; second one was the bigger win.

1. Tool-prefix cache breakpoint (Anthropic)
-------------------------------------------
New `partitionAndMarkAnthropicToolCache` helper at
src/messages/anthropicToolCache.ts. When `clientOptions.promptCache
=== true` and the provider is Anthropic, we now:

  - Stable-partition the bound tool list into [static, deferred] using
    `defer_loading` from `agentContext.toolDefinitions`.
  - Stamp `cache_control: { type: 'ephemeral' }` on the LAST static
    tool via the `extras` field LangChain's Anthropic adapter
    forwards (`AnthropicToolExtrasSchema`).

This way, the cached prefix covers exactly the static tool inventory.
Discovered deferred tools that arrive across turns sit *after* the
breakpoint and don't invalidate the prefix.

The wrap is non-mutating (clones with `Object.create(getProto…)` so
the original tool reference is untouched) and idempotent. 10 unit
tests in `src/messages/__tests__/anthropicToolCache.test.ts` cover
partition order, stamp behaviour, prototype preservation, extras
merging, and idempotency.

Wired in `Graph.ts` right after `resolveLocalToolsForBinding`, gated
on Anthropic + `promptCache: true`.

In practice the impact is smaller than I expected because
`addCacheControl` on the last user message ALREADY creates a cache
breakpoint that covers the tools+system prefix as part of the
cumulative cached prefix. The tool-level breakpoint becomes load-
bearing in cases where:
  - A turn arrives without a fresh user message (resume after tool
    call only),
  - The user message would otherwise blow the cache breakpoint, or
  - Cache budget needs to be split across multiple breakpoints to fit
    inside Anthropic's 4-breakpoint cap.

Still correct; still worth shipping; just not a 6× win on its own.

2. Harness accounting fix (the actual ~6× swing)
------------------------------------------------
`compare:pi`'s previous run reported ours' input tokens as ~28k–48k
per turn vs pi's ~35–80, suggesting a 6× cost gap that survived
caching. That number was wrong.

LangChain's Anthropic adapter at
src/llm/anthropic/utils/message_outputs.ts:31 reports
`usage_metadata.input_tokens` as the SUM of uncached input +
cache_creation + cache_read — i.e., the total prompt size, NOT the
uncached portion. Pi reports `usage.input` as uncached only. So our
"input new" column was double-counting cached tokens.

The harness now subtracts `cache_read + cache_creation` from
`input_tokens` when computing our "input new", so the column is
apples-to-apples with pi's `input` field.

Recomputed live results (N=2, caching on, real Anthropic):

  task                pi cost     ours cost   delta
  T1 simple-edit      $0.0158     $0.0171     +8%
  T2 fuzzy-edit       $0.0157     $0.0173     +10%
  T3 syntax-fix       $0.0210     $0.0248     +18%
  T4 type-error-loop  $0.0325     $0.0292     -10%   ← we win
  T5 multi-file       $0.0264     $0.0312     +18%
  T6 image (ours-only) —          $0.0158

Functional: pi 10/10 ✔, ours 12/12 ✔.
Wall time: ours wins T3/T4/T5 (most-complex tasks), pi wins T1/T2.
Tool calls: ours wins or ties every shared task.
Output tokens: ours wins every shared task.
Cost: pi ~10–20% cheaper on simpler tasks, ours wins the most
complex one (T4).

We're effectively at parity. The "we're 6× more expensive" claim
posted earlier was the harness error.

* feat(local-engine): workspace + exec seams + HITL on direct path

Three changes that fit together — first two answer the "what about a
second engine" + "what's the workspace" questions; the third was the
last gap blocking workspace-policy `'ask'` from working end-to-end.

1. WorkspaceFS exec seam (engine reusability)
---------------------------------------------
New `WorkspaceFS` interface in src/tools/local/workspaceFS.ts (read,
write, stat, readdir, mkdir, realpath, unlink, open) with a
`nodeWorkspaceFS` default. Every file-touching factory in the local
coding suite (read_file, write_file, edit_file, grep_search,
glob_search, list_directory, file checkpointer, binary detection
helper) now routes through this interface.

A future engine — stateful remote sandbox, in-memory test FS, ssh
jail — supplies its own `WorkspaceFS` and inherits every tool's
behavior (fuzzy-match edit, BOM/EOL preservation, syntax check,
checkpointer, attachment classifier) for free. Same story for
process spawning: the existing `local.spawn` is now nested under
`local.exec.spawn` with the legacy top-level field still honoured.

The seams are wired through three new resolvers:
  - `getWorkspaceRoots`  — canonical root + additionalRoots
  - `getReadRoots`/`getWriteRoots` — split allow-outside knobs
  - `getWorkspaceFS`     — `local.exec.fs` ?? nodeWorkspaceFS
  - `getSpawn`           — `local.exec.spawn` ?? local.spawn ?? Node

Plus 7 unit tests in src/tools/__tests__/workspaceSeam.test.ts that
verify additionalRoots, the new allowReadOutside/allowWriteOutside
split, legacy `cwd` + `allowOutsideWorkspace` back-compat, and that
a custom WorkspaceFS actually receives the file tool calls.

2. First-class workspace boundary + policy hook
-----------------------------------------------
Nested `local.workspace` config:
  workspace: {
    root: string,
    additionalRoots?: readonly string[],   // monorepo: extend boundary
    allowReadOutside?: boolean,            // split from write
    allowWriteOutside?: boolean,
  }
Back-compat: top-level `cwd` and `allowOutsideWorkspace` still work
and map to `workspace.root` / both `allowOutside*` flags.

`resolveWorkspacePathSafe(path, config, intent)` now takes a `'read'`
| `'write'` intent so reads and writes can have different allow-
outside semantics. Default callers pass `'write'` for stricter
clamping; read tools (read_file, grep_search, glob_search,
list_directory) explicitly pass `'read'`.

New `createWorkspacePolicyHook` in src/hooks/createWorkspacePolicyHook.ts
returns a `PreToolUse` callback that:
  - inspects each tool call's input via per-tool path extractors
    (defaults cover the local-engine coding suite; hosts can
    override / extend),
  - returns `allow` for in-workspace paths (incl. additionalRoots),
  - returns `'ask'` (default) / `'allow'` / `'deny'` for outside
    paths, with separate read/write policies.

This is exactly the user-visible "ask once / always" semantic
claude-code and opencode have, but built on the existing
PreToolUse + HITL machinery instead of a parallel permissions API.
14 unit tests in src/hooks/__tests__/createWorkspacePolicyHook.test.ts.

3. HITL interrupt() on the direct path (the missing piece)
----------------------------------------------------------
The previous direct-path hook commit fail-closed `'ask'` decisions
because the interrupt machinery only existed in the event-dispatch
path. The workspace policy hook above wants to use `'ask'` for
real, so this commit lifts the gap.

`runDirectToolWithLifecycleHooks` now:
  - When `humanInTheLoop.enabled === true` and PreToolUse returns
    `'ask'`: builds a single-tool `tool_approval` payload and raises
    a real LangGraph `interrupt()` (anchored to the node's
    RunnableConfig the same way `dispatchToolEvents` does — ToolNode
    disables LangSmith tracing so the AsyncLocalStorage frame must
    be re-established).
  - On resume:
      - `approve` runs the tool with the (possibly hook-rewritten) args
      - `reject` blocks via the new `blockDirectCall` helper
      - `respond` returns the host-supplied `responseText` as a
        synthetic success ToolMessage (no tool execution)
      - `edit` re-runs the tool with the host-edited args
  - When HITL is off: collapses to a fail-closed deny (matches the
    rest of the SDK's HITL-disabled default). One-time warning
    logged so hosts notice the gap.
  - `allowedDecisions` enforcement matches the event path — a
    decision type outside the policy's allowlist is fail-closed.

`blockDirectCall` is the shared helper that synthesises the Blocked
ToolMessage and fires `PermissionDenied`, used by every deny path
(deny decision, fail-closed ask, reject, allowedDecisions violation,
malformed respond).

Updated the existing fail-closed-on-ask test in
directToolHooks.test.ts to assert the new HITL-disabled behaviour
explicitly.

Live verification
-----------------
New `npm run local:workspace` script:
  - workspace.root = temp dir
  - additionalRoots = sibling temp dir
  - createWorkspacePolicyHook with outsideRead/outsideWrite='ask'
  - humanInTheLoop.enabled = true
  - asks the model to read 3 files: in-workspace, sibling, outside

Live result against real Anthropic Sonnet:
  ✔ inside.txt → workspace policy 'allow' → read INSIDE
  ✔ sibling.txt → 'allow' (additionalRoots) → read SIBLING
  ✔ outside.txt → 'ask' → INTERRUPT raised with action_request
                          carrying the path → script auto-approves
                          → tool runs → read SECRET
  Final: HITL interrupts handled = 1, successful tool messages = 3

The model didn't even need its "retry if blocked" instruction — HITL
handled the approval transparently.

Functional + non-API tests: 1245 passing (was 1224; +21 new across
workspace seam + workspace policy hook + the updated direct-tool
ask test).

* feat(common): canonical tool-name constants for local coding tools

LibreChat's `getToolIconType` (client/src/components/Chat/Messages/
Content/ToolOutput/ToolIcon.tsx) special-renders four of our tools
by exact name match: `bash_tool`, `read_file`, `execute_code`,
`run_tools_with_code`. We already use those names, so they pick up
the right icon end-to-end.

The newer local-engine tools (`write_file`, `edit_file`,
`grep_search`, `glob_search`, `list_directory`, `compile_check`)
were defined as per-file `Local*ToolName` consts — string literals
that would silently drift if anyone renamed them in only one place,
and that no consumer UI could discover programmatically.

Promotes them to `Constants.*` (single source of truth) and adds
two grouped lists alongside the workspace types so consumers can
match against canonical strings:

  - `Constants.WRITE_FILE`     ('write_file')
  - `Constants.EDIT_FILE`      ('edit_file')
  - `Constants.GREP_SEARCH`    ('grep_search')
  - `Constants.GLOB_SEARCH`    ('glob_search')
  - `Constants.LIST_DIRECTORY` ('list_directory')
  - `Constants.COMPILE_CHECK`  ('compile_check')
  - `LOCAL_CODING_TOOL_NAMES`  — the 7 local-only tools (the 6 above
                                 plus READ_FILE)
  - `LOCAL_CODING_BUNDLE_NAMES` — the local-only set + the bash/code/
                                  PTC pair the bundle wraps around
                                  the existing factories

The per-file `Local*ToolName` and `CompileCheckToolName` exports are
retained as back-compat aliases pointing at the canonical Constants
so existing consumers keep working.

Wired through:

- `createWorkspacePolicyHook` default path extractors now key off
  `Constants.*` (was string literals).
- `LocalCodingTools.ts` Local*ToolName aliases collapsed into
  `Constants.*` references.
- `CompileCheckTool.ts` likewise.

5 new unit tests in localToolNames.test.ts pin:
  - per-file aliases match Constants
  - canonical strings match the wire-level expectations LibreChat
    (and other consumer UIs) special-case
  - LOCAL_CODING_BUNDLE_NAMES matches what `createLocalCodingTools`
    actually emits
  - LOCAL_CODING_TOOL_NAMES matches what
    `createLocalCodingToolDefinitions` emits
  - bundle list is a strict superset of the local-only list

Net effect: when LibreChat (or any other consumer UI) eventually
adds icons for write_file / edit_file / grep_search / glob_search /
list_directory / compile_check, the wiring picks up automatically
without an SDK change. And renames from one side without the other
get caught at test time.

Tests: 1250 passing (was 1245; +5 new).

* fix(local): three Codex P2 review nits — bash args, rg cache, root-relative extras

Three real issues Codex flagged on the PR. Each is small but
behavior-affecting; tests pin each.

#1 — `executeLocalCode` dropped `args` when lang was bash
--------------------------------------------------------
For every other runtime (py, js, ts, php, go, rs, c, cpp, java, r,
d, f90), `getRuntimeCommand` appends `input.args` as positional
parameters. The `lang === 'bash'` short-circuit was calling
`executeLocalBash(input.code, config)` and silently discarding
`args`, so `{lang:'bash', code:'echo $1', args:['hi']}` printed an
empty string instead of `hi`.

Fix: when lang is bash AND args is non-empty, route through a new
`executeLocalBashWithArgs` helper that uses the standard
`bash -lc <code> -- <args...>` form so `$1`, `$2`, … resolve. Same
AST validation as the no-args path. The plain `executeLocalBash`
(used by the `bash_tool` factory itself) is unchanged.

Test: `codex review fixes › executeLocalCode bash args › passes
input.args as positional shell parameters when lang is bash`.

#2 — ripgrep availability cache was process-global
--------------------------------------------------
With per-Run `local.exec.spawn` backends (the seam added for the
future remote-engine), a single `let cachedRgAvailable` would let
Run A's "rg works" verdict poison Run B whose backend (e.g. a
remote sandbox without rg) doesn't have it. Run B would skip the
probe, try to spawn rg, throw, and never reach the Node fallback.

Fix: cache is now a `WeakMap<LocalSpawn, Promise<boolean>>` keyed
on the effective backend (whatever `getSpawn(config)` returns).
WeakMap lets disposed backends GC their entry; the test reset hook
re-creates the map.

Test: `codex review fixes › ripgrep cache backend scope › does not
bleed an "rg available" verdict from one backend to another`.

#3 — `additionalRoots` resolved against process.cwd, not workspace root
----------------------------------------------------------------------
With `workspace: { root: '/repo/app', additionalRoots: ['../shared'] }`,
the intended boundary is `/repo/shared`. The previous
`path.resolve(extra)` anchored to the process cwd, producing
`${process.cwd()}/../shared` — completely different on a server
with a different cwd, and could deny the legitimate sibling or
allow an unrelated directory.

Fix: `getWorkspaceRoots` and `createWorkspacePolicyHook` now both
do `isAbsolute(extra) ? resolve(extra) : resolve(root, extra)`.
Behaviour matches what users would expect from a "monorepo
sibling" config.

Test: `codex review fixes › additionalRoots resolved against
workspace root › treats relative additionalRoots as siblings of
root, not of process.cwd`.

Tests: 1254 passing (was 1250; +4 new). Lint + tsc clean.

* docs(hitl): unstale the "event-driven only" caveats; pin resume scope

Three JSDoc blocks were honest at PR #134 ship time but went stale
once `6122460` (hooks fire on the legacy non-event-driven path) and
`caed7a2` (HITL interrupt() lifted into the direct path) landed:

  - `HumanInTheLoopConfig` — "Scope: event-driven tools only" header
    plus the bullet declaring direct-path tools "bypass the hook
    system entirely. PreToolUse hooks do not fire for those tools and
    HITL approval does not gate them." Both halves are now false.
  - `ToolNodeOptions.hookRegistry` — "Only fires for event-driven
    tool calls; tools routed through directToolNames bypass hook
    dispatch entirely." Now false.
  - `ToolNodeOptions.humanInTheLoop` — "the interrupt path is only
    wired into the event-driven dispatch (dispatchToolEvents), not
    into directToolNames execution — direct tools bypass HITL
    entirely." Now false.

Also re-evaluated the "mixed direct + event batches re-execute the
direct half on resume" caveat. New tests in
`directToolHITLResumeScope.test.ts` pin the actual scope:

  - When a single tool interrupts via the direct path, its body
    runs **once** total. The first pass interrupted before reaching
    runTool; the resume pass ran the body after the host's decision
    was applied. The PreToolUse hook fires twice (the documented
    idempotency caveat).
  - When a sibling tool already ran in the same batch before the
    interrupting tool, the sibling's body runs **twice** — once on
    each pass. This applies symmetrically to event-dispatched and
    direct siblings; LangGraph rolls back the entire ToolNode
    invocation, not just the direct half.

Updated wording reflects both: HITL spans both paths uniformly, and
the side-effect warning applies to "any tool with side effects in a
batch where another tool interrupts" rather than singling out direct
tools. Hosts that want a tool to skip HITL must omit it from any
registered matcher rather than relying on the path it takes.

Tests: 1256 passing (was 1254; +2 new for resume-scope pinning).

* fix(local): two more Codex review nits — output OOM cap + bash_tool args

#4 P1 — `spawnLocalProcess` could OOM the host on noisy commands
----------------------------------------------------------------
Previous implementation accumulated every stdout/stderr chunk into
in-memory strings and only truncated post-`close`. A noisy command
(`yes`, `cat /dev/urandom | base64`, a verbose build) could stream
gigabytes before the cap kicked in — production-stability risk
since local tools execute arbitrary shell.

Fix: stream-time cap with three layers, all in `spawnLocalProcess`:

  1. Per-stream in-memory buffer capped at `maxOutputChars * 2`.
     Past that, chunks divert to a lazy temp file (the spill).
  2. The spill file is seeded with whatever was already buffered so
     it holds the FULL output (head from memory + tail from stream),
     not just the post-cap remainder.
  3. New `local.maxSpawnedBytes` config (default 50 MiB) hard-kills
     the process tree once total stdout+stderr crosses the cap.
     Independent from `maxOutputChars` (which only affects what the
     model sees) — this is the OOM backstop.

`finish()` now waits for the spill stream to flush before resolving,
so the model never sees `full_output_path: …` for a still-being-
written file.

Tests pinned: `codex review fixes (round 2) › streaming output cap`:
  - `yes` gets killed in ~200ms when capped at 64 KiB (vs the 30s
    timeout that would otherwise apply)
  - 200 KiB output with an 8 KB inline cap successfully spills; the
    file holds more than the in-memory truncation
  - small outputs do NOT create a spill file (no perf regression)

#5 P2 — `bash_tool` factory broke positional shell parameters
-------------------------------------------------------------
Same shape as the previous `executeLocalCode` bash-args fix
(`9b5fb85`), but in the `bash_tool` factory itself. It was
appending args literally to the command string:

    `${command} ${args.map(shellQuote).join(' ')}`

So `command: 'echo "$1"'` with `args: ['hi']` ran
`echo "$1" hi` — `$1` was empty, `hi` got appended literally.
Tool schema advertised args as execution arguments; the actual
behavior didn't match.

Fix: route through the `executeLocalBashWithArgs` helper (now
exported from LocalExecutionEngine) which uses
`bash -lc <command> -- <args...>` so `$1`, `$2`, … resolve
correctly. Falls back to the no-args `executeLocalBash` when args
is empty (no behavior change for that case).

Test: `bash_tool args › populates positional shell parameters from
input.args` (and a second case for missing args).

Tests: 1261 passing (was 1256; +5 new).

* fix(local): three more Codex nits — config.shell, probe cache scope, grep -e

#6 P1 — validateBashCommand hard-coded DEFAULT_SHELL
----------------------------------------------------
The `bash -n -c <command>` syntax preflight was hard-coded to
DEFAULT_SHELL ('bash' on POSIX, 'bash.exe' on Windows), so a Run
configured with `local.shell: '/bin/sh'` (or pointing at zsh, dash,
or any host where bash isn't available) would get its commands
syntax-validated against a different binary than the one that would
actually execute them. Valid commands could be rejected before
execution; commands relying on bashisms could be falsely accepted
on hosts without bash.

Fix: route the syntax preflight through `config.shell ?? DEFAULT_SHELL`
so validation uses the same binary the actual execution path uses.

Test: `codex review fixes (round 3) › validateBashCommand honours
configured shell › routes the -n preflight through local.shell when
set`. Intercepts the spawn calls and asserts the syntax check used
`/bin/sh`, not the DEFAULT_SHELL fallback.

#7 P2 — syntax-check probe cache was process-global
---------------------------------------------------
Same shape as the ripgrep-cache fix in `9b5fb85`: per-Run
`local.exec.spawn` backends mean a "node available" / "python
available" / "bash available" verdict from one backend's probe
could poison subsequent Runs whose backend (e.g. a remote sandbox
that has python but not node) doesn't have those tools.

Fix: cache is now `WeakMap<LocalSpawn, ProbeCache>` keyed on the
effective spawn backend (whatever `getSpawn(config)` returns).
Disposed backends GC their entry; the test reset hook re-creates the
map. Mirrors the ripgrep cache exactly.

Test: `codex review fixes (round 3) › syntax-check probe cache is
backend-keyed › does not bleed an "rg/node/python available" verdict
from one backend to another`.

#8 P2 — grep pattern parsed as flag when dash-prefixed
------------------------------------------------------
The `grep_search` tool sent `input.pattern` as a positional arg to
`rg`. A pattern like `-foo` got parsed as an unknown flag and rg
bailed out instead of searching for it. `rg --help` explicitly
requires `-e/--regexp` (or `--`) for dash-prefixed patterns.

Fix: pass the pattern via `-e <pattern>`. Same trick also defends
against any future flag-conflict if a user query happens to look like
an rg long option. The Node fallback path doesn't need this — JS
RegExp accepts dash-prefixed patterns natively.

Test: `codex review fixes (round 3) › grep passes pattern via -e ›
handles dash-prefixed patterns without rg interpreting them as flags`.

Tests: 1264 passing (was 1261; +3 new).

* fix(local): two more P1 Codex nits — quoted destructive + symlink-escape

#9 P1 — dangerous-command gate missed quoted targets
----------------------------------------------------
`validateBashCommand` checks `dangerousCommandPatterns` against
`normalized` — the post-`stripQuotedContent` form. That pass blanks
the contents of every quoted span, so destructive targets that the
LLM (accidentally or not) wraps in quotes — `rm -rf "/"`,
`rm -rf "$HOME"`, `chmod -R 777 "/"` — slip past the regex even
though they execute identically to the bare form. Real safety
regression on the default-on guard.

Fix: split the destructive check into two passes.

  1. The existing `dangerousCommandPatterns` continue to run against
     `normalized`. Same false-positive defence (`echo "rm -rf /"`
     stays valid because the destructive text is wrapped by the
     OUTER quote pair, not by quotes around the path argument).
  2. New `quotedDestructivePatterns` run against the ORIGINAL
     command string. Each pattern explicitly REQUIRES a matching
     quote pair around the destructive path (`"/"`, `"$HOME"`,
     `'/'`, etc.). `echo "rm -rf /"` doesn't match — the `/` isn't
     surrounded by quotes there, only the whole `rm -rf /` string is.

Tests: `codex review fixes (round 4) › quoted destructive targets`
covers `"/"`, `"$HOME"`, `'/'`, `chmod -R 777 "/"`, no-regression
on bare forms, and the `echo "rm -rf /"` no-false-positive case.

#10 P1 — workspace policy hook compared lexical paths only
----------------------------------------------------------
`createWorkspacePolicyHook` did `isAbsolute(p) ? resolve(p) :
resolve(root, p)` and called it done. A symlink inside the workspace
pointing outside (e.g. `workspace/escape → /etc/secret`) lexically
appears under `workspace/` and got auto-allowed — even though the
agent reading through it touches a path the policy meant to gate.

Critical when the hook is the primary gate: the documented "ask the
user" setup turns `allowReadOutside: true` so the file tools' own
clamp is off, leaving the hook as the sole defence.

Fix: the hook now realpaths both the candidate and the workspace
roots before comparing. Roots are pre-realpath'd once at construction
(memoized via a Promise so the per-call cost is one realpath) and
candidate paths get the same `realpathOfPathOrAncestor` walk
`resolveWorkspacePathSafe` already uses — so paths that don't yet
exist (e.g. `write_file` to a brand-new path) still get checked
against their nearest existing ancestor's realpath.

Two test cases pin both halves:
  - `workspace/escape → extra/secret.txt` — hook DENIES read of
    `escape` even though it's lexically inside.
  - `extra/alt-mount → workspace/` — hook ALLOWS read of
    `extra/alt-mount/file.ts` because the realpath lands inside the
    workspace root (covers the alternate-mount case so we don't
    over-correct).

Tests: 1272 passing (was 1264; +8 across both fixes).

* fix(local): treat maxSpawnedBytes=0 as unlimited (Codex P2 #11)

The public LocalExecutionConfig contract documents `maxSpawnedBytes: 0`
as "no cap", but the kill check `totalSpawnedBytes > hardKillBytes`
triggered on the very first byte when the cap was zero. Skip the kill
path entirely when `hardKillBytes <= 0` so configs that explicitly opt
out of the OOM backstop behave as documented.

* chore(scripts): forward ON_RUN_STEP in pi-vs-ours harness

Without an ON_RUN_STEP handler the aggregator's stepMap is empty when
ON_RUN_STEP_COMPLETED arrives, so every tool call logged "No run step
or runId found for completed step event" to stderr. Mirror the pattern
used in src/scripts/code_exec.ts (and friends): register both events.
Harness-only — no SDK behavior change.

* fix(local): ESM-safe spill + surface rg failures in glob_search (Codex P1 #12, P2 #13)

Two issues from the latest review pass:

1) `spawnLocalProcess`'s spill path used `require('fs')` inside an
   ESM-shipped module. Fine in CJS test runs, throws
   `ReferenceError: require is not defined` for any ESM consumer that
   triggers the overflow path. Switch to a static
   `import { createWriteStream } from 'fs'` so both build outputs work.

2) `createLocalGlobSearchTool` ignored ripgrep's exit code and stderr,
   so `rg --files <missing-target>` (exit 2) silently mapped to "No
   files found." — the agent then treated a tooling failure as a real
   absence of matches. Now we check `result.exitCode > 1` (and
   `timedOut`) and return an explicit `glob_search failed: <stderr>`
   string + an artifact carrying the error.

Tests pinned: `codex review fixes (round 5) › spill path is ESM-safe`
and `… › glob_search surfaces ripgrep failures`. The glob test uses an
injected spawn backend so it runs deterministically on hosts without
ripgrep installed.

* fix(local): sandbox loopback bridge + additionalRoots writes (Codex P1 #14, P2 #15)

Two issues with how `buildSandboxRuntimeConfig` lays out the
SandboxRuntimeConfig:

1) The local programmatic-tool bridge listens on 127.0.0.1, but the
   sandbox default `allowedDomains: []` denies all outbound network —
   so `run_tools_with_code` / `run_tools_with_bash` fail under sandbox
   even though they work unsandboxed. Seed allowedDomains with
   `127.0.0.1`, `localhost`, `::1` (skip any host the user explicitly
   listed in `deniedDomains`, and dedupe against user-supplied
   `allowedDomains`).

2) The sandbox `allowWrite` allowlist was seeded with `cwd` only, so
   `workspace.additionalRoots` paths were resolvable by file tools but
   blocked for sandboxed shell/code. Now seed from `getWorkspaceRoots`
   so both surfaces share the same boundary.

Exported `buildSandboxRuntimeConfig` so the round-6 tests can drive it
directly without standing up the real native sandbox runtime. Test
count: 1283 passing (was 1276). Lint baseline unchanged.

* fix(direct-path): HITL edit reads updatedInput; PostToolUse updates registry (Codex P1 #16, P2 #17)

Two bugs in `runDirectToolWithLifecycleHooks`:

1) The `edit` decision branch read `decision.args`, but the documented
   `ToolApprovalDecision` shape uses `updatedInput`. Hosts following
   the public type signature were silently ignored — the tool ran
   with the original (un-edited) arguments. Now mirrors the
   event-driven path's validation: read `updatedInput`, fail closed
   on missing/wrong-shaped payloads (returns an error ToolMessage
   instead of executing with undefined args).

2) When a `PostToolUse` hook returns `updatedOutput`, the direct
   path returned a new ToolMessage with the replacement content but
   never updated the `toolOutputReferences` registry. Subsequent
   `{{tool<i>turn<n>}}` substitutions then delivered the stale
   pre-hook bytes while the model observed the post-hook
   replacement. Now reads `_refKey`/`_refScope` from the message
   metadata (already stamped by `recordOutputReference`) and calls
   `toolOutputRegistry.set` with the replacement.

Tests: `direct-path HITL: resume scope > edit decision (Codex P1 #16)`
covers both the happy path (edited args reach the body) and the
fail-closed branch. `ToolNode tool output references > PostToolUse
updatedOutput updates the registry (Codex P2 #17)` chains two calls
where the second references the first via {{tool0turn0}} and asserts
the substituted value matches the post-hook content.

* fix: snapshot direct-batch args + curl-first bash bridge (Codex P1 #18, P2 #19)

P1 #18 (ToolNode.ts): direct-path tools resolved {{tool…turn…}}
placeholders against the LIVE registry inside `runTool`, which runs
AFTER awaiting PreToolUse hooks. In a `Promise.all`-driven batch a
slow hook on call A could let call B finish first and register its
output, then A's late re-resolve would substitute B's output into A's
args — order-dependent leakage that violates same-turn isolation.
Snapshot the registry once per batch in `run()` (mirrors the
event-driven path), thread it through `RunToolBatchContext`, and have
both `runDirectToolWithLifecycleHooks` and `runTool` resolve against
the snapshot when present.

P2 #19 (LocalProgrammaticToolCalling.ts): the bash bridge required
`python3` on PATH, breaking `run_tools_with_bash` on minimal
containers and Windows hosts without python3 installed. Bridge now
serves a `?mode=text` variant that returns the already-serialized
result body so bash callers can use `curl` (universally available)
without pulling in a JSON parser. Bash helper tries curl first, falls
back to python3 for environments without curl, errors helpfully if
neither is available.

Tests pinned: `ToolNode tool output references › direct-batch
snapshot isolation (Codex P1 #18)` (slow PreToolUse hook + sibling
finishes-first scenario), and `ProgrammaticToolCalling › bash bridge
script does not require python3 (Codex P2 #19)` (asserts both helper
branches present in the generated script with curl preferred).

* fix(local): block `--` end-of-options + gate compile_check overrides (Codex P1 #20, P1 #21)

P1 #20: the destructive-command guard required the target path to
follow option flags directly, so `rm -rf -- "/"` (and similarly
`chmod -R 777 -- "/"`) bypassed the check via the GNU/BSD
end-of-options marker. Both `dangerousCommandPatterns` and
`quotedDestructivePatterns` now allow an optional `(?:--\\s+)?`
between the flags and the destructive target.

P1 #21: `compile_check` ran the resolved command through `bash -lc`
without going through `validateBashCommand`, so a host-supplied
`command` override (or `compileCheck.command` config entry) could
execute mutating/destructive commands even with `readOnly: true` or
the dangerous-command guard normally in force. Now routes through
`validateBashCommand(detection.command, config)` first; on failure
returns an explanatory ToolMessage with `ran: false` rather than
spawning the process.

Tests pinned: `codex review fixes (round 6) › destructive guard
handles `--` end-of-options` (5 cases including a benign-`--`
no-regression check) and `… › compile_check enforces
validateBashCommand + readOnly` (rm -rf, touch under readOnly,
benign echo). Lint baseline unchanged.

* chore: audit-pass cleanups + missing test coverage (manual review #1, #3, #5, #7, #9, #10, #11)

Comprehensive-review audit findings, fixed in one pass:

#1 (attachments full-file MIME sniff): switch classifyAttachment to
   open() + 12-byte header read for sniffing; only readFile() the
   full buffer when we're about to embed. A 9 MB image now allocates
   12 bytes for sniffing instead of 9 MB.

#3 (fallback walker SKIP_DIRS too small): expanded the Node-fallback
   walker's skip set to cover dist/build/.next/.cache/__pycache__/
   .venv/venv/target/vendor/coverage/.tox/.mypy_cache and friends.
   ripgrep itself respects .gitignore so this only affects the
   fallback.

#5 (dynamic fs/promises import): replaced two `await import('fs/
   promises').then(...)` calls in CompileCheckTool with a static
   `import { readFile, stat } from 'fs/promises'` per AGENTS.md.

#7 (test-only resets in public API): added @internal JSDoc to all
   three `_reset*ForTests` functions so IDE autocomplete signals
   they aren't part of the SDK surface. (The leading underscore was
   convention-only; @internal is the standard tag.)

#9 (legacy fields missing @deprecated): added @deprecated tags to
   `LocalExecutionConfig.cwd` and `allowOutsideWorkspace` pointing
   to their workspace.* replacements.

#10 (dead lexicallyInside compute in workspace policy hook): dropped
   the unused variable. Realpath is the source of truth for both
   the symlink-escape and alternate-mount cases; the lexical check
   provides no information so we don't pay for it.

#11 (voided unused imports in pi comparison script): removed the
   `void readdir; void cp;` workaround and the dead imports they
   were silencing.

Bonus: also addressed the comprehensive review's #2 (bitwise OR) by
verifying it was a misread — the code uses `||`, not `|`. Manual
review's reading was wrong; no fix needed.

Larger findings addressed:

#4 (no editStrategies tests): added
   `src/tools/local/__tests__/editStrategies.test.ts` with 12 cases
   covering exact / line-trimmed / whitespace-normalized /
   indentation-flexible match boundaries, start/end-of-file matches,
   unicode content, multi-line needles spanning blank lines,
   ambiguous-match rejection, and applyEdit semantics.

#12 (no FileCheckpointer tests): added
   `src/tools/local/__tests__/FileCheckpointer.test.ts` with 6
   cases covering capture+rewind happy path, idempotent capture,
   absent-file rewind (deletion), multi-file rewind, oversize-file
   tracking-but-not-snapshotting, and rewind across a deleted
   parent directory.

Performance:

#8 (lineWindow split-whole-file): replaced `content.split('\\n')`
   in `lineWindow` with a newline-walk that's O(start + limit)
   instead of O(file). For a 10 MB file with `offset: 1, limit: 10`
   this drops from "split the whole file into millions of strings"
   to ~10 indexOf calls. Falls back to the simple split when no
   limit is supplied.

Deferred:

#6 (spill files never cleaned) needs a Run-lifecycle hook to run
   cleanup; tracked separately rather than fixed inline.

* fix: comprehensive review (round 7) — bridge hooks, sandbox latch, nested-shell, regex DoS, attachments via WorkspaceFS, expose checkpointer

Codex P2 (sandbox warning latch): syntax preflight, ripgrep-availability,
and node/python/bash probes used to spawn through the public path that
fires `maybeWarnSandboxOff`, both emitting a misleading "sandbox is off"
warning when the run had `sandbox.enabled: true` AND latching
`sandboxOffWarned = true` so a later genuinely-unsandboxed execution
silently skipped its warning. Added an opt-in `{ internal: true }`
options arg to spawnLocalProcess that suppresses both. Threaded it
through every internal probe.

Manual #A (P1): the in-process programmatic-tool bridge invoked inner
tools via `executeTools` directly, bypassing every PreToolUse hook the
host registered. ToolNode now plumbs a `hookContext` (registry +
runId/threadId/agentId) into the programmatic-tool factory; the bridge
runs PreToolUse before each inner tool.invoke. `deny` and `ask` (HITL
not reachable from inside an HTTP handler) fail closed; `updatedInput`
is applied to the inner tool args. 4 tests pinned.

Manual #B (P1): already fixed in c6e2632 (round 9 — compile_check
routes through validateBashCommand). Reviewer was at 75a54b3, predates
that commit.

Manual #C (P1): `bash -lc "rm -rf $HOME"` (and friends) bypassed the
destructive-command guard because `stripQuotedContent` blanks the
nested payload before the patterns run. Added a third pattern list
`nestedShellDestructivePatterns` that runs against the original
command and matches destructive ops (`rm -rf`, `chmod -R 777`,
`chown -R`) inside `<shell> -[l]?c "..."` and `eval "..."` payloads.
4 tests including a benign-nested-echo no-regression check.

Manual #D (P2): `fallbackGrep` compiled the model-supplied pattern via
`new RegExp` and ran it across every file, opening the door to
catastrophic-backtracking DoS (`(a+)+$`-style patterns). Added a
guardrail layer: 1024-char pattern length cap, nested-quantifier
heuristic rejection, 5-second wall-clock budget for the overall
search. Surfaced as a structured `FallbackGrepError` -> tool result
with engine: 'node-fallback' + the kind tag. Hosts that need
bulletproof regex safety should install ripgrep (RE2-based, no
backtracking). 2 tests pinned.

Manual #E (P2): `local.fileCheckpointing: true` in RunConfig was a
silent no-op in the auto-bind path — `createLocalCodingTools()` made
a checkpointer that was immediately discarded. Now the auto-bind path
uses `createLocalCodingToolBundle()` when checkpointing is on,
returns the checkpointer through `ResolveLocalToolsResult`, and
ToolNode stashes it + exposes via `getFileCheckpointer()`. 2 tests
pinned.

Manual #F (P3): `classifyAttachment` used host `fs/promises` directly
instead of the configured `WorkspaceFS`. A custom/remote engine could
fail to embed valid attachments OR accidentally read a host path with
the same absolute name. Added optional `fs` arg to classifyAttachment
threaded from `read_file` through to its `WorkspaceFS`. Backward
compatible — defaults to host fs/promises when no fs is supplied.

Tests touched: 768 passing across all suites (was 750ish). Lint
baseline unchanged.

* chore: address audit-of-audit findings (4/4 valid)

Follow-up audit on commit ebf8b6a flagged 4 nits/minors. All valid:

F1 (NIT): the lineWindow perf fix introduced exactly the same
  `void <var>` pattern that finding #10 had me remove from
  createWorkspacePolicyHook. Embarrassing — dropped the dead
  `line` variable + `line++` + `void line`.

F2 (MINOR): the editStrategies test commit message claimed coverage
  of "ambiguous-match rejection" but no test actually exercised the
  multi-match → null path. Added the missing case for the exact
  strategy. Noted in a comment that the looser strategies in the
  chain may still resolve duplicates unambiguously by falling
  through; whether that's correct is a separate design call.

F3 (NIT): the new FileCheckpointer.test.ts had a `import('fs/
  promises')` inside a catch block when `mkdir` was already
  available via the static import at the top. Static-imported
  `mkdir` and restructured the test to mkdir-then-writeFile
  unconditionally (the catch-and-retry pattern was unnecessary
  defensive plumbing).

F4 (NIT): the unused-imports cleanup missed `sum`/`void sum` in
  the comparison harness. Removed the function and its void.

* fix: expose file checkpointer through Run/Graph (audit follow-up)

Round-7 finding E only fixed half the problem: ToolNode got
`getFileCheckpointer()` so direct `new ToolNode(...)` users could
reach it, but the normal `Run.create(...)` path constructed
ToolNodes inline inside `StandardGraph.initializeTools` and dropped
the reference. So `RunConfig.toolExecution.local.fileCheckpointing:
true` was still a no-op for Run callers — and the public JSDoc on
`fileCheckpointing` referenced a `Run.rewindFiles()` that didn't
exist.

Now:

- Graph adds `getOrCreateFileCheckpointer()` — lazily constructs ONE
  per-Run checkpointer when local fileCheckpointing is on. Cleared
  in `clearHeavyState()` along with the other per-Run state.

- `StandardGraph.initializeTools` threads it into BOTH ToolNode
  constructions (the agentContext branch and the traditional-tools
  branch). Multi-agent graphs end up sharing a single snapshot
  store, so `rewind()` reaches writes from any agent.

- ToolNode accepts `fileCheckpointer?` in its constructor and
  passes it to `resolveLocalExecutionTools`, which forwards into
  `createLocalCodingToolBundle({ checkpointer })`. Caller-provided
  wins over the auto-created one.

- Run gains `getFileCheckpointer()` and `rewindFiles()` — the
  documented APIs. `rewindFiles()` returns 0 when checkpointing is
  disabled (no-op rather than throwing).

- `LocalExecutionConfig.fileCheckpointing` JSDoc updated to
  reference the now-real APIs.

Tests pinned: `… › fileCheckpointer reachable through
Run.getFileChec…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants