Skip to content

🐛 refactor: anchor code-generated file lookup on threadFileIds for branched conversations#13004

Merged
danny-avila merged 2 commits intodevfrom
fix/code-generated-files-thread-by-fileid
May 8, 2026
Merged

🐛 refactor: anchor code-generated file lookup on threadFileIds for branched conversations#13004
danny-avila merged 2 commits intodevfrom
fix/code-generated-files-thread-by-fileid

Conversation

@danny-avila
Copy link
Copy Markdown
Owner

Summary

In a branched conversation (regenerations producing the same code-output filename), getCodeGeneratedFiles silently excluded files whose File-record messageId lived on a sibling branch. User-visible symptom: turn 2 asks the agent to modify the prior turn's output and gets FileNotFoundError — LC sent _injected_files: [] to codeapi instead of priming the prior turn's file.

Root cause

claimCodeFile is keyed by (filename, conversationId, context) — not by messageId. When sibling A first creates output.csv, the File record persists with messageId = A. When sibling N (regeneration of A's parent) recreates output.csv, the claim finds A's record and processCodeOutput deliberately preserves messageId = A to keep file → original-creator provenance intact (correct for the linear case).

Turn N+1's parentMessageId = N. getThreadData walks back from N: thread = [N, root] — sibling A is NOT in it. The pre-fix query filtered by messageId IN [N, root], so the file was excluded.

Fix

getUserCodeFiles (the user-uploaded counterpart) has always filtered by file_id IN threadFileIds — the file_ids referenced by messages.files[] during the thread walk. The asymmetry between the two functions (user-uploaded anchored on the message's reference, code-generated anchored on the File's own creator) was the bug.

Anchoring both on threadFileIds reaches the right files regardless of which sibling first generated them. File.messageId stays informational for provenance; processCodeOutput's preservation logic stays as-is. Only the thread-scoped lookup key changes.

Files

  • packages/data-schemas/src/methods/file.ts: signature + filter change. JSDoc spells out the branched-conversation rationale.
  • packages/api/src/agents/initialize.ts: pass threadFileIds instead of threadMessageIds. The dead threadMessageIds local removed since its only consumer is gone.
  • packages/data-schemas/src/methods/file.spec.ts: 5 new cases including the explicit regression — file's creator messageId is on a sibling branch but file_id is in threadFileIds → finds it.

Independent of #12960

Applies cleanly on top of dev. When LC #12960 (the typed CodeEnvRef cutover) lands, the only conflict is the metadata key (metadata.fileIdentifiermetadata.codeEnvRef) — same line, trivial resolve.

Test plan

  • `cd packages/data-schemas && npx jest src/methods/file.spec` — 42/42 pass (5 new regression cases)
  • `cd packages/api && npx jest src/agents` — 722/722 pass (2 pre-existing summarization e2e failures unrelated)
  • `cd api && npx jest server/services/Files server/controllers/agents` — 432/432 pass
  • `npx tsc --noEmit -p packages/api/tsconfig.json` — clean
  • Manual: branched conversation reproducer — generate a file in turn 1, regenerate the parent (sibling), then in turn N+1 ask the agent to read the file. Pre-fix: `FileNotFoundError`. Post-fix: the file is primed and `load_workbook` succeeds.

Copilot AI review requested due to automatic review settings May 7, 2026 22:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a bug where code-generated files could fail to be re-injected into the execute_code sandbox for branched (regenerated) conversations when the File record’s messageId belonged to a sibling branch.

Changes:

  • Update getCodeGeneratedFiles to scope by file_id IN threadFileIds (thread message references) instead of messageId IN threadMessageIds (File creator).
  • Update agent initialization to pass threadFileIds and remove the now-dead threadMessageIds plumbing.
  • Add regression tests covering sibling-branch creator messageId scenarios and boundary conditions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
packages/data-schemas/src/methods/file.ts Changes execute_code file lookup to be anchored on thread-referenced file_ids to work across sibling branches.
packages/data-schemas/src/methods/file.spec.ts Adds new test coverage for the branched-conversation regression and related constraints.
packages/api/src/agents/initialize.ts Updates thread traversal consumption to pass threadFileIds into code-generated file lookup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +407 to +408
/* `tool_resources.execute_code.file_ids` is the source of
* threadFileIds, but `messages.files[]` includes files of
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

GitNexus: 🚀 deployed

The LibreChat-pr-13004 index is now live on the MCP server.
Deploy run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

GitNexus: 🚀 deployed

The LibreChat-pr-13004 index is now live on the MCP server.
Deploy run

@danny-avila
Copy link
Copy Markdown
Owner Author

@codex review

@danny-avila danny-avila changed the title 🐛 fix: anchor code-generated file lookup on threadFileIds for branched conversations 🐛 refactor: anchor code-generated file lookup on threadFileIds for branched conversations May 8, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…geIds

In a branched conversation (regenerations producing the same code-output
filename), `getCodeGeneratedFiles` would silently exclude files whose
File-record `messageId` lived on a sibling branch. The user-visible
symptom: "the previous file isn't persisted" — the LLM tries
`load_workbook("output.xlsx")` on turn 2 and gets `FileNotFoundError`
because LC sent `_injected_files: []` to codeapi instead of priming
the prior turn's output.

`claimCodeFile` is keyed by `(filename, conversationId, context)` —
not by messageId. When sibling A first creates `output.csv`, the File
record persists with `messageId = A`. When sibling N (a regeneration
of A's parent) recreates `output.csv`, the claim finds A's record and
`processCodeOutput` deliberately preserves `messageId = A` to keep
file→original-creator provenance intact (correct behavior for the
linear case where the original creator is in-thread).

Turn N+1's `parentMessageId = N`. `getThreadData` walks back from N:
the thread is `[N, root]` — sibling A is NOT in it. The pre-fix query
filtered by `messageId IN [N, root]`, so the file was excluded.

`getCodeGeneratedFiles` already lives next to `getUserCodeFiles`,
which has always filtered by `file_id IN threadFileIds` (the file_ids
referenced by `messages.files[]` arrays during the thread walk). The
asymmetry — user-uploaded files anchored on the message's reference,
code-generated files anchored on the File's own creator — was the
bug. Anchoring both functions on `threadFileIds` reaches the right
files regardless of which sibling first generated them.

`File.messageId` stays informational ("who first generated this") for
provenance and `processCodeOutput`'s "preserve original messageId on
update" logic stays as-is — only the lookup key for thread-scoped
fetches changes.

- `packages/data-schemas/src/methods/file.ts`: signature + filter
  change. JSDoc spells out the branched-conversation rationale.
- `packages/api/src/agents/initialize.ts`: pass `threadFileIds` instead
  of `threadMessageIds`. The local `threadMessageIds` declaration is
  removed since the only consumer is gone.
- `packages/data-schemas/src/methods/file.spec.ts`: 5 new cases:
  - basic happy-path (file referenced by current thread)
  - **the regression**: file's creator messageId is on a sibling
    branch but file_id is in threadFileIds → finds it
  - empty/missing threadFileIds returns []
  - cross-conversation isolation
  - non-execute_code context filter still applies (a chat attachment
    won't be returned even if its file_id is in threadFileIds —
    that's `getUserCodeFiles`'s job)

Applies cleanly on top of dev. When LC #12960 (the typed CodeEnvRef
cutover) lands, the only conflict is the legacy `metadata.fileIdentifier`
metadata key flipping to `metadata.codeEnvRef` — same line, trivial
resolve.

- [x] `cd packages/data-schemas && npx jest src/methods/file.spec` —
  42/42 pass (including the 5 new regression cases)
- [x] `cd packages/api && npx jest src/agents` — 722/722 pass
  (modulo 2 pre-existing summarization e2e failures unrelated)
- [x] `cd api && npx jest server/services/Files server/controllers/agents` —
  432/432 pass
- [x] `npx tsc --noEmit -p packages/api/tsconfig.json` — clean
- [ ] Manual: branched conversation reproducer — generate a file in
  turn 1, regenerate the parent (sibling), then in turn N+1 ask the
  agent to read the file. Pre-fix: `FileNotFoundError`. Post-fix:
  the file is primed and load_workbook succeeds.
Integration-level regression test asserting initializeAgent passes
`threadFileIds` (not `threadMessageIds`) to getCodeGeneratedFiles
in branched-conversation scenarios. Locks in the API shape from the
previous commit, sitting one layer above the data-schemas unit test —
so a future refactor to the priming chain can't silently revert to
the messageId-based filter without surfacing a test failure here.

Two cases:
- The full call shape: agent.tools=['execute_code'], resendFiles=true,
  threadData mock returns distinct messageIds and fileIds. Asserts the
  call uses fileIds, and that getUserCodeFiles uses the same array
  (the symmetric design that closes the sibling-branch hole).
- Empty threadFileIds: getCodeGeneratedFiles is still called with []
  (its own internal early-return handles the empty case); getUserCodeFiles
  is gated at the call site and stays unscheduled.
@danny-avila danny-avila force-pushed the fix/code-generated-files-thread-by-fileid branch from 3cc44fd to 4814b1b Compare May 8, 2026 14:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

GitNexus: 🚀 deployed

The LibreChat-pr-13004 index is now live on the MCP server.
Deploy run

@danny-avila danny-avila merged commit e1d998f into dev May 8, 2026
11 checks passed
@danny-avila danny-avila deleted the fix/code-generated-files-thread-by-fileid branch May 8, 2026 14:39
danny-avila added a commit that referenced this pull request May 8, 2026
…lookup

Code-execution outputs land on `messages.attachments` (set by
`processCodeOutput`), while user uploads land on `messages.files`.
The threadFileIds switch (#13004) walked only `files`, so on a
single linear thread:

  Turn 1: assistant produces sample.xlsx → attachment with codeEnvRef
  Turn 2: user says "add 2 rows"
          → primeCodeFiles: file_ids=0 resourceFiles=0
          → /exec sent files=[]
          → sandbox: FileNotFoundError: 'sample.xlsx'

The `getThreadData` walk found zero file_ids because the assistant's
codeEnvRef was on `attachments`, not `files`. Compounded by the
DB select string `'messageId parentMessageId files'` which didn't
pull `attachments` into memory in the first place — so even fixing
the walk in isolation wouldn't have surfaced them.

Both layers fixed:
  - `ThreadMessage` type adds `attachments?: Array<{ file_id?: string }>`
  - `getThreadData` walks both arrays, dedups via the same Set
  - `initialize.ts` selects `'messageId parentMessageId files attachments'`

## Test plan

`packages/api/src/utils/message.spec.ts` (+6 cases):
- collects file_ids from `attachments`
- walks both `files` and `attachments` on the same message
- regression: linear thread with code-output attachments across
  user→assistant→user→assistant produces the right file_ids
- dedupes shared ids that appear in both arrays
- skips attachments without file_id (mirrors `files` behavior)
- empty `attachments` array

`packages/api/src/agents/__tests__/initialize.test.ts` (+1 case):
- locks the DB select string includes `attachments` alongside
  `files` / `messageId` / `parentMessageId`

- [x] `npx jest src/utils/message.spec.ts` — 39/39 pass
- [x] `npx jest src/agents/__tests__/initialize.test.ts` — 33/33 pass
- [x] lint clean on all four touched files
danny-avila added a commit that referenced this pull request May 8, 2026
Code-execution outputs land on `messages.attachments` (set by
`processCodeOutput`), while user uploads land on `messages.files`.
The threadFileIds switch (#13004) walked only `files`, so on a
single linear thread:

  Turn 1: assistant produces sample.xlsx → attachment with codeEnvRef
  Turn 2: user says "add 2 rows"
          → primeCodeFiles: file_ids=0 resourceFiles=0
          → /exec sent files=[]
          → sandbox: FileNotFoundError: 'sample.xlsx'

The `getThreadData` walk found zero file_ids because the assistant's
codeEnvRef was on `attachments`, not `files`. Compounded by the
DB select string `'messageId parentMessageId files'` which didn't
pull `attachments` into memory in the first place — so even fixing
the walk in isolation wouldn't have surfaced them.

Both layers fixed:
  - `ThreadMessage` type adds `attachments?: Array<{ file_id?: string }>`
  - `getThreadData` walks both arrays, dedups via the same Set
  - `initialize.ts` selects `'messageId parentMessageId files attachments'`

## Test plan

`packages/api/src/utils/message.spec.ts` (+6 cases):
- collects file_ids from `attachments`
- walks both `files` and `attachments` on the same message
- regression: linear thread with code-output attachments across
  user→assistant→user→assistant produces the right file_ids
- dedupes shared ids that appear in both arrays
- skips attachments without file_id (mirrors `files` behavior)
- empty `attachments` array

`packages/api/src/agents/__tests__/initialize.test.ts` (+1 case):
- locks the DB select string includes `attachments` alongside
  `files` / `messageId` / `parentMessageId`

- [x] `npx jest src/utils/message.spec.ts` — 39/39 pass
- [x] `npx jest src/agents/__tests__/initialize.test.ts` — 33/33 pass
- [x] lint clean on all four touched files
danny-avila added a commit that referenced this pull request May 8, 2026
…anched conversations (#13004)

* 🐛 fix: anchor getCodeGeneratedFiles on threadFileIds, not threadMessageIds

In a branched conversation (regenerations producing the same code-output
filename), `getCodeGeneratedFiles` would silently exclude files whose
File-record `messageId` lived on a sibling branch. The user-visible
symptom: "the previous file isn't persisted" — the LLM tries
`load_workbook("output.xlsx")` on turn 2 and gets `FileNotFoundError`
because LC sent `_injected_files: []` to codeapi instead of priming
the prior turn's output.

`claimCodeFile` is keyed by `(filename, conversationId, context)` —
not by messageId. When sibling A first creates `output.csv`, the File
record persists with `messageId = A`. When sibling N (a regeneration
of A's parent) recreates `output.csv`, the claim finds A's record and
`processCodeOutput` deliberately preserves `messageId = A` to keep
file→original-creator provenance intact (correct behavior for the
linear case where the original creator is in-thread).

Turn N+1's `parentMessageId = N`. `getThreadData` walks back from N:
the thread is `[N, root]` — sibling A is NOT in it. The pre-fix query
filtered by `messageId IN [N, root]`, so the file was excluded.

`getCodeGeneratedFiles` already lives next to `getUserCodeFiles`,
which has always filtered by `file_id IN threadFileIds` (the file_ids
referenced by `messages.files[]` arrays during the thread walk). The
asymmetry — user-uploaded files anchored on the message's reference,
code-generated files anchored on the File's own creator — was the
bug. Anchoring both functions on `threadFileIds` reaches the right
files regardless of which sibling first generated them.

`File.messageId` stays informational ("who first generated this") for
provenance and `processCodeOutput`'s "preserve original messageId on
update" logic stays as-is — only the lookup key for thread-scoped
fetches changes.

- `packages/data-schemas/src/methods/file.ts`: signature + filter
  change. JSDoc spells out the branched-conversation rationale.
- `packages/api/src/agents/initialize.ts`: pass `threadFileIds` instead
  of `threadMessageIds`. The local `threadMessageIds` declaration is
  removed since the only consumer is gone.
- `packages/data-schemas/src/methods/file.spec.ts`: 5 new cases:
  - basic happy-path (file referenced by current thread)
  - **the regression**: file's creator messageId is on a sibling
    branch but file_id is in threadFileIds → finds it
  - empty/missing threadFileIds returns []
  - cross-conversation isolation
  - non-execute_code context filter still applies (a chat attachment
    won't be returned even if its file_id is in threadFileIds —
    that's `getUserCodeFiles`'s job)

Applies cleanly on top of dev. When LC #12960 (the typed CodeEnvRef
cutover) lands, the only conflict is the legacy `metadata.fileIdentifier`
metadata key flipping to `metadata.codeEnvRef` — same line, trivial
resolve.

- [x] `cd packages/data-schemas && npx jest src/methods/file.spec` —
  42/42 pass (including the 5 new regression cases)
- [x] `cd packages/api && npx jest src/agents` — 722/722 pass
  (modulo 2 pre-existing summarization e2e failures unrelated)
- [x] `cd api && npx jest server/services/Files server/controllers/agents` —
  432/432 pass
- [x] `npx tsc --noEmit -p packages/api/tsconfig.json` — clean
- [ ] Manual: branched conversation reproducer — generate a file in
  turn 1, regenerate the parent (sibling), then in turn N+1 ask the
  agent to read the file. Pre-fix: `FileNotFoundError`. Post-fix:
  the file is primed and load_workbook succeeds.

* 🧪 test: lock initialize.ts → getCodeGeneratedFiles call shape

Integration-level regression test asserting initializeAgent passes
`threadFileIds` (not `threadMessageIds`) to getCodeGeneratedFiles
in branched-conversation scenarios. Locks in the API shape from the
previous commit, sitting one layer above the data-schemas unit test —
so a future refactor to the priming chain can't silently revert to
the messageId-based filter without surfacing a test failure here.

Two cases:
- The full call shape: agent.tools=['execute_code'], resendFiles=true,
  threadData mock returns distinct messageIds and fileIds. Asserts the
  call uses fileIds, and that getUserCodeFiles uses the same array
  (the symmetric design that closes the sibling-branch hole).
- Empty threadFileIds: getCodeGeneratedFiles is still called with []
  (its own internal early-return handles the empty case); getUserCodeFiles
  is gated at the call site and stays unscheduled.
danny-avila added a commit that referenced this pull request May 8, 2026
Code-execution outputs land on `messages.attachments` (set by
`processCodeOutput`), while user uploads land on `messages.files`.
The threadFileIds switch (#13004) walked only `files`, so on a
single linear thread:

  Turn 1: assistant produces sample.xlsx → attachment with codeEnvRef
  Turn 2: user says "add 2 rows"
          → primeCodeFiles: file_ids=0 resourceFiles=0
          → /exec sent files=[]
          → sandbox: FileNotFoundError: 'sample.xlsx'

The `getThreadData` walk found zero file_ids because the assistant's
codeEnvRef was on `attachments`, not `files`. Compounded by the
DB select string `'messageId parentMessageId files'` which didn't
pull `attachments` into memory in the first place — so even fixing
the walk in isolation wouldn't have surfaced them.

Both layers fixed:
  - `ThreadMessage` type adds `attachments?: Array<{ file_id?: string }>`
  - `getThreadData` walks both arrays, dedups via the same Set
  - `initialize.ts` selects `'messageId parentMessageId files attachments'`

## Test plan

`packages/api/src/utils/message.spec.ts` (+6 cases):
- collects file_ids from `attachments`
- walks both `files` and `attachments` on the same message
- regression: linear thread with code-output attachments across
  user→assistant→user→assistant produces the right file_ids
- dedupes shared ids that appear in both arrays
- skips attachments without file_id (mirrors `files` behavior)
- empty `attachments` array

`packages/api/src/agents/__tests__/initialize.test.ts` (+1 case):
- locks the DB select string includes `attachments` alongside
  `files` / `messageId` / `parentMessageId`

- [x] `npx jest src/utils/message.spec.ts` — 39/39 pass
- [x] `npx jest src/agents/__tests__/initialize.test.ts` — 33/33 pass
- [x] lint clean on all four touched files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants