fix(core): Guard event log parsing against unbounded memory growth#28594
Conversation
Add a configurable working-set cap to readLoggedMessagesFromFile so that startup recovery aborts parsing a single event log file when the in-memory message count exceeds the configured threshold. Prevents crash loops on instances with legacy log files containing many unconfirmed messages. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Bundle ReportChanges will increase total bundle size by 1.52kB (0.0%) ⬆️. This is within the configured threshold ✅ Detailed changes
ℹ️ *Bundle size includes cached data from a previous commit Affected Assets, Files, and Routes:view changes for bundle: editor-ui-esmAssets Changed:
|
There was a problem hiding this comment.
No issues found across 4 files
Architecture diagram
sequenceDiagram
participant CLI as n8n Startup Process
participant Bus as MessageEventBusLogWriter
participant Config as GlobalConfig (Env Vars)
participant FS as File System (Event Log)
participant Log as Logger
Note over CLI,FS: Startup Recovery: Replaying Unsent Events
CLI->>Bus: readLoggedMessagesFromFile()
Bus->>Config: NEW: Get N8N_EVENTBUS_LOGWRITER_MAXMESSAGESPERPARSE
Config-->>Bus: limit (Default: 10,000)
Bus->>FS: createReadStream(logFileName)
loop For each line in file
FS-->>Bus: event/confirm line
Bus->>Bus: processLoggedLine()
alt is Event Message
Bus->>Bus: Add message to in-memory working set
else is Confirm Message
Bus->>Bus: Prune matching message from memory
end
alt NEW: working set size > limit
Note right of Bus: Guard triggered (Pathological legacy log)
Bus->>FS: CHANGED: destroy() stream
Bus->>Log: NEW: warn("exceeded in-memory messages... aborting")
Bus-->>CLI: Return partial results (prevents OOM)
end
end
Note over Bus,FS: Healthy Path: confirm records keep working set small
FS-->>Bus: EOF
Bus-->>CLI: Return collected messages
Performance ComparisonComparing current → latest master → 14-day baseline Memory consumption baseline with starter plan resources
docker-stats
Idle baseline with Instance AI module loaded
How to read this table
|
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/cli/src/eventbus/message-event-bus-writer/message-event-bus-log-writer.ts">
<violation number="1" location="packages/cli/src/eventbus/message-event-bus-writer/message-event-bus-log-writer.ts:209">
P2: Do not swallow stream errors with an empty handler; it hides real log-read failures and makes recovery silently degrade.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…test Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…builder test" This reverts commit 67e4ada.
phyllis-noester
left a comment
There was a problem hiding this comment.
this doesn't work for rolling updates/in multi main or with workers right? not your change but ingesting the logs from a local file in general
| try { | ||
| const stream = createReadStream(logFileName); | ||
| stream.on('error', (error) => { | ||
| if ((error as NodeJS.ErrnoException).code !== 'ERR_STREAM_DESTROYED') { |
There was a problem hiding this comment.
so we swallow all errors and never throw?
There was a problem hiding this comment.
Yep, errors are logged but not propagated. I think this makes sense, as we don't want this recovery process (furthermore a single log file reading) to hard fail the whole instance. It's a not critical recovery AFAIK (for instance, we've allowed ourselves to tamper with those files in cloud medic system to solve instances failures)
Yes you're right, this feature is built upon single node scope with persistent disk. each instance would try and recover. I think it's fine though (although probably improvable) as instance will replay the failed events that they initiated (so a main instance would reply a workflow execution failure, which is expected) |
|
Got released with |
Summary
Adds a configurable working-set guard to
readLoggedMessagesFromFileso startup recovery aborts parsing a single event log file when the in-memory message count exceedsN8N_EVENTBUS_LOGWRITER_MAXMESSAGESPERPARSE(default10000). Healthy logs are unaffected because confirm records prune the working set as the file streams; only legacy files full of unconfirmed messages (pre-PR #27334) trip the guard, which prevents the startup crash loop on Starter-plan containers.How to test: unit tests in
message-event-bus-log-writer.test.tscover both the bloat-abort path and the healthy paired-confirm path. To verify manually, seed~/.n8n/n8nEventLog.logwith >10k orphanedn8n.workflow.startedlines, restart n8n, and observe the warnEvent log ... exceeded 10000 in-memory messages during parseinstead of an OOM.Related Linear tickets, Github issues, and Community forum posts
https://linear.app/n8n/issue/IAM-528
Review / Merge checklist
Backport to Beta,Backport to Stable, orBackport to v1(if the PR is an urgent fix that needs to be backported)🤖 PR Summary generated by AI