fix(node): prevent stdout double-write when streamOutput=true in executeJar()#399
fix(node): prevent stdout double-write when streamOutput=true in executeJar()#399kuishou68 wants to merge 2 commits intoopendataloader-project:mainfrom
Conversation
When streamOutput=true, executeJar() writes each data chunk to process.stdout in real-time AND accumulates it in the stdout string. The caller (cli.ts) then writes the same accumulated string to process.stdout again, causing every line to be printed twice. Fix: when streamOutput=true, do not accumulate chunks in the stdout variable; instead, resolve with an empty string since the output was already streamed. This prevents the CLI layer from re-writing the already-displayed output. Closes opendataloader-project#398
WalkthroughThis change fixes stdout duplication in the PDF converter CLI by having Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@node/opendataloader-pdf/src/index.ts`:
- Around line 45-47: When streaming mode is enabled the existing code skips
accumulating stdout (stdout stays empty) which can make the error body empty on
non-zero exits; change the logic so that while you still do not return
accumulated stdout on success, you keep a bounded tail buffer (e.g., stdoutTail)
by appending chunks but trimming to a max size (e.g., 8KB) at the same spot
where stdout was previously appended (the line with "stdout += chunk"). Then,
when constructing the error body (the place that currently uses "stderr ||
stdout" at exit handling), use "stderr || stdoutTail" so Java errors written to
stdout are preserved for diagnostics. Ensure the tail buffer is size-limited and
only used for failure diagnostics; do not change success return behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 969e8499-0b1f-4339-9e2a-86d67b712823
📒 Files selected for processing (1)
node/opendataloader-pdf/src/index.ts
…am mode When streamOutput=true, the previous fix left stdout empty, which means on non-zero exit codes the error body (stderr || stdout) would be empty when Java reports errors via stdout. Add a streamedStdoutTail buffer (capped at 64 KB) that records the last portion of streamed stdout chunks. On failure, use this tail instead of the empty stdout to preserve Java error messages for diagnostics. Success behavior is unchanged: still resolves with '' when streaming.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@node/opendataloader-pdf/src/index.ts`:
- Around line 67-69: The convert() implementation currently uses streamOutput
derived from options.quiet to decide its resolved string (resolve(streamOutput ?
'' : stdout)), which causes library callers to get an empty string when quiet is
undefined; instead, restore convert() to always resolve the actual stdout string
(use stdout in the Promise resolution) and move the CLI suppression into the CLI
caller by passing an explicit flag (e.g., cliSuppressOutput or streamOutput
explicitly set by the CLI) or by having the CLI ignore the resolved string when
it streams to stdout; update the other similar resolution sites (the other
resolve calls referenced around the 110-112 area) to follow the same approach so
library behavior remains Promise<string> with real output while the CLI handles
suppression separately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c0822178-4c9b-4696-9d34-0102adca1ec9
📒 Files selected for processing (1)
node/opendataloader-pdf/src/index.ts
| // When streamOutput=true, content was already written to process.stdout in real-time. | ||
| // Return empty string to prevent callers from double-writing the same output. | ||
| resolve(streamOutput ? '' : stdout); |
There was a problem hiding this comment.
This changes convert()'s default success result to ''.
convert() passes streamOutput: !options.quiet, and quiet defaults to undefined, so the normal success path now resolves an empty string for successful conversions. That fixes the CLI duplication, but it also changes the public Promise<string> behavior for non-quiet/default library callers. Please keep the CLI suppression separate from the library return value.
Also applies to: 110-112
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@node/opendataloader-pdf/src/index.ts` around lines 67 - 69, The convert()
implementation currently uses streamOutput derived from options.quiet to decide
its resolved string (resolve(streamOutput ? '' : stdout)), which causes library
callers to get an empty string when quiet is undefined; instead, restore
convert() to always resolve the actual stdout string (use stdout in the Promise
resolution) and move the CLI suppression into the CLI caller by passing an
explicit flag (e.g., cliSuppressOutput or streamOutput explicitly set by the
CLI) or by having the CLI ignore the resolved string when it streams to stdout;
update the other similar resolution sites (the other resolve calls referenced
around the 110-112 area) to follow the same approach so library behavior remains
Promise<string> with real output while the CLI handles suppression separately.
There was a problem hiding this comment.
Good catch — you are right that tying streamOutput to options.quiet changes the public Promise<string> contract for library callers. The fix should separate CLI stdout suppression from the library return value so convert() always resolves with the actual stdout string, and the CLI handles suppression on its own side. Will address this.
There was a problem hiding this comment.
@kuishou68, glad the analysis is helpful! Here's a concrete approach for the fix:
Approach: always accumulate + always return real stdout; CLI skips re-printing
In executeJar, accumulate stdout unconditionally (even when streaming), and always resolve with the real string:
javaProcess.stdout.on('data', (data) => {
const chunk = data.toString();
stdout += chunk; // always accumulate for callers
if (streamOutput) {
process.stdout.write(chunk); // stream in real-time for CLI
}
});
javaProcess.on('close', (code) => {
if (code === 0) {
resolve(stdout); // always return real content — callers see the full string
} else {
const errorOutput = stderr || stdout;
reject(new Error(`The opendataloader-pdf CLI exited with code ${code}.\n\n${errorOutput}`));
}
});Then in cli.ts, suppress the redundant write when already streaming:
// cli.ts (rough sketch)
const result = await convert(inputPaths, options);
if (options.quiet) {
// streaming was off, nothing was written yet — print now
process.stdout.write(result);
}
// else: output was already streamed in real-time; skip re-printingThis keeps the convert() contract clean (Promise<string> always resolves with the actual output), and the double-write prevention is owned entirely by the CLI layer where it belongs. The streamedStdoutTail and MAX_STDOUT_TAIL machinery from the latest commit can also be removed since we're accumulating fully anyway.
🧠 Learnings used
Learnt from: the-waste-land
Repo: opendataloader-project/opendataloader-pdf PR: 409
File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351
Timestamp: 2026-04-11T14:50:33.654Z
Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).
|
Here's how the review process will go from here:
|
Summary
Fixes #398
Problem
In
node/opendataloader-pdf/src/index.ts, theexecuteJar()function had a bug where output was written toprocess.stdouttwice when thequietoption was not set.How the double-write happened
executeJar()is called withstreamOutput: !options.quietstreamOutput=true, each data chunk is immediately written toprocess.stdoutAND accumulated in thestdoutstringstdoutstringcli.tsreceives the resolved value and writes it toprocess.stdoutagainResult: every line of output appears twice for all CLI users who don't pass
--quiet.Fix
When
streamOutput=true, do not accumulate chunks in thestdoutvariable. Instead, return an empty string from the promise (since the output was already streamed in real-time). This prevents the CLI layer from re-writing already-displayed content.Impact
All CLI users who do not pass
--quietpreviously saw every line of conversion output duplicated. This fix ensures output is written exactly once.Summary by CodeRabbit