Skip to content

fix(node): prevent stdout double-write when streamOutput=true in executeJar()#399

Open
kuishou68 wants to merge 2 commits intoopendataloader-project:mainfrom
kuishou68:fix/issue-398-stdout-double-write
Open

fix(node): prevent stdout double-write when streamOutput=true in executeJar()#399
kuishou68 wants to merge 2 commits intoopendataloader-project:mainfrom
kuishou68:fix/issue-398-stdout-double-write

Conversation

@kuishou68
Copy link
Copy Markdown

@kuishou68 kuishou68 commented Apr 9, 2026

Summary

Fixes #398

Problem

In node/opendataloader-pdf/src/index.ts, the executeJar() function had a bug where output was written to process.stdout twice when the quiet option was not set.

How the double-write happened

  1. executeJar() is called with streamOutput: !options.quiet
  2. When streamOutput=true, each data chunk is immediately written to process.stdout AND accumulated in the stdout string
  3. The function resolves with the accumulated stdout string
  4. cli.ts receives the resolved value and writes it to process.stdout again

Result: every line of output appears twice for all CLI users who don't pass --quiet.

Fix

When streamOutput=true, do not accumulate chunks in the stdout variable. Instead, return an empty string from the promise (since the output was already streamed in real-time). This prevents the CLI layer from re-writing already-displayed content.

// Before (buggy)
javaProcess.stdout.on('data', (data) => {
  const chunk = data.toString();
  if (streamOutput) {
    process.stdout.write(chunk);
  }
  stdout += chunk;  // ← accumulated even when already streamed
});

javaProcess.on('close', (code) => {
  if (code === 0) {
    resolve(stdout);  // ← returned and written again by cli.ts
  }
  ...
});

// After (fixed)
javaProcess.stdout.on('data', (data) => {
  const chunk = data.toString();
  if (streamOutput) {
    process.stdout.write(chunk);
  } else {
    stdout += chunk;  // ← only accumulate when NOT streaming
  }
});

javaProcess.on('close', (code) => {
  if (code === 0) {
    resolve(streamOutput ? '' : stdout);  // ← empty when already streamed
  }
  ...
});

Impact

All CLI users who do not pass --quiet previously saw every line of conversion output duplicated. This fix ensures output is written exactly once.

Summary by CodeRabbit

  • Refactor
    • Improved streaming output handling to prevent duplicate console output and reduce resource use.
    • Error messages now include recent program output when available, providing clearer context on failures.

When streamOutput=true, executeJar() writes each data chunk to process.stdout
in real-time AND accumulates it in the stdout string. The caller (cli.ts) then
writes the same accumulated string to process.stdout again, causing every line
to be printed twice.

Fix: when streamOutput=true, do not accumulate chunks in the stdout variable;
instead, resolve with an empty string since the output was already streamed.
This prevents the CLI layer from re-writing the already-displayed output.

Closes opendataloader-project#398
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 9, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

Walkthrough

This change fixes stdout duplication in the PDF converter CLI by having executeJar() return an empty string when streamOutput is true (output was already written to process.stdout), and returning accumulated stdout only when streamOutput is false.

Changes

Cohort / File(s) Summary
Stdout Output Fix
node/opendataloader-pdf/src/index.ts
executeJar() now streams stdout chunks to process.stdout when streamOutput is true, keeps a bounded tail for error messages, and resolves with '' on success when streaming to avoid double-writing; resolves with full accumulated stdout when not streaming.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main fix: preventing stdout double-write when streamOutput=true in the executeJar() function, which is exactly what the PR accomplishes.
Linked Issues check ✅ Passed The PR fulfills the core requirement from #398: when streamOutput=true, executeJar() returns an empty string instead of accumulated stdout, preventing CLI double-write. The implementation also includes a bounded tail for error diagnostics.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objective of fixing issue #398. The modification to executeJar() behavior and the addition of streamedStdoutTail for error messages are necessary and related to the linked issue requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node/opendataloader-pdf/src/index.ts`:
- Around line 45-47: When streaming mode is enabled the existing code skips
accumulating stdout (stdout stays empty) which can make the error body empty on
non-zero exits; change the logic so that while you still do not return
accumulated stdout on success, you keep a bounded tail buffer (e.g., stdoutTail)
by appending chunks but trimming to a max size (e.g., 8KB) at the same spot
where stdout was previously appended (the line with "stdout += chunk"). Then,
when constructing the error body (the place that currently uses "stderr ||
stdout" at exit handling), use "stderr || stdoutTail" so Java errors written to
stdout are preserved for diagnostics. Ensure the tail buffer is size-limited and
only used for failure diagnostics; do not change success return behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 969e8499-0b1f-4339-9e2a-86d67b712823

📥 Commits

Reviewing files that changed from the base of the PR and between e134531 and d6cded2.

📒 Files selected for processing (1)
  • node/opendataloader-pdf/src/index.ts

Comment thread node/opendataloader-pdf/src/index.ts
…am mode

When streamOutput=true, the previous fix left stdout empty, which means
on non-zero exit codes the error body (stderr || stdout) would be empty
when Java reports errors via stdout.

Add a streamedStdoutTail buffer (capped at 64 KB) that records the last
portion of streamed stdout chunks. On failure, use this tail instead of
the empty stdout to preserve Java error messages for diagnostics.

Success behavior is unchanged: still resolves with '' when streaming.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node/opendataloader-pdf/src/index.ts`:
- Around line 67-69: The convert() implementation currently uses streamOutput
derived from options.quiet to decide its resolved string (resolve(streamOutput ?
'' : stdout)), which causes library callers to get an empty string when quiet is
undefined; instead, restore convert() to always resolve the actual stdout string
(use stdout in the Promise resolution) and move the CLI suppression into the CLI
caller by passing an explicit flag (e.g., cliSuppressOutput or streamOutput
explicitly set by the CLI) or by having the CLI ignore the resolved string when
it streams to stdout; update the other similar resolution sites (the other
resolve calls referenced around the 110-112 area) to follow the same approach so
library behavior remains Promise<string> with real output while the CLI handles
suppression separately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c0822178-4c9b-4696-9d34-0102adca1ec9

📥 Commits

Reviewing files that changed from the base of the PR and between d6cded2 and 7e750d2.

📒 Files selected for processing (1)
  • node/opendataloader-pdf/src/index.ts

Comment on lines +67 to +69
// When streamOutput=true, content was already written to process.stdout in real-time.
// Return empty string to prevent callers from double-writing the same output.
resolve(streamOutput ? '' : stdout);
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This changes convert()'s default success result to ''.

convert() passes streamOutput: !options.quiet, and quiet defaults to undefined, so the normal success path now resolves an empty string for successful conversions. That fixes the CLI duplication, but it also changes the public Promise<string> behavior for non-quiet/default library callers. Please keep the CLI suppression separate from the library return value.

Also applies to: 110-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node/opendataloader-pdf/src/index.ts` around lines 67 - 69, The convert()
implementation currently uses streamOutput derived from options.quiet to decide
its resolved string (resolve(streamOutput ? '' : stdout)), which causes library
callers to get an empty string when quiet is undefined; instead, restore
convert() to always resolve the actual stdout string (use stdout in the Promise
resolution) and move the CLI suppression into the CLI caller by passing an
explicit flag (e.g., cliSuppressOutput or streamOutput explicitly set by the
CLI) or by having the CLI ignore the resolved string when it streams to stdout;
update the other similar resolution sites (the other resolve calls referenced
around the 110-112 area) to follow the same approach so library behavior remains
Promise<string> with real output while the CLI handles suppression separately.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right that tying streamOutput to options.quiet changes the public Promise<string> contract for library callers. The fix should separate CLI stdout suppression from the library return value so convert() always resolves with the actual stdout string, and the CLI handles suppression on its own side. Will address this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuishou68, glad the analysis is helpful! Here's a concrete approach for the fix:

Approach: always accumulate + always return real stdout; CLI skips re-printing

In executeJar, accumulate stdout unconditionally (even when streaming), and always resolve with the real string:

javaProcess.stdout.on('data', (data) => {
  const chunk = data.toString();
  stdout += chunk; // always accumulate for callers
  if (streamOutput) {
    process.stdout.write(chunk); // stream in real-time for CLI
  }
});

javaProcess.on('close', (code) => {
  if (code === 0) {
    resolve(stdout); // always return real content — callers see the full string
  } else {
    const errorOutput = stderr || stdout;
    reject(new Error(`The opendataloader-pdf CLI exited with code ${code}.\n\n${errorOutput}`));
  }
});

Then in cli.ts, suppress the redundant write when already streaming:

// cli.ts (rough sketch)
const result = await convert(inputPaths, options);
if (options.quiet) {
  // streaming was off, nothing was written yet — print now
  process.stdout.write(result);
}
// else: output was already streamed in real-time; skip re-printing

This keeps the convert() contract clean (Promise<string> always resolves with the actual output), and the double-write prevention is owned entirely by the CLI layer where it belongs. The streamedStdoutTail and MAX_STDOUT_TAIL machinery from the latest commit can also be removed since we're accumulating fully anyway.


🧠 Learnings used
Learnt from: the-waste-land
Repo: opendataloader-project/opendataloader-pdf PR: 409
File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351
Timestamp: 2026-04-11T14:50:33.654Z
Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).

@bundolee
Copy link
Copy Markdown
Contributor

Here's how the review process will go from here:

  1. CodeRabbit — all review threads resolved with a fix commit
  2. CI — all checks pass (codecov excluded)
  3. Component owner approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(node): stdout output is written twice when quiet option is not set — streamOutput in executeJar conflicts with CLI output layer

3 participants