Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions node/opendataloader-pdf/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,10 @@ function executeJar(args: string[], executionOptions: JarExecutionOptions = {}):
const chunk = data.toString();
if (streamOutput) {
process.stdout.write(chunk);
} else {
// Only accumulate when not streaming to avoid double-write by the caller
stdout += chunk;
Comment thread
coderabbitai[bot] marked this conversation as resolved.
}
stdout += chunk;
});

javaProcess.stderr.on('data', (data) => {
Expand All @@ -56,7 +58,9 @@ function executeJar(args: string[], executionOptions: JarExecutionOptions = {}):

javaProcess.on('close', (code) => {
if (code === 0) {
resolve(stdout);
// When streamOutput=true, content was already written to process.stdout in real-time.
// Return empty string to prevent callers from double-writing the same output.
resolve(streamOutput ? '' : stdout);
Comment on lines +67 to +69
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This changes convert()'s default success result to ''.

convert() passes streamOutput: !options.quiet, and quiet defaults to undefined, so the normal success path now resolves an empty string for successful conversions. That fixes the CLI duplication, but it also changes the public Promise<string> behavior for non-quiet/default library callers. Please keep the CLI suppression separate from the library return value.

Also applies to: 110-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node/opendataloader-pdf/src/index.ts` around lines 67 - 69, The convert()
implementation currently uses streamOutput derived from options.quiet to decide
its resolved string (resolve(streamOutput ? '' : stdout)), which causes library
callers to get an empty string when quiet is undefined; instead, restore
convert() to always resolve the actual stdout string (use stdout in the Promise
resolution) and move the CLI suppression into the CLI caller by passing an
explicit flag (e.g., cliSuppressOutput or streamOutput explicitly set by the
CLI) or by having the CLI ignore the resolved string when it streams to stdout;
update the other similar resolution sites (the other resolve calls referenced
around the 110-112 area) to follow the same approach so library behavior remains
Promise<string> with real output while the CLI handles suppression separately.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right that tying streamOutput to options.quiet changes the public Promise<string> contract for library callers. The fix should separate CLI stdout suppression from the library return value so convert() always resolves with the actual stdout string, and the CLI handles suppression on its own side. Will address this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuishou68, glad the analysis is helpful! Here's a concrete approach for the fix:

Approach: always accumulate + always return real stdout; CLI skips re-printing

In executeJar, accumulate stdout unconditionally (even when streaming), and always resolve with the real string:

javaProcess.stdout.on('data', (data) => {
  const chunk = data.toString();
  stdout += chunk; // always accumulate for callers
  if (streamOutput) {
    process.stdout.write(chunk); // stream in real-time for CLI
  }
});

javaProcess.on('close', (code) => {
  if (code === 0) {
    resolve(stdout); // always return real content — callers see the full string
  } else {
    const errorOutput = stderr || stdout;
    reject(new Error(`The opendataloader-pdf CLI exited with code ${code}.\n\n${errorOutput}`));
  }
});

Then in cli.ts, suppress the redundant write when already streaming:

// cli.ts (rough sketch)
const result = await convert(inputPaths, options);
if (options.quiet) {
  // streaming was off, nothing was written yet — print now
  process.stdout.write(result);
}
// else: output was already streamed in real-time; skip re-printing

This keeps the convert() contract clean (Promise<string> always resolves with the actual output), and the double-write prevention is owned entirely by the CLI layer where it belongs. The streamedStdoutTail and MAX_STDOUT_TAIL machinery from the latest commit can also be removed since we're accumulating fully anyway.


🧠 Learnings used
Learnt from: the-waste-land
Repo: opendataloader-project/opendataloader-pdf PR: 409
File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/ImagesUtils.java:339-351
Timestamp: 2026-04-11T14:50:33.654Z
Learning: In `ImagesUtils.java` (Java, opendataloader-pdf), the `drawImage()` method in `PageImageCollector` intentionally stores full transformed bounds without applying the clipping path. The PR author (the-waste-land) explicitly deferred clipping-aware bound intersection to a follow-up, as it requires representative clipped-image PDF samples and targeted tests. Do not flag this as a bug in PRs that are scoped only to the native-crash fix (direct PDFBox image-object extraction path).

} else {
const errorOutput = stderr || stdout;
const error = new Error(
Expand Down