Skip to content

Fix: Handle invalid file paths in OpenDataLoaderPDF#431

Closed
pavanpai769 wants to merge 6 commits intoopendataloader-project:mainfrom
pavanpai769:fix/handle-invalid-input-paths
Closed

Fix: Handle invalid file paths in OpenDataLoaderPDF#431
pavanpai769 wants to merge 6 commits intoopendataloader-project:mainfrom
pavanpai769:fix/handle-invalid-input-paths

Conversation

@pavanpai769
Copy link
Copy Markdown

@pavanpai769 pavanpai769 commented Apr 15, 2026

Fixes the issue discussed in #375.

Changes:

  • Added input validation for invalid file paths (null, blank, non-existent, non-PDF)
  • Throws IllegalArgumentException for invalid inputs
  • Preserved existing IOException contract

Summary by CodeRabbit

  • Bug Fixes

    • Strengthened pre-processing validation for input files (blank/invalid paths, non-existent or non-regular files, and non-.pdf names), with clearer error reporting and warning-level logging to improve feedback on invalid inputs.
  • Documentation

    • Updated method documentation to reflect the new validation behavior and removed an outdated exception declaration.

Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Walkthrough

The OpenDataLoaderPDF.processFile() method now performs input validation via a new private validateInputFile() that checks for null/blank input, valid path parsing, existence as a regular file, and a case-insensitive .pdf extension; validation failures log warnings and throw IllegalArgumentException. Javadoc @throws IOException removed.

Changes

Cohort / File(s) Summary
Input Validation
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
Added Logger field and private validateInputFile(String) performing null/blank checks, Paths.get parsing (handling InvalidPathException), Files.exists/Files.isRegularFile checks, and case-insensitive .pdf extension validation; integrated into processFile() before calling DocumentProcessor.processFile(). Removed @throws IOException from javadoc.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

  • Issue #430: Adds the same input-path validation to OpenDataLoaderPDF.processFile() and throws IllegalArgumentException on invalid inputs.
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding validation for invalid file paths in the OpenDataLoaderPDF class.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)

41-47: ⚠️ Potential issue | 🟡 Minor

Document thrown exceptions in the public method Javadoc.

processFile(...) now rejects invalid input with IllegalArgumentException and still propagates IOException, but the Javadoc does not declare either via @throws. Please make the API contract explicit.

📝 Suggested Javadoc update
     /**
      * Processes a PDF file to extract its content and structure based on the provided configuration.
      *
      * `@param` inputPdfName The path to the input PDF file.
      * `@param` config       The configuration object specifying output formats and other options.
+     * `@throws` IllegalArgumentException if the input path is null, blank, invalid, missing, not a regular file, or not a PDF.
+     * `@throws` IOException if an I/O error occurs during processing.
      *
      */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 41 - 47, The Javadoc for the public method processFile in class
OpenDataLoaderPDF is missing `@throws` tags for the runtime and checked exceptions
it can produce; update the method Javadoc to explicitly document that it throws
IllegalArgumentException for invalid input (e.g., null/empty inputPdfName or
invalid config) and IOException for I/O failures encountered while
reading/processing the PDF, using `@throws` IllegalArgumentException and `@throws`
IOException lines so the API contract matches the method implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 61-88: The warning logs in OpenDataLoaderPDF (the block using
inputPdfName, path, LOGGER) currently print the full user-supplied path and must
be removed or sanitized; update each LOGGER.log call in that validation block to
avoid echoing inputPdfName or the full path and instead log a safe identifier
such as the file base name (path.getFileName().toString()) or a constant
placeholder like "<redacted-path>" (for the InvalidPathException case where path
is not available), e.g. replace messages that concatenate inputPdfName with ones
that use only the safeName or the placeholder so no absolute/user path is
emitted.

---

Outside diff comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 41-47: The Javadoc for the public method processFile in class
OpenDataLoaderPDF is missing `@throws` tags for the runtime and checked exceptions
it can produce; update the method Javadoc to explicitly document that it throws
IllegalArgumentException for invalid input (e.g., null/empty inputPdfName or
invalid config) and IOException for I/O failures encountered while
reading/processing the PDF, using `@throws` IllegalArgumentException and `@throws`
IOException lines so the API contract matches the method implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6bbb458e-d96c-4d3e-b2c0-69f9695fa559

📥 Commits

Reviewing files that changed from the base of the PR and between c3391de and 36a1cb0.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Comment on lines +61 to +88
if (inputPdfName == null || inputPdfName.isBlank()) {
LOGGER.log(Level.WARNING,"Input PDF name is null or Empty");
throw new IllegalArgumentException("Input PDF name is null or Empty");
}

final Path path;

try {
path = Paths.get(inputPdfName);
} catch (InvalidPathException ex) {
LOGGER.log(Level.WARNING,"Invalid Path: " + inputPdfName);
throw new IllegalArgumentException("Invalid Path: " + inputPdfName);
}

if (!Files.exists(path)) {
LOGGER.log(Level.WARNING,"File not found at " + inputPdfName + " location");
throw new IllegalArgumentException("File not found at " + inputPdfName + " location");
}

if (!Files.isRegularFile(path)) {
LOGGER.log(Level.WARNING,"Not a valid file " + inputPdfName);
throw new IllegalArgumentException("Not a valid file " + inputPdfName);
}

if (!path.getFileName().toString().toLowerCase(Locale.ROOT).endsWith(".pdf")) {
LOGGER.log(Level.WARNING,"Not a PDF file");
throw new IllegalArgumentException("Not a PDF file");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging full user-supplied paths at WARNING level.

These warning logs include raw input paths, which can leak sensitive local filesystem details (e.g., usernames/home paths) into centralized logs.

🔧 Suggested hardening
-            LOGGER.log(Level.WARNING,"Input PDF name is null or Empty");
+            LOGGER.log(Level.WARNING, "Input PDF path is null or blank");

-            LOGGER.log(Level.WARNING,"Invalid Path: " + inputPdfName);
+            LOGGER.log(Level.WARNING, "Invalid input PDF path");

-            LOGGER.log(Level.WARNING,"File not found at " + inputPdfName + " location");
+            LOGGER.log(Level.WARNING, "Input PDF file does not exist");

-            LOGGER.log(Level.WARNING,"Not a valid file " + inputPdfName);
+            LOGGER.log(Level.WARNING, "Input PDF path is not a regular file");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (inputPdfName == null || inputPdfName.isBlank()) {
LOGGER.log(Level.WARNING,"Input PDF name is null or Empty");
throw new IllegalArgumentException("Input PDF name is null or Empty");
}
final Path path;
try {
path = Paths.get(inputPdfName);
} catch (InvalidPathException ex) {
LOGGER.log(Level.WARNING,"Invalid Path: " + inputPdfName);
throw new IllegalArgumentException("Invalid Path: " + inputPdfName);
}
if (!Files.exists(path)) {
LOGGER.log(Level.WARNING,"File not found at " + inputPdfName + " location");
throw new IllegalArgumentException("File not found at " + inputPdfName + " location");
}
if (!Files.isRegularFile(path)) {
LOGGER.log(Level.WARNING,"Not a valid file " + inputPdfName);
throw new IllegalArgumentException("Not a valid file " + inputPdfName);
}
if (!path.getFileName().toString().toLowerCase(Locale.ROOT).endsWith(".pdf")) {
LOGGER.log(Level.WARNING,"Not a PDF file");
throw new IllegalArgumentException("Not a PDF file");
}
if (inputPdfName == null || inputPdfName.isBlank()) {
LOGGER.log(Level.WARNING, "Input PDF path is null or blank");
throw new IllegalArgumentException("Input PDF name is null or Empty");
}
final Path path;
try {
path = Paths.get(inputPdfName);
} catch (InvalidPathException ex) {
LOGGER.log(Level.WARNING, "Invalid input PDF path");
throw new IllegalArgumentException("Invalid Path: " + inputPdfName);
}
if (!Files.exists(path)) {
LOGGER.log(Level.WARNING, "Input PDF file does not exist");
throw new IllegalArgumentException("File not found at " + inputPdfName + " location");
}
if (!Files.isRegularFile(path)) {
LOGGER.log(Level.WARNING, "Input PDF path is not a regular file");
throw new IllegalArgumentException("Not a valid file " + inputPdfName);
}
if (!path.getFileName().toString().toLowerCase(Locale.ROOT).endsWith(".pdf")) {
LOGGER.log(Level.WARNING, "Not a PDF file");
throw new IllegalArgumentException("Not a PDF file");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 61 - 88, The warning logs in OpenDataLoaderPDF (the block using
inputPdfName, path, LOGGER) currently print the full user-supplied path and must
be removed or sanitized; update each LOGGER.log call in that validation block to
avoid echoing inputPdfName or the full path and instead log a safe identifier
such as the file base name (path.getFileName().toString()) or a constant
placeholder like "<redacted-path>" (for the InvalidPathException case where path
is not available), e.g. replace messages that concatenate inputPdfName with ones
that use only the safeName or the placeholder so no absolute/user path is
emitted.

Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)

41-47: ⚠️ Potential issue | 🟡 Minor

Restore the missing @throws IOException Javadoc contract.

Line 48 still declares throws IOException, but the Javadoc no longer documents it. This creates API-doc drift.

🛠️ Suggested Javadoc fix
 /**
  * Processes a PDF file to extract its content and structure based on the provided configuration.
  *
  * `@param` inputPdfName The path to the input PDF file.
  * `@param` config       The configuration object specifying output formats and other options.
+ * `@throws` IllegalArgumentException if the input path is invalid.
+ * `@throws` IOException if an I/O error occurs during processing.
  *
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 41 - 47, The Javadoc for the PDF processing method in class
OpenDataLoaderPDF is missing the `@throws` IOException tag; update the Javadoc for
the method that "Processes a PDF file to extract its content and structure" (the
method that declares "throws IOException") to include an `@throws` IOException
entry describing when an IOException is thrown (e.g., when the input PDF cannot
be read or I/O fails), ensuring the docblock matches the method signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 41-47: The Javadoc for the PDF processing method in class
OpenDataLoaderPDF is missing the `@throws` IOException tag; update the Javadoc for
the method that "Processes a PDF file to extract its content and structure" (the
method that declares "throws IOException") to include an `@throws` IOException
entry describing when an IOException is thrown (e.g., when the input PDF cannot
be read or I/O fails), ensuring the docblock matches the method signature.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26c2ad76-59dd-4c7d-9020-3e6ddfa1b820

📥 Commits

Reviewing files that changed from the base of the PR and between 36a1cb0 and b537a81.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

@bundolee
Copy link
Copy Markdown
Contributor

Here's how the review process will go from here:

  1. CodeRabbit — all review threads resolved with a fix commit
  2. CI — all checks pass (codecov excluded)
  3. Component owner approval

@pavanpai769 pavanpai769 closed this by deleting the head repository Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants