Skip to content

Fix #430: add input validation to OpenDataLoaderPDF.processFile#432

Open
nandanadileep wants to merge 5 commits intoopendataloader-project:mainfrom
nandanadileep:fix/issue-430-invalid-path-validation
Open

Fix #430: add input validation to OpenDataLoaderPDF.processFile#432
nandanadileep wants to merge 5 commits intoopendataloader-project:mainfrom
nandanadileep:fix/issue-430-invalid-path-validation

Conversation

@nandanadileep
Copy link
Copy Markdown

@nandanadileep nandanadileep commented Apr 15, 2026

Closes #430.

Problem

OpenDataLoaderPDF.processFile threw an unspecific exception on invalid paths, crashing an entire batch loop. The CLI handled this gracefully; the Java API did not.

Changes

  • Added validateInputFile(String) that checks for null/blank input, syntactically invalid path, non-existent file, non-regular file, and missing .pdf extension — each throws IllegalArgumentException with a clear message.
  • Preserved the existing throws IOException contract on processFile.
  • Logs only the file name (not the full path) to avoid path-disclosure in log output — addresses the security concern flagged in CodeRabbit's review of Fix: Handle invalid file paths in OpenDataLoaderPDF #431.
  • Updated Javadoc on processFile with both @throws tags and a batch-usage code example.
  • Added Javadoc to the private validateInputFile helper to meet docstring coverage requirements.

Improvements over #431

Area #431 This PR
@throws IOException in Javadoc Removed (breaks contract) Restored
@throws IllegalArgumentException in Javadoc Missing Present
Log path disclosure Logs full inputPdfName Logs only path.getFileName()
validateInputFile Javadoc Present Present
Batch-loop usage example Missing Included in Javadoc

Test manually

// Should throw IllegalArgumentException, not NPE or IOException
OpenDataLoaderPDF.processFile(null, config);
OpenDataLoaderPDF.processFile("", config);
OpenDataLoaderPDF.processFile("/does/not/exist.pdf", config);
OpenDataLoaderPDF.processFile("/tmp/somefile.txt", config);

// Should proceed to DocumentProcessor
OpenDataLoaderPDF.processFile("/path/to/valid.pdf", config);

Summary by CodeRabbit

  • Bug Fixes
    • Stricter validation for provided PDF input names: rejects null/blank, syntactically invalid paths, root-only paths, non-existent or non-regular files, and names not ending with “.pdf” (case-insensitive).
    • API contract updated to surface and document invalid-input errors when validation fails.
    • Validation failures now emit warning-level diagnostic logs to aid troubleshooting.

…rPDF.processFile

Throw IllegalArgumentException early for null/blank paths, invalid path
syntax, missing files, non-regular files, and non-.pdf extensions so callers
can catch and skip bad entries in a batch loop without losing IOException
propagation for genuine I/O failures.

- Log only file name (not full path) to avoid path-disclosure in logs
- Add @throws IllegalArgumentException and @throws IOException to Javadoc
- Include batch-usage example in Javadoc
- Add Javadoc to private validateInputFile helper
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 15, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c1d28d65-88c8-4f14-8e36-08de60155f03

📥 Commits

Reviewing files that changed from the base of the PR and between 045d62c and de35ebc.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Walkthrough

ProcessFile now validates the input path before processing: checks null/blank, path syntax, existence, regular file, filename component, and case-insensitive .pdf extension. Failures log WARNING and throw IllegalArgumentException (InvalidPathException preserved as cause); valid inputs delegate to DocumentProcessor.processFile.

Changes

Cohort / File(s) Summary
Input validation & API contract
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
Added private validateInputFile(String) and a static Logger; processFile now calls validation and documents throwing IllegalArgumentException for null/blank input, invalid path syntax (cause preserved), non-existent path, non-regular file, missing filename component, or non-.pdf extension. Validation failures log at WARNING.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Caller
    participant OpenDataLoaderPDF
    participant DocumentProcessor
    participant FileSystem
    Caller->>OpenDataLoaderPDF: processFile(inputPdfName, config)
    OpenDataLoaderPDF->>OpenDataLoaderPDF: validateInputFile(inputPdfName)
    alt invalid input
        OpenDataLoaderPDF->>OpenDataLoaderPDF: log WARNING
        OpenDataLoaderPDF-->>Caller: throw IllegalArgumentException
    else valid input
        OpenDataLoaderPDF->>DocumentProcessor: processFile(resolvedPath, config)
        DocumentProcessor->>FileSystem: open/read file
        FileSystem-->>DocumentProcessor: file stream / IO results
        DocumentProcessor-->>OpenDataLoaderPDF: processing result
        OpenDataLoaderPDF-->>Caller: return / complete
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly references the main change: adding input validation to the OpenDataLoaderPDF.processFile method to fix issue #430.
Linked Issues check ✅ Passed The PR implements all required objectives from #430: input validation throwing IllegalArgumentException for invalid paths [#430], preservation of throws IOException contract [#430], and documentation with batch usage examples in Javadoc [#430].
Out of Scope Changes check ✅ Passed The single file change is entirely scoped to input validation for OpenDataLoaderPDF.processFile as specified in #430, with no unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 91-92: The exception messages in OpenDataLoaderPDF currently
append the raw inputPdfName (e.g., in the IllegalArgumentException throws),
leaking full file paths; change these throws to avoid exposing full paths by
using only a sanitized basename or a generic identifier (derive the file name
via the inputPdfName's filename component or replace with "input file" or the
file's basename) instead of the raw path, and update every throw that references
inputPdfName (the three occurrences flagged) so they emit non-sensitive text
while preserving enough context for debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82ebf40c-f171-4de2-8a81-f12cf84c9ca4

📥 Commits

Reviewing files that changed from the base of the PR and between c3391de and 121afb7.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

@nandanadileep
Copy link
Copy Markdown
Author

Could a maintainer please approve the workflow run so CI can execute? (Fork PRs require manual approval on this repo.)

@bundolee
Copy link
Copy Markdown
Contributor

bundolee commented Apr 16, 2026

Thanks for this PR, @nandanadileep — clean work and a solid improvement over #431. The comparison table in the description is especially helpful.

Before we move forward, could you please resolve the open CodeRabbit review conversation? It flags path disclosure in the exception messages (lines 91, 96, 101) — the logger already uses path.getFileName(), but the thrown IllegalArgumentException messages still include the raw inputPdfName. The suggested fix looks straightforward.

Here's how the review process will go from here:

  1. CodeRabbit — resolve all open review threads (with a fix commit)
  2. CI — all checks must pass
  3. @MaximPlusov — final review from the Java component owner

Once the CodeRabbit thread is addressed and CI is green, Maxim will take it from there.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../org/opendataloader/pdf/api/OpenDataLoaderPDF.java 0.00% 20 Missing ⚠️

📢 Thoughts on this report? Let us know!

Use only the file name (not the full path) in exception messages to
avoid leaking absolute paths — consistent with the logger which already
used path.getFileName(). Addresses CodeRabbit and bundolee review feedback.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 89-104: The code calls path.getFileName().toString() without
guarding for null (root paths), causing NPEs; modify the validation in
OpenDataLoaderPDF to first obtain Path name = path.getFileName() and if name is
null treat it as invalid—log a warning via LOGGER (same style as existing
messages) and throw IllegalArgumentException (use path.toString() or a fallback
like path.toString() in the message) before any use of name.toString(); then
continue the existing checks (isRegularFile, extension) using the safe fileName
string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80c36d68-26cd-45d0-adef-98acfc8407cb

📥 Commits

Reviewing files that changed from the base of the PR and between 121afb7 and 93a0919.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

path.getFileName() returns null for root paths (e.g. "/" on Unix,
"C:\" on Windows). Add an explicit null check and throw
IllegalArgumentException before any use of the filename string.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)

90-93: ⚠️ Potential issue | 🟠 Major

Avoid full path disclosure in the root-path exception message.

The exception currently includes path text directly, which can expose filesystem details. Use a sanitized/generic identifier instead, consistent with the other validation exceptions.

🔧 Suggested fix
-        if (fileNamePath == null) {
-            LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)");
-            throw new IllegalArgumentException("Path has no file name component: " + path);
-        }
+        if (fileNamePath == null) {
+            LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)");
+            throw new IllegalArgumentException("Path has no file name component");
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 90 - 93, Replace the exception message that currently includes the
full path when fileNamePath is null by throwing an IllegalArgumentException with
a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed")
instead of the raw path string; update the throw site referencing fileNamePath,
path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null)
branch) so the log can still warn with context but the exception message does
not disclose filesystem details.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 90-93: Replace the exception message that currently includes the
full path when fileNamePath is null by throwing an IllegalArgumentException with
a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed")
instead of the raw path string; update the throw site referencing fileNamePath,
path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null)
branch) so the log can still warn with context but the exception message does
not disclose filesystem details.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dcb35156-723f-42e9-a92f-65a4d01c193f

📥 Commits

Reviewing files that changed from the base of the PR and between 93a0919 and cd1fa2e.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Java API: invalid file path crashes batch processing (CLI handles gracefully)

3 participants