Fix #430: add input validation to OpenDataLoaderPDF.processFile by nandanadileep · Pull Request #432 · opendataloader-project/opendataloader-pdf

nandanadileep · 2026-04-15T18:28:46Z

Closes #430.

Problem

OpenDataLoaderPDF.processFile threw an unspecific exception on invalid paths, crashing an entire batch loop. The CLI handled this gracefully; the Java API did not.

Changes

Added validateInputFile(String) that checks for null/blank input, syntactically invalid path, non-existent file, non-regular file, and missing .pdf extension — each throws IllegalArgumentException with a clear message.
Preserved the existing throws IOException contract on processFile.
Logs only the file name (not the full path) to avoid path-disclosure in log output — addresses the security concern flagged in CodeRabbit's review of Fix: Handle invalid file paths in OpenDataLoaderPDF #431.
Updated Javadoc on processFile with both @throws tags and a batch-usage code example.
Added Javadoc to the private validateInputFile helper to meet docstring coverage requirements.

Improvements over #431

Area	#431	This PR
`@throws IOException` in Javadoc	Removed (breaks contract)	Restored
`@throws IllegalArgumentException` in Javadoc	Missing	Present
Log path disclosure	Logs full `inputPdfName`	Logs only `path.getFileName()`
`validateInputFile` Javadoc	Present	Present
Batch-loop usage example	Missing	Included in Javadoc

Test manually

// Should throw IllegalArgumentException, not NPE or IOException
OpenDataLoaderPDF.processFile(null, config);
OpenDataLoaderPDF.processFile("", config);
OpenDataLoaderPDF.processFile("/does/not/exist.pdf", config);
OpenDataLoaderPDF.processFile("/tmp/somefile.txt", config);

// Should proceed to DocumentProcessor
OpenDataLoaderPDF.processFile("/path/to/valid.pdf", config);

Summary by CodeRabbit

Bug Fixes
- Stricter validation for provided PDF input names: rejects null/blank, syntactically invalid paths, root-only paths, non-existent or non-regular files, and names not ending with “.pdf” (case-insensitive).
- API contract updated to surface and document invalid-input errors when validation fails.
- Validation failures now emit warning-level diagnostic logs to aid troubleshooting.

@throws

…rPDF.processFile Throw IllegalArgumentException early for null/blank paths, invalid path syntax, missing files, non-regular files, and non-.pdf extensions so callers can catch and skip bad entries in a batch loop without losing IOException propagation for genuine I/O failures. - Log only file name (not full path) to avoid path-disclosure in logs - Add @throws IllegalArgumentException and @throws IOException to Javadoc - Include batch-usage example in Javadoc - Add Javadoc to private validateInputFile helper

CLAassistant · 2026-04-15T18:28:56Z

All committers have signed the CLA.

coderabbitai · 2026-04-15T18:29:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c1d28d65-88c8-4f14-8e36-08de60155f03

📥 Commits

Reviewing files that changed from the base of the PR and between 045d62c and de35ebc.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Walkthrough

ProcessFile now validates the input path before processing: checks null/blank, path syntax, existence, regular file, filename component, and case-insensitive .pdf extension. Failures log WARNING and throw IllegalArgumentException (InvalidPathException preserved as cause); valid inputs delegate to DocumentProcessor.processFile.

Changes

Cohort / File(s)	Summary
Input validation & API contract `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`	Added private `validateInputFile(String)` and a static `Logger`; `processFile` now calls validation and documents throwing `IllegalArgumentException` for null/blank input, invalid path syntax (cause preserved), non-existent path, non-regular file, missing filename component, or non-`.pdf` extension. Validation failures log at WARNING.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Caller
    participant OpenDataLoaderPDF
    participant DocumentProcessor
    participant FileSystem
    Caller->>OpenDataLoaderPDF: processFile(inputPdfName, config)
    OpenDataLoaderPDF->>OpenDataLoaderPDF: validateInputFile(inputPdfName)
    alt invalid input
        OpenDataLoaderPDF->>OpenDataLoaderPDF: log WARNING
        OpenDataLoaderPDF-->>Caller: throw IllegalArgumentException
    else valid input
        OpenDataLoaderPDF->>DocumentProcessor: processFile(resolvedPath, config)
        DocumentProcessor->>FileSystem: open/read file
        FileSystem-->>DocumentProcessor: file stream / IO results
        DocumentProcessor-->>OpenDataLoaderPDF: processing result
        OpenDataLoaderPDF-->>Caller: return / complete
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and directly references the main change: adding input validation to the OpenDataLoaderPDF.processFile method to fix issue `#430`.
Linked Issues check	✅ Passed	The PR implements all required objectives from `#430`: input validation throwing IllegalArgumentException for invalid paths [`#430`], preservation of throws IOException contract [`#430`], and documentation with batch usage examples in Javadoc [`#430`].
Out of Scope Changes check	✅ Passed	The single file change is entirely scoped to input validation for OpenDataLoaderPDF.processFile as specified in `#430`, with no unrelated modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 91-92: The exception messages in OpenDataLoaderPDF currently
append the raw inputPdfName (e.g., in the IllegalArgumentException throws),
leaking full file paths; change these throws to avoid exposing full paths by
using only a sanitized basename or a generic identifier (derive the file name
via the inputPdfName's filename component or replace with "input file" or the
file's basename) instead of the raw path, and update every throw that references
inputPdfName (the three occurrences flagged) so they emit non-sensitive text
while preserving enough context for debugging.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82ebf40c-f171-4de2-8a81-f12cf84c9ca4

📥 Commits

Reviewing files that changed from the base of the PR and between c3391de and 121afb7.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

nandanadileep · 2026-04-15T18:34:40Z

Could a maintainer please approve the workflow run so CI can execute? (Fork PRs require manual approval on this repo.)

bundolee · 2026-04-16T02:01:24Z

Thanks for this PR, @nandanadileep — clean work and a solid improvement over #431. The comparison table in the description is especially helpful.

Before we move forward, could you please resolve the open CodeRabbit review conversation? It flags path disclosure in the exception messages (lines 91, 96, 101) — the logger already uses path.getFileName(), but the thrown IllegalArgumentException messages still include the raw inputPdfName. The suggested fix looks straightforward.

Here's how the review process will go from here:

CodeRabbit — resolve all open review threads (with a fix commit)
CI — all checks must pass
@MaximPlusov — final review from the Java component owner

Once the CodeRabbit thread is addressed and CI is green, Maxim will take it from there.

codecov · 2026-04-16T02:08:32Z

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../org/opendataloader/pdf/api/OpenDataLoaderPDF.java	0.00%	20 Missing ⚠️

📢 Thoughts on this report? Let us know!

Use only the file name (not the full path) in exception messages to avoid leaking absolute paths — consistent with the logger which already used path.getFileName(). Addresses CodeRabbit and bundolee review feedback.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 89-104: The code calls path.getFileName().toString() without
guarding for null (root paths), causing NPEs; modify the validation in
OpenDataLoaderPDF to first obtain Path name = path.getFileName() and if name is
null treat it as invalid—log a warning via LOGGER (same style as existing
messages) and throw IllegalArgumentException (use path.toString() or a fallback
like path.toString() in the message) before any use of name.toString(); then
continue the existing checks (isRegularFile, extension) using the safe fileName
string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80c36d68-26cd-45d0-adef-98acfc8407cb

📥 Commits

Reviewing files that changed from the base of the PR and between 121afb7 and 93a0919.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

path.getFileName() returns null for root paths (e.g. "/" on Unix, "C:\" on Windows). Add an explicit null check and throw IllegalArgumentException before any use of the filename string.

coderabbitai

♻️ Duplicate comments (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)

90-93: ⚠️ Potential issue | 🟠 Major

Avoid full path disclosure in the root-path exception message.

The exception currently includes path text directly, which can expose filesystem details. Use a sanitized/generic identifier instead, consistent with the other validation exceptions.

🔧 Suggested fix

-        if (fileNamePath == null) {
-            LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)");
-            throw new IllegalArgumentException("Path has no file name component: " + path);
-        }
+        if (fileNamePath == null) {
+            LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)");
+            throw new IllegalArgumentException("Path has no file name component");
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 90 - 93, Replace the exception message that currently includes the
full path when fileNamePath is null by throwing an IllegalArgumentException with
a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed")
instead of the raw path string; update the throw site referencing fileNamePath,
path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null)
branch) so the log can still warn with context but the exception message does
not disclose filesystem details.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 90-93: Replace the exception message that currently includes the
full path when fileNamePath is null by throwing an IllegalArgumentException with
a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed")
instead of the raw path string; update the throw site referencing fileNamePath,
path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null)
branch) so the log can still warn with context but the exception message does
not disclose filesystem details.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dcb35156-723f-42e9-a92f-65a4d01c193f

📥 Commits

Reviewing files that changed from the base of the PR and between 93a0919 and cd1fa2e.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

nandanadileep requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 15, 2026 18:28

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java Outdated

Fix path disclosure in IllegalArgumentException messages

93a0919

Use only the file name (not the full path) in exception messages to avoid leaking absolute paths — consistent with the logger which already used path.getFileName(). Addresses CodeRabbit and bundolee review feedback.

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java Outdated

Guard against null getFileName() for root paths

cd1fa2e

path.getFileName() returns null for root paths (e.g. "/" on Unix, "C:\" on Windows). Add an explicit null check and throw IllegalArgumentException before any use of the filename string.

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

nandanadileep added 2 commits April 16, 2026 13:23

Remove path from root-path exception message to avoid path disclosure

045d62c

Add Javadoc to private constructor to meet docstring coverage threshold

de35ebc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #430: add input validation to OpenDataLoaderPDF.processFile#432

Fix #430: add input validation to OpenDataLoaderPDF.processFile#432
nandanadileep wants to merge 5 commits intoopendataloader-project:mainfrom
nandanadileep:fix/issue-430-invalid-path-validation

nandanadileep commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

CLAassistant commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

nandanadileep commented Apr 15, 2026

Uh oh!

bundolee commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nandanadileep commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Improvements over #431

Test manually

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nandanadileep commented Apr 15, 2026

Uh oh!

bundolee commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nandanadileep commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

CLAassistant commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

bundolee commented Apr 16, 2026 •

edited

Loading