Fix #430: add input validation to OpenDataLoaderPDF.processFile#432
Fix #430: add input validation to OpenDataLoaderPDF.processFile#432nandanadileep wants to merge 5 commits intoopendataloader-project:mainfrom
Conversation
…rPDF.processFile Throw IllegalArgumentException early for null/blank paths, invalid path syntax, missing files, non-regular files, and non-.pdf extensions so callers can catch and skip bad entries in a batch loop without losing IOException propagation for genuine I/O failures. - Log only file name (not full path) to avoid path-disclosure in logs - Add @throws IllegalArgumentException and @throws IOException to Javadoc - Include batch-usage example in Javadoc - Add Javadoc to private validateInputFile helper
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughProcessFile now validates the input path before processing: checks null/blank, path syntax, existence, regular file, filename component, and case-insensitive Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant OpenDataLoaderPDF
participant DocumentProcessor
participant FileSystem
Caller->>OpenDataLoaderPDF: processFile(inputPdfName, config)
OpenDataLoaderPDF->>OpenDataLoaderPDF: validateInputFile(inputPdfName)
alt invalid input
OpenDataLoaderPDF->>OpenDataLoaderPDF: log WARNING
OpenDataLoaderPDF-->>Caller: throw IllegalArgumentException
else valid input
OpenDataLoaderPDF->>DocumentProcessor: processFile(resolvedPath, config)
DocumentProcessor->>FileSystem: open/read file
FileSystem-->>DocumentProcessor: file stream / IO results
DocumentProcessor-->>OpenDataLoaderPDF: processing result
OpenDataLoaderPDF-->>Caller: return / complete
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 91-92: The exception messages in OpenDataLoaderPDF currently
append the raw inputPdfName (e.g., in the IllegalArgumentException throws),
leaking full file paths; change these throws to avoid exposing full paths by
using only a sanitized basename or a generic identifier (derive the file name
via the inputPdfName's filename component or replace with "input file" or the
file's basename) instead of the raw path, and update every throw that references
inputPdfName (the three occurrences flagged) so they emit non-sensitive text
while preserving enough context for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82ebf40c-f171-4de2-8a81-f12cf84c9ca4
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
|
Could a maintainer please approve the workflow run so CI can execute? (Fork PRs require manual approval on this repo.) |
|
Thanks for this PR, @nandanadileep — clean work and a solid improvement over #431. The comparison table in the description is especially helpful. Before we move forward, could you please resolve the open CodeRabbit review conversation? It flags path disclosure in the exception messages (lines 91, 96, 101) — the logger already uses Here's how the review process will go from here:
Once the CodeRabbit thread is addressed and CI is green, Maxim will take it from there. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Use only the file name (not the full path) in exception messages to avoid leaking absolute paths — consistent with the logger which already used path.getFileName(). Addresses CodeRabbit and bundolee review feedback.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 89-104: The code calls path.getFileName().toString() without
guarding for null (root paths), causing NPEs; modify the validation in
OpenDataLoaderPDF to first obtain Path name = path.getFileName() and if name is
null treat it as invalid—log a warning via LOGGER (same style as existing
messages) and throw IllegalArgumentException (use path.toString() or a fallback
like path.toString() in the message) before any use of name.toString(); then
continue the existing checks (isRegularFile, extension) using the safe fileName
string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 80c36d68-26cd-45d0-adef-98acfc8407cb
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
path.getFileName() returns null for root paths (e.g. "/" on Unix, "C:\" on Windows). Add an explicit null check and throw IllegalArgumentException before any use of the filename string.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)
90-93:⚠️ Potential issue | 🟠 MajorAvoid full path disclosure in the root-path exception message.
The exception currently includes
pathtext directly, which can expose filesystem details. Use a sanitized/generic identifier instead, consistent with the other validation exceptions.🔧 Suggested fix
- if (fileNamePath == null) { - LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)"); - throw new IllegalArgumentException("Path has no file name component: " + path); - } + if (fileNamePath == null) { + LOGGER.log(Level.WARNING, "Path has no file name component (root path not allowed)"); + throw new IllegalArgumentException("Path has no file name component"); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java` around lines 90 - 93, Replace the exception message that currently includes the full path when fileNamePath is null by throwing an IllegalArgumentException with a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed") instead of the raw path string; update the throw site referencing fileNamePath, path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null) branch) so the log can still warn with context but the exception message does not disclose filesystem details.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 90-93: Replace the exception message that currently includes the
full path when fileNamePath is null by throwing an IllegalArgumentException with
a sanitized/generic identifier (e.g., "invalid path" or "root path not allowed")
instead of the raw path string; update the throw site referencing fileNamePath,
path and the LOGGER block in OpenDataLoaderPDF (the if (fileNamePath == null)
branch) so the log can still warn with context but the exception message does
not disclose filesystem details.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: dcb35156-723f-42e9-a92f-65a4d01c193f
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
Closes #430.
Problem
OpenDataLoaderPDF.processFilethrew an unspecific exception on invalid paths, crashing an entire batch loop. The CLI handled this gracefully; the Java API did not.Changes
validateInputFile(String)that checks for null/blank input, syntactically invalid path, non-existent file, non-regular file, and missing.pdfextension — each throwsIllegalArgumentExceptionwith a clear message.throws IOExceptioncontract onprocessFile.processFilewith both@throwstags and a batch-usage code example.validateInputFilehelper to meet docstring coverage requirements.Improvements over #431
@throws IOExceptionin Javadoc@throws IllegalArgumentExceptionin JavadocinputPdfNamepath.getFileName()validateInputFileJavadocTest manually
Summary by CodeRabbit