Bug: PDF file cannot be deleted after OpenDataLoaderPDF.processFile (file in use by another process)
Environment
- OpenDataLoader Version: 2.2.1
- Java Version: OpenJDK 17.0.10
- OS: Windows (file lock behavior observed)
Issue Description
After calling OpenDataLoaderPDF.processFile to process a PDF document, attempting to delete the source PDF file fails with the error:
The file cannot be deleted because it is being used by another process.
Root cause: The PDF PDDocument and related resources are not properly closed after processing completes, leaving an active file handle/lock on the PDF file.
Steps to Reproduce
- Follow the official Java Quick Start guide: https://opendataloader.org/docs/quick-start-java
- Process a local PDF file using
OpenDataLoaderPDF.processFile(...)
- Immediately attempt to delete the processed PDF file
- Observe the file-in-use deletion error
Suggested Fix
Add resource cleanup logic to ensure all PDF-related resources (including PDDocument) are closed after processing finishes.
Changes to DocumentProcessor.java
- Add a new
closePdfResources() private method to safely close PDF resources
- Wrap the existing
processFile logic in a try-finally block to guarantee cleanup
// New method to release all PDF resources
private static void closePdfResources() {
try {
StaticLayoutContainers.closeContrastRatioConsumer();
} catch (Exception e) {
LOGGER.log(Level.WARNING, "Unable to close contrast ratio consumer: " + e.getMessage());
}
PDDocument document = StaticResources.getDocument();
if (document != null) {
try {
document.close();
} catch (Exception e) {
LOGGER.log(Level.WARNING, "Unable to close PDF document: " + e.getMessage());
}
}
}
// Updated processFile method with try-finally cleanup
public static void processFile(String inputPdfName, Config config) throws IOException {
try {
preprocessing(inputPdfName, config);
calculateDocumentInfo();
Set<Integer> pagesToProcess = getValidPageNumbers(config);
List<List<IObject>> contents;
if (StaticLayoutContainers.isUseStructTree()) {
contents = TaggedDocumentProcessor.processDocument(inputPdfName, config, pagesToProcess);
} else if (config.isHybridEnabled()) {
contents = HybridDocumentProcessor.processDocument(inputPdfName, config, pagesToProcess);
} else {
contents = processDocument(inputPdfName, config, pagesToProcess);
}
if (config.needsStructuredProcessing()) {
sortContents(contents, config);
}
ContentSanitizer contentSanitizer = new ContentSanitizer(
config.getFilterConfig().getFilterRules(),
config.getFilterConfig().isFilterSensitiveData()
);
contentSanitizer.sanitizeContents(contents);
generateOutputs(inputPdfName, contents, config);
} finally {
// Critical: Ensure resources are closed even if an exception occurs
closePdfResources();
}
}
Verification
After applying this fix:
- PDF files are properly unlocked after processing
- Files can be deleted immediately after
processFile returns
- No resource leaks or file locks remain
Bug: PDF file cannot be deleted after OpenDataLoaderPDF.processFile (file in use by another process)
Environment
Issue Description
After calling
OpenDataLoaderPDF.processFileto process a PDF document, attempting to delete the source PDF file fails with the error:The file cannot be deleted because it is being used by another process.Root cause: The PDF
PDDocumentand related resources are not properly closed after processing completes, leaving an active file handle/lock on the PDF file.Steps to Reproduce
OpenDataLoaderPDF.processFile(...)Suggested Fix
Add resource cleanup logic to ensure all PDF-related resources (including
PDDocument) are closed after processing finishes.Changes to
DocumentProcessor.javaclosePdfResources()private method to safely close PDF resourcesprocessFilelogic in atry-finallyblock to guarantee cleanupVerification
After applying this fix:
processFilereturns