feat: Added "document parser" OCR strategy#11519
feat: Added "document parser" OCR strategy#11519dlew wants to merge 3 commits intodanny-avila:devfrom
Conversation
f74940f to
6cfa3db
Compare
d1e671d to
2c5c352
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new "document_parser" OCR strategy that extracts text directly from common document formats (PDF/DOCX/XLS/XLSX) using local parsing libraries, avoiding external OCR/RAG services for these cases.
Changes:
- Introduces a new
document_parserOCR strategy and wires it into the file strategy selector. - Implements document parsing for PDF (pdfjs), DOCX (mammoth), and XLS/XLSX (SheetJS/xlsx).
- Adds dependencies plus Jest coverage for DOCX/XLSX parsing (with fixtures).
Reviewed changes
Copilot reviewed 6 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/data-provider/src/types/files.ts | Adds document_parser to FileSources so it can be referenced across packages. |
| packages/data-provider/src/config.ts | Adds DOCUMENT_PARSER to OCRStrategy and allows it via the config schema. |
| api/server/services/Files/strategies.js | Registers the new document_parser strategy and routes it to parseDocument. |
| api/server/services/Files/Documents/crud.js | Implements document parsing logic for PDF/DOCX/XLS/XLSX and returns extracted text payload. |
| api/server/services/Files/Documents/tests/documents.spec.js | Adds unit tests for DOCX/XLSX parsing and error cases. |
| api/server/services/Files/Documents/tests/sample.xlsx | XLSX fixture for tests. |
| api/server/services/Files/Documents/tests/sample.docx | DOCX fixture for tests. |
| api/server/services/Files/Documents/tests/empty.docx | Empty DOCX fixture for tests. |
| api/package.json | Adds new runtime deps: mammoth, pdfjs-dist, xlsx. |
| package-lock.json | Locks newly added dependencies and transitive tree. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| switch (file.mimetype) { | ||
| case 'application/pdf': | ||
| text = await pdfToText(file); | ||
| break; |
There was a problem hiding this comment.
The PDF branch (case 'application/pdf') introduces new behavior but isn't covered by tests (only docx/xlsx are covered). Even a minimal test (e.g., a tiny fixture PDF or mocking pdfjs) would help prevent regressions in the PDF path.
There was a problem hiding this comment.
I tried hard to make this happen, however since pdfjs uses ESM modules, we'd have to do a larger refactor to allow for ESM modules in LibreChat's Jest tests. Such a refactor didn't seem worth it for now.
A minimal test would only check that we call pdfToText() with the right mimetype, which didn't seem worth the effort of writing (it would only give minimal benefit & provide the illusion that this path is being tested).
2c5c352 to
500b9eb
Compare
|
I've addressed & responded to all the Copilot feedback. |
ebe5b77 to
3049969
Compare
The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server). To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml. We now support: - PDFs using pdfjs - DOCX using mammoth - XLS/XLSX using SheetJS (The associated packages were also added to the project.)
- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.
3049969 to
ee6f2b2
Compare
|
I wanted a few more changes before merging, continued here: #11900 I don't have access to push changes to your fork so opened a new PR with your original commits. |
|
Sounds good - do you also want to merge the docs update? LibreChat-AI/librechat.ai#492 |
Pull Request Template
Summary
The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server).
To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml.
We now support:
(The associated packages were also added to the project.)
Here's a documentation update PR as well.
Change Type
Please delete any irrelevant options.
Testing
I have added automated tests for most cases (the exception being PDFs, as getting Jest to work with ECMAScript modules would be a big lift just for this one small PR).
I also manually tested uploading PDFs, Word documents, and Excel sheets to LibreChat as text, to make sure they are parsed out.
Test Configuration:
ocragent capability in librechat.yaml:document_parser:Checklist
Please delete any irrelevant options.