Skip to content

feat: Added "document parser" OCR strategy#11519

Closed
dlew wants to merge 3 commits intodanny-avila:devfrom
newjersey:dlew/document-parser-ocr
Closed

feat: Added "document parser" OCR strategy#11519
dlew wants to merge 3 commits intodanny-avila:devfrom
newjersey:dlew/document-parser-ocr

Conversation

@dlew
Copy link
Copy Markdown
Contributor

@dlew dlew commented Jan 26, 2026

Pull Request Template

Summary

The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server).

To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml.

We now support:

  • PDFs using pdfjs
  • DOCX using mammoth
  • XLS/XLSX using SheetJS

(The associated packages were also added to the project.)

Here's a documentation update PR as well.

Change Type

Please delete any irrelevant options.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update (here)

Testing

I have added automated tests for most cases (the exception being PDFs, as getting Jest to work with ECMAScript modules would be a big lift just for this one small PR).

I also manually tested uploading PDFs, Word documents, and Excel sheets to LibreChat as text, to make sure they are parsed out.

Test Configuration:

  • Enable ocr agent capability in librechat.yaml:
    agents:
     capabilities:
       - "ocr"
    
  • Set ocr strategy to document_parser:
    ocr:
      strategy: "document_parser"
    

Checklist

Please delete any irrelevant options.

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • I have made pertinent documentation changes
  • My changes do not introduce new warnings
  • I have written tests demonstrating that my changes are effective or that my feature works
  • Local unit tests pass with my changes
  • A pull request for updating the documentation has been submitted (here).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new "document_parser" OCR strategy that extracts text directly from common document formats (PDF/DOCX/XLS/XLSX) using local parsing libraries, avoiding external OCR/RAG services for these cases.

Changes:

  • Introduces a new document_parser OCR strategy and wires it into the file strategy selector.
  • Implements document parsing for PDF (pdfjs), DOCX (mammoth), and XLS/XLSX (SheetJS/xlsx).
  • Adds dependencies plus Jest coverage for DOCX/XLSX parsing (with fixtures).

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
packages/data-provider/src/types/files.ts Adds document_parser to FileSources so it can be referenced across packages.
packages/data-provider/src/config.ts Adds DOCUMENT_PARSER to OCRStrategy and allows it via the config schema.
api/server/services/Files/strategies.js Registers the new document_parser strategy and routes it to parseDocument.
api/server/services/Files/Documents/crud.js Implements document parsing logic for PDF/DOCX/XLS/XLSX and returns extracted text payload.
api/server/services/Files/Documents/tests/documents.spec.js Adds unit tests for DOCX/XLSX parsing and error cases.
api/server/services/Files/Documents/tests/sample.xlsx XLSX fixture for tests.
api/server/services/Files/Documents/tests/sample.docx DOCX fixture for tests.
api/server/services/Files/Documents/tests/empty.docx Empty DOCX fixture for tests.
api/package.json Adds new runtime deps: mammoth, pdfjs-dist, xlsx.
package-lock.json Locks newly added dependencies and transitive tree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/server/services/Files/Documents/crud.js Outdated
Comment thread api/server/services/Files/Documents/crud.js Outdated
Comment thread api/server/services/Files/Documents/crud.js Outdated
Comment thread api/server/services/Files/Documents/crud.js Outdated
Comment on lines +16 to +19
switch (file.mimetype) {
case 'application/pdf':
text = await pdfToText(file);
break;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDF branch (case 'application/pdf') introduces new behavior but isn't covered by tests (only docx/xlsx are covered). Even a minimal test (e.g., a tiny fixture PDF or mocking pdfjs) would help prevent regressions in the PDF path.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hard to make this happen, however since pdfjs uses ESM modules, we'd have to do a larger refactor to allow for ESM modules in LibreChat's Jest tests. Such a refactor didn't seem worth it for now.

A minimal test would only check that we call pdfToText() with the right mimetype, which didn't seem worth the effort of writing (it would only give minimal benefit & provide the illusion that this path is being tested).

Comment thread api/server/services/Files/Documents/__tests__/documents.spec.js
@dlew dlew force-pushed the dlew/document-parser-ocr branch from 2c5c352 to 500b9eb Compare February 12, 2026 15:39
@dlew
Copy link
Copy Markdown
Contributor Author

dlew commented Feb 12, 2026

I've addressed & responded to all the Copilot feedback.

@dlew dlew force-pushed the dlew/document-parser-ocr branch 2 times, most recently from ebe5b77 to 3049969 Compare February 18, 2026 15:13
dlew added 3 commits February 19, 2026 08:34
The document parser uses libraries to parse the text out of known document types.
This lets LibreChat handle some complex document types without having to use a
secondary service (like Mistral or standing up a RAG API server).

To enable the document parser, set the ocr strategy to "document_parser" in
librechat.yaml.

We now support:

- PDFs using pdfjs
- DOCX using mammoth
- XLS/XLSX using SheetJS

(The associated packages were also added to the project.)
- Properly calculate length of text based on UTF8.

- Avoid issues with loading / blocking PDF parsing.
@danny-avila
Copy link
Copy Markdown
Owner

I wanted a few more changes before merging, continued here: #11900

I don't have access to push changes to your fork so opened a new PR with your original commits.

@dlew
Copy link
Copy Markdown
Contributor Author

dlew commented Feb 23, 2026

Sounds good - do you also want to merge the docs update? LibreChat-AI/librechat.ai#492

@dlew dlew deleted the dlew/document-parser-ocr branch February 23, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants