feat: Added "document parser" OCR strategy by dlew · Pull Request #11519 · danny-avila/LibreChat

dlew · 2026-01-26T17:17:16Z

Pull Request Template

Summary

The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server).

To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml.

We now support:

PDFs using pdfjs
DOCX using mammoth
XLS/XLSX using SheetJS

(The associated packages were also added to the project.)

Here's a documentation update PR as well.

Change Type

Please delete any irrelevant options.

New feature (non-breaking change which adds functionality)
This change requires a documentation update (here)

Testing

I have added automated tests for most cases (the exception being PDFs, as getting Jest to work with ECMAScript modules would be a big lift just for this one small PR).

I also manually tested uploading PDFs, Word documents, and Excel sheets to LibreChat as text, to make sure they are parsed out.

Test Configuration:

Enable ocr agent capability in librechat.yaml:
```
agents:
 capabilities:
   - "ocr"
```
Set ocr strategy to document_parser:
```
ocr:
  strategy: "document_parser"
```

Checklist

Please delete any irrelevant options.

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented in any complex areas of my code
I have made pertinent documentation changes
My changes do not introduce new warnings
I have written tests demonstrating that my changes are effective or that my feature works
Local unit tests pass with my changes
A pull request for updating the documentation has been submitted (here).

Copilot

Pull request overview

Adds a new "document_parser" OCR strategy that extracts text directly from common document formats (PDF/DOCX/XLS/XLSX) using local parsing libraries, avoiding external OCR/RAG services for these cases.

Changes:

Introduces a new document_parser OCR strategy and wires it into the file strategy selector.
Implements document parsing for PDF (pdfjs), DOCX (mammoth), and XLS/XLSX (SheetJS/xlsx).
Adds dependencies plus Jest coverage for DOCX/XLSX parsing (with fixtures).

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
packages/data-provider/src/types/files.ts	Adds `document_parser` to `FileSources` so it can be referenced across packages.
packages/data-provider/src/config.ts	Adds `DOCUMENT_PARSER` to `OCRStrategy` and allows it via the config schema.
api/server/services/Files/strategies.js	Registers the new `document_parser` strategy and routes it to `parseDocument`.
api/server/services/Files/Documents/crud.js	Implements document parsing logic for PDF/DOCX/XLS/XLSX and returns extracted text payload.
api/server/services/Files/Documents/tests/documents.spec.js	Adds unit tests for DOCX/XLSX parsing and error cases.
api/server/services/Files/Documents/tests/sample.xlsx	XLSX fixture for tests.
api/server/services/Files/Documents/tests/sample.docx	DOCX fixture for tests.
api/server/services/Files/Documents/tests/empty.docx	Empty DOCX fixture for tests.
api/package.json	Adds new runtime deps: `mammoth`, `pdfjs-dist`, `xlsx`.
package-lock.json	Locks newly added dependencies and transitive tree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T03:22:03Z

+  switch (file.mimetype) {
+    case 'application/pdf':
+      text = await pdfToText(file);
+      break;


The PDF branch (case 'application/pdf') introduces new behavior but isn't covered by tests (only docx/xlsx are covered). Even a minimal test (e.g., a tiny fixture PDF or mocking pdfjs) would help prevent regressions in the PDF path.

I tried hard to make this happen, however since pdfjs uses ESM modules, we'd have to do a larger refactor to allow for ESM modules in LibreChat's Jest tests. Such a refactor didn't seem worth it for now.

A minimal test would only check that we call pdfToText() with the right mimetype, which didn't seem worth the effort of writing (it would only give minimal benefit & provide the illusion that this path is being tested).

dlew · 2026-02-12T15:45:09Z

I've addressed & responded to all the Copilot feedback.

The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server). To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml. We now support: - PDFs using pdfjs - DOCX using mammoth - XLS/XLSX using SheetJS (The associated packages were also added to the project.)

- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.

danny-avila · 2026-02-22T18:08:26Z

I wanted a few more changes before merging, continued here: #11900

I don't have access to push changes to your fork so opened a new PR with your original commits.

dlew · 2026-02-23T14:07:01Z

Sounds good - do you also want to merge the docs update? LibreChat-AI/librechat.ai#492

dlew mentioned this pull request Jan 26, 2026

Added docs for OCR document parser strategy LibreChat-AI/librechat.ai#492

Open

dlew force-pushed the dlew/document-parser-ocr branch 4 times, most recently from f74940f to 6cfa3db Compare February 5, 2026 15:07

dlew force-pushed the dlew/document-parser-ocr branch 4 times, most recently from d1e671d to 2c5c352 Compare February 11, 2026 16:38

dlew mentioned this pull request Feb 11, 2026

Added document parser OCR strategy & enable it newjersey/LibreChat#107

Merged

danny-avila requested a review from Copilot February 12, 2026 03:12

Copilot started reviewing on behalf of danny-avila February 12, 2026 03:13 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

dlew force-pushed the dlew/document-parser-ocr branch from 2c5c352 to 500b9eb Compare February 12, 2026 15:39

dlew force-pushed the dlew/document-parser-ocr branch 2 times, most recently from ebe5b77 to 3049969 Compare February 18, 2026 15:13

dlew added 3 commits February 19, 2026 08:34

fix: applied Copilot code review suggestions

a4dce5b

- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.

fix: improved docs on parseDocument()

ee6f2b2

dlew force-pushed the dlew/document-parser-ocr branch from 3049969 to ee6f2b2 Compare February 19, 2026 14:34

danny-avila mentioned this pull request Feb 22, 2026

📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX #11900

Merged

8 tasks

danny-avila closed this Feb 22, 2026

dlew deleted the dlew/document-parser-ocr branch February 23, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Added "document parser" OCR strategy#11519

feat: Added "document parser" OCR strategy#11519
dlew wants to merge 3 commits intodanny-avila:devfrom
newjersey:dlew/document-parser-ocr

dlew commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

dlew Feb 12, 2026

Uh oh!

Uh oh!

dlew commented Feb 12, 2026

Uh oh!

danny-avila commented Feb 22, 2026

Uh oh!

dlew commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dlew commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Template

Summary

Change Type

Testing

Test Configuration:

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dlew Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dlew commented Feb 12, 2026

Uh oh!

danny-avila commented Feb 22, 2026

Uh oh!

dlew commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dlew commented Jan 26, 2026 •

edited

Loading