📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX by danny-avila · Pull Request #11900 · danny-avila/LibreChat

danny-avila · 2026-02-22T17:36:39Z

Summary

Originally #11519

Introduces a new document_parser strategy for extracting text from binary document formats locally, without requiring an external OCR service or a RAG API server. This makes structured document uploads to agent context work out of the box for the most common file types.

Adds parseDocument() in packages/api/src/files/documents/crud.ts (TypeScript, replacing the original JS draft) that dispatches to format-specific parsers: pdfjs-dist for PDFs, mammoth for DOCX, and xlsx for XLS/XLSX.
Lazily imports all three parsing libraries (pdfjs-dist, mammoth, xlsx) via await import() inside their respective functions, so instances not using document_parser pay zero startup cost.
Registers document_parser as a new FileSources enum value and OCRStrategy enum value in packages/data-provider.
Wires parseDocument into the file strategy system via documentParserStrategy() in strategies.js.
Updates processAgentFileUpload in process.js to automatically route PDF, DOCX, XLS, and XLSX files through document_parser even when no ocr: block is present in librechat.yaml, removing the need for any configuration to get basic document parsing.
Changes the explicit OCR strategy fallback from mistral_ocr to document_parser, so deployments that configure ocr: without specifying a strategy: no longer silently require a Mistral API key.
Scopes the AgentCapabilities.ocr capability gate to explicitly configured OCR only — the automatic document_parser path bypasses the capability check entirely.
Adds unit tests for parseDocument() covering DOCX parsing, XLSX parsing, unsupported type errors, and empty document errors (crud.spec.ts).
Adds integration tests for processAgentFileUpload covering all OCR strategy selection branches, including automatic fallback, configured strategy, missing capability, and mixed mime/config scenarios (process.spec.js).

Change Type

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Testing

Upload a PDF, DOCX, XLSX, and XLS file to an agent using the context tool resource with no ocr: key in librechat.yaml. Each file should be parsed and its text content made available to the agent without any additional configuration. Verify that an explicitly configured ocr: strategy: mistral_ocr still routes through Mistral and is unaffected. Verify that text/plain and other non-document types still fall through to parseText as before. Unit and integration tests can be run directly:

# TS unit tests (crud)
cd packages/api && npx jest src/files/documents/crud.spec.ts

# Integration tests (processAgentFileUpload)
cd api && npx jest server/services/Files/process.spec.js

Test Configuration:

No ocr: key in librechat.yaml. Agents with the context tool resource enabled. Test files: one PDF, one DOCX, one XLSX, one XLS, one .txt.

Checklist

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented in any complex areas of my code
My changes do not introduce new warnings
I have written tests demonstrating that my changes are effective or that my feature works
Local unit tests pass with my changes

The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server). To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml. We now support: - PDFs using pdfjs - DOCX using mammoth - XLS/XLSX using SheetJS (The associated packages were also added to the project.)

- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.

- Introduced support for additional document types in the OCR strategy, including PDF, DOCX, and XLS/XLSX. - Updated the file upload handling to dynamically select the appropriate parsing strategy based on the file type. - Refactored the document parsing functions to use asynchronous imports for improved performance and maintainability.

- Introduced a new test suite for the processAgentFileUpload function in process.spec.js. - Implemented various test cases to validate OCR strategy selection based on file types, including PDF, DOCX, XLSX, and XLS. - Mocked dependencies to ensure isolated testing of file upload handling and strategy selection logic. - Enhanced coverage for scenarios involving OCR capability checks and default strategy fallbacks.

Copilot

Pull request overview

Adds a new local “document_parser” OCR strategy that extracts text from common binary document formats (PDF/DOCX/XLS/XLSX) without external OCR services, and wires it into the agent file upload flow as the default fallback for those MIME types.

Changes:

Introduces parseDocument() (PDF via pdfjs-dist, DOCX via mammoth, XLS/XLSX via xlsx) and exports it from @librechat/api.
Registers document_parser in FileSources / OCRStrategy and integrates it into the server’s file strategy selector and processAgentFileUpload routing logic.
Adds unit/integration tests plus DOCX/XLSX fixtures for the new parser and strategy selection behavior.

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/data-provider/src/types/files.ts	Adds `FileSources.document_parser`.
packages/data-provider/src/config.ts	Adds `OCRStrategy.DOCUMENT_PARSER`.
packages/api/src/files/index.ts	Re-exports document parser CRUD entrypoint from `@librechat/api`.
packages/api/src/files/documents/crud.ts	Implements local parsing for PDF/DOCX/XLS(X) and returns `MistralOCRUploadResult`-shaped output.
packages/api/src/files/documents/crud.spec.ts	Adds unit tests for DOCX/XLSX + error conditions.
packages/api/src/files/documents/sample.docx	DOCX fixture for parser unit tests.
packages/api/src/files/documents/empty.docx	Empty DOCX fixture for “no text found” unit test.
packages/api/src/files/documents/sample.xlsx	XLSX fixture for parser unit tests.
packages/api/package.json	Adds parser library dependencies (mammoth/pdfjs-dist/xlsx) for the `@librechat/api` workspace package.
api/server/services/Files/strategies.js	Adds `document_parser` strategy that routes `handleFileUpload` to `parseDocument`.
api/server/services/Files/process.js	Routes PDF/DOCX/XLS/XLSX through `document_parser` automatically when OCR isn’t configured; changes default configured OCR fallback to `document_parser`.
api/server/services/Files/process.spec.js	Adds integration-style tests for OCR strategy selection branches in `processAgentFileUpload`.
api/package.json	Adds parser libraries to backend runtime deps.
package-lock.json	Updates lockfile for new dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T17:43:00Z

+    "pdfjs-dist": "^5.4.530",
    "rate-limit-redis": "^4.2.0",
    "tiktoken": "^1.0.15",
    "undici": "^7.18.2",
+    "xlsx": "*",


pdfjs-dist and xlsx are declared with inconsistent versions between devDependencies (pdfjs-dist ^5.4.624, xlsx tarball) and peerDependencies (pdfjs-dist ^5.4.530, xlsx "*"). This can lead to peer-dep warnings, duplicate installs, and different behavior between the package build/tests vs consumer runtime. Align the versions across both sections and avoid a wildcard peer dependency for xlsx (pin to the same version/tarball you test against).

Suggested change

"pdfjs-dist": "^5.4.530",

"rate-limit-redis": "^4.2.0",

"tiktoken": "^1.0.15",

"undici": "^7.18.2",

"xlsx": "*",

"pdfjs-dist": "^5.4.624",

"rate-limit-redis": "^4.2.0",

"tiktoken": "^1.0.15",

"undici": "^7.18.2",

"xlsx": "^0.0.0",

Copilot · 2026-02-22T17:43:01Z

@@ -110,6 +112,7 @@
    "undici": "^7.18.2",
    "winston": "^3.11.0",
    "winston-daily-rotate-file": "^5.0.0",
+    "xlsx": "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz",
    "zod": "^3.22.4"


pdfjs-dist is pinned here at ^5.4.530, but the workspace lockfile resolves pdfjs-dist to 5.4.624 (pulled in by @librechat/api). Keeping the version spec in sync with the resolved version (and with @librechat/api's peer range) will avoid multiple installs/deduping surprises and makes engine requirements more predictable.

Copilot · 2026-02-22T17:43:01Z

+  return {
+    filename: file.filename,
+    bytes: Buffer.byteLength(text, 'utf8'),
+    filepath: FileSources.document_parser,
+    text,
+    images: [],


parseDocument() returns filename: file.filename, but the existing OCR upload functions return filename: file.originalname (e.g., uploadMistralOCR). Using originalname here would keep MistralOCRUploadResult consistent across strategies and avoids cases where file.filename is unset/meaningless (depends on Multer storage config).

Copilot · 2026-02-22T17:43:01Z

+  let fullText = '';
+  for (let i = 1; i <= pdf.numPages; i++) {
+    const page = await pdf.getPage(i);
+    const textContent = await page.getTextContent();
+    const pageText = textContent.items
+      .filter((item): item is TextItem => !('type' in item))
+      .map((item) => item.str)
+      .join(' ');
+    fullText += pageText + '\n';
+  }


pdfToText() builds fullText via += inside the page loop. For multi-page PDFs this can cause repeated reallocations and degraded performance. Collect page strings into an array and join('\n') (and similarly for the Excel aggregation) to keep runtime/memory more predictable on large documents.

Copilot · 2026-02-22T17:43:01Z

+      text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n',
+    });
+  });
+


Unit tests cover DOCX and XLSX parsing, but the new PDF path (application/pdf) and the XLS path (application/vnd.ms-excel) in parseDocument() are currently untested. Adding fixtures/tests for those code paths would prevent regressions (especially since the PDF parsing relies on pdfjs-dist internals and can be sensitive to version/Node changes).

Suggested change

test('parseDocument() parses text from pdf', async () => {

const file = {

filename: 'sample.pdf',

path: path.join(__dirname, 'sample.pdf'),

mimetype: 'application/pdf',

} as Express.Multer.File;

const document = await parseDocument({ file });

expect(document).toEqual({

bytes: 30,

filename: 'sample.pdf',

filepath: 'document_parser',

images: [],

text: 'This is a sample PDF file.\n\n',

});

});

test('parseDocument() parses text from xls', async () => {

const file = {

filename: 'sample.xls',

path: path.join(__dirname, 'sample.xls'),

mimetype: 'application/vnd.ms-excel',

} as Express.Multer.File;

const document = await parseDocument({ file });

expect(document).toEqual({

bytes: 66,

filename: 'sample.xls',

filepath: 'document_parser',

images: [],

text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n',

});

});

- Bumped pdfjs-dist dependency to version 5.4.624 in both api and packages/api. - Refactored document parsing tests to use 'originalname' instead of 'filename' for file objects. - Added a new test case for parsing XLS files to improve coverage of document types supported by the parser. - Introduced a sample XLS file for testing purposes.

…ocessAgentFileUpload - Added a check to ensure extracted text does not exceed the 15MB storage limit, throwing an error if it does. - Refactored the OCR handling logic to improve fallback behavior when the configured OCR fails, ensuring a more robust document processing flow. - Enhanced unit tests to cover scenarios for oversized text and fallback mechanisms, ensuring proper error handling and functionality.

- Updated the OCR URL construction to ensure it correctly appends '/ocr' to the base URL if not already present, improving the reliability of the OCR request.

dlew and others added 6 commits February 22, 2026 08:59

fix: applied Copilot code review suggestions

6224291

- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.

fix: improved docs on parseDocument()

0a504a8

chore: move to packages/api for TS support

cab5826

Copilot AI review requested due to automatic review settings February 22, 2026 17:36

Copilot started reviewing on behalf of danny-avila February 22, 2026 17:37 View session

danny-avila changed the title ~~📄 feat: Document Parser: Local Text Extraction for PDF, DOCX, and XLS/XLSX~~ 📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX Feb 22, 2026

danny-avila changed the base branch from main to dev February 22, 2026 17:37

Copilot AI reviewed Feb 22, 2026

View reviewed changes

danny-avila mentioned this pull request Feb 22, 2026

feat: Added "document parser" OCR strategy #11519

Closed

10 tasks

danny-avila added 3 commits February 22, 2026 13:13

fix: correct OCR URL construction in performOCR function

9c6d2c4

- Updated the OCR URL construction to ensure it correctly appends '/ocr' to the base URL if not already present, improving the reliability of the OCR request.

danny-avila merged commit 7ce898d into dev Feb 22, 2026
6 checks passed

danny-avila deleted the pr-11519 branch February 22, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900

📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900
danny-avila merged 9 commits intodevfrom
pr-11519

danny-avila commented Feb 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+  test('parseDocument() parses text from pdf', async () => {
+    const file = {
+      filename: 'sample.pdf',
+      path: path.join(__dirname, 'sample.pdf'),
+      mimetype: 'application/pdf',
+    } as Express.Multer.File;
+    const document = await parseDocument({ file });
+    expect(document).toEqual({
+      bytes: 30,
+      filename: 'sample.pdf',
+      filepath: 'document_parser',
+      images: [],
+      text: 'This is a sample PDF file.\n\n',
+    });
+  });
+  test('parseDocument() parses text from xls', async () => {
+    const file = {
+      filename: 'sample.xls',
+      path: path.join(__dirname, 'sample.xls'),
+      mimetype: 'application/vnd.ms-excel',
+    } as Express.Multer.File;
+    const document = await parseDocument({ file });
+    expect(document).toEqual({
+      bytes: 66,
+      filename: 'sample.xls',
+      filepath: 'document_parser',
+      images: [],
+      text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n',
+    });
+  });

Uh oh!

Conversation

danny-avila commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Testing

Test Configuration:

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danny-avila commented Feb 22, 2026 •

edited

Loading