Skip to content

📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900

Merged
danny-avila merged 9 commits intodevfrom
pr-11519
Feb 22, 2026
Merged

📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900
danny-avila merged 9 commits intodevfrom
pr-11519

Conversation

@danny-avila
Copy link
Copy Markdown
Owner

@danny-avila danny-avila commented Feb 22, 2026

Summary

Originally #11519

Introduces a new document_parser strategy for extracting text from binary document formats locally, without requiring an external OCR service or a RAG API server. This makes structured document uploads to agent context work out of the box for the most common file types.

  • Adds parseDocument() in packages/api/src/files/documents/crud.ts (TypeScript, replacing the original JS draft) that dispatches to format-specific parsers: pdfjs-dist for PDFs, mammoth for DOCX, and xlsx for XLS/XLSX.
  • Lazily imports all three parsing libraries (pdfjs-dist, mammoth, xlsx) via await import() inside their respective functions, so instances not using document_parser pay zero startup cost.
  • Registers document_parser as a new FileSources enum value and OCRStrategy enum value in packages/data-provider.
  • Wires parseDocument into the file strategy system via documentParserStrategy() in strategies.js.
  • Updates processAgentFileUpload in process.js to automatically route PDF, DOCX, XLS, and XLSX files through document_parser even when no ocr: block is present in librechat.yaml, removing the need for any configuration to get basic document parsing.
  • Changes the explicit OCR strategy fallback from mistral_ocr to document_parser, so deployments that configure ocr: without specifying a strategy: no longer silently require a Mistral API key.
  • Scopes the AgentCapabilities.ocr capability gate to explicitly configured OCR only — the automatic document_parser path bypasses the capability check entirely.
  • Adds unit tests for parseDocument() covering DOCX parsing, XLSX parsing, unsupported type errors, and empty document errors (crud.spec.ts).
  • Adds integration tests for processAgentFileUpload covering all OCR strategy selection branches, including automatic fallback, configured strategy, missing capability, and mixed mime/config scenarios (process.spec.js).

Change Type

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Testing

Upload a PDF, DOCX, XLSX, and XLS file to an agent using the context tool resource with no ocr: key in librechat.yaml. Each file should be parsed and its text content made available to the agent without any additional configuration. Verify that an explicitly configured ocr: strategy: mistral_ocr still routes through Mistral and is unaffected. Verify that text/plain and other non-document types still fall through to parseText as before. Unit and integration tests can be run directly:

# TS unit tests (crud)
cd packages/api && npx jest src/files/documents/crud.spec.ts

# Integration tests (processAgentFileUpload)
cd api && npx jest server/services/Files/process.spec.js

Test Configuration:

No ocr: key in librechat.yaml. Agents with the context tool resource enabled. Test files: one PDF, one DOCX, one XLSX, one XLS, one .txt.

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • My changes do not introduce new warnings
  • I have written tests demonstrating that my changes are effective or that my feature works
  • Local unit tests pass with my changes

dlew and others added 6 commits February 22, 2026 08:59
The document parser uses libraries to parse the text out of known document types.
This lets LibreChat handle some complex document types without having to use a
secondary service (like Mistral or standing up a RAG API server).

To enable the document parser, set the ocr strategy to "document_parser" in
librechat.yaml.

We now support:

- PDFs using pdfjs
- DOCX using mammoth
- XLS/XLSX using SheetJS

(The associated packages were also added to the project.)
- Properly calculate length of text based on UTF8.

- Avoid issues with loading / blocking PDF parsing.
- Introduced support for additional document types in the OCR strategy, including PDF, DOCX, and XLS/XLSX.
- Updated the file upload handling to dynamically select the appropriate parsing strategy based on the file type.
- Refactored the document parsing functions to use asynchronous imports for improved performance and maintainability.
- Introduced a new test suite for the processAgentFileUpload function in process.spec.js.
- Implemented various test cases to validate OCR strategy selection based on file types, including PDF, DOCX, XLSX, and XLS.
- Mocked dependencies to ensure isolated testing of file upload handling and strategy selection logic.
- Enhanced coverage for scenarios involving OCR capability checks and default strategy fallbacks.
Copilot AI review requested due to automatic review settings February 22, 2026 17:36
@danny-avila danny-avila changed the title 📄 feat: Document Parser: Local Text Extraction for PDF, DOCX, and XLS/XLSX 📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX Feb 22, 2026
@danny-avila danny-avila changed the base branch from main to dev February 22, 2026 17:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new local “document_parser” OCR strategy that extracts text from common binary document formats (PDF/DOCX/XLS/XLSX) without external OCR services, and wires it into the agent file upload flow as the default fallback for those MIME types.

Changes:

  • Introduces parseDocument() (PDF via pdfjs-dist, DOCX via mammoth, XLS/XLSX via xlsx) and exports it from @librechat/api.
  • Registers document_parser in FileSources / OCRStrategy and integrates it into the server’s file strategy selector and processAgentFileUpload routing logic.
  • Adds unit/integration tests plus DOCX/XLSX fixtures for the new parser and strategy selection behavior.

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/data-provider/src/types/files.ts Adds FileSources.document_parser.
packages/data-provider/src/config.ts Adds OCRStrategy.DOCUMENT_PARSER.
packages/api/src/files/index.ts Re-exports document parser CRUD entrypoint from @librechat/api.
packages/api/src/files/documents/crud.ts Implements local parsing for PDF/DOCX/XLS(X) and returns MistralOCRUploadResult-shaped output.
packages/api/src/files/documents/crud.spec.ts Adds unit tests for DOCX/XLSX + error conditions.
packages/api/src/files/documents/sample.docx DOCX fixture for parser unit tests.
packages/api/src/files/documents/empty.docx Empty DOCX fixture for “no text found” unit test.
packages/api/src/files/documents/sample.xlsx XLSX fixture for parser unit tests.
packages/api/package.json Adds parser library dependencies (mammoth/pdfjs-dist/xlsx) for the @librechat/api workspace package.
api/server/services/Files/strategies.js Adds document_parser strategy that routes handleFileUpload to parseDocument.
api/server/services/Files/process.js Routes PDF/DOCX/XLS/XLSX through document_parser automatically when OCR isn’t configured; changes default configured OCR fallback to document_parser.
api/server/services/Files/process.spec.js Adds integration-style tests for OCR strategy selection branches in processAgentFileUpload.
api/package.json Adds parser libraries to backend runtime deps.
package-lock.json Updates lockfile for new dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/api/package.json Outdated
Comment on lines +117 to +121
"pdfjs-dist": "^5.4.530",
"rate-limit-redis": "^4.2.0",
"tiktoken": "^1.0.15",
"undici": "^7.18.2",
"xlsx": "*",
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdfjs-dist and xlsx are declared with inconsistent versions between devDependencies (pdfjs-dist ^5.4.624, xlsx tarball) and peerDependencies (pdfjs-dist ^5.4.530, xlsx "*"). This can lead to peer-dep warnings, duplicate installs, and different behavior between the package build/tests vs consumer runtime. Align the versions across both sections and avoid a wildcard peer dependency for xlsx (pin to the same version/tarball you test against).

Suggested change
"pdfjs-dist": "^5.4.530",
"rate-limit-redis": "^4.2.0",
"tiktoken": "^1.0.15",
"undici": "^7.18.2",
"xlsx": "*",
"pdfjs-dist": "^5.4.624",
"rate-limit-redis": "^4.2.0",
"tiktoken": "^1.0.15",
"undici": "^7.18.2",
"xlsx": "^0.0.0",

Copilot uses AI. Check for mistakes.
Comment thread api/package.json
Comment on lines 103 to 116
@@ -110,6 +112,7 @@
"undici": "^7.18.2",
"winston": "^3.11.0",
"winston-daily-rotate-file": "^5.0.0",
"xlsx": "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz",
"zod": "^3.22.4"
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdfjs-dist is pinned here at ^5.4.530, but the workspace lockfile resolves pdfjs-dist to 5.4.624 (pulled in by @librechat/api). Keeping the version spec in sync with the resolved version (and with @librechat/api's peer range) will avoid multiple installs/deduping surprises and makes engine requirements more predictable.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +41
return {
filename: file.filename,
bytes: Buffer.byteLength(text, 'utf8'),
filepath: FileSources.document_parser,
text,
images: [],
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseDocument() returns filename: file.filename, but the existing OCR upload functions return filename: file.originalname (e.g., uploadMistralOCR). Using originalname here would keep MistralOCRUploadResult consistent across strategies and avoids cases where file.filename is unset/meaningless (depends on Multer storage config).

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +62
let fullText = '';
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const textContent = await page.getTextContent();
const pageText = textContent.items
.filter((item): item is TextItem => !('type' in item))
.map((item) => item.str)
.join(' ');
fullText += pageText + '\n';
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdfToText() builds fullText via += inside the page loop. For multi-page PDFs this can cause repeated reallocations and degraded performance. Collect page strings into an array and join('\n') (and similarly for the Excel aggregation) to keep runtime/memory more predictable on large documents.

Copilot uses AI. Check for mistakes.
text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n',
});
});

Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests cover DOCX and XLSX parsing, but the new PDF path (application/pdf) and the XLS path (application/vnd.ms-excel) in parseDocument() are currently untested. Adding fixtures/tests for those code paths would prevent regressions (especially since the PDF parsing relies on pdfjs-dist internals and can be sensitive to version/Node changes).

Suggested change
test('parseDocument() parses text from pdf', async () => {
const file = {
filename: 'sample.pdf',
path: path.join(__dirname, 'sample.pdf'),
mimetype: 'application/pdf',
} as Express.Multer.File;
const document = await parseDocument({ file });
expect(document).toEqual({
bytes: 30,
filename: 'sample.pdf',
filepath: 'document_parser',
images: [],
text: 'This is a sample PDF file.\n\n',
});
});
test('parseDocument() parses text from xls', async () => {
const file = {
filename: 'sample.xls',
path: path.join(__dirname, 'sample.xls'),
mimetype: 'application/vnd.ms-excel',
} as Express.Multer.File;
const document = await parseDocument({ file });
expect(document).toEqual({
bytes: 66,
filename: 'sample.xls',
filepath: 'document_parser',
images: [],
text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n',
});
});

Copilot uses AI. Check for mistakes.
- Bumped pdfjs-dist dependency to version 5.4.624 in both api and packages/api.
- Refactored document parsing tests to use 'originalname' instead of 'filename' for file objects.
- Added a new test case for parsing XLS files to improve coverage of document types supported by the parser.
- Introduced a sample XLS file for testing purposes.
…ocessAgentFileUpload

- Added a check to ensure extracted text does not exceed the 15MB storage limit, throwing an error if it does.
- Refactored the OCR handling logic to improve fallback behavior when the configured OCR fails, ensuring a more robust document processing flow.
- Enhanced unit tests to cover scenarios for oversized text and fallback mechanisms, ensuring proper error handling and functionality.
- Updated the OCR URL construction to ensure it correctly appends '/ocr' to the base URL if not already present, improving the reliability of the OCR request.
@danny-avila danny-avila merged commit 7ce898d into dev Feb 22, 2026
6 checks passed
@danny-avila danny-avila deleted the pr-11519 branch February 22, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants