📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900
📄 feat: Local Text Extraction for PDF, DOCX, and XLS/XLSX#11900danny-avila merged 9 commits intodevfrom
Conversation
The document parser uses libraries to parse the text out of known document types. This lets LibreChat handle some complex document types without having to use a secondary service (like Mistral or standing up a RAG API server). To enable the document parser, set the ocr strategy to "document_parser" in librechat.yaml. We now support: - PDFs using pdfjs - DOCX using mammoth - XLS/XLSX using SheetJS (The associated packages were also added to the project.)
- Properly calculate length of text based on UTF8. - Avoid issues with loading / blocking PDF parsing.
- Introduced support for additional document types in the OCR strategy, including PDF, DOCX, and XLS/XLSX. - Updated the file upload handling to dynamically select the appropriate parsing strategy based on the file type. - Refactored the document parsing functions to use asynchronous imports for improved performance and maintainability.
- Introduced a new test suite for the processAgentFileUpload function in process.spec.js. - Implemented various test cases to validate OCR strategy selection based on file types, including PDF, DOCX, XLSX, and XLS. - Mocked dependencies to ensure isolated testing of file upload handling and strategy selection logic. - Enhanced coverage for scenarios involving OCR capability checks and default strategy fallbacks.
There was a problem hiding this comment.
Pull request overview
Adds a new local “document_parser” OCR strategy that extracts text from common binary document formats (PDF/DOCX/XLS/XLSX) without external OCR services, and wires it into the agent file upload flow as the default fallback for those MIME types.
Changes:
- Introduces
parseDocument()(PDF viapdfjs-dist, DOCX viamammoth, XLS/XLSX viaxlsx) and exports it from@librechat/api. - Registers
document_parserinFileSources/OCRStrategyand integrates it into the server’s file strategy selector andprocessAgentFileUploadrouting logic. - Adds unit/integration tests plus DOCX/XLSX fixtures for the new parser and strategy selection behavior.
Reviewed changes
Copilot reviewed 10 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/data-provider/src/types/files.ts | Adds FileSources.document_parser. |
| packages/data-provider/src/config.ts | Adds OCRStrategy.DOCUMENT_PARSER. |
| packages/api/src/files/index.ts | Re-exports document parser CRUD entrypoint from @librechat/api. |
| packages/api/src/files/documents/crud.ts | Implements local parsing for PDF/DOCX/XLS(X) and returns MistralOCRUploadResult-shaped output. |
| packages/api/src/files/documents/crud.spec.ts | Adds unit tests for DOCX/XLSX + error conditions. |
| packages/api/src/files/documents/sample.docx | DOCX fixture for parser unit tests. |
| packages/api/src/files/documents/empty.docx | Empty DOCX fixture for “no text found” unit test. |
| packages/api/src/files/documents/sample.xlsx | XLSX fixture for parser unit tests. |
| packages/api/package.json | Adds parser library dependencies (mammoth/pdfjs-dist/xlsx) for the @librechat/api workspace package. |
| api/server/services/Files/strategies.js | Adds document_parser strategy that routes handleFileUpload to parseDocument. |
| api/server/services/Files/process.js | Routes PDF/DOCX/XLS/XLSX through document_parser automatically when OCR isn’t configured; changes default configured OCR fallback to document_parser. |
| api/server/services/Files/process.spec.js | Adds integration-style tests for OCR strategy selection branches in processAgentFileUpload. |
| api/package.json | Adds parser libraries to backend runtime deps. |
| package-lock.json | Updates lockfile for new dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "pdfjs-dist": "^5.4.530", | ||
| "rate-limit-redis": "^4.2.0", | ||
| "tiktoken": "^1.0.15", | ||
| "undici": "^7.18.2", | ||
| "xlsx": "*", |
There was a problem hiding this comment.
pdfjs-dist and xlsx are declared with inconsistent versions between devDependencies (pdfjs-dist ^5.4.624, xlsx tarball) and peerDependencies (pdfjs-dist ^5.4.530, xlsx "*"). This can lead to peer-dep warnings, duplicate installs, and different behavior between the package build/tests vs consumer runtime. Align the versions across both sections and avoid a wildcard peer dependency for xlsx (pin to the same version/tarball you test against).
| "pdfjs-dist": "^5.4.530", | |
| "rate-limit-redis": "^4.2.0", | |
| "tiktoken": "^1.0.15", | |
| "undici": "^7.18.2", | |
| "xlsx": "*", | |
| "pdfjs-dist": "^5.4.624", | |
| "rate-limit-redis": "^4.2.0", | |
| "tiktoken": "^1.0.15", | |
| "undici": "^7.18.2", | |
| "xlsx": "^0.0.0", |
| @@ -110,6 +112,7 @@ | |||
| "undici": "^7.18.2", | |||
| "winston": "^3.11.0", | |||
| "winston-daily-rotate-file": "^5.0.0", | |||
| "xlsx": "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz", | |||
| "zod": "^3.22.4" | |||
There was a problem hiding this comment.
pdfjs-dist is pinned here at ^5.4.530, but the workspace lockfile resolves pdfjs-dist to 5.4.624 (pulled in by @librechat/api). Keeping the version spec in sync with the resolved version (and with @librechat/api's peer range) will avoid multiple installs/deduping surprises and makes engine requirements more predictable.
| return { | ||
| filename: file.filename, | ||
| bytes: Buffer.byteLength(text, 'utf8'), | ||
| filepath: FileSources.document_parser, | ||
| text, | ||
| images: [], |
There was a problem hiding this comment.
parseDocument() returns filename: file.filename, but the existing OCR upload functions return filename: file.originalname (e.g., uploadMistralOCR). Using originalname here would keep MistralOCRUploadResult consistent across strategies and avoids cases where file.filename is unset/meaningless (depends on Multer storage config).
| let fullText = ''; | ||
| for (let i = 1; i <= pdf.numPages; i++) { | ||
| const page = await pdf.getPage(i); | ||
| const textContent = await page.getTextContent(); | ||
| const pageText = textContent.items | ||
| .filter((item): item is TextItem => !('type' in item)) | ||
| .map((item) => item.str) | ||
| .join(' '); | ||
| fullText += pageText + '\n'; | ||
| } |
There was a problem hiding this comment.
pdfToText() builds fullText via += inside the page loop. For multi-page PDFs this can cause repeated reallocations and degraded performance. Collect page strings into an array and join('\n') (and similarly for the Excel aggregation) to keep runtime/memory more predictable on large documents.
| text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n', | ||
| }); | ||
| }); | ||
|
|
There was a problem hiding this comment.
Unit tests cover DOCX and XLSX parsing, but the new PDF path (application/pdf) and the XLS path (application/vnd.ms-excel) in parseDocument() are currently untested. Adding fixtures/tests for those code paths would prevent regressions (especially since the PDF parsing relies on pdfjs-dist internals and can be sensitive to version/Node changes).
| test('parseDocument() parses text from pdf', async () => { | |
| const file = { | |
| filename: 'sample.pdf', | |
| path: path.join(__dirname, 'sample.pdf'), | |
| mimetype: 'application/pdf', | |
| } as Express.Multer.File; | |
| const document = await parseDocument({ file }); | |
| expect(document).toEqual({ | |
| bytes: 30, | |
| filename: 'sample.pdf', | |
| filepath: 'document_parser', | |
| images: [], | |
| text: 'This is a sample PDF file.\n\n', | |
| }); | |
| }); | |
| test('parseDocument() parses text from xls', async () => { | |
| const file = { | |
| filename: 'sample.xls', | |
| path: path.join(__dirname, 'sample.xls'), | |
| mimetype: 'application/vnd.ms-excel', | |
| } as Express.Multer.File; | |
| const document = await parseDocument({ file }); | |
| expect(document).toEqual({ | |
| bytes: 66, | |
| filename: 'sample.xls', | |
| filepath: 'document_parser', | |
| images: [], | |
| text: 'Sheet One:\nData,on,first,sheet\nSecond Sheet:\nData,On\nSecond,Sheet\n', | |
| }); | |
| }); |
- Bumped pdfjs-dist dependency to version 5.4.624 in both api and packages/api. - Refactored document parsing tests to use 'originalname' instead of 'filename' for file objects. - Added a new test case for parsing XLS files to improve coverage of document types supported by the parser. - Introduced a sample XLS file for testing purposes.
…ocessAgentFileUpload - Added a check to ensure extracted text does not exceed the 15MB storage limit, throwing an error if it does. - Refactored the OCR handling logic to improve fallback behavior when the configured OCR fails, ensuring a more robust document processing flow. - Enhanced unit tests to cover scenarios for oversized text and fallback mechanisms, ensuring proper error handling and functionality.
- Updated the OCR URL construction to ensure it correctly appends '/ocr' to the base URL if not already present, improving the reliability of the OCR request.
Summary
Originally #11519
Introduces a new
document_parserstrategy for extracting text from binary document formats locally, without requiring an external OCR service or a RAG API server. This makes structured document uploads to agent context work out of the box for the most common file types.parseDocument()inpackages/api/src/files/documents/crud.ts(TypeScript, replacing the original JS draft) that dispatches to format-specific parsers:pdfjs-distfor PDFs,mammothfor DOCX, andxlsxfor XLS/XLSX.pdfjs-dist,mammoth,xlsx) viaawait import()inside their respective functions, so instances not usingdocument_parserpay zero startup cost.document_parseras a newFileSourcesenum value andOCRStrategyenum value inpackages/data-provider.parseDocumentinto the file strategy system viadocumentParserStrategy()instrategies.js.processAgentFileUploadinprocess.jsto automatically route PDF, DOCX, XLS, and XLSX files throughdocument_parsereven when noocr:block is present inlibrechat.yaml, removing the need for any configuration to get basic document parsing.mistral_ocrtodocument_parser, so deployments that configureocr:without specifying astrategy:no longer silently require a Mistral API key.AgentCapabilities.ocrcapability gate to explicitly configured OCR only — the automaticdocument_parserpath bypasses the capability check entirely.parseDocument()covering DOCX parsing, XLSX parsing, unsupported type errors, and empty document errors (crud.spec.ts).processAgentFileUploadcovering all OCR strategy selection branches, including automatic fallback, configured strategy, missing capability, and mixed mime/config scenarios (process.spec.js).Change Type
Testing
Upload a PDF, DOCX, XLSX, and XLS file to an agent using the
contexttool resource with noocr:key inlibrechat.yaml. Each file should be parsed and its text content made available to the agent without any additional configuration. Verify that an explicitly configuredocr: strategy: mistral_ocrstill routes through Mistral and is unaffected. Verify thattext/plainand other non-document types still fall through toparseTextas before. Unit and integration tests can be run directly:Test Configuration:
No
ocr:key inlibrechat.yaml. Agents with thecontexttool resource enabled. Test files: one PDF, one DOCX, one XLSX, one XLS, one.txt.Checklist