📃 fix: preserve raw formatting for .md files on /text endpoint#278
📃 fix: preserve raw formatting for .md files on /text endpoint#278danny-avila merged 2 commits intomainfrom
Conversation
The /text endpoint is called by LibreChat's "Upload as Text" feature, where the user's intent is to send the file contents verbatim to the model. Previously, .md files were loaded with UnstructuredMarkdownLoader, which parses the Markdown and returns plain text with syntax removed (headings, bold, lists, blockquotes all stripped). Two byte-identical uploads - one named .txt, one named .md - would therefore reach the model as very different text. Add a raw_text flag to get_loader / load_file_content, off by default so the embedding path (/embed, used for vector search) continues to use UnstructuredMarkdownLoader for better retrieval quality. Pass raw_text=True only from the /text endpoint, where Markdown files are now routed through TextLoader(autodetect_encoding=True) to preserve formatting byte-for-byte. Reported against LibreChat as danny-avila/LibreChat#12731.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c501616c34
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if raw_text and is_markdown: | ||
| return TextLoader(filepath, autodetect_encoding=True), True, file_ext |
There was a problem hiding this comment.
Restore binary loader precedence with raw_text enabled
The new early return in get_loader runs before all format-specific branches, so on the /text path (raw_text=True) any upload labeled with a markdown MIME type is forced into TextLoader even when the file extension is a known binary type like .pdf or .docx. This is a regression from prior behavior where those extensions were handled by their dedicated loaders first, and it can cause failed or nonsensical extraction when clients send conflicting multipart content types.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks — good catch. Pushed a442f0f to address this:
- Dropped the early return;
raw_textis now handled inside the existing markdown elif branch so the earlier PDF/CSV/RST/XML/PPT branches retain their precedence. - Tightened the markdown elif itself: it still accepts
.mdby extension, but only accepts a markdownContent-Typewhen the extension is not a known binary format (new_BINARY_FILE_EXTENSIONSset). This also closes a pre-existing edge where.docx/.xlsx/.epubuploaded withtext/markdownwould fall through the md branch even withoutraw_text. - Added a parameterised regression test covering
doc.pdf,report.docx,book.epub,data.xlsx, andslides.pptxeach withContent-Type: text/markdownon theraw_text=Truepath — each now routes to its dedicated binary loader (22/22 pytest passing).
Addressing review feedback from codex: the previous early-return variant of get_loader could force a binary file (.pdf, .docx, etc.) through TextLoader if the client sent a markdown Content-Type alongside the binary filename. Handle raw_text inside the existing markdown branch instead, and tighten the content-type matcher so markdown MIME types do not override binary extensions. Add a dedicated frozenset _BINARY_FILE_EXTENSIONS covering pdf / doc(x) / xls(x) / ppt(x) / epub. Add tests that cover each binary extension with a conflicting text/markdown content type on the raw_text path.
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Paired with LibreChat#12734, closes the server-side half of LibreChat#12731.
The
/textendpoint is called by LibreChat's Upload as Text feature, where the user's intent is to send raw file contents to the model..mdfiles were being loaded withUnstructuredMarkdownLoader, which parses Markdown and returns plain text with syntax removed (#,**,-,>all stripped). Two byte-identical uploads - one.txt, one.md- therefore reached the model as very different text.Changes
app/utils/document_loader.py: addraw_text: bool = Falseparameter toget_loader. Whenraw_text=Trueand the file is Markdown (extension.mdor any of the four markdown MIME types), returnTextLoader(filepath, autodetect_encoding=True)instead ofUnstructuredMarkdownLoader. Binary formats (PDF/DOCX/XLSX/PPTX/EPUB) and non-markdown text files are unaffected.app/routes/document_routes.py: thread araw_textflag throughload_file_content, and passraw_text=Trueonly from the/texthandler./embedand/embed-uploadkeep the default (raw_text=False) so vector search quality is unchanged.tests/utils/test_document_loader.py: cover the embed path (stillUnstructuredMarkdownLoader), all four Markdown MIME types on the raw-text path, extension-only detection, and a PDF regression to confirmraw_textdoesn't disturb binary formats.Test plan
pytest tests/utils/test_document_loader.py- 17/17 pass/textfor.md, upload a markdown file with headings/bold/lists; confirm raw content is returned/embedupload with.mdstill usesUnstructuredMarkdownLoader(unchanged)