Skip to content

📃 fix: preserve raw formatting for .md files on /text endpoint#278

Merged
danny-avila merged 2 commits intomainfrom
fix/upload-text-preserve-markdown
Apr 19, 2026
Merged

📃 fix: preserve raw formatting for .md files on /text endpoint#278
danny-avila merged 2 commits intomainfrom
fix/upload-text-preserve-markdown

Conversation

@danny-avila
Copy link
Copy Markdown
Owner

Summary

Paired with LibreChat#12734, closes the server-side half of LibreChat#12731.

The /text endpoint is called by LibreChat's Upload as Text feature, where the user's intent is to send raw file contents to the model. .md files were being loaded with UnstructuredMarkdownLoader, which parses Markdown and returns plain text with syntax removed (#, **, -, > all stripped). Two byte-identical uploads - one .txt, one .md - therefore reached the model as very different text.

Changes

  • app/utils/document_loader.py: add raw_text: bool = False parameter to get_loader. When raw_text=True and the file is Markdown (extension .md or any of the four markdown MIME types), return TextLoader(filepath, autodetect_encoding=True) instead of UnstructuredMarkdownLoader. Binary formats (PDF/DOCX/XLSX/PPTX/EPUB) and non-markdown text files are unaffected.
  • app/routes/document_routes.py: thread a raw_text flag through load_file_content, and pass raw_text=True only from the /text handler. /embed and /embed-upload keep the default (raw_text=False) so vector search quality is unchanged.
  • tests/utils/test_document_loader.py: cover the embed path (still UnstructuredMarkdownLoader), all four Markdown MIME types on the raw-text path, extension-only detection, and a PDF regression to confirm raw_text doesn't disturb binary formats.

Test plan

  • pytest tests/utils/test_document_loader.py - 17/17 pass
  • Integration: with a LibreChat build that still calls /text for .md, upload a markdown file with headings/bold/lists; confirm raw content is returned
  • Integration: /embed upload with .md still uses UnstructuredMarkdownLoader (unchanged)

The /text endpoint is called by LibreChat's "Upload as Text" feature,
where the user's intent is to send the file contents verbatim to the
model. Previously, .md files were loaded with UnstructuredMarkdownLoader,
which parses the Markdown and returns plain text with syntax removed
(headings, bold, lists, blockquotes all stripped). Two byte-identical
uploads - one named .txt, one named .md - would therefore reach the
model as very different text.

Add a raw_text flag to get_loader / load_file_content, off by default
so the embedding path (/embed, used for vector search) continues to use
UnstructuredMarkdownLoader for better retrieval quality. Pass
raw_text=True only from the /text endpoint, where Markdown files are
now routed through TextLoader(autodetect_encoding=True) to preserve
formatting byte-for-byte.

Reported against LibreChat as danny-avila/LibreChat#12731.
@danny-avila
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c501616c34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/utils/document_loader.py Outdated
Comment on lines +95 to +96
if raw_text and is_markdown:
return TextLoader(filepath, autodetect_encoding=True), True, file_ext
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore binary loader precedence with raw_text enabled

The new early return in get_loader runs before all format-specific branches, so on the /text path (raw_text=True) any upload labeled with a markdown MIME type is forced into TextLoader even when the file extension is a known binary type like .pdf or .docx. This is a regression from prior behavior where those extensions were handled by their dedicated loaders first, and it can cause failed or nonsensical extraction when clients send conflicting multipart content types.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — good catch. Pushed a442f0f to address this:

  • Dropped the early return; raw_text is now handled inside the existing markdown elif branch so the earlier PDF/CSV/RST/XML/PPT branches retain their precedence.
  • Tightened the markdown elif itself: it still accepts .md by extension, but only accepts a markdown Content-Type when the extension is not a known binary format (new _BINARY_FILE_EXTENSIONS set). This also closes a pre-existing edge where .docx / .xlsx / .epub uploaded with text/markdown would fall through the md branch even without raw_text.
  • Added a parameterised regression test covering doc.pdf, report.docx, book.epub, data.xlsx, and slides.pptx each with Content-Type: text/markdown on the raw_text=True path — each now routes to its dedicated binary loader (22/22 pytest passing).

Addressing review feedback from codex: the previous early-return variant
of get_loader could force a binary file (.pdf, .docx, etc.) through
TextLoader if the client sent a markdown Content-Type alongside the
binary filename.

Handle raw_text inside the existing markdown branch instead, and tighten
the content-type matcher so markdown MIME types do not override binary
extensions. Add a dedicated frozenset _BINARY_FILE_EXTENSIONS covering
pdf / doc(x) / xls(x) / ppt(x) / epub.

Add tests that cover each binary extension with a conflicting
text/markdown content type on the raw_text path.
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danny-avila
Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danny-avila danny-avila changed the title fix: preserve raw formatting for .md files on /text endpoint 📃 fix: preserve raw formatting for .md files on /text endpoint Apr 19, 2026
@danny-avila danny-avila merged commit dcafc08 into main Apr 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant