📃 fix: preserve raw formatting for .md files on /text endpoint by danny-avila · Pull Request #278 · danny-avila/rag_api

danny-avila · 2026-04-18T14:23:58Z

Summary

Paired with LibreChat#12734, closes the server-side half of LibreChat#12731.

The /text endpoint is called by LibreChat's Upload as Text feature, where the user's intent is to send raw file contents to the model. .md files were being loaded with UnstructuredMarkdownLoader, which parses Markdown and returns plain text with syntax removed (#, **, -, > all stripped). Two byte-identical uploads - one .txt, one .md - therefore reached the model as very different text.

Changes

app/utils/document_loader.py: add raw_text: bool = False parameter to get_loader. When raw_text=True and the file is Markdown (extension .md or any of the four markdown MIME types), return TextLoader(filepath, autodetect_encoding=True) instead of UnstructuredMarkdownLoader. Binary formats (PDF/DOCX/XLSX/PPTX/EPUB) and non-markdown text files are unaffected.
app/routes/document_routes.py: thread a raw_text flag through load_file_content, and pass raw_text=True only from the /text handler. /embed and /embed-upload keep the default (raw_text=False) so vector search quality is unchanged.
tests/utils/test_document_loader.py: cover the embed path (still UnstructuredMarkdownLoader), all four Markdown MIME types on the raw-text path, extension-only detection, and a PDF regression to confirm raw_text doesn't disturb binary formats.

Test plan

pytest tests/utils/test_document_loader.py - 17/17 pass
Integration: with a LibreChat build that still calls /text for .md, upload a markdown file with headings/bold/lists; confirm raw content is returned
Integration: /embed upload with .md still uses UnstructuredMarkdownLoader (unchanged)

The /text endpoint is called by LibreChat's "Upload as Text" feature, where the user's intent is to send the file contents verbatim to the model. Previously, .md files were loaded with UnstructuredMarkdownLoader, which parses the Markdown and returns plain text with syntax removed (headings, bold, lists, blockquotes all stripped). Two byte-identical uploads - one named .txt, one named .md - would therefore reach the model as very different text. Add a raw_text flag to get_loader / load_file_content, off by default so the embedding path (/embed, used for vector search) continues to use UnstructuredMarkdownLoader for better retrieval quality. Pass raw_text=True only from the /text endpoint, where Markdown files are now routed through TextLoader(autodetect_encoding=True) to preserve formatting byte-for-byte. Reported against LibreChat as danny-avila/LibreChat#12731.

danny-avila · 2026-04-18T14:33:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c501616c34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T14:36:20Z

+    if raw_text and is_markdown:
+        return TextLoader(filepath, autodetect_encoding=True), True, file_ext


Restore binary loader precedence with raw_text enabled

The new early return in get_loader runs before all format-specific branches, so on the /text path (raw_text=True) any upload labeled with a markdown MIME type is forced into TextLoader even when the file extension is a known binary type like .pdf or .docx. This is a regression from prior behavior where those extensions were handled by their dedicated loaders first, and it can cause failed or nonsensical extraction when clients send conflicting multipart content types.

Useful? React with 👍 / 👎.

Thanks — good catch. Pushed a442f0f to address this:

Dropped the early return; raw_text is now handled inside the existing markdown elif branch so the earlier PDF/CSV/RST/XML/PPT branches retain their precedence.

Tightened the markdown elif itself: it still accepts .md by extension, but only accepts a markdown Content-Type when the extension is not a known binary format (new _BINARY_FILE_EXTENSIONS set). This also closes a pre-existing edge where .docx / .xlsx / .epub uploaded with text/markdown would fall through the md branch even without raw_text.

Added a parameterised regression test covering doc.pdf, report.docx, book.epub, data.xlsx, and slides.pptx each with Content-Type: text/markdown on the raw_text=True path — each now routes to its dedicated binary loader (22/22 pytest passing).

Addressing review feedback from codex: the previous early-return variant of get_loader could force a binary file (.pdf, .docx, etc.) through TextLoader if the client sent a markdown Content-Type alongside the binary filename. Handle raw_text inside the existing markdown branch instead, and tighten the content-type matcher so markdown MIME types do not override binary extensions. Add a dedicated frozenset _BINARY_FILE_EXTENSIONS covering pdf / doc(x) / xls(x) / ppt(x) / epub. Add tests that cover each binary extension with a conflicting text/markdown content type on the raw_text path.

chatgpt-codex-connector · 2026-04-18T15:01:05Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danny-avila · 2026-04-18T15:03:44Z

@codex review

chatgpt-codex-connector · 2026-04-18T15:06:40Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector Bot reviewed Apr 18, 2026

View reviewed changes

danny-avila changed the title ~~fix: preserve raw formatting for .md files on /text endpoint~~ 📃 fix: preserve raw formatting for .md files on /text endpoint Apr 19, 2026

danny-avila merged commit dcafc08 into main Apr 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📃 fix: preserve raw formatting for .md files on /text endpoint#278

📃 fix: preserve raw formatting for .md files on /text endpoint#278
danny-avila merged 2 commits intomainfrom
fix/upload-text-preserve-markdown

danny-avila commented Apr 18, 2026

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

danny-avila Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if raw_text and is_markdown:
		return TextLoader(filepath, autodetect_encoding=True), True, file_ext

Conversation

danny-avila commented Apr 18, 2026

Summary

Changes

Test plan

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant