📝 fix: Preserve Raw Markdown Formatting on Upload as Text by danny-avila · Pull Request #12734 · danny-avila/LibreChat

danny-avila · 2026-04-18T14:23:14Z

Summary

When RAG_API_URL is set, uploading a .md file via Upload as Text sent the file to the RAG API /text endpoint, which routes Markdown through UnstructuredMarkdownLoader and strips formatting (#, **, lists, blockquotes). A .txt file with identical bytes round-tripped verbatim, so two byte-identical uploads produced very different extracted text depending only on the file extension.

This PR adds a short-circuit in parseText: if the file is Markdown (by MIME type or .md / .markdown extension) it skips the RAG API call and reads the file verbatim via parseTextNative. This is the defensive fix on the LibreChat side — a matching loader fix in danny-avila/rag_api covers the server path as well.

The embedding path (/embed, used by file_search/RAG) is untouched and continues to use the semantic markdown loader so vector search quality is unchanged.

Changes

packages/api/src/files/text.ts: add isMarkdownFile() helper and short-circuit in parseText
packages/api/src/files/text.spec.ts: cover each markdown MIME type, .md / .MARKDOWN by extension alone, and regression-check that non-markdown still hits the RAG API

Test plan

npx jest src/files/text.spec.ts — 15/15 pass
Manual: with RAG_API_URL set, upload a .md file with headings/bold/lists via Upload as Text; verify raw markdown reaches the model
Manual: upload a .txt with identical content; behavior should match
Manual regression: file_search (RAG) against a .md still uses the semantic loader on the RAG API side (unchanged)

When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged.

Copilot

Pull request overview

This PR fixes a LibreChat-side regression where uploading Markdown via Upload as Text would route .md files through the RAG API /text endpoint and lose raw Markdown formatting, by short-circuiting Markdown uploads to native parsing when RAG_API_URL is set.

Changes:

Add Markdown detection helper and bypass the RAG /text call for Markdown files, using parseTextNative to preserve raw content.
Add Jest coverage to ensure Markdown files short-circuit to native parsing and non-Markdown files still use the RAG API path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
packages/api/src/files/text.ts	Adds Markdown detection + early return to native parsing to preserve raw Markdown formatting when RAG is configured.
packages/api/src/files/text.spec.ts	Adds regression tests for Markdown short-circuit behavior and ensures non-Markdown still calls the RAG API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-18T14:26:43Z


+const MARKDOWN_MIME_TYPES = new Set([
+  'text/markdown',
+  'text/x-markdown',


isMarkdownFile() checks a fixed set of MIME types but does not include text/md. In the codebase, text/md is treated as a valid Markdown MIME type (e.g., artifacts prompt), and the Upload-as-Text path accepts arbitrary type/subtype values—so a Markdown upload with mimetype text/md (especially if the filename lacks a .md extension) will still go through the RAG /text endpoint and have formatting stripped. Consider adding text/md to the Markdown MIME set (or normalizing MIME types using the same canonicalization used elsewhere) so Markdown is consistently short-circuited.

Suggested change

'text/x-markdown',

'text/x-markdown',

'text/md',

Good catch — text/md is treated as a valid markdown MIME elsewhere in the codebase (e.g. artifacts prompt). Added it to MARKDOWN_MIME_TYPES in 99f062d2a, plus MIME-type normalization so parameterized variants like text/markdown; charset=utf-8 also short-circuit.

Copilot · 2026-04-18T14:26:44Z

+    it.each([
+      { mimetype: 'text/markdown', originalname: 'notes.md' },
+      { mimetype: 'text/x-markdown', originalname: 'notes.md' },
+      { mimetype: 'application/markdown', originalname: 'notes.md' },
+      { mimetype: 'application/x-markdown', originalname: 'notes.md' },
+      { mimetype: 'application/octet-stream', originalname: 'README.md' },
+      { mimetype: 'application/octet-stream', originalname: 'GUIDE.MARKDOWN' },
+    ])(


The markdown short-circuit tests cover several markdown MIME types and extension-only detection, but they don't cover the text/md MIME type which is treated elsewhere in the codebase as valid Markdown. Adding a text/md case (ideally with a non-.md filename to ensure the MIME-type path is what triggers) would prevent regressions where markdown still gets routed through the RAG /text endpoint.

Added a text/md test case (with an extension-less filename so the MIME path is what triggers) and three more cases covering the normalization edges: text/markdown; charset=utf-8, uppercase TEXT/MARKDOWN, and whitespace-padded input. See 99f062d2a.

github-actions · 2026-04-18T14:29:18Z

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

chatgpt-codex-connector · 2026-04-18T14:35:41Z

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T14:38:40Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danny-avila · 2026-04-18T14:50:33Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 275b20dbc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T14:53:33Z

+]);
+
+function isMarkdownFile(file: Express.Multer.File): boolean {
+  if (file.mimetype && MARKDOWN_MIME_TYPES.has(file.mimetype)) {


Normalize markdown MIME before set lookup

isMarkdownFile does an exact lookup on file.mimetype, so valid multipart values like text/markdown; charset=utf-8 will not be recognized as markdown unless the filename also ends in .md/.markdown. In that scenario (for example, markdown uploads named without extension), parseText still calls the RAG /text path and the markdown-formatting loss this change is meant to prevent can still occur. Please normalize the MIME type (lowercase and strip parameters) before checking the markdown set.

Useful? React with 👍 / 👎.

Thanks — pushed 99f062d2a to address this. isMarkdownFile now runs file.mimetype through a normalizeMimeType() helper that lowercases and strips parameters before the set lookup, so text/markdown; charset=utf-8, TEXT/MARKDOWN, and whitespace-padded variants all short-circuit as expected. New parametrized test cases cover each of those shapes.

Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants.

chatgpt-codex-connector · 2026-04-18T15:00:13Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danny-avila · 2026-04-18T15:03:44Z

@codex review

github-actions · 2026-04-18T15:03:45Z

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

chatgpt-codex-connector · 2026-04-18T15:06:38Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses.

github-actions · 2026-04-18T15:35:17Z

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.

github-actions · 2026-04-18T16:16:35Z

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

…a#12734) * 🐛 fix: Preserve Raw Markdown on `Upload as Text` When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged. * 🐛 fix: normalize markdown MIME and accept `text/md` Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants. * 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses. * 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.

Copilot AI review requested due to automatic review settings April 18, 2026 14:23

Copilot started reviewing on behalf of danny-avila April 18, 2026 14:23 View session

danny-avila mentioned this pull request Apr 18, 2026

📃 fix: preserve raw formatting for .md files on /text endpoint danny-avila/rag_api#278

Merged

3 tasks

Copilot AI reviewed Apr 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 18, 2026

View reviewed changes

🧪 test: Cover .mdwn regex branch in Markdown short-circuit

3e3917c

Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.

danny-avila changed the base branch from main to dev April 18, 2026 16:13

danny-avila changed the title ~~🐛 fix: Preserve Raw Markdown on Upload as Text~~ 📝 fix: Preserve Raw Markdown Formatting on Upload as Text Apr 20, 2026

danny-avila merged commit 2358d07 into dev Apr 20, 2026
9 checks passed

danny-avila deleted the claude/ecstatic-cartwright-652388 branch April 20, 2026 02:31

Uh oh!

Conversation

danny-avila commented Apr 18, 2026

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 18, 2026

GitNexus: 🚀 deployed

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

danny-avila commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

GitNexus: 🚀 deployed

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

GitNexus: 🚀 deployed

Uh oh!

github-actions Bot commented Apr 18, 2026

GitNexus: 🚀 deployed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants