Skip to content

📝 fix: Preserve Raw Markdown Formatting on Upload as Text#12734

Merged
danny-avila merged 4 commits intodevfrom
claude/ecstatic-cartwright-652388
Apr 20, 2026
Merged

📝 fix: Preserve Raw Markdown Formatting on Upload as Text#12734
danny-avila merged 4 commits intodevfrom
claude/ecstatic-cartwright-652388

Conversation

@danny-avila
Copy link
Copy Markdown
Owner

Summary

Fixes #12731.

When RAG_API_URL is set, uploading a .md file via Upload as Text sent the file to the RAG API /text endpoint, which routes Markdown through UnstructuredMarkdownLoader and strips formatting (#, **, lists, blockquotes). A .txt file with identical bytes round-tripped verbatim, so two byte-identical uploads produced very different extracted text depending only on the file extension.

This PR adds a short-circuit in parseText: if the file is Markdown (by MIME type or .md / .markdown extension) it skips the RAG API call and reads the file verbatim via parseTextNative. This is the defensive fix on the LibreChat side — a matching loader fix in danny-avila/rag_api covers the server path as well.

The embedding path (/embed, used by file_search/RAG) is untouched and continues to use the semantic markdown loader so vector search quality is unchanged.

Changes

  • packages/api/src/files/text.ts: add isMarkdownFile() helper and short-circuit in parseText
  • packages/api/src/files/text.spec.ts: cover each markdown MIME type, .md / .MARKDOWN by extension alone, and regression-check that non-markdown still hits the RAG API

Test plan

  • npx jest src/files/text.spec.ts — 15/15 pass
  • Manual: with RAG_API_URL set, upload a .md file with headings/bold/lists via Upload as Text; verify raw markdown reaches the model
  • Manual: upload a .txt with identical content; behavior should match
  • Manual regression: file_search (RAG) against a .md still uses the semantic loader on the RAG API side (unchanged)

When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API
`/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader`
and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload
as Text` to preserve raw content - identical bytes in a `.txt` file round-trip
verbatim, while the `.md` came back stripped.

Short-circuit the RAG API call for Markdown files (by MIME type or `.md` /
`.markdown` extension) and read the file verbatim via `parseTextNative`.
Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps
its existing loader so vector search quality is unchanged.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a LibreChat-side regression where uploading Markdown via Upload as Text would route .md files through the RAG API /text endpoint and lose raw Markdown formatting, by short-circuiting Markdown uploads to native parsing when RAG_API_URL is set.

Changes:

  • Add Markdown detection helper and bypass the RAG /text call for Markdown files, using parseTextNative to preserve raw content.
  • Add Jest coverage to ensure Markdown files short-circuit to native parsing and non-Markdown files still use the RAG API path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
packages/api/src/files/text.ts Adds Markdown detection + early return to native parsing to preserve raw Markdown formatting when RAG is configured.
packages/api/src/files/text.spec.ts Adds regression tests for Markdown short-circuit behavior and ensures non-Markdown still calls the RAG API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


const MARKDOWN_MIME_TYPES = new Set([
'text/markdown',
'text/x-markdown',
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMarkdownFile() checks a fixed set of MIME types but does not include text/md. In the codebase, text/md is treated as a valid Markdown MIME type (e.g., artifacts prompt), and the Upload-as-Text path accepts arbitrary type/subtype values—so a Markdown upload with mimetype text/md (especially if the filename lacks a .md extension) will still go through the RAG /text endpoint and have formatting stripped. Consider adding text/md to the Markdown MIME set (or normalizing MIME types using the same canonicalization used elsewhere) so Markdown is consistently short-circuited.

Suggested change
'text/x-markdown',
'text/x-markdown',
'text/md',

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — text/md is treated as a valid markdown MIME elsewhere in the codebase (e.g. artifacts prompt). Added it to MARKDOWN_MIME_TYPES in 99f062d2a, plus MIME-type normalization so parameterized variants like text/markdown; charset=utf-8 also short-circuit.

Comment on lines +304 to +311
it.each([
{ mimetype: 'text/markdown', originalname: 'notes.md' },
{ mimetype: 'text/x-markdown', originalname: 'notes.md' },
{ mimetype: 'application/markdown', originalname: 'notes.md' },
{ mimetype: 'application/x-markdown', originalname: 'notes.md' },
{ mimetype: 'application/octet-stream', originalname: 'README.md' },
{ mimetype: 'application/octet-stream', originalname: 'GUIDE.MARKDOWN' },
])(
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown short-circuit tests cover several markdown MIME types and extension-only detection, but they don't cover the text/md MIME type which is treated elsewhere in the codebase as valid Markdown. Adding a text/md case (ideally with a non-.md filename to ensure the MIME-type path is what triggers) would prevent regressions where markdown still gets routed through the RAG /text endpoint.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a text/md test case (with an extension-less filename so the MIME path is what triggers) and three more cases covering the normalization edges: text/markdown; charset=utf-8, uppercase TEXT/MARKDOWN, and whitespace-padded input. See 99f062d2a.

@github-actions
Copy link
Copy Markdown
Contributor

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danny-avila
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 275b20dbc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/src/files/text.ts Outdated
]);

function isMarkdownFile(file: Express.Multer.File): boolean {
if (file.mimetype && MARKDOWN_MIME_TYPES.has(file.mimetype)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize markdown MIME before set lookup

isMarkdownFile does an exact lookup on file.mimetype, so valid multipart values like text/markdown; charset=utf-8 will not be recognized as markdown unless the filename also ends in .md/.markdown. In that scenario (for example, markdown uploads named without extension), parseText still calls the RAG /text path and the markdown-formatting loss this change is meant to prevent can still occur. Please normalize the MIME type (lowercase and strip parameters) before checking the markdown set.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — pushed 99f062d2a to address this. isMarkdownFile now runs file.mimetype through a normalizeMimeType() helper that lowercases and strips parameters before the set lookup, so text/markdown; charset=utf-8, TEXT/MARKDOWN, and whitespace-padded variants all short-circuit as expected. New parametrized test cases cover each of those shapes.

Addressing review feedback on the `Upload as Text` short-circuit:

- Accept `text/md` in the markdown MIME set (LibreChat treats it as a
  valid markdown type elsewhere, e.g. the artifact-rendering prompt).
- Normalize the incoming MIME type (lowercase + strip parameters) before
  the set lookup so parameterized values like
  `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still
  short-circuit. Extensionless uploads relying only on the `Content-Type`
  header would otherwise fall through to the RAG `/text` endpoint and
  lose their markdown formatting.

Extend `text.spec.ts` parametrized cases with `text/md`, parameterized
MIME, uppercase, and whitespace-padded variants.
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danny-avila
Copy link
Copy Markdown
Owner Author

@codex review

@github-actions
Copy link
Copy Markdown
Contributor

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Addressing comprehensive review feedback:

- Debug log now includes filename and MIME type so operators can
  identify which upload triggered the short-circuit without having
  to correlate other logs.
- Expand markdown extension detection beyond `.md` / `.markdown` to
  cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex).
- Tighten `normalizeMimeType` parameter type from `string | undefined`
  to `string` to match the actual Express.Multer.File type. The
  falsy-check still protects against empty strings at runtime.
- Extend parametrized tests with the most common real-world shapes:
  `text/plain` + `.md` (the MIME most browsers/servers assign),
  the new rare extensions, and empty MIME + `.md` (pure extension
  fallback path).
- Add a positive assertion that `readFileAsString` was called with the
  expected arguments on every short-circuit case, so tests fail loudly
  if the native-parse path ever regresses.
@github-actions
Copy link
Copy Markdown
Contributor

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one
test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did
not, leaving a typo in that branch undetectable.
@danny-avila danny-avila changed the base branch from main to dev April 18, 2026 16:13
@github-actions
Copy link
Copy Markdown
Contributor

GitNexus: 🚀 deployed

The LibreChat-pr-12734 index is now live on the MCP server.
Deploy run

@danny-avila danny-avila changed the title 🐛 fix: Preserve Raw Markdown on Upload as Text 📝 fix: Preserve Raw Markdown Formatting on Upload as Text Apr 20, 2026
@danny-avila danny-avila merged commit 2358d07 into dev Apr 20, 2026
9 checks passed
@danny-avila danny-avila deleted the claude/ecstatic-cartwright-652388 branch April 20, 2026 02:31
krgokul pushed a commit to syedhabib39/LibreChat that referenced this pull request Apr 20, 2026
…a#12734)

* 🐛 fix: Preserve Raw Markdown on `Upload as Text`

When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API
`/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader`
and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload
as Text` to preserve raw content - identical bytes in a `.txt` file round-trip
verbatim, while the `.md` came back stripped.

Short-circuit the RAG API call for Markdown files (by MIME type or `.md` /
`.markdown` extension) and read the file verbatim via `parseTextNative`.
Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps
its existing loader so vector search quality is unchanged.

* 🐛 fix: normalize markdown MIME and accept `text/md`

Addressing review feedback on the `Upload as Text` short-circuit:

- Accept `text/md` in the markdown MIME set (LibreChat treats it as a
  valid markdown type elsewhere, e.g. the artifact-rendering prompt).
- Normalize the incoming MIME type (lowercase + strip parameters) before
  the set lookup so parameterized values like
  `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still
  short-circuit. Extensionless uploads relying only on the `Content-Type`
  header would otherwise fall through to the RAG `/text` endpoint and
  lose their markdown formatting.

Extend `text.spec.ts` parametrized cases with `text/md`, parameterized
MIME, uppercase, and whitespace-padded variants.

* 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix

Addressing comprehensive review feedback:

- Debug log now includes filename and MIME type so operators can
  identify which upload triggered the short-circuit without having
  to correlate other logs.
- Expand markdown extension detection beyond `.md` / `.markdown` to
  cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex).
- Tighten `normalizeMimeType` parameter type from `string | undefined`
  to `string` to match the actual Express.Multer.File type. The
  falsy-check still protects against empty strings at runtime.
- Extend parametrized tests with the most common real-world shapes:
  `text/plain` + `.md` (the MIME most browsers/servers assign),
  the new rare extensions, and empty MIME + `.md` (pure extension
  fallback path).
- Add a positive assertion that `readFileAsString` was called with the
  expected arguments on every short-circuit case, so tests fail loudly
  if the native-parse path ever regresses.

* 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit

Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one
test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did
not, leaving a typo in that branch undetectable.
krgokul pushed a commit to syedhabib39/LibreChat that referenced this pull request Apr 21, 2026
…a#12734)

* 🐛 fix: Preserve Raw Markdown on `Upload as Text`

When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API
`/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader`
and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload
as Text` to preserve raw content - identical bytes in a `.txt` file round-trip
verbatim, while the `.md` came back stripped.

Short-circuit the RAG API call for Markdown files (by MIME type or `.md` /
`.markdown` extension) and read the file verbatim via `parseTextNative`.
Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps
its existing loader so vector search quality is unchanged.

* 🐛 fix: normalize markdown MIME and accept `text/md`

Addressing review feedback on the `Upload as Text` short-circuit:

- Accept `text/md` in the markdown MIME set (LibreChat treats it as a
  valid markdown type elsewhere, e.g. the artifact-rendering prompt).
- Normalize the incoming MIME type (lowercase + strip parameters) before
  the set lookup so parameterized values like
  `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still
  short-circuit. Extensionless uploads relying only on the `Content-Type`
  header would otherwise fall through to the RAG `/text` endpoint and
  lose their markdown formatting.

Extend `text.spec.ts` parametrized cases with `text/md`, parameterized
MIME, uppercase, and whitespace-padded variants.

* 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix

Addressing comprehensive review feedback:

- Debug log now includes filename and MIME type so operators can
  identify which upload triggered the short-circuit without having
  to correlate other logs.
- Expand markdown extension detection beyond `.md` / `.markdown` to
  cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex).
- Tighten `normalizeMimeType` parameter type from `string | undefined`
  to `string` to match the actual Express.Multer.File type. The
  falsy-check still protects against empty strings at runtime.
- Extend parametrized tests with the most common real-world shapes:
  `text/plain` + `.md` (the MIME most browsers/servers assign),
  the new rare extensions, and empty MIME + `.md` (pure extension
  fallback path).
- Add a positive assertion that `readFileAsString` was called with the
  expected arguments on every short-circuit case, so tests fail loudly
  if the native-parse path ever regresses.

* 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit

Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one
test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did
not, leaving a typo in that branch undetectable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: "Upload as Text" strips most markdown formatting from .md files

2 participants