📝 fix: Preserve Raw Markdown Formatting on Upload as Text#12734
📝 fix: Preserve Raw Markdown Formatting on Upload as Text#12734danny-avila merged 4 commits intodevfrom
Conversation
When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged.
There was a problem hiding this comment.
Pull request overview
This PR fixes a LibreChat-side regression where uploading Markdown via Upload as Text would route .md files through the RAG API /text endpoint and lose raw Markdown formatting, by short-circuiting Markdown uploads to native parsing when RAG_API_URL is set.
Changes:
- Add Markdown detection helper and bypass the RAG
/textcall for Markdown files, usingparseTextNativeto preserve raw content. - Add Jest coverage to ensure Markdown files short-circuit to native parsing and non-Markdown files still use the RAG API path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/api/src/files/text.ts | Adds Markdown detection + early return to native parsing to preserve raw Markdown formatting when RAG is configured. |
| packages/api/src/files/text.spec.ts | Adds regression tests for Markdown short-circuit behavior and ensures non-Markdown still calls the RAG API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| const MARKDOWN_MIME_TYPES = new Set([ | ||
| 'text/markdown', | ||
| 'text/x-markdown', |
There was a problem hiding this comment.
isMarkdownFile() checks a fixed set of MIME types but does not include text/md. In the codebase, text/md is treated as a valid Markdown MIME type (e.g., artifacts prompt), and the Upload-as-Text path accepts arbitrary type/subtype values—so a Markdown upload with mimetype text/md (especially if the filename lacks a .md extension) will still go through the RAG /text endpoint and have formatting stripped. Consider adding text/md to the Markdown MIME set (or normalizing MIME types using the same canonicalization used elsewhere) so Markdown is consistently short-circuited.
| 'text/x-markdown', | |
| 'text/x-markdown', | |
| 'text/md', |
There was a problem hiding this comment.
Good catch — text/md is treated as a valid markdown MIME elsewhere in the codebase (e.g. artifacts prompt). Added it to MARKDOWN_MIME_TYPES in 99f062d2a, plus MIME-type normalization so parameterized variants like text/markdown; charset=utf-8 also short-circuit.
| it.each([ | ||
| { mimetype: 'text/markdown', originalname: 'notes.md' }, | ||
| { mimetype: 'text/x-markdown', originalname: 'notes.md' }, | ||
| { mimetype: 'application/markdown', originalname: 'notes.md' }, | ||
| { mimetype: 'application/x-markdown', originalname: 'notes.md' }, | ||
| { mimetype: 'application/octet-stream', originalname: 'README.md' }, | ||
| { mimetype: 'application/octet-stream', originalname: 'GUIDE.MARKDOWN' }, | ||
| ])( |
There was a problem hiding this comment.
The markdown short-circuit tests cover several markdown MIME types and extension-only detection, but they don't cover the text/md MIME type which is treated elsewhere in the codebase as valid Markdown. Adding a text/md case (ideally with a non-.md filename to ensure the MIME-type path is what triggers) would prevent regressions where markdown still gets routed through the RAG /text endpoint.
There was a problem hiding this comment.
Added a text/md test case (with an extension-less filename so the MIME path is what triggers) and three more cases covering the normalization edges: text/markdown; charset=utf-8, uppercase TEXT/MARKDOWN, and whitespace-padded input. See 99f062d2a.
GitNexus: 🚀 deployedThe |
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 275b20dbc3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ]); | ||
|
|
||
| function isMarkdownFile(file: Express.Multer.File): boolean { | ||
| if (file.mimetype && MARKDOWN_MIME_TYPES.has(file.mimetype)) { |
There was a problem hiding this comment.
Normalize markdown MIME before set lookup
isMarkdownFile does an exact lookup on file.mimetype, so valid multipart values like text/markdown; charset=utf-8 will not be recognized as markdown unless the filename also ends in .md/.markdown. In that scenario (for example, markdown uploads named without extension), parseText still calls the RAG /text path and the markdown-formatting loss this change is meant to prevent can still occur. Please normalize the MIME type (lowercase and strip parameters) before checking the markdown set.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks — pushed 99f062d2a to address this. isMarkdownFile now runs file.mimetype through a normalizeMimeType() helper that lowercases and strips parameters before the set lookup, so text/markdown; charset=utf-8, TEXT/MARKDOWN, and whitespace-padded variants all short-circuit as expected. New parametrized test cases cover each of those shapes.
Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants.
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
GitNexus: 🚀 deployedThe |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses.
GitNexus: 🚀 deployedThe |
Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.
GitNexus: 🚀 deployedThe |
Upload as Text…a#12734) * 🐛 fix: Preserve Raw Markdown on `Upload as Text` When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged. * 🐛 fix: normalize markdown MIME and accept `text/md` Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants. * 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses. * 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.
…a#12734) * 🐛 fix: Preserve Raw Markdown on `Upload as Text` When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged. * 🐛 fix: normalize markdown MIME and accept `text/md` Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants. * 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses. * 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.
Summary
Fixes #12731.
When
RAG_API_URLis set, uploading a.mdfile via Upload as Text sent the file to the RAG API/textendpoint, which routes Markdown throughUnstructuredMarkdownLoaderand strips formatting (#,**, lists, blockquotes). A.txtfile with identical bytes round-tripped verbatim, so two byte-identical uploads produced very different extracted text depending only on the file extension.This PR adds a short-circuit in
parseText: if the file is Markdown (by MIME type or.md/.markdownextension) it skips the RAG API call and reads the file verbatim viaparseTextNative. This is the defensive fix on the LibreChat side — a matching loader fix indanny-avila/rag_apicovers the server path as well.The embedding path (
/embed, used by file_search/RAG) is untouched and continues to use the semantic markdown loader so vector search quality is unchanged.Changes
packages/api/src/files/text.ts: addisMarkdownFile()helper and short-circuit inparseTextpackages/api/src/files/text.spec.ts: cover each markdown MIME type,.md/.MARKDOWNby extension alone, and regression-check that non-markdown still hits the RAG APITest plan
npx jest src/files/text.spec.ts— 15/15 passRAG_API_URLset, upload a.mdfile with headings/bold/lists via Upload as Text; verify raw markdown reaches the model.txtwith identical content; behavior should match.mdstill uses the semantic loader on the RAG API side (unchanged)