Skip to content

Commit 1225a3f

Browse files
danny-avilakrgokul
authored andcommitted
📝 fix: Preserve Raw Markdown Formatting on Upload as Text (danny-avila#12734)
* 🐛 fix: Preserve Raw Markdown on `Upload as Text` When `RAG_API_URL` is configured, `.md` uploads were sent to the RAG API `/text` endpoint, which routes Markdown through `UnstructuredMarkdownLoader` and strips formatting (`#`, `**`, lists, blockquotes). Users expect `Upload as Text` to preserve raw content - identical bytes in a `.txt` file round-trip verbatim, while the `.md` came back stripped. Short-circuit the RAG API call for Markdown files (by MIME type or `.md` / `.markdown` extension) and read the file verbatim via `parseTextNative`. Non-Markdown paths are unaffected, and the embedding path (`/embed`) keeps its existing loader so vector search quality is unchanged. * 🐛 fix: normalize markdown MIME and accept `text/md` Addressing review feedback on the `Upload as Text` short-circuit: - Accept `text/md` in the markdown MIME set (LibreChat treats it as a valid markdown type elsewhere, e.g. the artifact-rendering prompt). - Normalize the incoming MIME type (lowercase + strip parameters) before the set lookup so parameterized values like `text/markdown; charset=utf-8` and uppercase `TEXT/MARKDOWN` still short-circuit. Extensionless uploads relying only on the `Content-Type` header would otherwise fall through to the RAG `/text` endpoint and lose their markdown formatting. Extend `text.spec.ts` parametrized cases with `text/md`, parameterized MIME, uppercase, and whitespace-padded variants. * 🧹 chore: Address Code Review Follow-ups on `Upload as Text` fix Addressing comprehensive review feedback: - Debug log now includes filename and MIME type so operators can identify which upload triggered the short-circuit without having to correlate other logs. - Expand markdown extension detection beyond `.md` / `.markdown` to cover `.mdown`, `.mkdn`, `.mkd`, `.mdwn` (case-insensitive regex). - Tighten `normalizeMimeType` parameter type from `string | undefined` to `string` to match the actual Express.Multer.File type. The falsy-check still protects against empty strings at runtime. - Extend parametrized tests with the most common real-world shapes: `text/plain` + `.md` (the MIME most browsers/servers assign), the new rare extensions, and empty MIME + `.md` (pure extension fallback path). - Add a positive assertion that `readFileAsString` was called with the expected arguments on every short-circuit case, so tests fail loudly if the native-parse path ever regresses. * 🧪 test: Cover `.mdwn` regex branch in Markdown short-circuit Every other alternation in `MARKDOWN_EXTENSIONS_RE` has at least one test case (`md`, `markdown`, `mdown`, `mkdn`, `mkd`) but `mdwn` did not, leaving a typo in that branch undetectable.
1 parent 034f2ef commit 1225a3f

2 files changed

Lines changed: 101 additions & 0 deletions

File tree

packages/api/src/files/text.spec.ts

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -300,5 +300,73 @@ describe('text', () => {
300300
source: FileSources.text,
301301
});
302302
});
303+
304+
it.each([
305+
{ mimetype: 'text/markdown', originalname: 'notes.md' },
306+
{ mimetype: 'text/x-markdown', originalname: 'notes.md' },
307+
{ mimetype: 'text/md', originalname: 'notes' },
308+
{ mimetype: 'application/markdown', originalname: 'notes.md' },
309+
{ mimetype: 'application/x-markdown', originalname: 'notes.md' },
310+
{ mimetype: 'text/plain', originalname: 'notes.md' },
311+
{ mimetype: 'application/octet-stream', originalname: 'README.md' },
312+
{ mimetype: 'application/octet-stream', originalname: 'GUIDE.MARKDOWN' },
313+
{ mimetype: 'application/octet-stream', originalname: 'post.mdown' },
314+
{ mimetype: 'application/octet-stream', originalname: 'post.mkdn' },
315+
{ mimetype: 'application/octet-stream', originalname: 'post.mkd' },
316+
{ mimetype: 'application/octet-stream', originalname: 'docs.mdwn' },
317+
{ mimetype: 'text/markdown; charset=utf-8', originalname: 'notes' },
318+
{ mimetype: 'TEXT/MARKDOWN', originalname: 'notes' },
319+
{ mimetype: ' text/markdown ; charset=UTF-8 ', originalname: 'notes' },
320+
{ mimetype: '', originalname: 'notes.md' },
321+
])(
322+
'should short-circuit to native parsing for markdown file (%o)',
323+
async ({ mimetype, originalname }) => {
324+
process.env.RAG_API_URL = 'http://rag-api.test';
325+
const mockText = '# Heading\n\n**bold** text';
326+
const mockBytes = Buffer.byteLength(mockText, 'utf8');
327+
328+
mockedReadFileAsString.mockResolvedValue({
329+
content: mockText,
330+
bytes: mockBytes,
331+
});
332+
333+
const result = await parseText({
334+
req: mockReq,
335+
file: { ...mockFile, mimetype, originalname },
336+
file_id: mockFileId,
337+
});
338+
339+
expect(mockedAxios.get).not.toHaveBeenCalled();
340+
expect(mockedAxios.post).not.toHaveBeenCalled();
341+
expect(mockedReadFileAsString).toHaveBeenCalledWith('/tmp/test.txt', {
342+
fileSize: 100,
343+
});
344+
expect(result).toEqual({
345+
text: mockText,
346+
bytes: mockBytes,
347+
source: FileSources.text,
348+
});
349+
},
350+
);
351+
352+
it('should still call the RAG API for non-markdown text files', async () => {
353+
process.env.RAG_API_URL = 'http://rag-api.test';
354+
const mockText = 'plain text content';
355+
356+
mockedAxios.get.mockResolvedValue({ status: 200, statusText: 'OK' });
357+
mockedAxios.post.mockResolvedValue({ data: { text: mockText } });
358+
359+
await parseText({
360+
req: mockReq,
361+
file: mockFile,
362+
file_id: mockFileId,
363+
});
364+
365+
expect(mockedAxios.post).toHaveBeenCalledWith(
366+
'http://rag-api.test/text',
367+
expect.any(Object),
368+
expect.objectContaining({ timeout: 300000 }),
369+
);
370+
});
303371
});
304372
});

packages/api/src/files/text.ts

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,32 @@ import type { ServerRequest } from '~/types';
77
import { logAxiosError, readFileAsString } from '~/utils';
88
import { generateShortLivedToken } from '~/crypto/jwt';
99

10+
const MARKDOWN_MIME_TYPES = new Set([
11+
'text/markdown',
12+
'text/x-markdown',
13+
'text/md',
14+
'application/markdown',
15+
'application/x-markdown',
16+
]);
17+
18+
const MARKDOWN_EXTENSIONS_RE = /\.(md|markdown|mdown|mkdn|mkd|mdwn)$/i;
19+
20+
function normalizeMimeType(mimetype: string): string {
21+
if (!mimetype) {
22+
return '';
23+
}
24+
const semi = mimetype.indexOf(';');
25+
const base = semi === -1 ? mimetype : mimetype.slice(0, semi);
26+
return base.trim().toLowerCase();
27+
}
28+
29+
function isMarkdownFile(file: Express.Multer.File): boolean {
30+
if (MARKDOWN_MIME_TYPES.has(normalizeMimeType(file.mimetype))) {
31+
return true;
32+
}
33+
return MARKDOWN_EXTENSIONS_RE.test(file.originalname ?? '');
34+
}
35+
1036
/**
1137
* Attempts to parse text using RAG API, falls back to native text parsing
1238
* @param params - The parameters object
@@ -29,6 +55,13 @@ export async function parseText({
2955
return parseTextNative(file);
3056
}
3157

58+
if (isMarkdownFile(file)) {
59+
logger.debug(
60+
`[parseText] Markdown file detected (${file.originalname}, ${file.mimetype}), using native parsing to preserve raw formatting`,
61+
);
62+
return parseTextNative(file);
63+
}
64+
3265
const userId = req.user?.id;
3366
if (!userId) {
3467
logger.debug('[parseText] No user ID provided, falling back to native text parsing');

0 commit comments

Comments
 (0)