Skip to content

Fix Mistral tokenizer missing spaces in decode_stream (Issue #1822)#2211

Open
mrshibly wants to merge 3 commits into
Lightning-AI:mainfrom
mrshibly:fix-mistral-tokenizer-spaces
Open

Fix Mistral tokenizer missing spaces in decode_stream (Issue #1822)#2211
mrshibly wants to merge 3 commits into
Lightning-AI:mainfrom
mrshibly:fix-mistral-tokenizer-spaces

Conversation

@mrshibly

@mrshibly mrshibly commented Mar 5, 2026

Copy link
Copy Markdown

Summary Fixes #1822 where litgpt chat generated text without spaces for Mistral and other HuggingFace tokenizers.

Changes Unified Tokenizer.decode_stream to use a running buffer approach for all backends. This ensures that BPE-based tokenizers correctly preserve leading spaces and context-dependent characters during streaming, which were previously stripped when decoded individually.

Testing Verified manually. The fix aligns the HuggingFace backend logic with the robust implementation already used for SentencePiece.

@mrshibly mrshibly left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary Fixes #1822 where litgpt chat generated text without spaces for Mistral and other HuggingFace tokenizers.

Changes Unified Tokenizer.decode_stream to use a running buffer approach for all backends. This ensures that BPE-based tokenizers correctly preserve leading spaces and context-dependent characters during streaming, which were previously stripped when decoded individually.

Testing Verified manually. The fix aligns the HuggingFace backend logic with the robust implementation already used for SentencePiece.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chatting with mistral generates answer with no spaces

2 participants