Fix Mistral tokenizer missing spaces in decode_stream (Issue #1822) by mrshibly · Pull Request #2211 · Lightning-AI/litgpt

mrshibly · 2026-03-05T00:27:20Z

Summary Fixes #1822 where litgpt chat generated text without spaces for Mistral and other HuggingFace tokenizers.

Changes Unified Tokenizer.decode_stream to use a running buffer approach for all backends. This ensures that BPE-based tokenizers correctly preserve leading spaces and context-dependent characters during streaming, which were previously stripped when decoded individually.

Testing Verified manually. The fix aligns the HuggingFace backend logic with the robust implementation already used for SentencePiece.

…g-AI#1822)

for more information, see https://pre-commit.ci

mrshibly

Summary Fixes #1822 where litgpt chat generated text without spaces for Mistral and other HuggingFace tokenizers.

Changes Unified Tokenizer.decode_stream to use a running buffer approach for all backends. This ensures that BPE-based tokenizers correctly preserve leading spaces and context-dependent characters during streaming, which were previously stripped when decoded individually.

Testing Verified manually. The fix aligns the HuggingFace backend logic with the robust implementation already used for SentencePiece.

Fix Mistral tokenizer missing spaces in decode_stream (Issue Lightnin…

f90c740

…g-AI#1822)

mrshibly requested review from KaelanDt, andyland, k223kim, lantiga, lianakoleva and t-vi as code owners March 5, 2026 00:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb3fd03

for more information, see https://pre-commit.ci

mrshibly commented Mar 5, 2026

View reviewed changes

lianakoleva approved these changes Mar 21, 2026

View reviewed changes

Merge branch 'main' into fix-mistral-tokenizer-spaces

843647f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Mistral tokenizer missing spaces in decode_stream (Issue #1822)#2211

Fix Mistral tokenizer missing spaces in decode_stream (Issue #1822)#2211
mrshibly wants to merge 3 commits into
Lightning-AI:mainfrom
mrshibly:fix-mistral-tokenizer-spaces

mrshibly commented Mar 5, 2026

Uh oh!

mrshibly left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrshibly commented Mar 5, 2026

Uh oh!

mrshibly left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants