Skip to content

feat: make ICU the default FTS tokenizer#6968

Draft
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/icu-default-fts-tokenizer
Draft

feat: make ICU the default FTS tokenizer#6968
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/icu-default-fts-tokenizer

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented May 27, 2026

This changes the default native FTS tokenizer from simple to icu so new inverted indexes handle mixed-language text without requiring users to opt into multilingual tokenization. Legacy missing tokenizer metadata continues to resolve to simple, and builds without the ICU feature still fall back to simple.

Benchmark context from the 100M-row runs: English-only recall was unchanged; ICU index build time was +15.4%, index size was +0.6%, and common-term latency was effectively flat. On mixed English/CJK/French/Thai/Japanese data, ICU recovered CJK/Japanese/Thai rare-term recall from 0.0 to 1.0, with +20.4% build time and +25.7% index size.

@github-actions github-actions Bot added enhancement New feature or request python labels May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant