feat: make ICU the default FTS tokenizer by Xuanwo · Pull Request #6968 · lance-format/lance

Xuanwo · 2026-05-27T19:46:33Z

This changes the default native FTS tokenizer from simple to icu so new inverted indexes handle mixed-language text without requiring users to opt into multilingual tokenization. Legacy missing tokenizer metadata continues to resolve to simple, and builds without the ICU feature still fall back to simple.

Benchmark context from the 100M-row runs: English-only recall was unchanged; ICU index build time was +15.4%, index size was +0.6%, and common-term latency was effectively flat. On mixed English/CJK/French/Thai/Japanese data, ICU recovered CJK/Japanese/Thai rare-term recall from 0.0 to 1.0, with +20.4% build time and +25.7% index size.

feat: make ICU the default FTS tokenizer

a3fb622

github-actions Bot added enhancement New feature or request python labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make ICU the default FTS tokenizer#6968

feat: make ICU the default FTS tokenizer#6968
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/icu-default-fts-tokenizer

Xuanwo commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xuanwo commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant