Add a dict based tokenizer for fulltext.#24297
Add a dict based tokenizer for fulltext.#24297fengttt wants to merge 4 commits intomatrixorigin:mainfrom
Conversation
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Review: Add a dict based tokenizer for fulltext (gojieba)
Summary
This PR adds Chinese word segmentation via gojieba to the fulltext search engine. It vendors dictionary files (~600K lines), refactors SimpleTokenizer to accept input at Tokenize() time (enabling a Tokenizer interface), and threads the parser name through the fulltext pipeline so both index-build and query tokenization use the same segmenter.
Re-submission of #24272 (closed).
Architecture
The design is sound:
- Index build (
fulltext_tokenize.go): usesSharedJiebaTokenizer(false)— dictionary-only, no HMM - Query parse (
parsePatternInNLModeJieba): also usesSharedJiebaTokenizer(false)— consistent with index - ASCII spans: pre-split and routed through
SimpleTokenizer's Latin path before handing CJK to gojieba — correct boundary handling - Shared singleton: process-wide
sync.Oncesingletons, safe because gojieba's C++ read path (Tokenize) is thread-safe after init
Issues
1. (High) gojieba.NewJieba panics if dictionary files are missing — no graceful degradation
SharedJiebaTokenizer is called inside sync.Once. If resolveJiebaDictDir() returns "" (env var unset + source tree gone), gojieba.NewJieba receives empty paths and panics. This will crash the entire MO process on first fulltext query against a gojieba index.
Suggestion: validate paths before calling NewJieba; return an error (or at minimum log a fatal) instead of letting the panic propagate uncontrolled. Consider making SharedJiebaTokenizer return (*JiebaTokenizer, error).
2. (Medium) ~600K lines of dictionary data vendored directly in the repo
jieba.dict.utf8 (349K lines) and idf.utf8 (259K lines) add ~20MB to the repo. Every clone pays this cost forever. Consider:
- Git LFS for the dict files
- A build-time download step (e.g.,
go generate) - Or at minimum, document the size tradeoff in the PR description
3. (Medium) resolveJiebaDictDir fallback via runtime.Caller(0) is fragile
The fallback path uses the source file's compile-time location to find ./dict. This only works when the source tree is still on disk (i.e., go test in a dev checkout). In production:
- Docker: handled via
MO_JIEBA_DICT_DIRenv var ✓ - Bare-metal deployments: if the env var isn't set, the fallback silently returns
""→ panic
Suggestion: document that MO_JIEBA_DICT_DIR is required for production deployments, or embed the dicts at compile time.
4. (Low) Typo in error message
return nil, moerr.NewInternalErrorNoCtx("Invalid input search string. search string onverted to empty pattern")"onverted" → "converted"
5. (Low) Misleading docstring on SharedJiebaTokenizer
The comment says "true at query time: HMM new-word discovery broadens recall" but the actual implementation uses false at both index and query time (for good reason — consistency). The docstring describes aspirational usage, not actual usage, which could confuse future readers.
6. (Low) SimpleTokenizer is now stateful across Tokenize() calls
Tokenize() mutates t.input, t.begin, etc. This makes it unsafe to reuse concurrently. Currently all callers create fresh instances, but the Tokenizer interface doesn't document this constraint. Consider either:
- Adding a doc comment on the interface noting "not safe for concurrent use"
- Or making
Tokenizeuse only local state (pass through the closure without storing on the struct)
Positive Aspects
- HMM-on/off decision is well-reasoned and consistent between index and query
- Pre-splitting ASCII spans before gojieba avoids the known issue of gojieba fragmenting English words into single characters
- Comprehensive integration tests covering NL mode, boolean mode, Chinese queries, update/delete/re-insert cycles
truncateUTF8correctly handles multi-byte boundary truncation- Dockerfile changes properly set up the env var for production
- The
Tokenizerinterface is a clean abstraction enabling future tokenizer backends
Make SimpleTokenizer stateless by moving per-call state into a closure- local struct, so a single instance is safe to share across goroutines. Change the Tokenizer interface to iter.Seq2[Token, error] so impls can report errors before or during iteration. Convert gojieba's missing- dictionary panic into a returned error in NewJiebaTokenizer and SharedJiebaTokenizer. Also fix the "search string onverted" typo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use gojieba. #24271.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24271
What this PR does / why we need it:
Add dict based fulltext tokenizer. This is a re-submission of #24272