Skip to content

Add a dict based tokenizer for fulltext.#24297

Open
fengttt wants to merge 4 commits intomatrixorigin:mainfrom
fengttt:main
Open

Add a dict based tokenizer for fulltext.#24297
fengttt wants to merge 4 commits intomatrixorigin:mainfrom
fengttt:main

Conversation

@fengttt
Copy link
Copy Markdown
Contributor

@fengttt fengttt commented May 7, 2026

Use gojieba. #24271.

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24271

What this PR does / why we need it:

Add dict based fulltext tokenizer. This is a re-submission of #24272

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add a dict based tokenizer for fulltext (gojieba)

Summary

This PR adds Chinese word segmentation via gojieba to the fulltext search engine. It vendors dictionary files (~600K lines), refactors SimpleTokenizer to accept input at Tokenize() time (enabling a Tokenizer interface), and threads the parser name through the fulltext pipeline so both index-build and query tokenization use the same segmenter.

Re-submission of #24272 (closed).

Architecture

The design is sound:

  • Index build (fulltext_tokenize.go): uses SharedJiebaTokenizer(false) — dictionary-only, no HMM
  • Query parse (parsePatternInNLModeJieba): also uses SharedJiebaTokenizer(false) — consistent with index
  • ASCII spans: pre-split and routed through SimpleTokenizer's Latin path before handing CJK to gojieba — correct boundary handling
  • Shared singleton: process-wide sync.Once singletons, safe because gojieba's C++ read path (Tokenize) is thread-safe after init

Issues

1. (High) gojieba.NewJieba panics if dictionary files are missing — no graceful degradation

SharedJiebaTokenizer is called inside sync.Once. If resolveJiebaDictDir() returns "" (env var unset + source tree gone), gojieba.NewJieba receives empty paths and panics. This will crash the entire MO process on first fulltext query against a gojieba index.

Suggestion: validate paths before calling NewJieba; return an error (or at minimum log a fatal) instead of letting the panic propagate uncontrolled. Consider making SharedJiebaTokenizer return (*JiebaTokenizer, error).

2. (Medium) ~600K lines of dictionary data vendored directly in the repo

jieba.dict.utf8 (349K lines) and idf.utf8 (259K lines) add ~20MB to the repo. Every clone pays this cost forever. Consider:

  • Git LFS for the dict files
  • A build-time download step (e.g., go generate)
  • Or at minimum, document the size tradeoff in the PR description

3. (Medium) resolveJiebaDictDir fallback via runtime.Caller(0) is fragile

The fallback path uses the source file's compile-time location to find ./dict. This only works when the source tree is still on disk (i.e., go test in a dev checkout). In production:

  • Docker: handled via MO_JIEBA_DICT_DIR env var ✓
  • Bare-metal deployments: if the env var isn't set, the fallback silently returns "" → panic

Suggestion: document that MO_JIEBA_DICT_DIR is required for production deployments, or embed the dicts at compile time.

4. (Low) Typo in error message

return nil, moerr.NewInternalErrorNoCtx("Invalid input search string.  search string onverted to empty pattern")

"onverted" → "converted"

5. (Low) Misleading docstring on SharedJiebaTokenizer

The comment says "true at query time: HMM new-word discovery broadens recall" but the actual implementation uses false at both index and query time (for good reason — consistency). The docstring describes aspirational usage, not actual usage, which could confuse future readers.

6. (Low) SimpleTokenizer is now stateful across Tokenize() calls

Tokenize() mutates t.input, t.begin, etc. This makes it unsafe to reuse concurrently. Currently all callers create fresh instances, but the Tokenizer interface doesn't document this constraint. Consider either:

  • Adding a doc comment on the interface noting "not safe for concurrent use"
  • Or making Tokenize use only local state (pass through the closure without storing on the struct)

Positive Aspects

  • HMM-on/off decision is well-reasoned and consistent between index and query
  • Pre-splitting ASCII spans before gojieba avoids the known issue of gojieba fragmenting English words into single characters
  • Comprehensive integration tests covering NL mode, boolean mode, Chinese queries, update/delete/re-insert cycles
  • truncateUTF8 correctly handles multi-byte boundary truncation
  • Dockerfile changes properly set up the env var for production
  • The Tokenizer interface is a clean abstraction enabling future tokenizer backends

fengttt and others added 2 commits May 7, 2026 09:01
Make SimpleTokenizer stateless by moving per-call state into a closure-
local struct, so a single instance is safe to share across goroutines.
Change the Tokenizer interface to iter.Seq2[Token, error] so impls can
report errors before or during iteration. Convert gojieba's missing-
dictionary panic into a returned error in NewJiebaTokenizer and
SharedJiebaTokenizer. Also fix the "search string onverted" typo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants