Add a dict based tokenizer for fulltext. by fengttt · Pull Request #24297 · matrixorigin/matrixone

fengttt · 2026-05-07T05:44:06Z

Use gojieba. #24271.

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Add dict based fulltext tokenizer. This is a re-submission of #24272

qodo-code-review · 2026-05-07T05:44:11Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

XuPeng-SH

Review: Add a dict based tokenizer for fulltext (gojieba)

Summary

This PR adds Chinese word segmentation via gojieba to the fulltext search engine. It vendors dictionary files (~600K lines), refactors SimpleTokenizer to accept input at Tokenize() time (enabling a Tokenizer interface), and threads the parser name through the fulltext pipeline so both index-build and query tokenization use the same segmenter.

Re-submission of #24272 (closed).

Architecture

The design is sound:

Index build (fulltext_tokenize.go): uses SharedJiebaTokenizer(false) — dictionary-only, no HMM
Query parse (parsePatternInNLModeJieba): also uses SharedJiebaTokenizer(false) — consistent with index
ASCII spans: pre-split and routed through SimpleTokenizer's Latin path before handing CJK to gojieba — correct boundary handling
Shared singleton: process-wide sync.Once singletons, safe because gojieba's C++ read path (Tokenize) is thread-safe after init

Issues

1. (High) gojieba.NewJieba panics if dictionary files are missing — no graceful degradation

SharedJiebaTokenizer is called inside sync.Once. If resolveJiebaDictDir() returns "" (env var unset + source tree gone), gojieba.NewJieba receives empty paths and panics. This will crash the entire MO process on first fulltext query against a gojieba index.

Suggestion: validate paths before calling NewJieba; return an error (or at minimum log a fatal) instead of letting the panic propagate uncontrolled. Consider making SharedJiebaTokenizer return (*JiebaTokenizer, error).

2. (Medium) ~600K lines of dictionary data vendored directly in the repo

jieba.dict.utf8 (349K lines) and idf.utf8 (259K lines) add ~20MB to the repo. Every clone pays this cost forever. Consider:

Git LFS for the dict files
A build-time download step (e.g., go generate)
Or at minimum, document the size tradeoff in the PR description

3. (Medium) resolveJiebaDictDir fallback via runtime.Caller(0) is fragile

The fallback path uses the source file's compile-time location to find ./dict. This only works when the source tree is still on disk (i.e., go test in a dev checkout). In production:

Docker: handled via MO_JIEBA_DICT_DIR env var ✓
Bare-metal deployments: if the env var isn't set, the fallback silently returns "" → panic

Suggestion: document that MO_JIEBA_DICT_DIR is required for production deployments, or embed the dicts at compile time.

4. (Low) Typo in error message

return nil, moerr.NewInternalErrorNoCtx("Invalid input search string.  search string onverted to empty pattern")

"onverted" → "converted"

5. (Low) Misleading docstring on SharedJiebaTokenizer

The comment says "true at query time: HMM new-word discovery broadens recall" but the actual implementation uses false at both index and query time (for good reason — consistency). The docstring describes aspirational usage, not actual usage, which could confuse future readers.

6. (Low) SimpleTokenizer is now stateful across Tokenize() calls

Tokenize() mutates t.input, t.begin, etc. This makes it unsafe to reuse concurrently. Currently all callers create fresh instances, but the Tokenizer interface doesn't document this constraint. Consider either:

Adding a doc comment on the interface noting "not safe for concurrent use"
Or making Tokenize use only local state (pass through the closure without storing on the struct)

Positive Aspects

HMM-on/off decision is well-reasoned and consistent between index and query
Pre-splitting ASCII spans before gojieba avoids the known issue of gojieba fragmenting English words into single characters
Comprehensive integration tests covering NL mode, boolean mode, Chinese queries, update/delete/re-insert cycles
truncateUTF8 correctly handles multi-byte boundary truncation
Dockerfile changes properly set up the env var for production
The Tokenizer interface is a clean abstraction enabling future tokenizer backends

Use gojieba. matrixorigin#24271.

Make SimpleTokenizer stateless by moving per-call state into a closure- local struct, so a single instance is safe to share across goroutines. Change the Tokenizer interface to iter.Seq2[Token, error] so impls can report errors before or during iteration. Convert gojieba's missing- dictionary panic into a returned error in NewJiebaTokenizer and SharedJiebaTokenizer. Also fix the "search string onverted" typo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fengttt requested review from XuPeng-SH, aunjgr, heni02 and ouyuanning as code owners May 7, 2026 05:44

fengttt temporarily deployed to ci May 7, 2026 05:44 — with GitHub Actions Inactive

matrix-meow added the size/XXL Denotes a PR that changes 2000+ lines label May 7, 2026

mergify Bot added the kind/feature label May 7, 2026

heni02 approved these changes May 7, 2026

View reviewed changes

XuPeng-SH reviewed May 7, 2026

View reviewed changes

Add a dict based tokenizer for fulltext.

ee0743b

Use gojieba. matrixorigin#24271.

fengttt force-pushed the main branch from 713ffbc to 382d5ba Compare May 7, 2026 15:47

fengttt had a problem deploying to ci May 7, 2026 15:48 — with GitHub Actions Error

fengttt had a problem deploying to ci May 7, 2026 15:48 — with GitHub Actions Failure

fengttt had a problem deploying to ci May 7, 2026 15:48 — with GitHub Actions Error

fengttt temporarily deployed to ci May 7, 2026 15:48 — with GitHub Actions Inactive

fengttt had a problem deploying to ci May 7, 2026 15:48 — with GitHub Actions Error

fengttt temporarily deployed to ci May 7, 2026 15:48 — with GitHub Actions Inactive

fengttt and others added 2 commits May 7, 2026 09:01

Update .gitignore to ignore CLAUDE.md

cdf86e5

fengttt force-pushed the main branch from 382d5ba to cdf86e5 Compare May 7, 2026 16:03

fengttt requested a review from XuPeng-SH May 7, 2026 16:04

fengttt temporarily deployed to ci May 7, 2026 16:04 — with GitHub Actions Inactive

aunjgr approved these changes May 9, 2026

View reviewed changes

Merge branch 'main' into main

5017619

fengttt temporarily deployed to ci May 10, 2026 04:46 — with GitHub Actions Inactive

fengttt had a problem deploying to ci May 10, 2026 04:46 — with GitHub Actions Error

fengttt temporarily deployed to ci May 10, 2026 04:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a dict based tokenizer for fulltext.#24297

Add a dict based tokenizer for fulltext.#24297
fengttt wants to merge 4 commits intomatrixorigin:mainfrom
fengttt:main

fengttt commented May 7, 2026

Uh oh!

qodo-code-review Bot commented May 7, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fengttt commented May 7, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 7, 2026

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Review: Add a dict based tokenizer for fulltext (gojieba)

Summary

Architecture

Issues

Positive Aspects

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants