Skip to content

feat: add keyword argument support for train_from_stream (#114)#155

Open
Yegorov wants to merge 5 commits into
cardmagic:masterfrom
Yegorov:114
Open

feat: add keyword argument support for train_from_stream (#114)#155
Yegorov wants to merge 5 commits into
cardmagic:masterfrom
Yegorov:114

Conversation

@Yegorov
Copy link
Copy Markdown
Contributor

@Yegorov Yegorov commented May 21, 2026

Closes #114

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR extends train_from_stream across all four classifiers (Bayes, KNN, LogisticRegression, LSI) to accept multiple categories via keyword arguments (e.g. train_from_stream(spam: io1, ham: io2)), alongside the existing positional form. New tests cover the multi-category path and invalid-IO validation.

  • The LSI implementation moves the begin/ensure block inside the per-category loop, causing build_index to fire once per category rather than once after all streams have been loaded.
  • Making category and io optional without an explicit guard means callers who supply only one of the two arguments silently get a no-op instead of an error.
  • The Streaming base module stub was not updated to match the optional-positional signature now used by every implementation.

Confidence Score: 3/5

The multi-category keyword API is a useful addition, but the LSI implementation rebuilds the index after each category stream and the missing-argument case silently does nothing across all classifiers.

The LSI build_index regression means any caller who streams multiple categories while auto_rebuild is enabled will trigger redundant and increasingly expensive index rebuilds. The silent no-op for partial arguments removes a safety net that existed for free under the old required-parameter contract.

lib/classifier/lsi.rb needs the auto_rebuild guard hoisted outside the loop; lib/classifier/bayes.rb and logistic_regression.rb need an explicit guard for the partial-argument case; lib/classifier/streaming.rb base stub should be updated to match the optional signature.

Important Files Changed

Filename Overview
lib/classifier/lsi.rb Moving the begin/ensure block inside the category loop causes build_index to fire after each individual stream instead of once after all categories, breaking the original optimisation.
lib/classifier/bayes.rb Adds multi-category keyword support. Missing guard means partial argument combinations silently do nothing instead of raising.
lib/classifier/logistic_regression.rb Same pattern as Bayes; same silent no-op concern applies. synchronize and dirty handling looks structurally correct inside the loop.
lib/classifier/knn.rb Thin delegation to lsi.train_from_stream; correctly threads **categories through.
lib/classifier/streaming.rb Base stub not updated to match the optional-positional signature adopted by all implementations; @RBS annotation is also inconsistent.
test/lsi/streaming_test.rb Adds multi-category and invalid-IO tests; does not cover the auto_rebuild regression.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["train_from_stream(category=nil, io=nil, **categories)"] --> B{category && io?}
    B -- Yes --> C["{ category => io }"]
    B -- No --> D["categories hash"]
    C --> E[".each do |cat, stream|"]
    D --> E
    E --> F[Validate stream]
    F --> G[LineReader + Progress]
    G --> H[each_batch]
    H --> I[Train + update progress]
    I --> N{More batches?}
    N -- Yes --> H
    N -- No --> O{More categories?}
    O -- Yes --> E
    O -- No --> P[Done]
    subgraph LSI_BUG ["LSI only"]
        Q["save auto_rebuild, set false"]
        R["ensure: restore + build_index per-category"]
    end
    E -.-> Q
    Q --> G
    G -.-> R
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
lib/classifier/lsi.rb:667-688
**`build_index` called once per category instead of once after all streams**

The `begin/ensure` block (which saves/restores `@auto_rebuild` and calls `build_index`) is now inside the `.each` loop. When multiple categories are provided, `build_index` fires after every stream rather than once at the end. This defeats the original optimisation: if `auto_rebuild` was `true` before the call, you'll pay the full re-index cost for each category, and every intermediate rebuild immediately becomes obsolete. The fix is to hoist the `auto_rebuild` save, the `@auto_rebuild = false` assignment, and the `ensure` block outside the loop so the index is rebuilt exactly once after all categories have been loaded.

### Issue 2 of 3
lib/classifier/bayes.rb:332-333
**Silent no-op when only one of `category`/`io` is supplied**

When a caller passes only `category` (e.g. `train_from_stream(:spam, batch_size: 100)`) or only `io`, the ternary resolves to `categories` which is an empty hash, so the method returns without training or raising. Before this PR, the required positional parameters made such a call a Ruby `ArgumentError`. Consider an explicit guard to preserve that contract.

```suggestion
    def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories)
      raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? == io.nil? && category.nil? && categories.empty?
      raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?

      (category && io ? { category => io } : categories).each do |category, io|
```

### Issue 3 of 3
lib/classifier/streaming.rb:29-30
**Base method signature still requires `category` and `io` as positional arguments**

The `Streaming` module's stub still declares `category` and `io` as required, but every concrete implementation now makes them optional. A caller coding to the base interface would be incorrectly told these are mandatory. Update the stub and its `@rbs` annotation to match.

```suggestion
    # @rbs (?(Symbol | String), ?IO, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
    def train_from_stream(category = nil, io = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block)
```

Reviews (1): Last reviewed commit: "feat: add keyword argument support for `..." | Re-trigger Greptile

Comment thread lib/classifier/lsi.rb Outdated
Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/streaming.rb Outdated
@Yegorov
Copy link
Copy Markdown
Contributor Author

Yegorov commented May 21, 2026

@cardmagic can you take a look, please?

Copy link
Copy Markdown
Owner

@cardmagic cardmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword-arg API is a nice addition and the multi-category path works — issues are around the edges (validation ordering, state mutation, base-module rbs). Test suite passes locally (678 runs, 0 failures).

Note: Greptile's earlier review is stale — it flagged a "build_index per-category" regression and "silent no-op for partial args" that you already fixed in a later commit. Those no longer apply.

See inline comments. Must-fix: #1, #2, #3. Should-fix: #4. The rest are nits/suggestions.

Comment thread lib/classifier/lsi.rb Outdated
Comment thread lib/classifier/lsi.rb Outdated
Comment thread lib/classifier/streaming.rb Outdated
#
# @rbs (Symbol | String, IO, ?batch_size: Integer) { (Progress) -> void } -> void
def train_from_stream(category, io, batch_size: DEFAULT_BATCH_SIZE, &block)
# @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
Copy link
Copy Markdown
Owner

@cardmagic cardmagic May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must-fix #3 — base-module type signature mismatches the implementations.

In RBS, **T is the type of each rest-kwarg value. So:

  • **IO (all four implementations) = each kwarg value is an IO — matches the runtime: train_from_stream(spam: io1) passes an IO as the value
  • **Hash[Symbol, IO] (this base stub) = each kwarg value is itself a Hash[Symbol, IO] — would mean train_from_stream(spam: { foo: io })

The base stub should match:

# @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **IO) { (Progress) -> void } -> void

This may also let you drop the # @type var categories: untyped escape hatch in knn.rb:273.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cardmagic thanks, fixed!

This may also let you drop the # @type var categories: untyped escape hatch in knn.rb:273.

But # @type var categories: untyped comment it's still needed to avoid errors.

Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/knn.rb Outdated
def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE, &block)
@lsi.train_from_stream(category, io, batch_size: batch_size, &block)
# @rbs (?(String | Symbol | nil), ?IO?, ?batch_size: Integer, **IO) { (Streaming::Progress) -> void } -> void
def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — inconsistent block forwarding.

KNN uses &block here but never references block by name; the other three classifiers all use anonymous & (Ruby 3.1+). Switch for consistency:

def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &)
  @lsi.train_from_stream(category, io, batch_size: batch_size, **categories, &)
  synchronize { @dirty = true }
end

Also: when must-fix #3 is fixed, the # @type var categories: untyped on line 273 may no longer be needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cardmagic thanks, fixed!

Also: when must-fix #3 is fixed, the # @type var categories: untyped on line 273 may no longer be needed.

Unfortunately, this has no effect.

Comment thread lib/classifier/logistic_regression.rb Outdated
Comment thread test/bayes/streaming_test.rb
@Yegorov Yegorov requested a review from cardmagic May 31, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add keyword argument support for train_from_stream

2 participants