Skip to content

fix(memstreamer): park Recv on sync.Cond instead of busy-spin#196

Open
andrewwormald wants to merge 1 commit into
mainfrom
fix-memstreamer-busy-spin
Open

fix(memstreamer): park Recv on sync.Cond instead of busy-spin#196
andrewwormald wants to merge 1 commit into
mainfrom
fix-memstreamer-busy-spin

Conversation

@andrewwormald

@andrewwormald andrewwormald commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

adapters/memstreamer/(*Stream).Recv polls the in-memory log in an unconditional for { ... continue } loop with no backoff. Each iteration acquires and releases s.mu and cursorStore.mu. With N step consumers per workflow, every iteration of the runtime scheduler keeps all N goroutines runnable on those two mutexes — even when there is nothing to deliver.

In a downstream daemon (everflow) using memstreamer as the production EventStreamer (durability already provided by a sqlite-backed RecordStore + transactional outbox), this pinned an Apple Silicon laptop at ~380% CPU for 15+ hours, even after the single in-flight Run had already reached a terminal status and the event log was no longer growing. A SIGQUIT goroutine dump showed all step consumers in runnable state inside (*cursorStore).Get → Mutex.lockSlow.

This PR replaces the busy loop with sync.Cond parking: Recv blocks on cond.Wait when there's nothing to deliver and is woken by cond.Broadcast on Send.

What changed

  • adapters/memstreamer/memstreamer.go

    • Added cond *sync.Cond to the shared Stream struct, constructed once in New() keyed off the existing shared s.mu. Sender + receiver Streams share the same cond pointer (same wiring pattern as mu and log).
    • Send calls s.cond.Broadcast() while still holding s.mu — the natural place after appending.
    • Recv holds s.mu across the for-loop; when the log is exhausted it calls s.cond.Wait() instead of continue. Topic-skip still advances the cursor and loops without parking.
    • A per-call watcher goroutine selects on <-ctx.Done() vs <-stop; on ctx done it locks s.mu, broadcasts, unlocks. defer close(stop) ensures the watcher exits when Recv returns — no per-call leak.
  • adapters/memstreamer/memstreamer_test.go — three new regression tests:

    • TestRecv_DoesNotBusySpin — 8 parked receivers; asserts runtime.NumGoroutine() stable over 200 ms AND dumps runtime.Stack to assert at least 8 stacks point into memstreamer.(*Stream).Recv and at least one is in sync.runtime_notifyListWait / sync.(*Cond).Wait. The stack-shape assertion is what actually distinguishes spin from park.
    • TestRecv_WakesOnSend — parked Recv returns within 100 ms of a Send.
    • TestRecv_WakesOnCtxCancel — parked Recv returns ctx.Err() within 100 ms of cancel; NumGoroutine does not grow after cancel (no watcher leak).

Semantics preserved

  • Public interfaces unchanged: StreamConstructor.NewSender, NewReceiver, Send, Recv, Close, the ack closure, and the WithClock option are all identical. No new exported symbols or options.
  • Topic filtering, per-receiver cursor advancement via cursorStore, StreamFromLatest option, and the idempotent ack-closure-advances-cursor-exactly-once semantics all behave the same.
  • Lock order unchanged: Recv takes s.mu (across the wait); the ack closure only touches cursorStore.mu; Send takes s.mu (briefly) and broadcasts.

Verification

go test ./adapters/memstreamer/... -count=1 -race:

--- PASS: TestStreamer (1.02s)
--- PASS: TestConnector (0.26s)
--- PASS: TestRecv_DoesNotBusySpin (0.30s)
--- PASS: TestRecv_WakesOnSend (0.05s)
--- PASS: TestRecv_WakesOnCtxCancel (0.07s)
ok  	github.com/luno/workflow/adapters/memstreamer	2.94s

go test ./... -count=1 -race — all packages green (~23s), including the workflow consumer paths that exercise the adapter.

I verified that the new tests fail against the original busy-loop implementation by stashing the fix and re-running: TestRecv_DoesNotBusySpin dumped goroutines stuck in internal/sync.(*Mutex).lockSlow in the runnable state inside Recv (matching exactly the SIGQUIT dump described above) before failing.

Behavioural notes

  • Wakeup ordering / fairness: cond.Broadcast() wakes all parked receivers; they then race for s.mu. The Go runtime does not guarantee FIFO order on sync.Mutex.Lock. The previous busy-loop had no fairness either (all goroutines hammered the mutex), so this is no worse — but consumers should not depend on a particular wake order. For per-receiver cursor independence (multiple names, same topic) this is fine because each receiver has its own cursor in cursorStore.
  • Per-call ctx watcher: every Recv invocation now spawns one ephemeral goroutine. Workflow consumers typically call Recv in a tight loop (one event at a time), so this is one short-lived goroutine per delivered event — bounded by event throughput, cleaned up via defer close(stop). Negligible compared to the CPU saved.
  • Lock held across Wait: s.cond.Wait() releases s.mu while parked and reacquires on wake. Send is not blocked by parked receivers.

Test plan

  • go test ./adapters/memstreamer/... -race
  • go test ./... -race
  • Reverse-verify: stash the fix, re-run; new tests fail as expected
  • (Reviewer) Confirm no consumer in the codebase depends on FIFO wakeup order across multiple Recv callers on the same topic

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved streaming so waiting receivers now sleep efficiently instead of busy-spinning.
    • Receivers wake promptly when new events arrive or when their context is cancelled, reducing latency and avoiding goroutine leaks.
    • Event delivery and acknowledgement behaviour remains unchanged.

The previous Recv implementation polled the log in a tight for-loop with
no backoff: each iteration acquired and released s.mu and cursorStore.mu.
With N step consumers per workflow, all N goroutines stayed runnable on
those two mutexes even when the log was idle, pinning hardware at high
CPU (observed ~380% on Apple Silicon for hours after a single Run had
already reached a terminal status).

This change adds a sync.Cond keyed off the existing shared mutex. Recv
now blocks on cond.Wait when there is nothing to deliver and is woken by
cond.Broadcast on Send. A short-lived watcher goroutine per Recv call
broadcasts on ctx.Done so cancellation still unblocks parked receivers;
it is cleaned up via a stop channel on Recv return, so there is no
per-call goroutine leak.

Semantics preserved: topic filtering, per-receiver cursor advancement
via the existing cursorStore, StreamFromLatest receiver option, and the
idempotent ack closure that advances the cursor exactly once.

Adds three regression tests:
- TestRecv_DoesNotBusySpin (8 parked receivers, asserts stable
  goroutine count and that stacks show goroutines parked on
  sync.Cond.Wait rather than spinning in the Recv for-loop)
- TestRecv_WakesOnSend (parked Recv returns within 100ms of Send)
- TestRecv_WakesOnCtxCancel (parked Recv returns ctx.Err() within
  100ms of cancel and the watcher goroutine does not leak)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

memstreamer now uses a shared mutex and sync.Cond across sender and receiver construction. Send broadcasts after appending an event. Recv now waits on the condition variable instead of polling, wakes on ctx.Done(), keeps skipping non-matching topics, and preserves ack cursor updates. New tests cover idle waiting, wake-up on send, and wake-up on context cancellation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • luno/workflow#190: Also changes adapters/memstreamer/memstreamer.go receiver-path behaviour and Recv/NewReceiver cursor handling in the same area.

Suggested reviewers

  • ScaleneZA
  • jesseLuno

Poem

A rabbit hopped where streams did hum,
The cond woke up with a gentle thrum.
No busy spin, just wait and see,
Then send and recv in harmony.
🐇✨ Hop, wake, ack — hooray for the key!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarises the main change: parking Recv on sync.Cond instead of busy-spinning.
Description check ✅ Passed The description is directly about the memstreamer Recv fix and matches the changeset details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-memstreamer-busy-spin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot requested review from ScaleneZA and jesseLuno June 25, 2026 15:29

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
adapters/memstreamer/memstreamer.go (1)

139-149: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Start the context watcher only when Recv is about to park.

Right now every Recv call spawns a goroutine, even when an event is immediately available. Moving watcher creation behind the cond.Wait() path keeps the hot path goroutine-free while preserving cancellation wake-ups.

Suggested local refactor
-	stop := make(chan struct{})
-	defer close(stop)
-	go func() {
-		select {
-		case <-ctx.Done():
-			s.mu.Lock()
-			s.cond.Broadcast()
-			s.mu.Unlock()
-		case <-stop:
-		}
-	}()
+	var stop chan struct{}
+	defer func() {
+		if stop != nil {
+			close(stop)
+		}
+	}()
+	startCtxWatcher := func() {
+		if stop != nil {
+			return
+		}
+		stop = make(chan struct{})
+		go func(stop <-chan struct{}) {
+			select {
+			case <-ctx.Done():
+				s.mu.Lock()
+				s.cond.Broadcast()
+				s.mu.Unlock()
+			case <-stop:
+			}
+		}(stop)
+	}

Then call startCtxWatcher() immediately before s.cond.Wait().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapters/memstreamer/memstreamer.go` around lines 139 - 149, The Recv path is
spawning a context-watcher goroutine on every call, even when an item is already
available. Move the watcher setup behind the blocking path so it only starts
when Recv is about to call cond.Wait(), and keep the cancellation wake-up
behavior by extracting that logic into a helper such as startCtxWatcher that is
invoked immediately before waiting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@adapters/memstreamer/memstreamer_test.go`:
- Around line 167-187: The goroutine leak assertion in the Recv cancel test is
taking the baseline too late, after the Recv goroutine and its context watcher
are already running. Move the runtime.NumGoroutine baseline capture in
memstreamer_test.go so it happens before starting the goroutine that calls Recv,
then keep the existing cancel/wait/after check around done to ensure a leaked
watcher is counted correctly.
- Around line 35-55: Ensure TestRecv_DoesNotBusySpin waits for all receiver
goroutines to finish before exiting, since the current deferred cancel can leave
Recv loops from memstreamer.NewReceiver still unwinding when later tests inspect
runtime.NumGoroutine(). Add synchronization in the test (for example around the
goroutines launched from the Recv loop) so the test does not return until every
consumer has observed cancellation and exited cleanly.

---

Nitpick comments:
In `@adapters/memstreamer/memstreamer.go`:
- Around line 139-149: The Recv path is spawning a context-watcher goroutine on
every call, even when an item is already available. Move the watcher setup
behind the blocking path so it only starts when Recv is about to call
cond.Wait(), and keep the cancellation wake-up behavior by extracting that logic
into a helper such as startCtxWatcher that is invoked immediately before
waiting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31456de6-c40a-4b56-80a8-9a883a25f46d

📥 Commits

Reviewing files that changed from the base of the PR and between 0a9e052 and edb9581.

📒 Files selected for processing (2)
  • adapters/memstreamer/memstreamer.go
  • adapters/memstreamer/memstreamer_test.go

Comment on lines +35 to +55
func TestRecv_DoesNotBusySpin(t *testing.T) {
s := memstreamer.New()
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

const consumers = 8
for i := 0; i < consumers; i++ {
name := "consumer-" + string(rune('a'+i))
rec, err := s.NewReceiver(ctx, testTopic, name)
if err != nil {
t.Fatalf("NewReceiver: %v", err)
}
go func() {
for {
_, _, err := rec.Recv(ctx)
if err != nil {
return
}
}
}()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Wait for the receiver goroutines before the test exits.

This test uses process-wide goroutine counts, but it cancels via defer without joining the goroutines it started. Those goroutines can still be unwinding when the next test samples runtime.NumGoroutine().

Suggested cleanup
 import (
 	"context"
 	"runtime"
 	"strings"
+	"sync"
 	"testing"
 	"time"
@@
 	s := memstreamer.New()
 	ctx, cancel := context.WithCancel(context.Background())
-	defer cancel()
+	var wg sync.WaitGroup
+	t.Cleanup(func() {
+		cancel()
+		wg.Wait()
+	})
@@
 		if err != nil {
 			t.Fatalf("NewReceiver: %v", err)
 		}
+		wg.Add(1)
 		go func() {
+			defer wg.Done()
 			for {
 				_, _, err := rec.Recv(ctx)
 				if err != nil {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestRecv_DoesNotBusySpin(t *testing.T) {
s := memstreamer.New()
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
const consumers = 8
for i := 0; i < consumers; i++ {
name := "consumer-" + string(rune('a'+i))
rec, err := s.NewReceiver(ctx, testTopic, name)
if err != nil {
t.Fatalf("NewReceiver: %v", err)
}
go func() {
for {
_, _, err := rec.Recv(ctx)
if err != nil {
return
}
}
}()
}
func TestRecv_DoesNotBusySpin(t *testing.T) {
s := memstreamer.New()
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
t.Cleanup(func() {
cancel()
wg.Wait()
})
const consumers = 8
for i := 0; i < consumers; i++ {
name := "consumer-" + string(rune('a'+i))
rec, err := s.NewReceiver(ctx, testTopic, name)
if err != nil {
t.Fatalf("NewReceiver: %v", err)
}
wg.Add(1)
go func() {
defer wg.Done()
for {
_, _, err := rec.Recv(ctx)
if err != nil {
return
}
}
}()
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapters/memstreamer/memstreamer_test.go` around lines 35 - 55, Ensure
TestRecv_DoesNotBusySpin waits for all receiver goroutines to finish before
exiting, since the current deferred cancel can leave Recv loops from
memstreamer.NewReceiver still unwinding when later tests inspect
runtime.NumGoroutine(). Add synchronization in the test (for example around the
goroutines launched from the Recv loop) so the test does not return until every
consumer has observed cancellation and exited cleanly.

Comment on lines +167 to +187
before := runtime.NumGoroutine()
cancelAt := time.Now()
cancel()

select {
case err := <-done:
if elapsed := time.Since(cancelAt); elapsed > 100*time.Millisecond {
t.Errorf("Recv took %v to wake after cancel (want <100ms)", elapsed)
}
if err == nil {
t.Errorf("Recv should return ctx.Err(), got nil")
}
case <-time.After(500 * time.Millisecond):
t.Fatalf("Recv did not unblock on ctx cancel within 500ms")
}

// Give the watcher goroutine a moment to exit after Recv returns.
time.Sleep(50 * time.Millisecond)
after := runtime.NumGoroutine()
if after > before {
t.Errorf("goroutine leak after ctx cancel: before=%d after=%d", before, after)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Take the leak baseline before starting Recv.

before is captured while both the Recv goroutine and its context watcher are already alive, so a leaked watcher can still leave after <= before once only the Recv goroutine exits.

Suggested assertion fix
+	baseline := runtime.NumGoroutine()
 	done := make(chan error, 1)
 	go func() {
 		_, _, err := rec.Recv(ctx)
 		done <- err
 	}()
@@
-	before := runtime.NumGoroutine()
 	cancelAt := time.Now()
 	cancel()
@@
 	time.Sleep(50 * time.Millisecond)
 	after := runtime.NumGoroutine()
-	if after > before {
-		t.Errorf("goroutine leak after ctx cancel: before=%d after=%d", before, after)
+	if after > baseline {
+		t.Errorf("goroutine leak after ctx cancel: baseline=%d after=%d", baseline, after)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
before := runtime.NumGoroutine()
cancelAt := time.Now()
cancel()
select {
case err := <-done:
if elapsed := time.Since(cancelAt); elapsed > 100*time.Millisecond {
t.Errorf("Recv took %v to wake after cancel (want <100ms)", elapsed)
}
if err == nil {
t.Errorf("Recv should return ctx.Err(), got nil")
}
case <-time.After(500 * time.Millisecond):
t.Fatalf("Recv did not unblock on ctx cancel within 500ms")
}
// Give the watcher goroutine a moment to exit after Recv returns.
time.Sleep(50 * time.Millisecond)
after := runtime.NumGoroutine()
if after > before {
t.Errorf("goroutine leak after ctx cancel: before=%d after=%d", before, after)
baseline := runtime.NumGoroutine()
done := make(chan error, 1)
go func() {
_, _, err := rec.Recv(ctx)
done <- err
}()
cancelAt := time.Now()
cancel()
select {
case err := <-done:
if elapsed := time.Since(cancelAt); elapsed > 100*time.Millisecond {
t.Errorf("Recv took %v to wake after cancel (want <100ms)", elapsed)
}
if err == nil {
t.Errorf("Recv should return ctx.Err(), got nil")
}
case <-time.After(500 * time.Millisecond):
t.Fatalf("Recv did not unblock on ctx cancel within 500ms")
}
// Give the watcher goroutine a moment to exit after Recv returns.
time.Sleep(50 * time.Millisecond)
after := runtime.NumGoroutine()
if after > baseline {
t.Errorf("goroutine leak after ctx cancel: baseline=%d after=%d", baseline, after)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapters/memstreamer/memstreamer_test.go` around lines 167 - 187, The
goroutine leak assertion in the Recv cancel test is taking the baseline too
late, after the Recv goroutine and its context watcher are already running. Move
the runtime.NumGoroutine baseline capture in memstreamer_test.go so it happens
before starting the goroutine that calls Recv, then keep the existing
cancel/wait/after check around done to ensure a leaked watcher is counted
correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant