diff --git a/.github/images/benchmark_ping.png b/.github/images/benchmark_ping.png
new file mode 100644
index 000000000..dcd083d59
Binary files /dev/null and b/.github/images/benchmark_ping.png differ
diff --git a/CHANGELOG.md b/CHANGELOG.md
index e1c5ac9d2..402088a91 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,87 @@ All notable changes to this project will be documented in this file.
This format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+#### [v3.3.0](https://github.com/ergo-services/ergo/releases/tag/v1.999.330) 2026-xx-xx [tag version v1.999.330] ####
+
+* Added **pointer type support** in EDF - `*int`, `*string`, `[]*T`, `map[K]*V`, pointer struct fields. Nil state preserved. Nested pointers (`**T`) not supported. Max encoding depth limit (100) prevents stack overflow on deeply nested structures. See [Network Transparency](https://docs.ergo.services/networking/network-transparency) documentation
+* Fixed logger to preserve Behavior name when process registers name
+* Added **process lifecycle counters** to `gen.NodeInfo` - `ProcessesSpawned`, `ProcessesSpawnFailed`, `ProcessesTerminated` for cumulative statistics
+* Added **mailbox latency measurement** (build with `-tags=latency`). `QueueMPSC.Latency()` returns the age of the oldest message in the queue (nanoseconds), -1 if disabled. `ProcessMailbox.Latency()` returns the max across all four queues. Added `MailboxLatency` field to `ProcessShortInfo` and latency fields to `MailboxQueues` in `ProcessInfo`. See [Debugging](https://docs.ergo.services/advanced/debugging) documentation
+* Added **`Node.ProcessRangeShortInfo`** for efficient callback-based iteration over all processes with their current state. See [Metrics actor](https://docs.ergo.services/extra-library/actors/metrics) for Prometheus integration
+* Added **per-event metrics** - `EventInfo` now includes `MessagesPublished`, `MessagesLocalSent`, `MessagesRemoteSent` counters. Added `Node.EventInfo` and `Node.EventRangeInfo` for querying event statistics. Added `EventsPublished`, `EventsReceived`, `EventsLocalSent`, `EventsRemoteSent` to `NodeInfo`. `EventsPublished` counts only local producer publishes, `EventsReceived` counts events arriving from remote nodes
+* Added **process init time measurement** - `InitTime` field in `ProcessShortInfo` and `ProcessInfo` records the time spent in `ProcessInit` callback (nanoseconds)
+* Fixed **message counters for meta processes** - meta process traffic now propagates to parent process counters, making `ProcessRangeShortInfo` aggregates balanced
+* Fixed **self-send message counter** - `messagesOut` now incremented for self-sends
+* Fixed **simultaneous connect dead loop** - two nodes dialing each other at the same time no longer cause infinite retry loops. Deterministic connection IDs and Erlang-style collision detection (`EnableSimultaneousConnect` flag) ensure exactly one connection per pair. Fixed related connection leaks
+* Fixed **silent data loss on connection pool write failure** - a transient write error could permanently break a pool item's write path without detection, causing all subsequent messages to be silently dropped while the connection appeared healthy
+* Added **software keepalive** for inter-node connections. Application-level heartbeat detects silent failures that TCP keepalive cannot: stuck processes, broken flushers, goroutine starvation. Each side advertises its period during handshake (8 bits in `NetworkFlags`); receiver uses peer's period for timeout. Enabled by default (15s period, 3 misses, 45s timeout). Configure via `NetworkFlags.EnableSoftwareKeepAlive` and `NetworkOptions.SoftwareKeepAliveMisses`. See [Network Stack](https://docs.ergo.services/networking/network-stack#software-keepalive) documentation
+* Added **handshake deadline** (5s) to prevent hung handshakes from blocking connection goroutines indefinitely
+* Added **message fragmentation** for large messages. Messages exceeding the fragment size (default 65000 bytes) are automatically split for transmission and reassembled on the receiving side. Works with compression, important delivery, and all message types. With `KeepNetworkOrder` disabled, fragments are distributed across all TCP connections in the pool for maximum throughput. Both nodes must enable `EnableFragmentation` flag (enabled by default). Configure via `NetworkOptions.FragmentSize`, `FragmentTimeout`, `MaxFragmentAssemblies`. See [Network Stack](https://docs.ergo.services/networking/network-stack#message-fragmentation) documentation
+* Fixed **important delivery use-after-release** - reference ID for acknowledgment was read from buffer after it was returned to the pool, causing corrupted ACK responses under load. Affected `SendImportant` for PID, ProcessID, and Alias targets
+
+#### [v3.2.0](https://github.com/ergo-services/ergo/releases/tag/v1.999.320) 2026-02-04 [tag version v1.999.320] ####
+
+* Introduced **mTLS support** - new `gen.CertAuthManager` interface for mutual TLS with CA pool management (`ClientCAs`, `RootCAs`, `ClientAuth`, `ServerName`). See [Mutual TLS](https://docs.ergo.services/networking/mutual-tls) documentation
+* Introduced **NAT support** - new `RouteHost` and `RoutePort` options in `gen.AcceptorOptions` for nodes behind NAT or load balancers. See [Behind the NAT](https://docs.ergo.services/networking/behind-the-nat) documentation
+* Introduced **spawn time control** - `InitTimeout` option in `gen.ProcessOptions` limits `ProcessInit` duration for both local and remote spawn. Remote spawn and application processes limited to max 15 seconds. See [Process](https://docs.ergo.services/basics/process) documentation
+* Introduced **zip-bomb protection** - decompression size limits to prevent memory exhaustion attacks
+* Added `gen.Ref` methods for request timeout tracking. See [Generic Types](https://docs.ergo.services/basics/generic-types#gen.ref):
+ - `Deadline` - returns deadline timestamp stored in reference
+ - `IsAlive` - checks if reference is still valid (deadline not exceeded)
+* Added `gen.Node` methods. See [Node](https://docs.ergo.services/basics/node) documentation:
+ - `ProcessPID` / `ProcessName` - resolve process PID by name and vice versa
+ - `Call`, `CallWithTimeout`, `CallWithPriority`, `CallImportant`, `CallPID`, `CallProcessID`, `CallAlias` - synchronous requests from Node interface
+ - `Inspect` / `InspectMeta` - inspect processes and meta processes
+ - `MakeRefWithDeadline` - create reference with embedded deadline
+* Added `gen.RemoteNode.ApplicationInfo` - query application information from remote nodes. See [Remote Start Application](https://docs.ergo.services/networking/remote-start-application) documentation
+* Added `gen.Process` methods. See [Process](https://docs.ergo.services/basics/process) documentation:
+ - `SendWithPriorityAfter` - delayed send with priority
+ - `SendExitAfter` / `SendExitMetaAfter` - delayed exit signals
+ - `SendResponseImportant` / `SendResponseErrorImportant` - important delivery for responses
+* Added `gen.Meta` methods. See [Meta Process](https://docs.ergo.services/basics/meta-process) documentation:
+ - `SendResponse` / `SendResponseError` - respond to requests from meta process
+ - `SendPriority` / `SetSendPriority` - message priority control
+ - `Compression` / `SetCompression` - compression settings
+ - `EnvDefault` - get environment variable with default value
+* Added `gen.ApplicationSpec` / `gen.ApplicationInfo` fields:
+ - `Tags` - labels for instance selection (blue/green, canary, maintenance). See [Tags for Instance Selection](https://docs.ergo.services/basics/application#tags-for-instance-selection)
+ - `Map` - logical role to process name mapping. See [Process Role Mapping](https://docs.ergo.services/basics/application#process-role-mapping)
+* Added **HandleInspect** implementations for all supervisor types (OFO, ARFO, SOFO)
+* Fixed **LinkChild** in `RemoteNode.Spawn` / `RemoteNode.SpawnRegister`
+* Fixed **args persistence** for Simple One For One supervisor - child processes now restart with their original spawn arguments
+* Fixed **critical bug**: terminate signals (Link/Monitor exits) were incorrectly rejected due to wrong incarnation validation in network layer. Thanks to [@qjpcpu](https://github.com/qjpcpu) for reporting [#248](https://github.com/ergo-services/ergo/issues/248)
+* Completely reworked internal **Target Manager** (`node/tm/`) - improved architecture for process, event, and node target management with comprehensive test coverage
+* Completely reworked internal **Pub/Sub** mechanism - improved reliability and performance
+* Improved **ProcessInit state** - more `gen.Process` methods now available during initialization:
+ - `Link*`, `Unlink*`, `Monitor*`, `Demonitor*`
+ - `Call*`, `Inspect`, `InspectMeta`
+ - `RegisterName`, `UnregisterName`, `RegisterEvent`, `UnregisterEvent`
+ - `SendResponse*`, `SendResponseError*`
+ - `CreateAlias`, `DeleteAlias`
+* Introduced **shutdown timeout** - `ShutdownTimeout` option in `gen.NodeOptions` (default 3 minutes). During graceful shutdown, pending processes are logged every 5 seconds with state and queue info. After timeout, node force exits with error code 1. See [Node](https://docs.ergo.services/basics/node) documentation
+* Added **pprof labels** for actor and meta process goroutines (with `--tags pprof`) - each process goroutine is labeled with its PID, each meta process with its Alias, making it easy to identify stuck processes in pprof output
+* Improved API documentation - comprehensive godoc comments for all public interfaces
+* **Documentation rewritten** - complete documentation now included in the repository (`docs/`) and available at [docs.ergo.services](https://docs.ergo.services)
+* New documentation articles:
+ - [Project Structure](https://docs.ergo.services/basics/project-structure) - organizing projects with message isolation levels, deployment patterns, and evolution strategies
+ - [Building a Cluster](https://docs.ergo.services/advanced/building-a-cluster) - step-by-step guide to distributed systems with service discovery, load balancing, and failover
+ - [Message Versioning](https://docs.ergo.services/advanced/message-versioning) - evolving message contracts in distributed clusters with explicit versioning strategies
+ - [Handle Sync](https://docs.ergo.services/advanced/handle-sync) - synchronous message handling patterns
+ - [Important Delivery](https://docs.ergo.services/advanced/important-delivery) - guaranteed delivery mechanism
+ - [Pub/Sub Internals](https://docs.ergo.services/advanced/pub-sub-internals) - event system architecture
+ - [Debugging](https://docs.ergo.services/advanced/debugging) - build tags, pprof integration, troubleshooting stuck processes
+
+* **Extra Library - Actors** (https://github.com/ergo-services/actor):
+ - Introduced **Leader** actor - distributed leader election with Raft-inspired consensus algorithm. Features: term-based disambiguation, automatic failover, split-brain prevention through majority quorum, dynamic peer discovery. See [documentation](https://docs.ergo.services/extra-library/actors/leader)
+ - Introduced **Metrics** actor - Prometheus metrics exporter that collects node/network telemetry via HTTP endpoint. Features: automatic collection of node metrics (uptime, processes, memory), network metrics per remote node, extensible for custom metrics. See [documentation](https://docs.ergo.services/extra-library/actors/metrics)
+
+* **Extra Library - Meta Processes** (https://github.com/ergo-services/meta):
+ - Introduced **SSE** (Server-Sent Events) meta-process - unidirectional server-to-client streaming over HTTP. Features: server handler for accepting connections, client connection for external SSE endpoints, full SSE spec support (event types, IDs, retry hints, multi-line data), process pool with round-robin load balancing, Last-Event-ID for reconnection. See [documentation](https://docs.ergo.services/extra-library/meta-processes/sse)
+
+* **Benchmarks** (https://github.com/ergo-services/benchmarks):
+ - Introduced **Distributed Pub/Sub** benchmark - demonstrates event delivery to 1,000,000 subscribers across 10 nodes. Achieves 2.9M msg/sec delivery rate with only 10 network messages (one per consumer node) instead of 1M
+
+
#### [v3.1.0](https://github.com/ergo-services/ergo/releases/tag/v1.999.310) 2025-09-04 [tag version v1.999.310] ####
**New Features**
diff --git a/README.md b/README.md
index 4a9d938b5..349d5960f 100644
--- a/README.md
+++ b/README.md
@@ -1,65 +1,157 @@
[](https://docs.ergo.services)
-[](https://pkg.go.dev/ergo.services/ergo)
[](https://opensource.org/licenses/MIT)
[](https://t.me/ergo_services)
[](https://reddit.com/r/ergo_services)
+[](https://ergo.cloud)
-The Ergo Framework is an implementation of ideas, technologies, and design patterns from the Erlang world in the Go programming language. It is based on the actor model, network transparency, and a set of ready-to-use components for development. This significantly simplifies the creation of complex and distributed solutions while maintaining a high level of reliability and performance.
+**Actor model for Go. Build distributed systems without the distributed systems headache.**
-### Features ###
+Goroutines and channels work great until your system grows. Then come the mutexes, the race conditions, the service discovery configs, the retry logic, the connection pool management. Ergo replaces all of that with one model: isolated processes that communicate through messages, supervised automatically, addressable across any cluster.
-1. **Actor Model**: isolated processes communicate through message passing, handling messages sequentially in their own mailbox with four priority queues. Processes support both asynchronous messaging and synchronous request-response patterns, enabling flexible communication while maintaining the actor model guarantees.
+Inspired by Erlang/OTP. Zero external dependencies. Pure Go.
-2. **Network Transparency**: actors interact the same way whether local or remote. The framework uses custom serialization and protocol for [efficient](https://github.com/ergo-services/benchmarks) distributed communication with connection pooling, compression, and type caching, making network location transparent to application code.
+### The core idea in 30 seconds ###
-3. **Supervision Trees**: hierarchical fault recovery where supervisors monitor child processes and apply restart strategies when failures occur. Supports multiple supervision types (One For One, All For One, Rest For One, Simple One For One) and restart strategies (Transient, Temporary, Permanent) for building self-healing systems.
+```go
+type Counter struct {
+ act.Actor
+ count int
+}
-4. **Meta Processes**: bridge blocking I/O with the actor model through dedicated meta processes that handle TCP, UDP, Port, and Web protocols. Meta processes run blocking operations without affecting regular actor message processing.
+type MessageInc struct{}
-5. **Distributed Systems**: service discovery through embedded or external registrars (etcd, Saturn), distributed publish/subscribe events with token-based authorization and buffering, remote process spawning with factory-based permissions, and remote application orchestration across nodes.
+func (c *Counter) HandleMessage(from gen.PID, msg any) error {
+ switch msg.(type) {
+ case MessageInc:
+ // safe without locks even with thousands of concurrent senders:
+ // messages are processed one at a time
+ c.count++
+ c.Log().Info("count: %d", c.count)
+ }
+ return nil
+}
-6. **Ready-to-use Components**: core framework includes Actor, Supervisor, Pool, and WebWorker actors plus TCP, UDP, Port, and Web meta processes. Extra library provides Leader, Metrics actors, WebSocket, SSE meta processes, Observer application, Colored and Rotate loggers, Erlang protocol support.
+func factory_Counter() gen.ProcessBehavior { return &Counter{} }
-7. **Flexibility**: customize network stack components, certificate management, compression and message priorities, logging, distributed events, and meta processes. Framework supports mTLS, NAT traversal, important delivery for guaranteed messaging, and Cron-based scheduling.
+// Start a node and spawn the actor
+node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{})
+pid, _ := node.Spawn(factory_Counter, gen.ProcessOptions{})
-Examples demonstrating the framework's capabilities are available in the [examples repository](https://github.com/ergo-services/examples).
+// Same API whether local or on another continent
+node.Send(pid, MessageInc{})
+node.Send(pid, MessageInc{})
+```
-### Benchmarks ###
+No locks. No race conditions. Sequential message handling is the guarantee.
-On a 64-core processor, Ergo Framework demonstrates a performance of **over 21 million messages per second locally** and **nearly 5 million messages per second over the network**.
+### Why not just goroutines + channels? ###
-
+| | Goroutines + channels | Ergo |
+|---|---|---|
+| Shared state | You manage with mutexes | No shared state by design |
+| Failure recovery | Manual | Supervision trees restart automatically |
+| Cross-node messaging | Build it yourself | Same API, transparent |
+| Service discovery | External tool needed | Built in |
+| Race conditions | Possible | Impossible within a process |
-Available benchmarks can be found in the [benchmarks repository](https://github.com/ergo-services/benchmarks).
+### What you can build ###
-* Messaging performance (local, network)
+**Real-time backends.** Each WebSocket connection becomes an addressable actor. Any node in your cluster can push to any specific client. No pub/sub intermediaries.
-* Memory consumption per process (demonstrates framework memory footprint)
+**IoT platforms.** One actor per device. Thousands of devices per node. Supervisors restart failed device actors automatically.
-* Serialization performance comparison: EDF vs Protobuf vs Gob
+**Multi-agent AI systems.** Each agent is an isolated actor with a mailbox. Crash isolation, supervision, distributed addressability, and a built-in [MCP server](https://docs.ergo.services/extra-library/applications/mcp) that exposes your running cluster to any AI assistant (Claude Code, Cursor, and other MCP-compatible clients). See [AI Agents](https://docs.ergo.services/ai-agents) for patterns and diagnostics.
-* Distributed Pub/Sub (event delivery to 1,000,000 subscribers across 10 nodes)
+**Financial and event-driven systems.** Four priority queues per mailbox, guaranteed delivery, no dropped messages.
-### Observer ###
-To inspect the node, network stack, running applications, and processes, you can use the [`observer`](https://github.com/ergo-services/tools/) tool
+**Distributed Pub/Sub across the cluster.** Producer registers an event once; any process on any node subscribes. The framework delivers one network message per node, not per subscriber. 1M subscribers across 10 nodes cost 10 network messages, not 1M.
-
+```go
+// Producer on any node
+token, _ := producer.RegisterEvent("prices", gen.EventOptions{})
+producer.SendEvent("prices", token, PriceUpdate{Asset: "BTC", Price: 95000})
-To install the Observer tool, you need to have the Go compiler version 1.20 or higher. Run the following command:
+// Subscriber on any other node, identical API
+process.MonitorEvent(gen.Event{Name: "prices", Node: "producer@host"})
+func (s *Sub) HandleEvent(event gen.MessageEvent) error {
+ fmt.Println(event.Message.(PriceUpdate))
+ return nil
+}
```
-$ go install ergo.tools/observer@latest
+
+### Performance ###
+
+On a 64-core processor:
+
+* **21M+ messages/second** locally
+* **~5.5M messages/second** over the network
+* **Distributed Pub/Sub**: 2.9M msg/sec delivery to 1,000,000 subscribers across 10 nodes
+
+Lock-free queues. Processes sleep when idle. No CPU wasted.
+
+
+
+Full benchmarks: [benchmarks repository](https://github.com/ergo-services/benchmarks).
+
+### Observer ###
+
+Observer is a real-time web UI for monitoring and inspecting Ergo nodes. It provides live visibility into every layer of the system:
+
+- **Processes** - full process list with state, mailbox depth, latency, running time, wakeups, and uptime. Click any process to inspect its supervision tree, links, monitors, aliases, environment, and internal actor state
+- **Applications** - running applications with their process trees, modes, and uptime
+- **Network** - cluster topology, per-node connection details, traffic counters, and protocol info
+- **Events** - registered events with producer, subscriber counts, and publication statistics
+- **Logs** - live log stream with level filtering across the cluster
+- **Profiler** - goroutine dump with grouping and stack traces, heap profile with allocation breakdown, and GC pressure charts
+
+
+
+Add Observer to your node as an application:
+
+```go
+import "ergo.services/application/observer"
+
+options.Applications = []gen.ApplicationBehavior{
+ observer.CreateApp(observer.Options{}),
+}
```
-You can also embed the [Observer application](https://docs.ergo.services/extra-library/applications/observer) into your node. To see it in action, see the [demo example](https://github.com/ergo-services/examples/tree/master/demo). For more information, visit the [Observer documentation](https://docs.ergo.services/tools/observer).
+To see it in action with a fully loaded cluster, see the [observability example](https://github.com/ergo-services/examples/tree/master/observability). For more information, visit the [Observer documentation](https://docs.ergo.services/extra-library/applications/observer).
+
+### Features ###
+
+1. **Actor Model:** isolated processes communicate through message passing, handling messages sequentially with four priority queues. Supports asynchronous messaging and synchronous request-response, with per-process [mailbox latency measurement](https://docs.ergo.services/advanced/debugging#mailbox-latency) (`-tags=latency`) for production diagnostics.
+
+2. **Network Transparency:** actors interact the same way whether local or remote. Uses EDF (Ergo Data Format), a custom binary serialization with type caching, pointer support, and [message versioning](https://docs.ergo.services/advanced/message-versioning) for seamless upgrades. Includes connection pooling, compression, [message fragmentation](https://docs.ergo.services/networking/network-stack#message-fragmentation), and [application-level keepalive](https://docs.ergo.services/networking/network-stack#software-keepalive) for silent failure detection.
+
+3. **Supervision Trees:** hierarchical fault recovery where supervisors monitor child processes and apply configurable restart strategies. Supports One For One, All For One, Rest For One, and Simple One For One supervision types with Transient, Temporary, and Permanent restart policies.
+
+4. **Meta Processes:** bridge blocking I/O with the actor model through dedicated meta processes handling [TCP](https://docs.ergo.services/meta-processes/tcp), [UDP](https://docs.ergo.services/meta-processes/udp), [Port](https://docs.ergo.services/meta-processes/port), [Web](https://docs.ergo.services/meta-processes/web), [WebSocket](https://docs.ergo.services/extra-library/meta-processes/websocket), and [SSE](https://docs.ergo.services/extra-library/meta-processes/sse) protocols without affecting regular actor message processing.
+
+5. **Distributed Systems:** service discovery via embedded or external registrars ([etcd](https://docs.ergo.services/extra-library/registrars/etcd-client), [Saturn](https://docs.ergo.services/extra-library/registrars/saturn-client)), distributed [publish/subscribe events](https://docs.ergo.services/basics/events) with token-based authorization and buffering, [remote process spawning](https://docs.ergo.services/networking/remote-spawn-process) with factory-based permissions, [remote application orchestration](https://docs.ergo.services/networking/remote-start-application) across nodes, and Raft-based [leader election](https://docs.ergo.services/extra-library/actors/leader) without external dependencies for coordinating exclusive work across cluster replicas.
+
+6. **Observability:** real-time cluster inspection via the [Observer](https://docs.ergo.services/extra-library/applications/observer) web UI, native [distributed tracing](https://docs.ergo.services/advanced/distributed-tracing) that follows message chains across nodes with automatic propagation (exportable to OTLP backends like Grafana Tempo or Jaeger via [Pulse](https://docs.ergo.services/extra-library/applications/pulse)), and production metrics via [Radar](https://docs.ergo.services/extra-library/applications/radar) with a ready-to-use Grafana dashboard covering process lifecycle, mailbox pressure, network traffic, and event fanout. The extensible [Metrics](https://docs.ergo.services/extra-library/actors/metrics) actor adds custom Prometheus collectors alongside built-in node telemetry.
+
+7. **AI-Native:** built-in [MCP server](https://docs.ergo.services/extra-library/applications/mcp) exposes the full cluster to AI agents (Claude, Cursor, and any MCP-compatible client). Inspect processes, query events, capture goroutine dumps, stream logs, and run real-time samplers through natural language, turning any AI assistant into an interactive SRE for your Ergo cluster.
+
+8. **Cloud Native:** built-in Kubernetes health probes (liveness, readiness, startup) via the [Health](https://docs.ergo.services/extra-library/actors/health) actor, [Prometheus](https://docs.ergo.services/extra-library/actors/metrics) metrics endpoint, and [mTLS](https://docs.ergo.services/networking/mutual-tls) support for zero-trust deployments.
+
+9. **Ready-to-use Components:** core framework includes Actor, Supervisor, Pool, and WebWorker actors plus TCP, UDP, Port, WebSocket, SSE, and Web meta processes. Extra library provides [Leader](https://docs.ergo.services/extra-library/actors/leader), [Metrics](https://docs.ergo.services/extra-library/actors/metrics), and [Health](https://docs.ergo.services/extra-library/actors/health) actors, [Observer](https://docs.ergo.services/extra-library/applications/observer) and [Radar](https://docs.ergo.services/extra-library/applications/radar) applications, [Colored](https://docs.ergo.services/extra-library/loggers/colored) and [Rotate](https://docs.ergo.services/extra-library/loggers/rotate) loggers.
+10. **Erlang Interoperability:** native support for the [Erlang distribution protocol](https://docs.ergo.services/extra-library/network-protocols/erlang) enables heterogeneous clusters where Ergo (Go) and Erlang/Elixir nodes participate as equal peers. Send messages, spawn processes, and set up links and monitors across language boundaries without any proxies or bridges.
+11. **Flexibility:** customize network stack, certificate management ([mTLS](https://docs.ergo.services/networking/mutual-tls), [NAT traversal](https://docs.ergo.services/networking/behind-the-nat)), compression and message priorities, [Cron-based scheduling](https://docs.ergo.services/basics/cron), [important delivery](https://docs.ergo.services/advanced/important-delivery) for guaranteed messaging, and logging. The [`ergo`](https://docs.ergo.services/tools/ergo) CLI tool generates project scaffolding, actors, supervisors, and message types from the command line.
+
+Examples demonstrating the framework's capabilities are available in the [examples repository](https://github.com/ergo-services/examples).
+
+Questions and answers: [FAQ](https://docs.ergo.services/faq).
### Quick start ###
-For a quick start, use the [`ergo`](https://docs.ergo.services/tools/ergo) tool — a command-line utility designed to simplify the process of generating boilerplate code for your project based on the Ergo Framework. With this tool, you can rapidly create a complete project structure, including applications, actors, supervisors, network components, and more. It offers a set of arguments that allow you to customize the project according to specific requirements, ensuring it is ready for immediate development.
+The [`ergo`](https://docs.ergo.services/tools/ergo) CLI generates project scaffolding for you: applications, actors, supervisors, message types. The output is a complete, runnable project structure. Add components incrementally as your service grows.
To install use the following command:
@@ -67,135 +159,85 @@ To install use the following command:
$ go install ergo.tools/ergo@latest
```
-Now, you can create your project with just one command. Here is example:
-
-Supervision tree
+Create a project and start adding components:
```
- mynode
- ├─ myapp
- │ │
- │ └─ mysup
- │ │
- │ └─ myactor
- ├─ myweb
- └─ myactor2
+$ ergo init MyNode github.com/myorg/mynode
+$ cd mynode
+$ ergo add supervisor MyNodeApp:MySup
+$ ergo add actor MySup:MyActor
+$ go run ./cmd
```
-To generate project for this design use the following command:
+The generated project is ready to run immediately. Add more components as your service grows:
```
-$ ergo -init MyNode \
- -with-app MyApp \
- -with-sup MyApp:MySup \
- -with-actor MySup:MyActor \
- -with-web MyWeb \
- -with-actor MyActor2 \
- -with-observer
+$ ergo add actor --pool MySup:MyPool
+$ ergo add app BackgroundApp
+$ ergo add message MessageConnect --field ID:gen.Alias --field Addr:string
```
-as a result you will get generated project:
+For the full command reference, see the [ergo tool documentation](https://docs.ergo.services/tools/ergo).
-```
- mynode
- ├── apps
- │ └── myapp
- │ ├── myactor.go
- │ ├── myapp.go
- │ └── mysup.go
- ├── cmd
- │ ├── myactor2.go
- │ ├── mynode.go
- │ ├── myweb.go
- │ └── myweb_worker.go
- ├── go.mod
- ├── go.sum
- └── README.md
-```
+### Claude Code integration ###
+
+Pre-built agents and skills for [Claude Code](https://claude.com/claude-code) turn any Claude session into an Ergo-aware collaborator. Two paired toolkits shipped in the [ergo-services/claude](https://github.com/ergo-services/claude) repository:
+
+- **framework** - designing and implementing actor systems. An architect agent (DDD bounded contexts, supervision trees, cluster topology, load analysis) plus a skill with progressive-disclosure references covering actors, supervision, messages, applications, pool, meta processes, node configuration, EDF, cluster, unit testing, and every extension library.
+
+- **devops** - diagnosing running clusters via the built-in [MCP application](https://docs.ergo.services/extra-library/applications/mcp). An SRE agent that runs hypothesis-driven investigations (observe, hypothesize, test, confirm) plus a skill with the full 48-tool catalog, counters reference, 10 diagnostic playbooks, active/passive sampler recipes, and build-tag awareness.
-to try it:
+Install as a Claude Code plugin (one-shot, updates managed by Claude Code):
```
-$ cd mynode
-$ go run ./cmd
+/plugin marketplace add ergo-services/claude
+/plugin install ergo@ergo-services
```
-Since we included Observer application, open http://localhost:9911 to inspect your node and running processes.
+After install, invoke the skills as `/ergo:framework` or `/ergo:devops`. Agents pick themselves up from trigger phrases ("design ergo application", "why is it slow", "check cluster health", etc.).
-### Erlang support ###
+### ergo.cloud ###
-Starting from version 3.0.0, support for the Erlang network stack has been moved to a [separate module](https://github.com/ergo-services/proto). Version 3.0 was distributed under the BSL 1.1 license, but starting from version 3.1 it is available under the MIT license. Detailed information is available in the [Erlang protocol documentation](https://docs.ergo.services/extra-library/network-protocols/erlang).
+[ergo.cloud](https://ergo.cloud) is an overlay network that connects Ergo nodes across AWS, GCP, Azure, and bare metal into a single transparent cluster without VPNs, proxies, or tunnels. End-to-end encrypted. Currently available via [waitlist](https://ergo.cloud).
### Requirements ###
-* Go 1.20.x and above
+* Go 1.21.x and above
### Changelog ###
Fully detailed changelog see in the [ChangeLog](CHANGELOG.md) file.
-#### [v3.2.0](https://github.com/ergo-services/ergo/releases/tag/v1.999.320) 2026-02-04 [tag version v1.999.320] ####
-
-* Introduced **mTLS support** - new `gen.CertAuthManager` interface for mutual TLS with CA pool management (`ClientCAs`, `RootCAs`, `ClientAuth`, `ServerName`). See [Mutual TLS](https://docs.ergo.services/networking/mutual-tls) documentation
-* Introduced **NAT support** - new `RouteHost` and `RoutePort` options in `gen.AcceptorOptions` for nodes behind NAT or load balancers. See [Behind the NAT](https://docs.ergo.services/networking/behind-the-nat) documentation
-* Introduced **spawn time control** - `InitTimeout` option in `gen.ProcessOptions` limits `ProcessInit` duration for both local and remote spawn. Remote spawn and application processes limited to max 15 seconds. See [Process](https://docs.ergo.services/basics/process) documentation
-* Introduced **zip-bomb protection** - decompression size limits to prevent memory exhaustion attacks
-* Added `gen.Ref` methods for request timeout tracking. See [Generic Types](https://docs.ergo.services/basics/generic-types#gen.ref):
- - `Deadline` - returns deadline timestamp stored in reference
- - `IsAlive` - checks if reference is still valid (deadline not exceeded)
-* Added `gen.Node` methods. See [Node](https://docs.ergo.services/basics/node) documentation:
- - `ProcessPID` / `ProcessName` - resolve process PID by name and vice versa
- - `Call`, `CallWithTimeout`, `CallWithPriority`, `CallImportant`, `CallPID`, `CallProcessID`, `CallAlias` - synchronous requests from Node interface
- - `Inspect` / `InspectMeta` - inspect processes and meta processes
- - `MakeRefWithDeadline` - create reference with embedded deadline
-* Added `gen.RemoteNode.ApplicationInfo` - query application information from remote nodes. See [Remote Start Application](https://docs.ergo.services/networking/remote-start-application) documentation
-* Added `gen.Process` methods. See [Process](https://docs.ergo.services/basics/process) documentation:
- - `SendWithPriorityAfter` - delayed send with priority
- - `SendExitAfter` / `SendExitMetaAfter` - delayed exit signals
- - `SendResponseImportant` / `SendResponseErrorImportant` - important delivery for responses
-* Added `gen.Meta` methods. See [Meta Process](https://docs.ergo.services/basics/meta-process) documentation:
- - `SendResponse` / `SendResponseError` - respond to requests from meta process
- - `SendPriority` / `SetSendPriority` - message priority control
- - `Compression` / `SetCompression` - compression settings
- - `EnvDefault` - get environment variable with default value
-* Added `gen.ApplicationSpec` / `gen.ApplicationInfo` fields:
- - `Tags` - labels for instance selection (blue/green, canary, maintenance). See [Tags for Instance Selection](https://docs.ergo.services/basics/application#tags-for-instance-selection)
- - `Map` - logical role to process name mapping. See [Process Role Mapping](https://docs.ergo.services/basics/application#process-role-mapping)
-* Added **HandleInspect** implementations for all supervisor types (OFO, ARFO, SOFO)
-* Fixed **LinkChild** in `RemoteNode.Spawn` / `RemoteNode.SpawnRegister`
-* Fixed **args persistence** for Simple One For One supervisor - child processes now restart with their original spawn arguments
-* Fixed **critical bug**: terminate signals (Link/Monitor exits) were incorrectly rejected due to wrong incarnation validation in network layer. Thanks to [@qjpcpu](https://github.com/qjpcpu) for reporting [#248](https://github.com/ergo-services/ergo/issues/248)
-* Completely reworked internal **Target Manager** (`node/tm/`) - improved architecture for process, event, and node target management with comprehensive test coverage
-* Completely reworked internal **Pub/Sub** mechanism - improved reliability and performance
-* Improved **ProcessInit state** - more `gen.Process` methods now available during initialization:
- - `Link*`, `Unlink*`, `Monitor*`, `Demonitor*`
- - `Call*`, `Inspect`, `InspectMeta`
- - `RegisterName`, `UnregisterName`, `RegisterEvent`, `UnregisterEvent`
- - `SendResponse*`, `SendResponseError*`
- - `CreateAlias`, `DeleteAlias`
-* Introduced **shutdown timeout** - `ShutdownTimeout` option in `gen.NodeOptions` (default 3 minutes). During graceful shutdown, pending processes are logged every 5 seconds with state and queue info. After timeout, node force exits with error code 1. See [Node](https://docs.ergo.services/basics/node) documentation
-* Added **pprof labels** for actor and meta process goroutines (with `--tags pprof`) - each process goroutine is labeled with its PID, each meta process with its Alias, making it easy to identify stuck processes in pprof output
-* Improved API documentation - comprehensive godoc comments for all public interfaces
-* **Documentation rewritten** - complete documentation now included in the repository (`docs/`) and available at [docs.ergo.services](https://docs.ergo.services)
-* New documentation articles:
- - [Project Structure](https://docs.ergo.services/basics/project-structure) - organizing projects with message isolation levels, deployment patterns, and evolution strategies
- - [Building a Cluster](https://docs.ergo.services/advanced/building-a-cluster) - step-by-step guide to distributed systems with service discovery, load balancing, and failover
- - [Message Versioning](https://docs.ergo.services/advanced/message-versioning) - evolving message contracts in distributed clusters with explicit versioning strategies
- - [Handle Sync](https://docs.ergo.services/advanced/handle-sync) - synchronous message handling patterns
- - [Important Delivery](https://docs.ergo.services/advanced/important-delivery) - guaranteed delivery mechanism
- - [Pub/Sub Internals](https://docs.ergo.services/advanced/pub-sub-internals) - event system architecture
- - [Debugging](https://docs.ergo.services/advanced/debugging) - build tags, pprof integration, troubleshooting stuck processes
-
-* **Extra Library - Actors** (https://github.com/ergo-services/actor):
- - Introduced **Leader** actor - distributed leader election with Raft-inspired consensus algorithm. Features: term-based disambiguation, automatic failover, split-brain prevention through majority quorum, dynamic peer discovery. See [documentation](https://docs.ergo.services/extra-library/actors/leader)
- - Introduced **Metrics** actor - Prometheus metrics exporter that collects node/network telemetry via HTTP endpoint. Features: automatic collection of node metrics (uptime, processes, memory), network metrics per remote node, extensible for custom metrics. See [documentation](https://docs.ergo.services/extra-library/actors/metrics)
-
-* **Extra Library - Meta Processes** (https://github.com/ergo-services/meta):
- - Introduced **SSE** (Server-Sent Events) meta-process - unidirectional server-to-client streaming over HTTP. Features: server handler for accepting connections, client connection for external SSE endpoints, full SSE spec support (event types, IDs, retry hints, multi-line data), process pool with round-robin load balancing, Last-Event-ID for reconnection. See [documentation](https://docs.ergo.services/extra-library/meta-processes/sse)
-
-* **Benchmarks** (https://github.com/ergo-services/benchmarks):
- - Introduced **Distributed Pub/Sub** benchmark - demonstrates event delivery to 1,000,000 subscribers across 10 nodes. Achieves 2.9M msg/sec delivery rate with only 10 network messages (one per consumer node) instead of 1M
+#### [v3.3.0](https://github.com/ergo-services/ergo/releases/tag/v1.999.330) 2026-xx-xx [tag version v1.999.330] ####
+
+* Added **pointer type support** in EDF - `*int`, `*string`, `[]*T`, `map[K]*V`, pointer struct fields. Nil state preserved. Nested pointers (`**T`) not supported. Max encoding depth limit (100) prevents stack overflow on deeply nested structures. See [Network Transparency](https://docs.ergo.services/networking/network-transparency) documentation
+* Added **per-type encode/decode statistics** (build with `-tags=typestats`). Tracks count of root-level operations and decompressed wire-byte volume per registered EDF type. Available via `Network().RegisteredTypes()` and the Observer Types panel. Helps identify heavy message types and decide where to enable compression. Overhead approximately 2-3% on encode/decode throughput
+* Fixed logger to preserve Behavior name when process registers name
+
+* Added **process lifecycle counters** to `NodeInfo` - `ProcessesSpawned`, `ProcessesSpawnFailed`, `ProcessesTerminated` for cumulative statistics
+
+* Added **mailbox latency measurement** (build with `-tags=latency`). `QueueMPSC.Latency()` returns the age of the oldest message in the queue (nanoseconds), -1 if disabled. `ProcessMailbox.Latency()` returns the max across all four queues. Added `MailboxLatency` field to `ProcessShortInfo` and `MailboxQueues` latency fields to `ProcessInfo`. Added `Node.ProcessRangeShortInfo` for efficient iteration over all processes. See [actor/metrics](https://github.com/ergo-services/actor-metrics) for Prometheus integration with histogram, top-N, and Grafana dashboard
+
+* Added **per-event metrics** - `EventInfo` now includes `MessagesPublished`, `MessagesLocalSent`, `MessagesRemoteSent` counters. Added `Node.EventInfo` and `Node.EventRangeInfo` for querying event statistics. Added `EventsPublished`, `EventsReceived`, `EventsLocalSent`, `EventsRemoteSent` to `NodeInfo`. `EventsPublished` counts only local producer publishes, `EventsReceived` counts events arriving from remote nodes
+
+* Added **process init time measurement** - `InitTime` field in `ProcessShortInfo` and `ProcessInfo` records the time spent in `ProcessInit` callback (nanoseconds). Enables detection of slow process initialization
+
+* Fixed **message counters for meta processes** - meta process traffic now propagates to parent process counters (`messagesIn`/`messagesOut`), making `ProcessRangeShortInfo` aggregates balanced. Meta process own counters preserved for meta-level observability
+
+* Fixed **self-send message counter** - `messagesOut` now incremented for self-sends (process sending to itself), consistent with other send paths
+
+* Fixed **simultaneous connect dead loop** - two nodes dialing each other at the same time no longer cause infinite retry loops. Deterministic connection IDs and Erlang-style collision detection (`EnableSimultaneousConnect` flag) ensure exactly one connection per pair. Fixed related connection leaks
+
+* Fixed **silent data loss on connection pool write failure** - a transient write error could permanently break a pool item's write path without detection, causing all subsequent messages to be silently dropped while the connection appeared healthy
+
+* Added **software keepalive** for inter-node connections - application-level heartbeat that detects silent failures invisible to TCP keepalive. Enabled by default (15s period, 3 misses, 45s timeout). See [Network Stack](https://docs.ergo.services/networking/network-stack#software-keepalive) documentation
+
+* Added **handshake deadline** (5s) to prevent hung handshakes from blocking connection goroutines indefinitely
+
+* Added **message fragmentation** for large messages. Messages exceeding the fragment size (default 65000 bytes) are automatically split and reassembled. With `KeepNetworkOrder` disabled, fragments use all TCP connections for maximum throughput. See [Network Stack](https://docs.ergo.services/networking/network-stack#message-fragmentation) documentation
+
+* Fixed **important delivery use-after-release** - reference ID read from buffer after pool release, causing corrupted ACK responses
### Development and debugging ###
@@ -220,7 +262,11 @@ This helps identify stuck processes during shutdown by matching PIDs/Aliases fro
To disable panic recovery use `--tags norecover`.
-To enable trace logging level for the internals (node, network,...) use `--tags trace` and set the log level `gen.LogLevelTrace` for your node.
+To enable mailbox latency measurement use `--tags latency`. This adds a monotonic timestamp to every message pushed into the MPSC queue, allowing `QueueMPSC.Latency()` and `ProcessMailbox.Latency()` to report the age of the oldest unprocessed message. Overhead is approximately 10-25% on micro-benchmarks (LOCAL 1-1 scenario). Without the tag, `Latency()` returns -1 and there is zero overhead.
+
+To enable per-type encode/decode statistics use `--tags typestats`. This tracks the count of root-level encode/decode operations and decompressed wire-byte volume per registered EDF type, exposed via `Network().RegisteredTypes()` and visible in the Observer Types panel. Helps identify which message types dominate network traffic and which processes would benefit from compression. Overhead is approximately 2-3% on encode/decode throughput. Without the tag, counters remain zero and there is zero overhead.
+
+To enable trace logging level for the internals (node, network,...) use `--tags verbose` and set the log level `gen.LogLevelTrace` for your node.
For detailed debugging techniques, troubleshooting scenarios, and best practices, see the [Debugging](https://docs.ergo.services/advanced/debugging) documentation.
diff --git a/act/actor.go b/act/actor.go
index ee4599df5..dea127a75 100644
--- a/act/actor.go
+++ b/act/actor.go
@@ -4,7 +4,7 @@ import (
"fmt"
"reflect"
"runtime"
- "strings"
+ "time"
"ergo.services/ergo/gen"
"ergo.services/ergo/lib"
@@ -45,6 +45,9 @@ type ActorBehavior interface {
// this event using gen.Process.LinkEvent or gen.Process.MonitorEvent
HandleEvent(message gen.MessageEvent) error
+ // HandleSpan invoked on a tracing span if this process was added as a tracing exporter.
+ HandleSpan(message gen.TracingSpan) error
+
// HandleInspect invoked on the request made with gen.Process.Inspect(...)
HandleInspect(from gen.PID, item ...string) map[string]string
}
@@ -102,8 +105,7 @@ func (a *Actor) ProcessInit(process gen.Process, args ...any) (rr error) {
var ok bool
if a.behavior, ok = process.Behavior().(ActorBehavior); ok == false {
- unknown := strings.TrimPrefix(reflect.TypeOf(process.Behavior()).String(), "*")
- return fmt.Errorf("ProcessInit: not an ActorBehavior %s", unknown)
+ return fmt.Errorf("ProcessInit: not an ActorBehavior %s", process.BehaviorName())
}
if lib.Recover() {
@@ -125,6 +127,7 @@ func (a *Actor) ProcessInit(process gen.Process, args ...any) (rr error) {
func (a *Actor) ProcessRun() (rr error) {
var message *gen.MailboxMessage
+ var savedTracing gen.Tracing
if lib.Recover() {
defer func() {
@@ -186,6 +189,13 @@ func (a *Actor) ProcessRun() (rr error) {
retry:
switch message.Type {
case gen.MailboxMessageTypeRegular:
+ // activate tracing context from the incoming message
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ savedTracing = a.PropagatingTrace()
+ a.SetPropagatingTrace(message.Tracing)
+ }
+
var reason error
if a.split {
@@ -202,10 +212,28 @@ func (a *Actor) ProcessRun() (rr error) {
}
if reason != nil {
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindSend, reason.Error())
+ }
return reason
}
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindSend, "")
+ // restore tracing only if handler didn't change it
+ if a.PropagatingTrace().ID == message.Tracing.ID {
+ a.SetPropagatingTrace(savedTracing)
+ }
+ }
+
case gen.MailboxMessageTypeRequest:
+ // activate tracing context from the incoming message
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ savedTracing = a.PropagatingTrace()
+ a.SetPropagatingTrace(message.Tracing)
+ }
+
var reason error
var result any
@@ -223,21 +251,40 @@ func (a *Actor) ProcessRun() (rr error) {
}
if reason != nil {
- // if reason is "normal" and we got response - send it before termination
if reason == gen.TerminateReasonNormal && result != nil {
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindRequest, "")
+ }
a.SendResponse(message.From, message.Ref, result)
+ return reason
+ }
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindRequest, reason.Error())
}
return reason
}
if result == nil {
- // async handling of sync request. response could be sent
- // later, even by the other process
+ // async handling — emit Processed for tracing chain completeness
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindRequest, "")
+ if a.PropagatingTrace().ID == message.Tracing.ID {
+ a.SetPropagatingTrace(savedTracing)
+ }
+ }
continue
}
+ if messageHasTracing {
+ a.sendSpanProcessed(message, gen.TracingKindRequest, "")
+ }
+
a.SendResponse(message.From, message.Ref, result)
+ if messageHasTracing && a.PropagatingTrace().ID == message.Tracing.ID {
+ a.SetPropagatingTrace(savedTracing)
+ }
+
case gen.MailboxMessageTypeEvent:
if reason := a.behavior.HandleEvent(message.Message.(gen.MessageEvent)); reason != nil {
return reason
@@ -289,6 +336,11 @@ func (a *Actor) ProcessRun() (rr error) {
case gen.MailboxMessageTypeInspect:
result := a.behavior.HandleInspect(message.From, message.Message.([]string)...)
a.SendResponse(message.From, message.Ref, result)
+
+ case gen.MailboxMessageTypeSpan:
+ if reason := a.behavior.HandleSpan(message.Message.(gen.TracingSpan)); reason != nil {
+ return reason
+ }
}
}
@@ -330,6 +382,11 @@ func (a *Actor) HandleEvent(message gen.MessageEvent) error {
return nil
}
+func (a *Actor) HandleSpan(message gen.TracingSpan) error {
+ a.Log().Warning("Actor.HandleSpan: unhandled tracing span %#v", message)
+ return nil
+}
+
func (a *Actor) Terminate(reason error) {}
func (a *Actor) HandleMessageName(name gen.Atom, from gen.PID, message any) error {
@@ -351,3 +408,26 @@ func (a *Actor) HandleCallAlias(alias gen.Alias, from gen.PID, ref gen.Ref, requ
a.Log().Warning("Actor.HandleCallAlias %s: unhandled request from %s", alias, from)
return nil, nil
}
+
+func (a *Actor) sendSpanProcessed(message *gen.MailboxMessage, kind gen.TracingKind, errStr string) {
+ var msgType string
+ if message.Message != nil {
+ msgType = reflect.TypeOf(message.Message).String()
+ }
+ a.SendTracingSpan(gen.TracingSpan{
+ TraceID: message.Tracing.ID,
+ SpanID: message.Tracing.SpanID,
+ Point: gen.TracingPointProcessed,
+ Kind: kind,
+ Timestamp: time.Now().UnixNano(),
+ Node: a.Node().Name(),
+ From: message.From,
+ To: a.PID(),
+ Ref: message.Ref,
+ Behavior: a.BehaviorName(),
+ Message: msgType,
+ Error: errStr,
+ Attributes: a.TracingAttributes(),
+ })
+ a.ClearTracingSpanAttributes()
+}
diff --git a/act/pool.go b/act/pool.go
index 38b2ff40f..6a7ec73da 100644
--- a/act/pool.go
+++ b/act/pool.go
@@ -4,7 +4,7 @@ import (
"fmt"
"reflect"
"runtime"
- "strings"
+ "time"
"ergo.services/ergo/gen"
"ergo.services/ergo/lib"
@@ -104,8 +104,7 @@ func (p *Pool) ProcessInit(process gen.Process, args ...any) (rr error) {
var ok bool
if p.behavior, ok = process.Behavior().(PoolBehavior); ok == false {
- unknown := strings.TrimPrefix(reflect.TypeOf(process.Behavior()).String(), "*")
- return fmt.Errorf("ProcessInit: not a PoolBehavior %s", unknown)
+ return fmt.Errorf("ProcessInit: not a PoolBehavior %s", process.BehaviorName())
}
p.Process = process
p.mailbox = process.Mailbox()
@@ -197,7 +196,7 @@ func (p *Pool) ProcessRun() (rr error) {
// got new regular message. handle it
message = msg.(*gen.MailboxMessage)
if message.Type < gen.MailboxMessageTypeExit {
- // MailboxMessageTypeRegular, MailboxMessageTypeRequest, MailboxMessageTypeEvent
+ // MailboxMessageTypeRegular, MailboxMessageTypeRequest, MailboxMessageTypeEvent, MailboxMessageTypeSpan
p.forward(message)
// it shouldn't be "released" back to the pool
message = nil
@@ -217,32 +216,57 @@ func (p *Pool) ProcessRun() (rr error) {
switch message.Type {
case gen.MailboxMessageTypeRegular:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ p.SetPropagatingTrace(message.Tracing)
+ }
+
if reason := p.behavior.HandleMessage(message.From, message.Message); reason != nil {
+ p.sendSpanProcessed(message, gen.TracingKindSend, reason.Error())
return reason
}
+ p.sendSpanProcessed(message, gen.TracingKindSend, "")
+
+ if messageHasTracing {
+ p.SetPropagatingTrace(gen.Tracing{})
+ }
case gen.MailboxMessageTypeRequest:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ p.SetPropagatingTrace(message.Tracing)
+ }
+
var reason error
var result any
result, reason = p.behavior.HandleCall(message.From, message.Ref, message.Message)
if reason != nil {
- // if reason is "normal" and we got response - send it before termination
if reason == gen.TerminateReasonNormal && result != nil {
+ p.sendSpanProcessed(message, gen.TracingKindRequest, "")
p.SendResponse(message.From, message.Ref, result)
+ } else {
+ p.sendSpanProcessed(message, gen.TracingKindRequest, reason.Error())
}
return reason
}
if result == nil {
- // async handling of sync request. response could be sent
- // later, even by the other process
+ p.sendSpanProcessed(message, gen.TracingKindRequest, "")
+ if messageHasTracing {
+ p.SetPropagatingTrace(gen.Tracing{})
+ }
continue
}
+ p.sendSpanProcessed(message, gen.TracingKindRequest, "")
p.SendResponse(message.From, message.Ref, result)
+ if messageHasTracing {
+ p.SetPropagatingTrace(gen.Tracing{})
+ }
+
case gen.MailboxMessageTypeEvent:
if reason := p.behavior.HandleEvent(message.Message.(gen.MessageEvent)); reason != nil {
return reason
@@ -272,6 +296,7 @@ func (p *Pool) ProcessRun() (rr error) {
case gen.MailboxMessageTypeInspect:
result := p.behavior.HandleInspect(message.From, message.Message.([]string)...)
p.SendResponse(message.From, message.Ref, result)
+
}
}
@@ -298,6 +323,31 @@ func (p *Pool) HandleEvent(message gen.MessageEvent) error {
p.Log().Warning("Pool.HandleEvent: unhandled event message %#v", message)
return nil
}
+func (p *Pool) sendSpanProcessed(message *gen.MailboxMessage, kind gen.TracingKind, errStr string) {
+ if message.Tracing.ID == [2]uint64{} {
+ return
+ }
+ var msgType string
+ if message.Message != nil {
+ msgType = reflect.TypeOf(message.Message).String()
+ }
+ p.SendTracingSpan(gen.TracingSpan{
+ TraceID: message.Tracing.ID,
+ SpanID: message.Tracing.SpanID,
+ Point: gen.TracingPointProcessed,
+ Kind: kind,
+ Timestamp: time.Now().UnixNano(),
+ Node: p.Node().Name(),
+ From: message.From,
+ To: p.PID(),
+ Ref: message.Ref,
+ Message: msgType,
+ Error: errStr,
+ Attributes: p.TracingAttributes(),
+ })
+ p.ClearTracingSpanAttributes()
+}
+
func (p *Pool) HandleInspect(from gen.PID, item ...string) map[string]string {
return map[string]string{
"pool_size": fmt.Sprintf("%d", p.options.PoolSize),
diff --git a/act/supervisor.go b/act/supervisor.go
index c06de4f80..97f934370 100644
--- a/act/supervisor.go
+++ b/act/supervisor.go
@@ -5,7 +5,6 @@ import (
"reflect"
"runtime"
"sort"
- "strings"
"time"
"ergo.services/ergo/gen"
@@ -254,8 +253,7 @@ func (s *Supervisor) ProcessInit(process gen.Process, args ...any) (rr error) {
var ok bool
if s.behavior, ok = process.Behavior().(SupervisorBehavior); ok == false {
- unknown := strings.TrimPrefix(reflect.TypeOf(process.Behavior()).String(), "*")
- return fmt.Errorf("ProcessInit: not a SupervisorBehavior %s", unknown)
+ return fmt.Errorf("ProcessInit: not a SupervisorBehavior %s", process.BehaviorName())
}
s.Process = process
@@ -410,6 +408,11 @@ func (s *Supervisor) ProcessRun() (rr error) {
switch message.Type {
case gen.MailboxMessageTypeRegular:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ s.SetPropagatingTrace(message.Tracing)
+ }
+
var reason error
if s.handleChild {
switch m := message.Message.(type) {
@@ -425,14 +428,33 @@ func (s *Supervisor) ProcessRun() (rr error) {
}
if reason != nil {
+ s.sendSpanProcessed(message, gen.TracingKindSend, reason.Error())
action := s.sup.childTerminated(s.Name(), s.PID(), reason)
if err := s.handleAction(action); err != nil {
return err
}
+ } else {
+ s.sendSpanProcessed(message, gen.TracingKindSend, "")
+ }
+
+ if messageHasTracing {
+ s.SetPropagatingTrace(gen.Tracing{})
}
case gen.MailboxMessageTypeRequest:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ s.SetPropagatingTrace(message.Tracing)
+ }
+
result, reason := s.behavior.HandleCall(message.From, message.Ref, message.Message)
+
+ if reason != nil {
+ s.sendSpanProcessed(message, gen.TracingKindRequest, reason.Error())
+ } else {
+ s.sendSpanProcessed(message, gen.TracingKindRequest, "")
+ }
+
if result != nil {
s.SendResponse(message.From, message.Ref, result)
}
@@ -443,6 +465,10 @@ func (s *Supervisor) ProcessRun() (rr error) {
}
}
+ if messageHasTracing {
+ s.SetPropagatingTrace(gen.Tracing{})
+ }
+
case gen.MailboxMessageTypeEvent:
if reason := s.behavior.HandleEvent(message.Message.(gen.MessageEvent)); reason != nil {
return reason
@@ -497,6 +523,9 @@ func (s *Supervisor) ProcessRun() (rr error) {
case gen.MailboxMessageTypeInspect:
result := s.behavior.HandleInspect(message.From, message.Message.([]string)...)
s.SendResponse(message.From, message.Ref, result)
+
+ case gen.MailboxMessageTypeSpan:
+ panic("supervisor process can not be a tracing exporter")
}
}
}
@@ -752,3 +781,29 @@ func sortSupChild(c []supChild) []SupervisorChild {
}
return children
}
+
+func (s *Supervisor) sendSpanProcessed(message *gen.MailboxMessage, kind gen.TracingKind, errStr string) {
+ if message.Tracing.ID == [2]uint64{} {
+ return
+ }
+ var msgType string
+ if message.Message != nil {
+ msgType = reflect.TypeOf(message.Message).String()
+ }
+ s.SendTracingSpan(gen.TracingSpan{
+ TraceID: message.Tracing.ID,
+ SpanID: message.Tracing.SpanID,
+ Point: gen.TracingPointProcessed,
+ Kind: kind,
+ Timestamp: time.Now().UnixNano(),
+ Node: s.Node().Name(),
+ From: message.From,
+ To: s.PID(),
+ Ref: message.Ref,
+ Behavior: s.BehaviorName(),
+ Message: msgType,
+ Error: errStr,
+ Attributes: s.TracingAttributes(),
+ })
+ s.ClearTracingSpanAttributes()
+}
diff --git a/act/web_worker.go b/act/web_worker.go
index 11637029d..b92b9dd26 100644
--- a/act/web_worker.go
+++ b/act/web_worker.go
@@ -5,7 +5,7 @@ import (
"net/http"
"reflect"
"runtime"
- "strings"
+ "time"
"ergo.services/ergo/gen"
"ergo.services/ergo/lib"
@@ -67,8 +67,7 @@ type WebWorker struct {
func (w *WebWorker) ProcessInit(process gen.Process, args ...any) (rr error) {
var ok bool
if w.behavior, ok = process.Behavior().(WebWorkerBehavior); ok == false {
- unknown := strings.TrimPrefix(reflect.TypeOf(process.Behavior()).String(), "*")
- return fmt.Errorf("ProcessInit: not a WebWorkerBehavior %s", unknown)
+ return fmt.Errorf("ProcessInit: not a WebWorkerBehavior %s", process.BehaviorName())
}
w.Process = process
w.mailbox = process.Mailbox()
@@ -89,6 +88,7 @@ func (w *WebWorker) ProcessInit(process gen.Process, args ...any) (rr error) {
func (w *WebWorker) ProcessRun() (rr error) {
var message *gen.MailboxMessage
+ var savedTracing gen.Tracing
if lib.Recover() {
defer func() {
@@ -145,6 +145,12 @@ func (w *WebWorker) ProcessRun() (rr error) {
switch message.Type {
case gen.MailboxMessageTypeRegular:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ savedTracing = w.PropagatingTrace()
+ w.SetPropagatingTrace(message.Tracing)
+ }
+
if r, ok := message.Message.(meta.MessageWebRequest); ok {
var reason error
switch r.Request.Method {
@@ -169,36 +175,61 @@ func (w *WebWorker) ProcessRun() (rr error) {
}
r.Done()
if reason != nil {
+ w.sendSpanProcessed(message, gen.TracingKindSend, r.Request.Method+" "+r.Request.RequestURI, reason.Error())
return reason
}
+ w.sendSpanProcessed(message, gen.TracingKindSend, r.Request.Method+" "+r.Request.RequestURI, "")
+ if messageHasTracing && w.PropagatingTrace().ID == message.Tracing.ID {
+ w.SetPropagatingTrace(savedTracing)
+ }
continue
}
if reason := w.behavior.HandleMessage(message.From, message.Message); reason != nil {
+ w.sendSpanProcessed(message, gen.TracingKindSend, reflectMsgType(message.Message), reason.Error())
return reason
}
+ w.sendSpanProcessed(message, gen.TracingKindSend, reflectMsgType(message.Message), "")
+ if messageHasTracing && w.PropagatingTrace().ID == message.Tracing.ID {
+ w.SetPropagatingTrace(savedTracing)
+ }
case gen.MailboxMessageTypeRequest:
+ messageHasTracing := message.Tracing.ID != [2]uint64{}
+ if messageHasTracing {
+ savedTracing = w.PropagatingTrace()
+ w.SetPropagatingTrace(message.Tracing)
+ }
+
var reason error
var result any
result, reason = w.behavior.HandleCall(message.From, message.Ref, message.Message)
if reason != nil {
- // if reason is "normal" and we got response - send it before termination
if reason == gen.TerminateReasonNormal && result != nil {
+ w.sendSpanProcessed(message, gen.TracingKindRequest, reflectMsgType(message.Message), "")
w.SendResponse(message.From, message.Ref, result)
+ return reason
}
+ w.sendSpanProcessed(message, gen.TracingKindRequest, reflectMsgType(message.Message), reason.Error())
return reason
}
if result == nil {
- // async handling of sync request. response could be sent
- // later, even by the other process
+ w.sendSpanProcessed(message, gen.TracingKindRequest, reflectMsgType(message.Message), "")
+ if messageHasTracing && w.PropagatingTrace().ID == message.Tracing.ID {
+ w.SetPropagatingTrace(savedTracing)
+ }
continue
}
+ w.sendSpanProcessed(message, gen.TracingKindRequest, reflectMsgType(message.Message), "")
+
w.SendResponse(message.From, message.Ref, result)
+ if messageHasTracing && w.PropagatingTrace().ID == message.Tracing.ID {
+ w.SetPropagatingTrace(savedTracing)
+ }
case gen.MailboxMessageTypeEvent:
if reason := w.behavior.HandleEvent(message.Message.(gen.MessageEvent)); reason != nil {
@@ -229,6 +260,9 @@ func (w *WebWorker) ProcessRun() (rr error) {
case gen.MailboxMessageTypeInspect:
result := w.behavior.HandleInspect(message.From, message.Message.([]string)...)
w.SendResponse(message.From, message.Ref, result)
+
+ case gen.MailboxMessageTypeSpan:
+ panic("web worker process can not be a tracing exporter")
}
}
@@ -295,3 +329,32 @@ func (w *WebWorker) HandleOptions(from gen.PID, writer http.ResponseWriter, requ
http.Error(writer, "unhandled request", http.StatusNotImplemented)
return nil
}
+
+func reflectMsgType(msg any) string {
+ if msg == nil {
+ return ""
+ }
+ return reflect.TypeOf(msg).String()
+}
+
+func (w *WebWorker) sendSpanProcessed(message *gen.MailboxMessage, kind gen.TracingKind, msgType string, errStr string) {
+ if message.Tracing.ID == [2]uint64{} {
+ return
+ }
+ w.SendTracingSpan(gen.TracingSpan{
+ TraceID: message.Tracing.ID,
+ SpanID: message.Tracing.SpanID,
+ Point: gen.TracingPointProcessed,
+ Kind: kind,
+ Timestamp: time.Now().UnixNano(),
+ Node: w.Node().Name(),
+ From: message.From,
+ To: w.PID(),
+ Ref: message.Ref,
+ Behavior: w.BehaviorName(),
+ Message: msgType,
+ Error: errStr,
+ Attributes: w.TracingAttributes(),
+ })
+ w.ClearTracingSpanAttributes()
+}
diff --git a/app/system/app.go b/app/system/app.go
index 9d712d55f..7a32cfa49 100644
--- a/app/system/app.go
+++ b/app/system/app.go
@@ -1,6 +1,9 @@
package system
import (
+ "fmt"
+
+ "ergo.services/ergo/app/system/inspect"
"ergo.services/ergo/gen"
)
@@ -15,6 +18,9 @@ type systemApp struct {
}
func (sa *systemApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := inspect.RegisterTypes(node.Network()); err != nil {
+ return gen.ApplicationSpec{}, fmt.Errorf("inspect types: %w", err)
+ }
return gen.ApplicationSpec{
Name: Name,
Description: "System Application",
diff --git a/app/system/inspect/application_list.go b/app/system/inspect/application_list.go
index 42cb5967a..cdcb91b08 100644
--- a/app/system/inspect/application_list.go
+++ b/app/system/inspect/application_list.go
@@ -16,21 +16,36 @@ type application_list struct {
token gen.Ref
generating bool
+ loopID uint64
event gen.Atom
}
func (ial *application_list) Init(args ...any) error {
ial.Log().SetLogger("default")
ial.Log().Debug("application list inspector started")
- // RegisterEvent is not allowed here
- ial.Send(ial.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s", inspectApplicationList))
+ token, err := ial.RegisterEvent(evname, eopts)
+ if err != nil {
+ ial.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ ial.Log().Info("registered event %s", evname)
+ ial.event = evname
+ ial.token = token
+ ial.SendAfter(ial.PID(), shutdown{}, inspectApplicationListIdlePeriod)
+
return nil
}
func (ial *application_list) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ial.generating == false {
+ if m.id != ial.loopID || ial.generating == false {
ial.Log().Debug("generating canceled")
break // cancelled
}
@@ -60,7 +75,7 @@ func (ial *application_list) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- ial.SendAfter(ial.PID(), generate{}, inspectApplicationListPeriod)
+ ial.SendAfter(ial.PID(), generate{id: ial.loopID}, inspectApplicationListPeriod)
case requestInspect:
response := ResponseInspectApplicationList{
@@ -72,23 +87,6 @@ func (ial *application_list) HandleMessage(from gen.PID, message any) error {
ial.SendResponse(m.pid, m.ref, response)
ial.Log().Debug("sent response for the inspect application list request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s", inspectApplicationList))
- token, err := ial.RegisterEvent(evname, eopts)
- if err != nil {
- ial.Log().Error("unable to register event: %s", err)
- return err
- }
- ial.Log().Info("registered event %s", evname)
- ial.event = evname
-
- ial.token = token
- ial.SendAfter(ial.PID(), shutdown{}, inspectApplicationListIdlePeriod)
-
case shutdown:
if ial.generating {
ial.Log().Debug("ignore shutdown. generating is active")
@@ -98,7 +96,8 @@ func (ial *application_list) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
ial.Log().Debug("got first subscriber. start generating events...")
- ial.Send(ial.PID(), generate{})
+ ial.loopID++
+ ial.Send(ial.PID(), generate{id: ial.loopID})
ial.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/application_tree.go b/app/system/inspect/application_tree.go
index 68a032c37..71ea803a0 100644
--- a/app/system/inspect/application_tree.go
+++ b/app/system/inspect/application_tree.go
@@ -18,6 +18,7 @@ type application_tree struct {
application gen.Atom
limit int
generating bool
+ loopID uint64
event gen.Atom
}
@@ -26,15 +27,30 @@ func (iat *application_tree) Init(args ...any) error {
iat.limit = args[1].(int)
iat.Log().SetLogger("default")
iat.Log().Debug("application tree inspector started for %s with limit %d", iat.application, iat.limit)
- // RegisterEvent is not allowed here
- iat.Send(iat.PID(), register{})
+ iat.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s_%d", inspectApplicationTree, iat.application, iat.limit))
+ token, err := iat.RegisterEvent(evname, eopts)
+ if err != nil {
+ iat.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ iat.Log().Info("registered event %s", evname)
+ iat.event = evname
+ iat.token = token
+ iat.SendAfter(iat.PID(), shutdown{}, inspectApplicationTreeIdlePeriod)
+
return nil
}
func (iat *application_tree) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if iat.generating == false {
+ if m.id != iat.loopID || iat.generating == false {
iat.Log().Debug("generating canceled")
break // cancelled
}
@@ -56,7 +72,7 @@ func (iat *application_tree) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- iat.SendAfter(iat.PID(), generate{}, inspectApplicationTreePeriod)
+ iat.SendAfter(iat.PID(), generate{id: iat.loopID}, inspectApplicationTreePeriod)
case requestInspect:
response := ResponseInspectApplicationTree{
@@ -68,23 +84,6 @@ func (iat *application_tree) HandleMessage(from gen.PID, message any) error {
iat.SendResponse(m.pid, m.ref, response)
iat.Log().Debug("sent response for the inspect application tree request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s_%d", inspectApplicationTree, iat.application, iat.limit))
- token, err := iat.RegisterEvent(evname, eopts)
- if err != nil {
- iat.Log().Error("unable to register event: %s", err)
- return err
- }
- iat.Log().Info("registered event %s", evname)
- iat.event = evname
-
- iat.token = token
- iat.SendAfter(iat.PID(), shutdown{}, inspectApplicationTreeIdlePeriod)
-
case shutdown:
if iat.generating {
iat.Log().Debug("ignore shutdown. generating is active")
@@ -94,7 +93,8 @@ func (iat *application_tree) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
iat.Log().Debug("got first subscriber. start generating events...")
- iat.Send(iat.PID(), generate{})
+ iat.loopID++
+ iat.Send(iat.PID(), generate{id: iat.loopID})
iat.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/connection.go b/app/system/inspect/connection.go
index 85393fdd1..60e7777fc 100644
--- a/app/system/inspect/connection.go
+++ b/app/system/inspect/connection.go
@@ -17,6 +17,7 @@ type connection struct {
event gen.Atom
generating bool
+ loopID uint64
remote gen.Atom
}
@@ -24,15 +25,29 @@ func (ic *connection) Init(args ...any) error {
ic.remote = args[0].(gen.Atom)
ic.Log().SetLogger("default")
ic.Log().Debug("connection inspector started")
- // RegisterEvent is not allowed here
- ic.Send(ic.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", inspectConnection, ic.remote))
+ token, err := ic.RegisterEvent(evname, eopts)
+ if err != nil {
+ ic.Log().Error("unable to register connection event: %s", err)
+ return err
+ }
+ ic.Log().Info("registered event %s", inspectNetwork)
+ ic.event = evname
+ ic.token = token
+ ic.SendAfter(ic.PID(), shutdown{}, inspectNetworkIdlePeriod)
+
return nil
}
func (ic *connection) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ic.generating == false {
+ if m.id != ic.loopID || ic.generating == false {
ic.Log().Debug("generating canceled")
break // cancelled
}
@@ -60,7 +75,7 @@ func (ic *connection) HandleMessage(from gen.PID, message any) error {
if ev.Disconnected {
return gen.TerminateReasonNormal
}
- ic.SendAfter(ic.PID(), generate{}, inspectNetworkPeriod)
+ ic.SendAfter(ic.PID(), generate{id: ic.loopID}, inspectNetworkPeriod)
case requestInspect:
response := ResponseInspectConnection{
@@ -80,23 +95,6 @@ func (ic *connection) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", inspectConnection, ic.remote))
- token, err := ic.RegisterEvent(evname, eopts)
- if err != nil {
- ic.Log().Error("unable to register connection event: %s", err)
- return err
- }
- ic.Log().Info("registered event %s", inspectNetwork)
- ic.event = evname
-
- ic.token = token
- ic.SendAfter(ic.PID(), shutdown{}, inspectNetworkIdlePeriod)
-
case shutdown:
if ic.generating {
ic.Log().Debug("ignore shutdown. generating is active")
@@ -106,7 +104,8 @@ func (ic *connection) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
ic.Log().Debug("got first subscriber. start generating events...")
- ic.Send(ic.PID(), generate{})
+ ic.loopID++
+ ic.Send(ic.PID(), generate{id: ic.loopID})
ic.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/connection_list.go b/app/system/inspect/connection_list.go
new file mode 100644
index 000000000..63b455384
--- /dev/null
+++ b/app/system/inspect/connection_list.go
@@ -0,0 +1,140 @@
+package inspect
+
+import (
+ "fmt"
+ "slices"
+ "strings"
+
+ "ergo.services/ergo/act"
+ "ergo.services/ergo/gen"
+)
+
+func factory_connection_list() gen.ProcessBehavior {
+ return &connection_list{}
+}
+
+type connection_list struct {
+ act.Actor
+ token gen.Ref
+
+ name string
+ limit int
+ hash string
+ generating bool
+ loopID uint64
+ event gen.Atom
+}
+
+func (icl *connection_list) Init(args ...any) error {
+ icl.name = args[0].(string)
+ icl.limit = args[1].(int)
+ icl.hash = args[2].(string)
+
+ icl.Log().SetLogger("default")
+ icl.Log().Debug("connection list inspector started. name=%q limit=%d", icl.name, icl.limit)
+ icl.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1,
+ }
+ icl.event = gen.Atom(fmt.Sprintf("%s_%s", inspectConnectionList, icl.hash))
+ token, err := icl.RegisterEvent(icl.event, eopts)
+ if err != nil {
+ icl.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ icl.Log().Info("registered event %s", icl.event)
+ icl.token = token
+ icl.SendAfter(icl.PID(), shutdown{}, inspectConnectionListIdlePeriod)
+
+ return nil
+}
+
+func (icl *connection_list) HandleMessage(from gen.PID, message any) error {
+ switch m := message.(type) {
+ case generate:
+ if m.id != icl.loopID || icl.generating == false {
+ break
+ }
+
+ networkInfo, err := icl.Node().Network().Info()
+ if err != nil {
+ return err
+ }
+
+ nameLower := strings.ToLower(icl.name)
+ var connections []gen.RemoteNodeInfo
+
+ // sort node names for stable output
+ slices.Sort(networkInfo.Nodes)
+
+ for _, n := range networkInfo.Nodes {
+ if nameLower != "" {
+ if strings.Contains(strings.ToLower(string(n)), nameLower) == false {
+ continue
+ }
+ }
+
+ remote, rerr := icl.Node().Network().Node(n)
+ if rerr != nil {
+ continue
+ }
+
+ connections = append(connections, remote.Info())
+
+ if icl.limit > 0 && len(connections) >= icl.limit {
+ break
+ }
+ }
+
+ ev := MessageInspectConnectionList{
+ Node: icl.Node().Name(),
+ Connections: connections,
+ }
+
+ if err := icl.SendEvent(icl.event, icl.token, ev); err != nil {
+ icl.Log().Error("unable to send event %q: %s", icl.event, err)
+ return gen.TerminateReasonNormal
+ }
+
+ icl.SendAfter(icl.PID(), generate{id: icl.loopID}, inspectConnectionListPeriod)
+
+ case requestInspect:
+ response := ResponseInspectConnectionList{
+ Event: gen.Event{
+ Name: icl.event,
+ Node: icl.Node().Name(),
+ },
+ }
+ icl.SendResponse(m.pid, m.ref, response)
+
+ case shutdown:
+ if icl.generating {
+ break
+ }
+ return gen.TerminateReasonNormal
+
+ case gen.MessageEventStart:
+ icl.Log().Debug("got first subscriber. start generating events...")
+ icl.loopID++
+ icl.Send(icl.PID(), generate{id: icl.loopID})
+ icl.generating = true
+
+ case gen.MessageEventStop:
+ icl.Log().Debug("no subscribers. stop generating")
+ if icl.generating {
+ icl.generating = false
+ icl.SendAfter(icl.PID(), shutdown{}, inspectConnectionListIdlePeriod)
+ }
+
+ default:
+ icl.Log().Error("unknown message (ignored) %#v", message)
+ }
+
+ return nil
+}
+
+func (icl *connection_list) Terminate(reason error) {
+ icl.Log().Debug("connection list inspector terminated: %s", reason)
+}
diff --git a/app/system/inspect/event_list.go b/app/system/inspect/event_list.go
new file mode 100644
index 000000000..76da4952a
--- /dev/null
+++ b/app/system/inspect/event_list.go
@@ -0,0 +1,157 @@
+package inspect
+
+import (
+ "fmt"
+ "strings"
+
+ "ergo.services/ergo/act"
+ "ergo.services/ergo/gen"
+)
+
+func factory_event_list() gen.ProcessBehavior {
+ return &event_list{}
+}
+
+type event_list struct {
+ act.Actor
+ token gen.Ref
+
+ timestamp int64
+ name string
+ notify int
+ buffered int
+ open int
+ minSubscribers int64
+ limit int
+ hash string
+
+ generating bool
+ loopID uint64
+ event gen.Atom
+}
+
+func (iel *event_list) Init(args ...any) error {
+ iel.timestamp = args[0].(int64)
+ iel.name = args[1].(string)
+ iel.notify = args[2].(int)
+ iel.buffered = args[3].(int)
+ iel.open = args[4].(int)
+ iel.minSubscribers = args[5].(int64)
+ iel.limit = args[6].(int)
+ iel.hash = args[7].(string)
+
+ iel.Log().SetLogger("default")
+ iel.Log().Debug("event list inspector started. timestamp=%d name=%q notify=%d buffered=%d open=%d minSubs=%d limit=%d",
+ iel.timestamp, iel.name, iel.notify, iel.buffered, iel.open, iel.minSubscribers, iel.limit)
+ iel.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1,
+ }
+ iel.event = gen.Atom(fmt.Sprintf("%s_%s", inspectEventList, iel.hash))
+ token, err := iel.RegisterEvent(iel.event, eopts)
+ if err != nil {
+ iel.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ iel.Log().Info("registered event %s", iel.event)
+ iel.token = token
+ iel.SendAfter(iel.PID(), shutdown{}, inspectEventListIdlePeriod)
+
+ return nil
+}
+
+func (iel *event_list) HandleMessage(from gen.PID, message any) error {
+ switch m := message.(type) {
+ case generate:
+ if m.id != iel.loopID || iel.generating == false {
+ iel.Log().Debug("generating canceled")
+ break
+ }
+ iel.Log().Debug("generating event")
+
+ events, _ := iel.Node().EventListInfo(iel.timestamp, iel.limit, iel.filterEvent)
+
+ ev := MessageInspectEventList{
+ Node: iel.Node().Name(),
+ Events: events,
+ }
+
+ if err := iel.SendEvent(iel.event, iel.token, ev); err != nil {
+ iel.Log().Error("unable to send event %q: %s", iel.event, err)
+ return gen.TerminateReasonNormal
+ }
+
+ iel.SendAfter(iel.PID(), generate{id: iel.loopID}, inspectEventListPeriod)
+
+ case requestInspect:
+ response := ResponseInspectEventList{
+ Event: gen.Event{
+ Name: iel.event,
+ Node: iel.Node().Name(),
+ },
+ }
+ iel.SendResponse(m.pid, m.ref, response)
+ iel.Log().Debug("sent response for the inspect event list request to: %s", m.pid)
+
+ case shutdown:
+ if iel.generating {
+ iel.Log().Debug("ignore shutdown. generating is active")
+ break
+ }
+ return gen.TerminateReasonNormal
+
+ case gen.MessageEventStart:
+ iel.Log().Debug("got first subscriber. start generating events...")
+ iel.loopID++
+ iel.Send(iel.PID(), generate{id: iel.loopID})
+ iel.generating = true
+
+ case gen.MessageEventStop:
+ iel.Log().Debug("no subscribers. stop generating")
+ if iel.generating {
+ iel.generating = false
+ iel.SendAfter(iel.PID(), shutdown{}, inspectEventListIdlePeriod)
+ }
+
+ default:
+ iel.Log().Error("unknown message (ignored) %#v", message)
+ }
+
+ return nil
+}
+
+func (iel *event_list) filterEvent(info gen.EventInfo) bool {
+ if iel.name != "" {
+ if strings.Contains(strings.ToLower(string(info.Event.Name)), strings.ToLower(iel.name)) == false {
+ return false
+ }
+ }
+ if iel.notify == 1 && info.Notify == false {
+ return false
+ }
+ if iel.notify == -1 && info.Notify == true {
+ return false
+ }
+ if iel.buffered == 1 && info.BufferSize == 0 {
+ return false
+ }
+ if iel.buffered == -1 && info.BufferSize > 0 {
+ return false
+ }
+ if iel.open == 1 && info.Open == false {
+ return false
+ }
+ if iel.open == -1 && info.Open == true {
+ return false
+ }
+ if iel.minSubscribers > 0 && info.Subscribers < iel.minSubscribers {
+ return false
+ }
+ return true
+}
+
+func (iel *event_list) Terminate(reason error) {
+ iel.Log().Debug("event list inspector terminated: %s", reason)
+}
diff --git a/app/system/inspect/goroutines.go b/app/system/inspect/goroutines.go
new file mode 100644
index 000000000..f427204f2
--- /dev/null
+++ b/app/system/inspect/goroutines.go
@@ -0,0 +1,169 @@
+package inspect
+
+import (
+ "runtime"
+ "sort"
+ "strconv"
+ "strings"
+)
+
+func captureGoroutines(req RequestDoGoroutines) ResponseDoGoroutines {
+ buf := make([]byte, 1<<20)
+ for {
+ n := runtime.Stack(buf, true)
+ if n < len(buf) {
+ buf = buf[:n]
+ break
+ }
+ buf = make([]byte, len(buf)*2)
+ }
+
+ blocks := strings.Split(string(buf), "\n\n")
+
+ stackFilter := strings.ToLower(req.Stack)
+ stateFilter := strings.ToLower(req.State)
+
+ type parsed struct {
+ id int
+ state string
+ waitSec int64
+ frames string
+ top string
+ bottom string
+ full string
+ }
+
+ var matched []parsed
+ total := 0
+
+ for _, block := range blocks {
+ block = strings.TrimSpace(block)
+ if strings.HasPrefix(block, "goroutine ") == false {
+ continue
+ }
+ total++
+
+ id, state, waitSec := parseHeader(block)
+
+ if stateFilter != "" && strings.ToLower(state) != stateFilter {
+ continue
+ }
+ if req.MinWait > 0 && waitSec < req.MinWait {
+ continue
+ }
+ if stackFilter != "" && strings.Contains(strings.ToLower(block), stackFilter) == false {
+ continue
+ }
+
+ funcs := parseFuncLines(block)
+ top := ""
+ bottom := ""
+ if len(funcs) > 0 {
+ top = funcs[0]
+ bottom = funcs[len(funcs)-1]
+ }
+
+ matched = append(matched, parsed{
+ id: id,
+ state: state,
+ waitSec: waitSec,
+ frames: state + "|" + strings.Join(funcs, "|"),
+ top: top,
+ bottom: bottom,
+ full: block,
+ })
+ }
+
+ // group by identical stack
+ groupMap := make(map[string]*GoroutineGroup)
+ var order []string
+
+ for _, p := range matched {
+ g, ok := groupMap[p.frames]
+ if ok == false {
+ g = &GoroutineGroup{State: p.state, WaitSec: p.waitSec, Current: p.top, Origin: p.bottom, Stack: p.full}
+ groupMap[p.frames] = g
+ order = append(order, p.frames)
+ }
+ g.Count++
+ g.IDs = append(g.IDs, p.id)
+ }
+
+ groups := make([]GoroutineGroup, 0, len(order))
+ for _, key := range order {
+ groups = append(groups, *groupMap[key])
+ }
+
+ sort.Slice(groups, func(i, j int) bool {
+ return groups[i].Count > groups[j].Count
+ })
+
+ return ResponseDoGoroutines{
+ Groups: groups,
+ Total: total,
+ Filtered: len(matched),
+ }
+}
+
+func parseHeader(block string) (id int, state string, waitSec int64) {
+ lines := strings.SplitN(block, "\n", 2)
+ header := lines[0]
+
+ rest := header[len("goroutine "):]
+ spaceIdx := strings.IndexByte(rest, ' ')
+ if spaceIdx < 0 {
+ return
+ }
+ id, _ = strconv.Atoi(rest[:spaceIdx])
+
+ open := strings.IndexByte(rest, '[')
+ close := strings.IndexByte(rest, ']')
+ if open < 0 || close <= open {
+ return
+ }
+
+ stateStr := rest[open+1 : close]
+ parts := strings.SplitN(stateStr, ",", 2)
+ state = strings.TrimSpace(parts[0])
+
+ if len(parts) > 1 {
+ waitSec = parseWaitDuration(strings.TrimSpace(parts[1]))
+ }
+ return
+}
+
+func parseWaitDuration(s string) int64 {
+ // "5 minutes", "847 minutes", "2 hours"
+ parts := strings.Fields(s)
+ if len(parts) < 2 {
+ return 0
+ }
+ n, err := strconv.ParseInt(parts[0], 10, 64)
+ if err != nil {
+ return 0
+ }
+ unit := strings.TrimSuffix(parts[1], "s") // "minute" from "minutes"
+ switch unit {
+ case "minute":
+ return n * 60
+ case "hour":
+ return n * 3600
+ case "second":
+ return n
+ }
+ return 0
+}
+
+func parseFuncLines(block string) []string {
+ lines := strings.Split(block, "\n")
+ var funcs []string
+ for i := 1; i < len(lines); i++ {
+ if len(lines[i]) > 0 && lines[i][0] != '\t' {
+ f := strings.TrimSpace(lines[i])
+ if f != "" {
+ funcs = append(funcs, f)
+ }
+ }
+ }
+ return funcs
+}
diff --git a/app/system/inspect/heap.go b/app/system/inspect/heap.go
new file mode 100644
index 000000000..818ce4ef2
--- /dev/null
+++ b/app/system/inspect/heap.go
@@ -0,0 +1,62 @@
+package inspect
+
+import (
+ "runtime"
+ "sort"
+)
+
+func captureHeapProfile(req RequestDoHeapProfile) ResponseDoHeapProfile {
+ // force up-to-date stats
+ runtime.GC()
+
+ var p []runtime.MemProfileRecord
+ n, _ := runtime.MemProfile(nil, true)
+ p = make([]runtime.MemProfileRecord, n)
+ runtime.MemProfile(p, true)
+
+ var records []HeapRecord
+ var totalInuse, totalAlloc, totalObjects int64
+
+ for _, r := range p {
+ inuse := r.InUseBytes()
+ if req.MinBytes > 0 && inuse < req.MinBytes {
+ continue
+ }
+
+ frames := runtime.CallersFrames(r.Stack())
+ var stack []string
+ for {
+ frame, more := frames.Next()
+ if frame.Function != "" {
+ stack = append(stack, frame.Function)
+ }
+ if more == false {
+ break
+ }
+ }
+
+ rec := HeapRecord{
+ InuseBytes: inuse,
+ InuseObjects: r.InUseObjects(),
+ AllocBytes: r.AllocBytes,
+ AllocObjects: r.AllocObjects,
+ Stack: stack,
+ }
+ records = append(records, rec)
+
+ totalInuse += inuse
+ totalAlloc += r.AllocBytes
+ totalObjects += r.InUseObjects()
+ }
+
+ sort.Slice(records, func(i, j int) bool {
+ return records[i].InuseBytes > records[j].InuseBytes
+ })
+
+ return ResponseDoHeapProfile{
+ Records: records,
+ TotalInuse: totalInuse,
+ TotalAlloc: totalAlloc,
+ TotalObjects: totalObjects,
+ }
+}
diff --git a/app/system/inspect/heap_inspector.go b/app/system/inspect/heap_inspector.go
new file mode 100644
index 000000000..f9ecc2941
--- /dev/null
+++ b/app/system/inspect/heap_inspector.go
@@ -0,0 +1,201 @@
+package inspect
+
+import (
+ "fmt"
+ "runtime"
+ "sort"
+ "strings"
+
+ "ergo.services/ergo/act"
+ "ergo.services/ergo/gen"
+)
+
+func factory_heap() gen.ProcessBehavior {
+ return &heap_inspector{}
+}
+
+type heap_inspector struct {
+ act.Actor
+ token gen.Ref
+
+ limit int
+ name string
+
+ generating bool
+ loopID uint64
+ event gen.Atom
+}
+
+func (h *heap_inspector) Init(args ...any) error {
+ h.limit = args[0].(int)
+ h.name = args[1].(string)
+
+ h.Log().SetLogger("default")
+ h.SetCompression(true)
+
+ eopts := gen.EventOptions{Notify: true, Buffer: 1}
+ hash := filterHash(h.name, "", "", "", 0, h.limit)
+ h.event = gen.Atom(fmt.Sprintf("%s_%s", inspectHeap, hash))
+ token, err := h.RegisterEvent(h.event, eopts)
+ if err != nil {
+ h.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ h.token = token
+ h.SendAfter(h.PID(), shutdown{}, inspectHeapIdlePeriod)
+
+ return nil
+}
+
+func (h *heap_inspector) HandleMessage(from gen.PID, message any) error {
+ switch m := message.(type) {
+ case generate:
+ if m.id != h.loopID || h.generating == false {
+ break
+ }
+
+ records, totalAlloc, totalFree := h.captureTop()
+
+ var totalInuse, totalObjects int64
+ for _, r := range records {
+ totalInuse += r.InuseBytes
+ totalObjects += r.InuseObjects
+ }
+
+ var ms runtime.MemStats
+ runtime.ReadMemStats(&ms)
+
+ ev := MessageInspectHeap{
+ Node: h.Node().Name(),
+ Records: records,
+ TotalInuse: totalInuse,
+ TotalObjects: totalObjects,
+ TotalAlloc: totalAlloc,
+ TotalFree: totalFree,
+ GCCPUFraction: ms.GCCPUFraction,
+ }
+
+ if err := h.SendEvent(h.event, h.token, ev); err != nil {
+ h.Log().Error("unable to send event %q: %s", h.event, err)
+ return gen.TerminateReasonNormal
+ }
+
+ h.SendAfter(h.PID(), generate{id: h.loopID}, inspectHeapPeriod)
+
+ case requestInspect:
+ response := ResponseInspectHeap{
+ Event: gen.Event{
+ Name: h.event,
+ Node: h.Node().Name(),
+ },
+ }
+ h.SendResponse(m.pid, m.ref, response)
+
+ case shutdown:
+ if h.generating {
+ break
+ }
+ return gen.TerminateReasonNormal
+
+ case gen.MessageEventStart:
+ h.loopID++
+ h.Send(h.PID(), generate{id: h.loopID})
+ h.generating = true
+
+ case gen.MessageEventStop:
+ if h.generating {
+ h.generating = false
+ h.SendAfter(h.PID(), shutdown{}, inspectHeapIdlePeriod)
+ }
+ }
+
+ return nil
+}
+
+func (h *heap_inspector) Terminate(reason error) {}
+
+func (h *heap_inspector) captureTop() ([]HeapRecord, int64, int64) {
+ var p []runtime.MemProfileRecord
+ n, _ := runtime.MemProfile(nil, false) // false = include freed records
+ p = make([]runtime.MemProfileRecord, n)
+ runtime.MemProfile(p, false)
+
+ nameLower := strings.ToLower(h.name)
+
+ type entry struct {
+ inuse int64
+ objects int64
+ alloc int64
+ allocN int64
+ freeN int64
+ stack []string
+ }
+
+ var entries []entry
+ var totalAlloc, totalFree int64
+
+ for _, r := range p {
+ totalAlloc += r.AllocObjects
+ totalFree += r.FreeObjects
+
+ inuse := r.InUseBytes()
+ if inuse <= 0 {
+ continue
+ }
+
+ frames := runtime.CallersFrames(r.Stack())
+ var stack []string
+ for {
+ frame, more := frames.Next()
+ if frame.Function != "" {
+ stack = append(stack, frame.Function)
+ }
+ if more == false {
+ break
+ }
+ }
+
+ if nameLower != "" {
+ matched := false
+ for _, f := range stack {
+ if strings.Contains(strings.ToLower(f), nameLower) {
+ matched = true
+ break
+ }
+ }
+ if matched == false {
+ continue
+ }
+ }
+
+ entries = append(entries, entry{
+ inuse: inuse,
+ objects: r.InUseObjects(),
+ alloc: r.AllocBytes,
+ allocN: r.AllocObjects,
+ freeN: r.FreeObjects,
+ stack: stack,
+ })
+ }
+
+ sort.Slice(entries, func(i, j int) bool {
+ return entries[i].inuse > entries[j].inuse
+ })
+
+ if len(entries) > h.limit {
+ entries = entries[:h.limit]
+ }
+
+ records := make([]HeapRecord, len(entries))
+ for i, e := range entries {
+ records[i] = HeapRecord{
+ InuseBytes: e.inuse,
+ InuseObjects: e.objects,
+ AllocBytes: e.alloc,
+ AllocObjects: e.allocN,
+ FreeObjects: e.freeN,
+ Stack: e.stack,
+ }
+ }
+ return records, totalAlloc, totalFree
+}
diff --git a/app/system/inspect/inspect.go b/app/system/inspect/inspect.go
index f85eb7b08..30d1eb6c9 100644
--- a/app/system/inspect/inspect.go
+++ b/app/system/inspect/inspect.go
@@ -48,6 +48,8 @@ const (
inspectLog = "inspect_log"
inspectLogIdlePeriod = 10 * time.Second
+ inspectTracing = "inspect_tracing"
+
inspectApplicationList = "inspect_application_list"
inspectApplicationListPeriod = time.Second
inspectApplicationListIdlePeriod = 5 * time.Second
@@ -55,6 +57,22 @@ const (
inspectApplicationTree = "inspect_application_tree"
inspectApplicationTreePeriod = time.Second
inspectApplicationTreeIdlePeriod = 5 * time.Second
+
+ inspectEventList = "inspect_event_list"
+ inspectEventListPeriod = time.Second
+ inspectEventListIdlePeriod = 5 * time.Second
+
+ inspectProcessRange = "inspect_process_range"
+ inspectProcessRangePeriod = time.Second
+ inspectProcessRangeIdlePeriod = 5 * time.Second
+
+ inspectConnectionList = "inspect_connection_list"
+ inspectConnectionListPeriod = time.Second
+ inspectConnectionListIdlePeriod = 5 * time.Second
+
+ inspectHeap = "inspect_heap"
+ inspectHeapPeriod = time.Second
+ inspectHeapIdlePeriod = 5 * time.Second
)
var (
@@ -68,9 +86,67 @@ var (
)
func Factory() gen.ProcessBehavior {
+ return &inspectPool{}
+}
+
+// RegisterTypes registers all inspector wire-format types with the given network.
+// Called by the system application during Load, after the network stack is up.
+func RegisterTypes(network gen.Network) error {
+ types := []any{
+ RequestInspectNode{}, ResponseInspectNode{}, MessageInspectNode{},
+ RequestInspectNetwork{}, ResponseInspectNetwork{}, MessageInspectNetwork{},
+ RequestInspectConnection{}, ResponseInspectConnection{}, MessageInspectConnection{},
+ RequestInspectConnectionList{}, ResponseInspectConnectionList{}, MessageInspectConnectionList{},
+ RequestInspectProcessList{}, ResponseInspectProcessList{}, MessageInspectProcessList{},
+ RequestInspectProcessRange{}, ResponseInspectProcessRange{},
+ RequestInspectEventList{}, ResponseInspectEventList{}, MessageInspectEventList{},
+ RequestInspectLog{}, ResponseInspectLog{}, InspectLogEntry{}, MessageInspectLog{},
+ RequestInspectProcess{}, ResponseInspectProcess{}, MessageInspectProcess{},
+ RequestInspectProcessState{}, ResponseInspectProcessState{}, MessageInspectProcessState{},
+ RequestInspectMeta{}, ResponseInspectMeta{}, MessageInspectMeta{},
+ RequestInspectMetaState{}, ResponseInspectMetaState{}, MessageInspectMetaState{},
+ RequestInspectApplicationList{}, ResponseInspectApplicationList{}, MessageInspectApplicationList{},
+ RequestInspectApplicationTree{}, ResponseInspectApplicationTree{}, MessageInspectApplicationTree{},
+ RequestInspectHeap{}, ResponseInspectHeap{}, MessageInspectHeap{},
+ RequestInspectTracing{}, ResponseInspectTracing{}, MessageInspectTracing{},
+
+ RequestDoSend{}, ResponseDoSend{},
+ RequestDoSendMeta{}, ResponseDoSendMeta{},
+ RequestDoSendExit{}, ResponseDoSendExit{},
+ RequestDoSendExitMeta{}, ResponseDoSendExitMeta{},
+ RequestDoKill{}, ResponseDoKill{},
+ RequestDoSetLogLevel{}, RequestDoSetProcessLogLevel{}, RequestDoSetMetaLogLevel{}, ResponseDoSetLogLevel{},
+ RequestDoSetProcessSendPriority{}, RequestDoSetProcessCompression{},
+ RequestDoSetProcessCompressionType{}, RequestDoSetProcessCompressionLevel{},
+ RequestDoSetProcessCompressionThreshold{}, RequestDoSetProcessKeepNetworkOrder{},
+ RequestDoSetProcessImportantDelivery{}, RequestDoSetMetaSendPriority{}, ResponseDoSet{},
+ RequestDoSetNodeTracingSampler{}, RequestDoSetProcessTracingSampler{},
+ RequestDoAppStart{}, ResponseDoAppStart{},
+ RequestDoAppStop{}, ResponseDoAppStop{},
+ RequestDoAppUnload{}, ResponseDoAppUnload{},
+ RequestDoInspect{}, ResponseDoInspect{},
+ RequestDoGoroutines{}, GoroutineGroup{}, ResponseDoGoroutines{},
+ RequestDoHeapProfile{}, HeapRecord{}, ResponseDoHeapProfile{},
+ RequestDoTypes{}, ResponseDoTypes{},
+ }
+ return network.RegisterTypes(types)
+}
+
+func workerFactory() gen.ProcessBehavior {
return &inspect{}
}
+type inspectPool struct {
+ act.Pool
+}
+
+func (p *inspectPool) Init(args ...any) (act.PoolOptions, error) {
+ return act.PoolOptions{
+ PoolSize: 5,
+ WorkerFactory: workerFactory,
+ }, nil
+}
+
type inspect struct {
act.Actor
}
@@ -82,11 +158,13 @@ type requestInspect struct {
type register struct{}
type shutdown struct{}
-type generate struct{}
+type generate struct{ id uint64 }
+type flushLog struct{ id uint64 }
func (i *inspect) Init(args ...any) error {
i.Log().SetLogger("default")
i.Log().Debug("%s started", i.Name())
+ i.SetCompression(true)
return nil
}
@@ -146,14 +224,16 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
opts := gen.ProcessOptions{
LinkParent: true,
}
- if r.Start < 1000 {
+ if r.Start >= 0 && r.Start < 1000 {
r.Start = 1000
}
if r.Limit < 1 {
r.Limit = 1000
}
- pname := gen.Atom(fmt.Sprintf("%s_%d_%d", inspectProcessList, r.Start, r.Start+r.Limit-1))
- _, err := i.SpawnRegister(pname, factory_process_list, opts, r.Start, r.Limit)
+ hash := filterHash(r.Name, r.Behavior, r.Application, r.State, r.MinMailbox, r.Limit)
+ pname := gen.Atom(fmt.Sprintf("%s_%d_%s", inspectProcessList, r.Start, hash))
+ _, err := i.SpawnRegister(pname, factory_process_list, opts,
+ r.Start, r.Limit, r.Name, r.Behavior, r.Application, r.State, r.MinMailbox)
if err != nil && err != gen.ErrTaken {
return err, nil
}
@@ -165,6 +245,27 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
i.Send(pname, forward)
return nil, nil // no reply
+ case RequestInspectProcessRange:
+ opts := gen.ProcessOptions{
+ LinkParent: true,
+ }
+ if r.Limit < 1 {
+ r.Limit = 10000
+ }
+ hash := filterHash(r.Name, r.Behavior, r.Application, r.State, r.MinMailbox, r.Limit)
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectProcessRange, hash))
+ _, err := i.SpawnRegister(pname, factory_process_range, opts,
+ r.Name, r.Behavior, r.Application, r.State, r.MinMailbox, r.Limit, hash)
+ if err != nil && err != gen.ErrTaken {
+ return err, nil
+ }
+ forward := requestInspect{
+ pid: from,
+ ref: ref,
+ }
+ i.Send(pname, forward)
+ return nil, nil // no reply
+
case RequestInspectProcess:
opts := gen.ProcessOptions{
LinkParent: true,
@@ -240,36 +341,28 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
// try to spawn node inspector process
opts := gen.ProcessOptions{
LinkParent: true,
+ Compression: gen.Compression{
+ Enable: true,
+ Type: gen.CompressionTypeGZIP,
+ Level: gen.CompressionBestSpeed,
+ },
}
- name := "diwep"
levels := r.Levels
if len(r.Levels) > 0 {
- b := []byte{}
- sort.Slice(r.Levels, func(i, j int) bool {
- return r.Levels[i] < r.Levels[j]
- })
- for i := range r.Levels {
- switch r.Levels[i] {
- case gen.LogLevelDebug:
- b = append(b, 'd')
- case gen.LogLevelInfo:
- b = append(b, 'i')
- case gen.LogLevelWarning:
- b = append(b, 'w')
- case gen.LogLevelError:
- b = append(b, 'e')
- case gen.LogLevelPanic:
- b = append(b, 'p')
- }
- }
- name = string(b)
+ sort.Slice(levels, func(i, j int) bool { return levels[i] < levels[j] })
} else {
levels = inspectLogFilter
}
- pname := gen.Atom(fmt.Sprintf("%s_%s", inspectLog, name))
- _, err := i.SpawnRegister(pname, factory_log, opts, levels)
+ limit := r.Limit
+ if limit < 1 {
+ limit = 500
+ }
+
+ hash := fmt.Sprintf("%x", hashStr(fmt.Sprintf("%v|%d|%s|%v", levels, limit, r.MessagePattern, r.MessageExclude)))
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectLog, hash))
+ _, err := i.SpawnRegister(pname, factory_log, opts, levels, limit, r.MessagePattern, r.MessageExclude)
if err != nil && err != gen.ErrTaken {
return err, nil
}
@@ -281,6 +374,75 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
i.Send(pname, forward)
return nil, nil // no reply
+ case RequestInspectTracing:
+ opts := gen.ProcessOptions{
+ LinkParent: true,
+ Compression: gen.Compression{
+ Enable: true,
+ Type: gen.CompressionTypeGZIP,
+ Level: gen.CompressionBestSpeed,
+ },
+ }
+
+ limit := r.Limit
+ if limit < 1 {
+ limit = 500
+ }
+
+ hash := fmt.Sprintf("%x", hashStr(fmt.Sprintf("%v|%d|%d|%d|%s|%v", r.Flags, limit, r.Kinds, r.Points, r.MessagePattern, r.MessageExclude)))
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectTracing, hash))
+ _, err := i.SpawnRegister(pname, factory_tracing, opts, r.Flags, limit, r.Kinds, r.Points, r.MessagePattern, r.MessageExclude)
+ if err != nil && err != gen.ErrTaken {
+ return err, nil
+ }
+ forward := requestInspect{
+ pid: from,
+ ref: ref,
+ }
+ i.Send(pname, forward)
+ return nil, nil
+
+ case RequestInspectEventList:
+ opts := gen.ProcessOptions{
+ LinkParent: true,
+ }
+ if r.Limit < 1 {
+ r.Limit = 500
+ }
+ hash := eventListHash(r.Timestamp, r.Name, r.Notify, r.Buffered, r.Open, r.MinSubscribers, r.Limit)
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectEventList, hash))
+ _, err := i.SpawnRegister(pname, factory_event_list, opts,
+ r.Timestamp, r.Name, r.Notify, r.Buffered, r.Open, r.MinSubscribers, r.Limit, hash)
+ if err != nil && err != gen.ErrTaken {
+ return err, nil
+ }
+ forward := requestInspect{
+ pid: from,
+ ref: ref,
+ }
+ i.Send(pname, forward)
+ return nil, nil
+
+ case RequestInspectConnectionList:
+ opts := gen.ProcessOptions{
+ LinkParent: true,
+ }
+ if r.Limit < 1 {
+ r.Limit = 100
+ }
+ hash := connectionListHash(r.Name, r.Limit)
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectConnectionList, hash))
+ _, err := i.SpawnRegister(pname, factory_connection_list, opts, r.Name, r.Limit, hash)
+ if err != nil && err != gen.ErrTaken {
+ return err, nil
+ }
+ forward := requestInspect{
+ pid: from,
+ ref: ref,
+ }
+ i.Send(pname, forward)
+ return nil, nil
+
case RequestInspectApplicationList:
opts := gen.ProcessOptions{
LinkParent: true,
@@ -317,6 +479,21 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
i.Send(pname, forward)
return nil, nil // no reply
+ case RequestInspectHeap:
+ opts := gen.ProcessOptions{LinkParent: true}
+ if r.Limit < 1 {
+ r.Limit = 100
+ }
+ hash := filterHash(r.Name, "", "", "", 0, r.Limit)
+ pname := gen.Atom(fmt.Sprintf("%s_%s", inspectHeap, hash))
+ _, err := i.SpawnRegister(pname, factory_heap, opts, r.Limit, r.Name)
+ if err != nil && err != gen.ErrTaken {
+ return err, nil
+ }
+ forward := requestInspect{pid: from, ref: ref}
+ i.Send(pname, forward)
+ return nil, nil
+
// do commands
case RequestDoSend:
@@ -355,19 +532,111 @@ func (i *inspect) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error
}
return response, nil
- case RequestDoSetLogLevelProcess:
+ case RequestDoSetNodeTracingSampler:
+ sampler := makeSampler(r.Type, r.Rate, r.Limit)
+ return ResponseDoSet{Error: i.Node().SetTracingSampler(sampler)}, nil
+
+ case RequestDoSetProcessTracingSampler:
+ sampler := makeSampler(r.Type, r.Rate, r.Limit)
+ return ResponseDoSet{Error: i.Node().SetProcessTracingSampler(r.PID, sampler)}, nil
+
+ case RequestDoSetProcessLogLevel:
response := ResponseDoSetLogLevel{
- Error: i.Node().SetLogLevelProcess(r.PID, r.Level),
+ Error: i.Node().SetProcessLogLevel(r.PID, r.Level),
}
return response, nil
- case RequestDoSetLogLevelMeta:
+ case RequestDoSetMetaLogLevel:
response := ResponseDoSetLogLevel{
- Error: i.Node().SetLogLevelMeta(r.Meta, r.Level),
+ Error: i.Node().SetMetaLogLevel(r.Meta, r.Level),
}
return response, nil
+
+ // process settings
+
+ case RequestDoSetProcessSendPriority:
+ return ResponseDoSet{Error: i.Node().SetProcessSendPriority(r.PID, r.Priority)}, nil
+
+ case RequestDoSetProcessCompression:
+ return ResponseDoSet{Error: i.Node().SetProcessCompression(r.PID, r.Enabled)}, nil
+
+ case RequestDoSetProcessCompressionType:
+ return ResponseDoSet{Error: i.Node().SetProcessCompressionType(r.PID, r.Type)}, nil
+
+ case RequestDoSetProcessCompressionLevel:
+ return ResponseDoSet{Error: i.Node().SetProcessCompressionLevel(r.PID, r.Level)}, nil
+
+ case RequestDoSetProcessCompressionThreshold:
+ return ResponseDoSet{Error: i.Node().SetProcessCompressionThreshold(r.PID, r.Threshold)}, nil
+
+ case RequestDoSetProcessKeepNetworkOrder:
+ return ResponseDoSet{Error: i.Node().SetProcessKeepNetworkOrder(r.PID, r.Order)}, nil
+
+ case RequestDoSetProcessImportantDelivery:
+ return ResponseDoSet{Error: i.Node().SetProcessImportantDelivery(r.PID, r.Important)}, nil
+
+ // meta settings
+
+ case RequestDoSetMetaSendPriority:
+ return ResponseDoSet{Error: i.Node().SetMetaSendPriority(r.Meta, r.Priority)}, nil
+
+ // app lifecycle
+
+ case RequestDoAppStart:
+ opts := gen.ApplicationOptions{}
+ var err error
+ switch r.Mode {
+ case gen.ApplicationModeTemporary:
+ err = i.Node().ApplicationStartTemporary(r.Name, opts)
+ case gen.ApplicationModeTransient:
+ err = i.Node().ApplicationStartTransient(r.Name, opts)
+ case gen.ApplicationModePermanent:
+ err = i.Node().ApplicationStartPermanent(r.Name, opts)
+ default:
+ err = i.Node().ApplicationStart(r.Name, opts)
+ }
+ return ResponseDoAppStart{Error: err}, nil
+
+ case RequestDoAppStop:
+ var err error
+ if r.Force {
+ err = i.Node().ApplicationStopForce(r.Name)
+ } else {
+ err = i.Node().ApplicationStop(r.Name)
+ }
+ return ResponseDoAppStop{Error: err}, nil
+
+ case RequestDoAppUnload:
+ return ResponseDoAppUnload{Error: i.Node().ApplicationUnload(r.Name)}, nil
+
+ // one-shot inspect
+
+ case RequestDoInspect:
+ state, err := i.Inspect(r.PID)
+ return ResponseDoInspect{State: state, Error: err}, nil
+
+ case RequestDoGoroutines:
+ return captureGoroutines(r), nil
+
+ case RequestDoHeapProfile:
+ return captureHeapProfile(r), nil
+
+ case RequestDoTypes:
+ return ResponseDoTypes{Types: i.Node().Network().RegisteredTypes()}, nil
}
i.Log().Error("unsupported request: %#v", request)
return gen.ErrUnsupported, nil
}
+
+func makeSampler(typ string, rate float64, limit int) gen.TracingSampler {
+ switch typ {
+ case "always":
+ return gen.TracingSamplerAlways
+ case "ratio":
+ return gen.TracingSamplerRatio(rate)
+ case "rate_limit":
+ return gen.TracingSamplerRateLimit(limit)
+ }
+ return gen.TracingSamplerDisable
+}
diff --git a/app/system/inspect/log.go b/app/system/inspect/log.go
index 1ffb873f5..a568daec3 100644
--- a/app/system/inspect/log.go
+++ b/app/system/inspect/log.go
@@ -2,6 +2,8 @@ package inspect
import (
"fmt"
+ "strings"
+ "time"
"ergo.services/ergo/act"
"ergo.services/ergo/gen"
@@ -16,16 +18,47 @@ type log struct {
token gen.Ref
event gen.Atom
- levels []gen.LogLevel
- generating bool
+ levels []gen.LogLevel
+ limit int
+ messagePattern string // lower-cased for fast matching
+ messageExclude bool
+ generating bool
+ loopID uint64
+
+ // ring buffer
+ ring []InspectLogEntry
+ pos int
+ full bool
+ received int64
}
+const logFlushInterval = time.Second
+
func (il *log) Init(args ...any) error {
il.levels = args[0].([]gen.LogLevel)
+ il.limit = args[1].(int)
+ if len(args) > 3 {
+ il.messagePattern = strings.ToLower(args[2].(string))
+ il.messageExclude = args[3].(bool)
+ }
+ il.ring = make([]InspectLogEntry, il.limit)
il.Log().SetLogger("default")
- il.Log().Debug("log inspector started")
- // RegisterEvent is not allowed here
- il.Send(il.PID(), register{})
+ il.Log().Debug("log inspector started (limit: %d)", il.limit)
+ il.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", string(il.Name()), il.PID()))
+ token, err := il.RegisterEvent(evname, eopts)
+ if err != nil {
+ return err
+ }
+
+ il.event = evname
+ il.token = token
+ il.SendAfter(il.PID(), shutdown{}, inspectLogIdlePeriod)
+
return nil
}
@@ -34,6 +67,49 @@ func (il *log) Init(args ...any) error {
func (il *log) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
+ case flushLog:
+ if m.id != il.loopID || il.generating == false {
+ break
+ }
+ if il.received == 0 {
+ il.SendAfter(il.PID(), flushLog{id: il.loopID}, logFlushInterval)
+ break
+ }
+
+ // collect entries from ring buffer in correct order
+ var entries []InspectLogEntry
+ if il.full {
+ // ring wrapped: oldest at pos, newest at pos-1
+ entries = make([]InspectLogEntry, il.limit)
+ copy(entries, il.ring[il.pos:])
+ copy(entries[il.limit-il.pos:], il.ring[:il.pos])
+ } else {
+ entries = make([]InspectLogEntry, il.pos)
+ copy(entries, il.ring[:il.pos])
+ }
+
+ suppressed := il.received - int64(len(entries))
+ if suppressed < 0 {
+ suppressed = 0
+ }
+
+ ev := MessageInspectLog{
+ Node: il.Node().Name(),
+ Entries: entries,
+ Suppressed: suppressed,
+ }
+
+ // reset ring
+ il.pos = 0
+ il.full = false
+ il.received = 0
+
+ if err := il.SendEvent(il.event, il.token, ev); err != nil {
+ return gen.TerminateReasonNormal
+ }
+
+ il.SendAfter(il.PID(), flushLog{id: il.loopID}, logFlushInterval)
+
case requestInspect:
response := ResponseInspectLog{
Event: gen.Event{
@@ -43,39 +119,26 @@ func (il *log) HandleMessage(from gen.PID, message any) error {
}
il.SendResponse(m.pid, m.ref, response)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", string(il.Name()), il.PID()))
- token, err := il.RegisterEvent(evname, eopts)
- if err != nil {
- return err
- }
-
- il.event = evname
- il.token = token
- il.SendAfter(il.PID(), shutdown{}, inspectLogIdlePeriod)
-
case shutdown:
if il.generating {
- break // ignore.
+ break // ignore
}
return gen.TerminateReasonNormal
case gen.MessageEventStart: // got first subscriber
- // register this process as a logger
il.Log().Debug("add this process as a logger")
il.Node().LoggerAddPID(il.PID(), il.PID().String(), il.levels...)
- // we cant use Log() method while this process registered as a logger
+ il.loopID++
il.generating = true
+ il.SendAfter(il.PID(), flushLog{id: il.loopID}, logFlushInterval)
case gen.MessageEventStop: // no subscribers
- // unregister this process as a logger
il.Node().LoggerDeletePID(il.PID())
- // now we can use Log() method
il.Log().Debug("removed this process as a logger")
il.generating = false
+ il.pos = 0
+ il.full = false
+ il.received = 0
il.SendAfter(il.PID(), shutdown{}, inspectLogIdlePeriod)
}
@@ -83,67 +146,52 @@ func (il *log) HandleMessage(from gen.PID, message any) error {
}
func (il *log) HandleLog(message gen.MessageLog) error {
+ msg := fmt.Sprintf(message.Format, message.Args...)
+
+ if il.messagePattern != "" {
+ contains := strings.Contains(strings.ToLower(msg), il.messagePattern)
+ if il.messageExclude == contains {
+ return nil
+ }
+ }
+
+ entry := InspectLogEntry{
+ Timestamp: message.Time.UnixNano(),
+ Level: message.Level,
+ Message: msg,
+ Fields: message.Fields,
+ }
+
switch m := message.Source.(type) {
case gen.MessageLogNode:
- // handle message
- ev := MessageInspectLogNode{
- Node: m.Node,
- Creation: m.Creation,
- Timestamp: message.Time.UnixNano(),
- Level: message.Level,
- Message: fmt.Sprintf(message.Format, message.Args...),
- }
- if err := il.SendEvent(il.event, il.token, ev); err != nil {
- return gen.TerminateReasonNormal
- }
+ entry.Source = "node"
+ entry.Creation = m.Creation
case gen.MessageLogProcess:
- // handle message
- ev := MessageInspectLogProcess{
- Node: m.Node,
- Name: m.Name,
- PID: m.PID,
- Timestamp: message.Time.UnixNano(),
- Level: message.Level,
- Message: fmt.Sprintf(message.Format, message.Args...),
- }
- if err := il.SendEvent(il.event, il.token, ev); err != nil {
- return gen.TerminateReasonNormal
- }
-
+ entry.Source = "process"
+ entry.Name = m.Name
+ entry.PID = m.PID
+ entry.Behavior = m.Behavior
case gen.MessageLogMeta:
- // handle message
- ev := MessageInspectLogMeta{
- Node: m.Node,
- Parent: m.Parent,
- Meta: m.Meta,
- Timestamp: message.Time.UnixNano(),
- Level: message.Level,
- Message: fmt.Sprintf(message.Format, message.Args...),
- }
-
- if err := il.SendEvent(il.event, il.token, ev); err != nil {
- return gen.TerminateReasonNormal
- }
+ entry.Source = "meta"
+ entry.Parent = m.Parent
+ entry.Meta = m.Meta
+ entry.Behavior = m.Behavior
case gen.MessageLogNetwork:
- ev := MessageInspectLogNetwork{
- Node: m.Node,
- Peer: m.Peer,
- Timestamp: message.Time.UnixNano(),
- Level: message.Level,
- Message: fmt.Sprintf(message.Format, message.Args...),
- }
- if err := il.SendEvent(il.event, il.token, ev); err != nil {
- return gen.TerminateReasonNormal
- }
+ entry.Source = "network"
+ entry.Peer = gen.Atom(m.Peer.CRC32())
}
- // ignore any other log messages
- // TODO should we handle them?
+
+ il.ring[il.pos] = entry
+ il.pos++
+ if il.pos >= il.limit {
+ il.pos = 0
+ il.full = true
+ }
+ il.received++
+
return nil
}
func (il *log) Terminate(reason error) {
- // since this process is already unregistered
- // it is also unregistered as a logger
- // so we can use Log() here
il.Log().Debug("log inspector terminated: %s", reason)
}
diff --git a/app/system/inspect/message.go b/app/system/inspect/message.go
index 547c6b3e9..c51042703 100644
--- a/app/system/inspect/message.go
+++ b/app/system/inspect/message.go
@@ -4,13 +4,15 @@ import "ergo.services/ergo/gen"
type RequestInspectNode struct{}
type ResponseInspectNode struct {
- CRC32 string
- Event gen.Event
- OS string
- Arch string
- Cores int
- Version gen.Version
- Creation int64
+ CRC32 string
+ Event gen.Event
+ OS string
+ Arch string
+ Cores int
+ Timezone string
+ GoVersion string
+ Version gen.Version
+ Creation int64
}
type MessageInspectNode struct {
@@ -48,11 +50,31 @@ type MessageInspectConnection struct {
Info gen.RemoteNodeInfo
}
+// connection list (scoped)
+
+type RequestInspectConnectionList struct {
+ Limit int
+ Name string
+}
+type ResponseInspectConnectionList struct {
+ Event gen.Event
+}
+
+type MessageInspectConnectionList struct {
+ Node gen.Atom
+ Connections []gen.RemoteNodeInfo
+}
+
// process list
type RequestInspectProcessList struct {
- Start int
- Limit int
+ Start int
+ Limit int
+ Name string
+ Behavior string
+ Application string
+ State string
+ MinMailbox uint64
}
type ResponseInspectProcessList struct {
Event gen.Event
@@ -66,44 +88,34 @@ type MessageInspectProcessList struct {
// node logs
type RequestInspectLog struct {
- Levels []gen.LogLevel
+ Levels []gen.LogLevel
+ Limit int
+ MessagePattern string
+ MessageExclude bool
}
type ResponseInspectLog struct {
Event gen.Event
}
-type MessageInspectLogNode struct {
- Node gen.Atom
- Creation int64
- Timestamp int64
- Level gen.LogLevel
- Message string
-}
-
-type MessageInspectLogProcess struct {
- Node gen.Atom
+type InspectLogEntry struct {
+ Source string // "node", "process", "network", "meta"
Name gen.Atom
PID gen.PID
- Timestamp int64
- Level gen.LogLevel
- Message string
-}
-
-type MessageInspectLogNetwork struct {
- Node gen.Atom
+ Behavior string
Peer gen.Atom
- Timestamp int64
- Level gen.LogLevel
- Message string
-}
-
-type MessageInspectLogMeta struct {
- Node gen.Atom
Parent gen.PID
Meta gen.Alias
+ Creation int64
Timestamp int64
Level gen.LogLevel
Message string
+ Fields []gen.LogField
+}
+
+type MessageInspectLog struct {
+ Node gen.Atom
+ Entries []InspectLogEntry
+ Suppressed int64
}
// process
@@ -116,9 +128,8 @@ type ResponseInspectProcess struct {
}
type MessageInspectProcess struct {
- Node gen.Atom
- Info gen.ProcessInfo
- Terminated bool
+ Node gen.Atom
+ Info gen.ProcessInfo
}
// process state
@@ -145,9 +156,8 @@ type ResponseInspectMeta struct {
}
type MessageInspectMeta struct {
- Node gen.Atom
- Info gen.MetaInfo
- Terminated bool
+ Node gen.Atom
+ Info gen.MetaInfo
}
// meta state
@@ -221,18 +231,235 @@ type ResponseDoSetLogLevel struct {
Error error
}
+// do set tracing sampler and flags (node-level)
+
+type RequestDoSetNodeTracingSampler struct {
+ Type string // "always", "disable", "ratio", "rate_limit"
+ Rate float64 // for ratio
+ Limit int // for rate_limit
+}
+
+type RequestDoSetProcessTracingSampler struct {
+ PID gen.PID
+ Type string
+ Rate float64
+ Limit int
+}
+
+
// process
-type RequestDoSetLogLevelProcess struct {
+type RequestDoSetProcessLogLevel struct {
PID gen.PID
Level gen.LogLevel
}
// meta
-type RequestDoSetLogLevelMeta struct {
+type RequestDoSetMetaLogLevel struct {
Meta gen.Alias
Level gen.LogLevel
}
+// do set process settings
+
+type RequestDoSetProcessSendPriority struct {
+ PID gen.PID
+ Priority gen.MessagePriority
+}
+
+type RequestDoSetProcessCompression struct {
+ PID gen.PID
+ Enabled bool
+}
+
+type RequestDoSetProcessCompressionType struct {
+ PID gen.PID
+ Type gen.CompressionType
+}
+
+type RequestDoSetProcessCompressionLevel struct {
+ PID gen.PID
+ Level gen.CompressionLevel
+}
+
+type RequestDoSetProcessCompressionThreshold struct {
+ PID gen.PID
+ Threshold int
+}
+
+type RequestDoSetProcessKeepNetworkOrder struct {
+ PID gen.PID
+ Order bool
+}
+
+type RequestDoSetProcessImportantDelivery struct {
+ PID gen.PID
+ Important bool
+}
+
+// do set meta settings
+
+type RequestDoSetMetaSendPriority struct {
+ Meta gen.Alias
+ Priority gen.MessagePriority
+}
+
+// generic response for do-set operations
+type ResponseDoSet struct {
+ Error error
+}
+
+// do app lifecycle
+
+type RequestDoAppStart struct {
+ Name gen.Atom
+ Mode gen.ApplicationMode
+}
+type ResponseDoAppStart struct {
+ Error error
+}
+
+type RequestDoAppStop struct {
+ Name gen.Atom
+ Force bool
+}
+type ResponseDoAppStop struct {
+ Error error
+}
+
+type RequestDoAppUnload struct {
+ Name gen.Atom
+}
+type ResponseDoAppUnload struct {
+ Error error
+}
+
+// do one-shot inspect
+
+type RequestDoInspect struct {
+ PID gen.PID
+}
+type ResponseDoInspect struct {
+ State map[string]string
+ Error error
+}
+
+// goroutine dump
+
+type RequestDoGoroutines struct {
+ Stack string // substring match in stack text
+ State string // exact state match (running, chan receive, etc.)
+ MinWait int64 // minimum wait duration in seconds (0 = any)
+}
+
+type GoroutineInfo struct {
+ ID int
+ State string
+ Wait string
+ Frames []string
+ FullText string
+}
+
+type GoroutineGroup struct {
+ Count int
+ State string
+ WaitSec int64
+ Origin string
+ Current string
+ Stack string
+ IDs []int
+}
+
+type ResponseDoGoroutines struct {
+ Groups []GoroutineGroup
+ Total int
+ Filtered int
+ Error error
+}
+
+// heap profile
+
+type RequestDoHeapProfile struct {
+ MinBytes int64
+}
+
+type HeapRecord struct {
+ InuseBytes int64
+ InuseObjects int64
+ AllocBytes int64
+ AllocObjects int64
+ FreeObjects int64
+ Stack []string
+}
+
+type HeapStats struct {
+ TotalInuse int64
+ TotalObjects int64
+ TotalAlloc int64
+ TotalFree int64
+}
+
+type ResponseDoHeapProfile struct {
+ Records []HeapRecord
+ TotalInuse int64
+ TotalAlloc int64
+ TotalObjects int64
+ Error error
+}
+
+// heap inspector (event-based)
+
+type RequestInspectHeap struct {
+ Limit int
+ Name string
+}
+type ResponseInspectHeap struct {
+ Event gen.Event
+}
+
+type MessageInspectHeap struct {
+ Node gen.Atom
+ Records []HeapRecord
+ TotalInuse int64
+ TotalObjects int64
+ TotalAlloc int64
+ TotalFree int64
+ GCCPUFraction float64
+}
+
+// process range (full scan with filters)
+
+type RequestInspectProcessRange struct {
+ Name string
+ Behavior string
+ Application string
+ State string
+ MinMailbox uint64
+ Limit int
+}
+type ResponseInspectProcessRange struct {
+ Event gen.Event
+}
+
+// event list
+
+type RequestInspectEventList struct {
+ Timestamp int64 // 0=oldest first, -1=newest first, >0=from this unix nanos
+ Limit int
+ Name string
+ Notify int // 0=any, 1=yes, -1=no
+ Buffered int // 0=any, 1=yes, -1=no
+ Open int // 0=any, 1=yes, -1=no
+ MinSubscribers int64
+}
+type ResponseInspectEventList struct {
+ Event gen.Event
+}
+
+type MessageInspectEventList struct {
+ Node gen.Atom
+ Events []gen.EventInfo
+}
+
// application list
type RequestInspectApplicationList struct{}
@@ -260,3 +487,33 @@ type MessageInspectApplicationTree struct {
Application gen.Atom
Processes []gen.ProcessShortInfo
}
+
+// tracing
+
+type RequestInspectTracing struct {
+ Flags gen.TracingFlags
+ Limit int
+ Kinds uint32 // bitmask: 1=send, 2=request, 4=response, 8=spawn, 16=terminate
+ Points uint32 // bitmask: 1=sent, 2=delivered, 4=processed
+ MessagePattern string
+ MessageExclude bool
+}
+
+type ResponseInspectTracing struct {
+ Event gen.Event
+}
+
+type MessageInspectTracing struct {
+ Node gen.Atom
+ Spans []gen.TracingSpan
+ Suppressed int64
+}
+
+// types
+
+type RequestDoTypes struct{}
+
+type ResponseDoTypes struct {
+ Types []gen.RegisteredTypeInfo
+ Error error
+}
diff --git a/app/system/inspect/meta.go b/app/system/inspect/meta.go
index 7e7dc8e1d..f9c1c378d 100644
--- a/app/system/inspect/meta.go
+++ b/app/system/inspect/meta.go
@@ -17,6 +17,7 @@ type meta struct {
event gen.Atom
generating bool
+ loopID uint64
meta gen.Alias
}
@@ -24,22 +25,39 @@ func (im *meta) Init(args ...any) error {
im.meta = args[0].(gen.Alias)
im.Log().SetLogger("default")
im.Log().Debug("meta process inspector started. pid %s", im.meta)
- // RegisterEvent is not allowed here
- im.Send(im.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", inspectMeta, im.meta))
+ token, err := im.RegisterEvent(evname, eopts)
+ if err != nil {
+ im.Log().Error("unable to register meta process event: %s", err)
+ return err
+ }
+ im.Log().Info("registered event %s", evname)
+ im.event = evname
+ im.token = token
+ im.SendAfter(im.PID(), shutdown{}, inspectMetaIdlePeriod)
+
return nil
}
func (im *meta) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if im.generating == false {
+ if m.id != im.loopID || im.generating == false {
im.Log().Debug("generating canceled")
break // cancelled
}
im.Log().Debug("generating event")
ev := MessageInspectMeta{
- Node: im.Node().Name(),
- Terminated: true,
+ Node: im.Node().Name(),
+ Info: gen.MetaInfo{
+ ID: im.meta,
+ State: gen.MetaStateTerminated,
+ },
}
info, err := im.MetaInfo(im.meta)
@@ -56,12 +74,11 @@ func (im *meta) HandleMessage(from gen.PID, message any) error {
default:
im.Log().Error("unable to inspect meta process %s: %s", im.meta, err)
// will try next time
- im.SendAfter(im.PID(), generate{}, inspectMetaPeriod)
+ im.SendAfter(im.PID(), generate{id: im.loopID}, inspectMetaPeriod)
return nil
}
- ev.Terminated = false
ev.Info = info
if err := im.SendEvent(im.event, im.token, ev); err != nil {
@@ -69,7 +86,7 @@ func (im *meta) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- im.SendAfter(im.PID(), generate{}, inspectMetaPeriod)
+ im.SendAfter(im.PID(), generate{id: im.loopID}, inspectMetaPeriod)
case requestInspect:
response := ResponseInspectMeta{
@@ -81,23 +98,6 @@ func (im *meta) HandleMessage(from gen.PID, message any) error {
im.SendResponse(m.pid, m.ref, response)
im.Log().Debug("sent response for the inspect meta request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", inspectMeta, im.meta))
- token, err := im.RegisterEvent(evname, eopts)
- if err != nil {
- im.Log().Error("unable to register meta process event: %s", err)
- return err
- }
- im.Log().Info("registered event %s", evname)
- im.event = evname
-
- im.token = token
- im.SendAfter(im.PID(), shutdown{}, inspectMetaIdlePeriod)
-
case shutdown:
if im.generating {
im.Log().Debug("ignore shutdown. generating is active")
@@ -107,7 +107,8 @@ func (im *meta) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
im.Log().Debug("got first subscriber. start generating events...")
- im.Send(im.PID(), generate{})
+ im.loopID++
+ im.Send(im.PID(), generate{id: im.loopID})
im.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/meta_state.go b/app/system/inspect/meta_state.go
index c27c8e6c2..31787de81 100644
--- a/app/system/inspect/meta_state.go
+++ b/app/system/inspect/meta_state.go
@@ -17,6 +17,7 @@ type meta_state struct {
event gen.Atom
generating bool
+ loopID uint64
meta gen.Alias
}
@@ -24,15 +25,29 @@ func (ims *meta_state) Init(args ...any) error {
ims.meta = args[0].(gen.Alias)
ims.Log().SetLogger("default")
ims.Log().Debug("meta state inspector started. id %s", ims.meta)
- // RegisterEvent is not allowed here
- ims.Send(ims.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", inspectMetaState, ims.meta))
+ token, err := ims.RegisterEvent(evname, eopts)
+ if err != nil {
+ ims.Log().Error("unable to register meta state event: %s", err)
+ return err
+ }
+ ims.Log().Info("registered event %s", evname)
+ ims.event = evname
+ ims.token = token
+ ims.SendAfter(ims.PID(), shutdown{}, inspectMetaStateIdlePeriod)
+
return nil
}
func (ims *meta_state) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ims.generating == false {
+ if m.id != ims.loopID || ims.generating == false {
ims.Log().Debug("generating canceled")
break // cancelled
}
@@ -44,7 +59,7 @@ func (ims *meta_state) HandleMessage(from gen.PID, message any) error {
}
ims.Log().Error("unable to inspect meta state %s: %s", ims.meta, err)
// will try next time
- ims.SendAfter(ims.PID(), generate{}, inspectMetaStatePeriod)
+ ims.SendAfter(ims.PID(), generate{id: ims.loopID}, inspectMetaStatePeriod)
return nil
}
if state == nil {
@@ -62,7 +77,7 @@ func (ims *meta_state) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- ims.SendAfter(ims.PID(), generate{}, inspectMetaStatePeriod)
+ ims.SendAfter(ims.PID(), generate{id: ims.loopID}, inspectMetaStatePeriod)
case requestInspect:
response := ResponseInspectMetaState{
@@ -74,23 +89,6 @@ func (ims *meta_state) HandleMessage(from gen.PID, message any) error {
ims.SendResponse(m.pid, m.ref, response)
ims.Log().Debug("sent response for the inspect meta state %s request to: %s", ims.meta, m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", inspectMetaState, ims.meta))
- token, err := ims.RegisterEvent(evname, eopts)
- if err != nil {
- ims.Log().Error("unable to register meta state event: %s", err)
- return err
- }
- ims.Log().Info("registered event %s", evname)
- ims.event = evname
-
- ims.token = token
- ims.SendAfter(ims.PID(), shutdown{}, inspectMetaStateIdlePeriod)
-
case shutdown:
if ims.generating {
ims.Log().Debug("ignore shutdown. generating is active")
@@ -100,7 +98,8 @@ func (ims *meta_state) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
ims.Log().Debug("got first subscriber. start generating events...")
- ims.Send(ims.PID(), generate{})
+ ims.loopID++
+ ims.Send(ims.PID(), generate{id: ims.loopID})
ims.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/network.go b/app/system/inspect/network.go
index 4a054662d..919766ef3 100644
--- a/app/system/inspect/network.go
+++ b/app/system/inspect/network.go
@@ -15,20 +15,33 @@ type network struct {
token gen.Ref
generating bool
+ loopID uint64
}
func (in *network) Init(args ...any) error {
in.Log().SetLogger("default")
in.Log().Debug("network inspector started")
- // RegisterEvent is not allowed here
- in.Send(in.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ token, err := in.RegisterEvent(inspectNetwork, eopts)
+ if err != nil {
+ in.Log().Error("unable to register network event: %s", err)
+ return err
+ }
+ in.Log().Info("registered event %s", inspectNetwork)
+ in.token = token
+ in.SendAfter(in.PID(), shutdown{}, inspectNetworkIdlePeriod)
+
return nil
}
func (in *network) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if in.generating == false {
+ if m.id != in.loopID || in.generating == false {
in.Log().Debug("generating canceled")
break // cancelled
}
@@ -47,7 +60,7 @@ func (in *network) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- in.SendAfter(in.PID(), generate{}, inspectNetworkPeriod)
+ in.SendAfter(in.PID(), generate{id: in.loopID}, inspectNetworkPeriod)
case requestInspect:
info, err := in.Node().Network().Info()
@@ -62,21 +75,6 @@ func (in *network) HandleMessage(from gen.PID, message any) error {
in.SendResponse(m.pid, m.ref, response)
in.Log().Debug("sent response for the inspect network request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- token, err := in.RegisterEvent(inspectNetwork, eopts)
- if err != nil {
- in.Log().Error("unable to register network event: %s", err)
- return err
- }
- in.Log().Info("registered event %s", inspectNetwork)
-
- in.token = token
- in.SendAfter(in.PID(), shutdown{}, inspectNetworkIdlePeriod)
-
case shutdown:
if in.generating {
in.Log().Debug("ignore shutdown. generating is active")
@@ -86,7 +84,8 @@ func (in *network) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
in.Log().Debug("got first subscriber. start generating events...")
- in.Send(in.PID(), generate{})
+ in.loopID++
+ in.Send(in.PID(), generate{id: in.loopID})
in.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/node.go b/app/system/inspect/node.go
index e7c275fc7..a41698091 100644
--- a/app/system/inspect/node.go
+++ b/app/system/inspect/node.go
@@ -5,6 +5,7 @@ import (
"fmt"
"runtime"
"slices"
+ "time"
"ergo.services/ergo/act"
"ergo.services/ergo/gen"
@@ -19,20 +20,33 @@ type node struct {
token gen.Ref
generating bool
+ loopID uint64
}
func (in *node) Init(args ...any) error {
in.Log().SetLogger("default")
in.Log().Debug("node inspector started")
- // RegisterEvent is not allowed here
- in.Send(in.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ token, err := in.RegisterEvent(inspectNode, eopts)
+ if err != nil {
+ in.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ in.Log().Info("registered event %s", inspectNode)
+ in.token = token
+ in.SendAfter(in.PID(), shutdown{}, inspectNodeIdlePeriod)
+
return nil
}
func (in *node) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if in.generating == false {
+ if m.id != in.loopID || in.generating == false {
in.Log().Debug("generating canceled")
break // cancelled
}
@@ -60,7 +74,7 @@ func (in *node) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- in.SendAfter(in.PID(), generate{}, inspectNodePeriod)
+ in.SendAfter(in.PID(), generate{id: in.loopID}, inspectNodePeriod)
case requestInspect:
response := ResponseInspectNode{
@@ -69,9 +83,19 @@ func (in *node) HandleMessage(from gen.PID, message any) error {
Node: in.Node().Name(),
},
- Arch: runtime.GOARCH,
- OS: runtime.GOOS,
- Cores: runtime.NumCPU(),
+ Arch: runtime.GOARCH,
+ OS: runtime.GOOS,
+ Cores: runtime.NumCPU(),
+ GoVersion: runtime.Version(),
+ Timezone: func() string {
+ now := time.Now()
+ name, _ := now.Zone()
+ loc := now.Location().String()
+ if loc == "Local" {
+ return name // e.g. "MSK", "CET"
+ }
+ return loc // e.g. "Europe/Moscow"
+ }(),
Version: in.Node().Version(),
Creation: in.Node().Creation(),
CRC32: in.Node().Name().CRC32(),
@@ -79,21 +103,6 @@ func (in *node) HandleMessage(from gen.PID, message any) error {
in.SendResponse(m.pid, m.ref, response)
in.Log().Debug("sent response for the inspect node request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- token, err := in.RegisterEvent(inspectNode, eopts)
- if err != nil {
- in.Log().Error("unable to register event: %s", err)
- return err
- }
- in.Log().Info("registered event %s", inspectNode)
-
- in.token = token
- in.SendAfter(in.PID(), shutdown{}, inspectNodeIdlePeriod)
-
case shutdown:
if in.generating {
in.Log().Debug("ignore shutdown. generating is active")
@@ -103,14 +112,14 @@ func (in *node) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
in.Log().Debug("got first subscriber. start generating events...")
- in.Send(in.PID(), generate{})
+ in.loopID++
+ in.Send(in.PID(), generate{id: in.loopID})
in.generating = true
case gen.MessageEventStop: // no subscribers
in.Log().Debug("no subscribers. stop generating")
if in.generating {
in.generating = false
- // wait 10 seconds and terminate this process
in.SendAfter(in.PID(), shutdown{}, inspectNodeIdlePeriod)
}
diff --git a/app/system/inspect/process.go b/app/system/inspect/process.go
index 9b94092b8..c0fdb524b 100644
--- a/app/system/inspect/process.go
+++ b/app/system/inspect/process.go
@@ -18,29 +18,47 @@ type process struct {
event gen.Atom
pid gen.PID
generating bool
+ loopID uint64
}
func (ip *process) Init(args ...any) error {
ip.pid = args[0].(gen.PID)
ip.Log().SetLogger("default")
ip.Log().Debug("process inspector started. pid %s", ip.pid)
- // RegisterEvent is not allowed here
- ip.Send(ip.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", inspectProcess, ip.pid))
+ token, err := ip.RegisterEvent(evname, eopts)
+ if err != nil {
+ ip.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ ip.Log().Info("registered event %s", evname)
+ ip.event = evname
+ ip.token = token
+ ip.SendAfter(ip.PID(), shutdown{}, inspectProcessIdlePeriod)
+
return nil
}
func (ip *process) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ip.generating == false {
+ if m.id != ip.loopID || ip.generating == false {
ip.Log().Debug("generating canceled")
break // cancelled
}
ip.Log().Debug("generating event")
ev := MessageInspectProcess{
- Node: ip.Node().Name(),
- Terminated: true,
+ Node: ip.Node().Name(),
+ Info: gen.ProcessInfo{
+ PID: ip.pid,
+ State: gen.ProcessStateTerminated,
+ },
}
info, err := ip.Node().ProcessInfo(ip.pid)
@@ -59,7 +77,7 @@ func (ip *process) HandleMessage(from gen.PID, message any) error {
default:
ip.Log().Error("unable to inspect process %s: %s", ip.pid, err)
// will try next time (seems to be busy)
- ip.SendAfter(ip.PID(), generate{}, inspectProcessPeriod)
+ ip.SendAfter(ip.PID(), generate{id: ip.loopID}, inspectProcessPeriod)
return nil
}
@@ -67,7 +85,6 @@ func (ip *process) HandleMessage(from gen.PID, message any) error {
info.Env[k] = fmt.Sprintf("%#v", v)
}
- ev.Terminated = false
ev.Info = info
if err := ip.SendEvent(ip.event, ip.token, ev); err != nil {
@@ -75,7 +92,7 @@ func (ip *process) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- ip.SendAfter(ip.PID(), generate{}, inspectProcessPeriod)
+ ip.SendAfter(ip.PID(), generate{id: ip.loopID}, inspectProcessPeriod)
case requestInspect:
response := ResponseInspectProcess{
@@ -87,23 +104,6 @@ func (ip *process) HandleMessage(from gen.PID, message any) error {
ip.SendResponse(m.pid, m.ref, response)
ip.Log().Debug("sent response for the inspect process request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", inspectProcess, ip.pid))
- token, err := ip.RegisterEvent(evname, eopts)
- if err != nil {
- ip.Log().Error("unable to register event: %s", err)
- return err
- }
- ip.Log().Info("registered event %s", evname)
- ip.event = evname
-
- ip.token = token
- ip.SendAfter(ip.PID(), shutdown{}, inspectProcessIdlePeriod)
-
case shutdown:
if ip.generating {
ip.Log().Debug("ignore shutdown. generating is active")
@@ -113,7 +113,8 @@ func (ip *process) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
ip.Log().Debug("got first subscriber. start generating events...")
- ip.Send(ip.PID(), generate{})
+ ip.loopID++
+ ip.Send(ip.PID(), generate{id: ip.loopID})
ip.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/process_list.go b/app/system/inspect/process_list.go
index 9380337bd..e8c2d4c67 100644
--- a/app/system/inspect/process_list.go
+++ b/app/system/inspect/process_list.go
@@ -3,6 +3,7 @@ package inspect
import (
"fmt"
"slices"
+ "strings"
"ergo.services/ergo/act"
"ergo.services/ergo/gen"
@@ -16,31 +17,66 @@ type process_list struct {
act.Actor
token gen.Ref
- start int
- limit int
+ start int
+ limit int
+ name string
+ behavior string
+ application string
+ state string
+ minMailbox uint64
+
generating bool
+ loopID uint64
event gen.Atom
}
func (ipl *process_list) Init(args ...any) error {
ipl.start = args[0].(int)
ipl.limit = args[1].(int)
+ ipl.name = args[2].(string)
+ ipl.behavior = args[3].(string)
+ ipl.application = args[4].(string)
+ ipl.state = args[5].(string)
+ ipl.minMailbox = args[6].(uint64)
+
ipl.Log().SetLogger("default")
ipl.Log().Debug("process list inspector started. %d...%d", ipl.start, ipl.start+ipl.limit-1)
- // RegisterEvent is not allowed here
- ipl.Send(ipl.PID(), register{})
+ ipl.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1,
+ }
+ hash := filterHash(ipl.name, ipl.behavior, ipl.application, ipl.state, ipl.minMailbox, ipl.limit)
+ evname := gen.Atom(fmt.Sprintf("%s_%d_%s", inspectProcessList, ipl.start, hash))
+ token, err := ipl.RegisterEvent(evname, eopts)
+ if err != nil {
+ ipl.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ ipl.Log().Info("registered event %s", evname)
+ ipl.event = evname
+ ipl.token = token
+ ipl.SendAfter(ipl.PID(), shutdown{}, inspectProcessListIdlePeriod)
+
return nil
}
+
func (ipl *process_list) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ipl.generating == false {
+ if m.id != ipl.loopID || ipl.generating == false {
ipl.Log().Debug("generating canceled")
- break // cancelled
+ break
}
ipl.Log().Debug("generating event")
- list, err := ipl.Node().ProcessListShortInfo(ipl.start, ipl.limit)
+ var filter []func(gen.ProcessShortInfo) bool
+ if ipl.hasFilters() {
+ filter = append(filter, ipl.matchFilter)
+ }
+
+ list, err := ipl.Node().ProcessListShortInfo(ipl.start, ipl.limit, filter...)
if err != nil {
return err
}
@@ -59,7 +95,7 @@ func (ipl *process_list) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- ipl.SendAfter(ipl.PID(), generate{}, inspectProcessListPeriod)
+ ipl.SendAfter(ipl.PID(), generate{id: ipl.loopID}, inspectProcessListPeriod)
case requestInspect:
response := ResponseInspectProcessList{
@@ -71,36 +107,20 @@ func (ipl *process_list) HandleMessage(from gen.PID, message any) error {
ipl.SendResponse(m.pid, m.ref, response)
ipl.Log().Debug("sent response for the inspect process list request to: %s", m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%d_%d", inspectProcessList, ipl.start, ipl.start+ipl.limit-1))
- token, err := ipl.RegisterEvent(evname, eopts)
- if err != nil {
- ipl.Log().Error("unable to register event: %s", err)
- return err
- }
- ipl.Log().Info("registered event %s", evname)
- ipl.event = evname
-
- ipl.token = token
- ipl.SendAfter(ipl.PID(), shutdown{}, inspectProcessListIdlePeriod)
-
case shutdown:
if ipl.generating {
ipl.Log().Debug("ignore shutdown. generating is active")
- break // ignore.
+ break
}
return gen.TerminateReasonNormal
- case gen.MessageEventStart: // got first subscriber
+ case gen.MessageEventStart:
ipl.Log().Debug("got first subscriber. start generating events...")
- ipl.Send(ipl.PID(), generate{})
+ ipl.loopID++
+ ipl.Send(ipl.PID(), generate{id: ipl.loopID})
ipl.generating = true
- case gen.MessageEventStop: // no subscribers
+ case gen.MessageEventStop:
ipl.Log().Debug("no subscribers. stop generating")
if ipl.generating {
ipl.generating = false
@@ -117,3 +137,26 @@ func (ipl *process_list) HandleMessage(from gen.PID, message any) error {
func (ipl *process_list) Terminate(reason error) {
ipl.Log().Debug("process list inspector terminated: %s", reason)
}
+
+func (ipl *process_list) hasFilters() bool {
+ return ipl.name != "" || ipl.behavior != "" || ipl.application != "" || ipl.state != "" || ipl.minMailbox > 0
+}
+
+func (ipl *process_list) matchFilter(info gen.ProcessShortInfo) bool {
+ if ipl.name != "" && strings.Contains(strings.ToLower(string(info.Name)), strings.ToLower(ipl.name)) == false {
+ return false
+ }
+ if ipl.behavior != "" && strings.Contains(strings.ToLower(info.Behavior), strings.ToLower(ipl.behavior)) == false {
+ return false
+ }
+ if ipl.application != "" && strings.Contains(strings.ToLower(string(info.Application)), strings.ToLower(ipl.application)) == false {
+ return false
+ }
+ if ipl.state != "" && strings.EqualFold(info.State.String(), ipl.state) == false {
+ return false
+ }
+ if ipl.minMailbox > 0 && info.MessagesMailbox < ipl.minMailbox {
+ return false
+ }
+ return true
+}
diff --git a/app/system/inspect/process_range.go b/app/system/inspect/process_range.go
new file mode 100644
index 000000000..9fbe27bac
--- /dev/null
+++ b/app/system/inspect/process_range.go
@@ -0,0 +1,191 @@
+package inspect
+
+import (
+ "fmt"
+ "slices"
+ "strings"
+
+ "ergo.services/ergo/act"
+ "ergo.services/ergo/gen"
+)
+
+func factory_process_range() gen.ProcessBehavior {
+ return &process_range{}
+}
+
+type process_range struct {
+ act.Actor
+ token gen.Ref
+
+ name string
+ behavior string
+ application string
+ state string
+ minMailbox uint64
+ limit int
+ hash string
+
+ generating bool
+ loopID uint64
+ event gen.Atom
+}
+
+func (ipr *process_range) Init(args ...any) error {
+ ipr.name = args[0].(string)
+ ipr.behavior = args[1].(string)
+ ipr.application = args[2].(string)
+ ipr.state = args[3].(string)
+ ipr.minMailbox = args[4].(uint64)
+ ipr.limit = args[5].(int)
+ ipr.hash = args[6].(string)
+
+ ipr.Log().SetLogger("default")
+ ipr.Log().Debug("process range inspector started. name=%q behavior=%q app=%q state=%q mailbox>=%d limit=%d",
+ ipr.name, ipr.behavior, ipr.application, ipr.state, ipr.minMailbox, ipr.limit)
+ ipr.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1,
+ }
+ ipr.event = gen.Atom(fmt.Sprintf("%s_%s", inspectProcessRange, ipr.hash))
+ token, err := ipr.RegisterEvent(ipr.event, eopts)
+ if err != nil {
+ ipr.Log().Error("unable to register event: %s", err)
+ return err
+ }
+ ipr.Log().Info("registered event %s", ipr.event)
+ ipr.token = token
+ ipr.SendAfter(ipr.PID(), shutdown{}, inspectProcessRangeIdlePeriod)
+
+ return nil
+}
+
+func (ipr *process_range) HandleMessage(from gen.PID, message any) error {
+ switch m := message.(type) {
+ case generate:
+ if m.id != ipr.loopID || ipr.generating == false {
+ break
+ }
+
+ var list []gen.ProcessShortInfo
+ nameLower := strings.ToLower(ipr.name)
+ behaviorLower := strings.ToLower(ipr.behavior)
+ appLower := strings.ToLower(ipr.application)
+
+ ipr.Node().ProcessRangeShortInfo(func(info gen.ProcessShortInfo) bool {
+ // apply filters
+ if nameLower != "" {
+ if strings.Contains(strings.ToLower(string(info.Name)), nameLower) == false {
+ return true // skip, continue
+ }
+ }
+ if behaviorLower != "" {
+ if strings.Contains(strings.ToLower(info.Behavior), behaviorLower) == false {
+ return true
+ }
+ }
+ if appLower != "" {
+ if strings.Contains(strings.ToLower(string(info.Application)), appLower) == false {
+ return true
+ }
+ }
+ if ipr.state != "" {
+ if strings.EqualFold(info.State.String(), ipr.state) == false {
+ return true
+ }
+ }
+ if ipr.minMailbox > 0 {
+ if info.MessagesMailbox < ipr.minMailbox {
+ return true
+ }
+ }
+
+ list = append(list, info)
+
+ if ipr.limit > 0 && len(list) >= ipr.limit {
+ return false // stop iteration
+ }
+ return true
+ })
+
+ slices.SortStableFunc(list, func(a, b gen.ProcessShortInfo) int {
+ return int(a.PID.ID - b.PID.ID)
+ })
+
+ // reuse MessageInspectProcessList, same payload format
+ ev := MessageInspectProcessList{
+ Node: ipr.Node().Name(),
+ Processes: list,
+ }
+
+ if err := ipr.SendEvent(ipr.event, ipr.token, ev); err != nil {
+ ipr.Log().Error("unable to send event %q: %s", ipr.event, err)
+ return gen.TerminateReasonNormal
+ }
+
+ ipr.SendAfter(ipr.PID(), generate{id: ipr.loopID}, inspectProcessRangePeriod)
+
+ case requestInspect:
+ response := ResponseInspectProcessRange{
+ Event: gen.Event{
+ Name: ipr.event,
+ Node: ipr.Node().Name(),
+ },
+ }
+ ipr.SendResponse(m.pid, m.ref, response)
+
+ case shutdown:
+ if ipr.generating {
+ break
+ }
+ return gen.TerminateReasonNormal
+
+ case gen.MessageEventStart:
+ ipr.Log().Debug("got first subscriber. start generating events...")
+ ipr.loopID++
+ ipr.Send(ipr.PID(), generate{id: ipr.loopID})
+ ipr.generating = true
+
+ case gen.MessageEventStop:
+ ipr.Log().Debug("no subscribers. stop generating")
+ if ipr.generating {
+ ipr.generating = false
+ ipr.SendAfter(ipr.PID(), shutdown{}, inspectProcessRangeIdlePeriod)
+ }
+
+ default:
+ ipr.Log().Error("unknown message (ignored) %#v", message)
+ }
+
+ return nil
+}
+
+func (ipr *process_range) Terminate(reason error) {
+ ipr.Log().Debug("process range inspector terminated: %s", reason)
+}
+
+// filterHash builds a short deterministic suffix from filter fields
+func filterHash(name, behavior, application, state string, minMailbox uint64, limit int) string {
+ return fmt.Sprintf("%x", hashStr(fmt.Sprintf("%s|%s|%s|%s|%d|%d",
+ name, behavior, application, state, minMailbox, limit)))
+}
+
+// eventListHash builds a short deterministic suffix from event list filter fields
+func eventListHash(timestamp int64, name string, notify, buffered, open int, minSubscribers int64, limit int) string {
+ return fmt.Sprintf("%x", hashStr(fmt.Sprintf("%d|%s|%d|%d|%d|%d|%d",
+ timestamp, name, notify, buffered, open, minSubscribers, limit)))
+}
+
+func connectionListHash(name string, limit int) string {
+ return fmt.Sprintf("%x", hashStr(fmt.Sprintf("%s|%d", name, limit)))
+}
+
+func hashStr(s string) uint32 {
+ h := uint32(2166136261)
+ for i := 0; i < len(s); i++ {
+ h ^= uint32(s[i])
+ h *= 16777619
+ }
+ return h
+}
diff --git a/app/system/inspect/process_state.go b/app/system/inspect/process_state.go
index 4c3badad9..655ac511b 100644
--- a/app/system/inspect/process_state.go
+++ b/app/system/inspect/process_state.go
@@ -17,6 +17,7 @@ type process_state struct {
event gen.Atom
generating bool
+ loopID uint64
pid gen.PID
}
@@ -24,15 +25,29 @@ func (ips *process_state) Init(args ...any) error {
ips.pid = args[0].(gen.PID)
ips.Log().SetLogger("default")
ips.Log().Debug("process state inspector started. pid %s", ips.pid)
- // RegisterEvent is not allowed here
- ips.Send(ips.PID(), register{})
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ Buffer: 1, // keep the last event
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", inspectProcessState, ips.pid))
+ token, err := ips.RegisterEvent(evname, eopts)
+ if err != nil {
+ ips.Log().Error("unable to register process state event: %s", err)
+ return err
+ }
+ ips.Log().Info("registered event %s", evname)
+ ips.event = evname
+ ips.token = token
+ ips.SendAfter(ips.PID(), shutdown{}, inspectProcessStateIdlePeriod)
+
return nil
}
func (ips *process_state) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case generate:
- if ips.generating == false {
+ if m.id != ips.loopID || ips.generating == false {
ips.Log().Debug("generating canceled")
break // cancelled
}
@@ -44,7 +59,7 @@ func (ips *process_state) HandleMessage(from gen.PID, message any) error {
}
ips.Log().Error("unable to inspect process state %s: %s", ips.pid, err)
// will try next time
- ips.SendAfter(ips.PID(), generate{}, inspectProcessStatePeriod)
+ ips.SendAfter(ips.PID(), generate{id: ips.loopID}, inspectProcessStatePeriod)
return nil
}
@@ -59,7 +74,7 @@ func (ips *process_state) HandleMessage(from gen.PID, message any) error {
return gen.TerminateReasonNormal
}
- ips.SendAfter(ips.PID(), generate{}, inspectProcessStatePeriod)
+ ips.SendAfter(ips.PID(), generate{id: ips.loopID}, inspectProcessStatePeriod)
case requestInspect:
response := ResponseInspectProcessState{
@@ -71,23 +86,6 @@ func (ips *process_state) HandleMessage(from gen.PID, message any) error {
ips.SendResponse(m.pid, m.ref, response)
ips.Log().Debug("sent response for the inspect process state %s request to: %s", ips.pid, m.pid)
- case register:
- eopts := gen.EventOptions{
- Notify: true,
- Buffer: 1, // keep the last event
- }
- evname := gen.Atom(fmt.Sprintf("%s_%s", inspectProcessState, ips.pid))
- token, err := ips.RegisterEvent(evname, eopts)
- if err != nil {
- ips.Log().Error("unable to register process state event: %s", err)
- return err
- }
- ips.Log().Info("registered event %s", evname)
- ips.event = evname
-
- ips.token = token
- ips.SendAfter(ips.PID(), shutdown{}, inspectProcessStateIdlePeriod)
-
case shutdown:
if ips.generating {
ips.Log().Debug("ignore shutdown. generating is active")
@@ -97,7 +95,8 @@ func (ips *process_state) HandleMessage(from gen.PID, message any) error {
case gen.MessageEventStart: // got first subscriber
ips.Log().Debug("got first subscriber. start generating events...")
- ips.Send(ips.PID(), generate{})
+ ips.loopID++
+ ips.Send(ips.PID(), generate{id: ips.loopID})
ips.generating = true
case gen.MessageEventStop: // no subscribers
diff --git a/app/system/inspect/tracing.go b/app/system/inspect/tracing.go
new file mode 100644
index 000000000..085c9e516
--- /dev/null
+++ b/app/system/inspect/tracing.go
@@ -0,0 +1,189 @@
+package inspect
+
+import (
+ "fmt"
+ "strings"
+ "time"
+
+ "ergo.services/ergo/act"
+ "ergo.services/ergo/gen"
+)
+
+func factory_tracing() gen.ProcessBehavior {
+ return &tracing{}
+}
+
+type tracing struct {
+ act.Actor
+ token gen.Ref
+ event gen.Atom
+
+ flags gen.TracingFlags
+ limit int
+ kinds uint32
+ points uint32
+ messagePattern string
+ messageExclude bool
+ generating bool
+ loopID uint64
+
+ // ring buffer
+ ring []gen.TracingSpan
+ pos int
+ full bool
+ received int64
+}
+
+type flushTracing struct{ id uint64 }
+
+const tracingFlushInterval = time.Second
+const inspectTracingIdlePeriod = 10 * time.Second
+
+func (it *tracing) Init(args ...any) error {
+ it.flags = args[0].(gen.TracingFlags)
+ it.limit = args[1].(int)
+ if it.limit < 1 {
+ it.limit = 500
+ }
+ it.kinds = args[2].(uint32)
+ it.points = args[3].(uint32)
+ it.messagePattern = args[4].(string)
+ it.messageExclude = args[5].(bool)
+ it.ring = make([]gen.TracingSpan, it.limit)
+ it.Log().Debug("tracing inspector started (limit: %d)", it.limit)
+ it.SetCompression(true)
+
+ eopts := gen.EventOptions{
+ Notify: true,
+ }
+ evname := gen.Atom(fmt.Sprintf("%s_%s", string(it.Name()), it.PID()))
+ token, err := it.RegisterEvent(evname, eopts)
+ if err != nil {
+ return err
+ }
+
+ it.event = evname
+ it.token = token
+ it.SendAfter(it.PID(), shutdown{}, inspectTracingIdlePeriod)
+
+ return nil
+}
+
+func (it *tracing) HandleMessage(from gen.PID, message any) error {
+ switch m := message.(type) {
+ case flushTracing:
+ if m.id != it.loopID || it.generating == false {
+ break
+ }
+ if it.received == 0 {
+ it.SendAfter(it.PID(), flushTracing{id: it.loopID}, tracingFlushInterval)
+ break
+ }
+
+ var spans []gen.TracingSpan
+ if it.full {
+ spans = make([]gen.TracingSpan, it.limit)
+ copy(spans, it.ring[it.pos:])
+ copy(spans[it.limit-it.pos:], it.ring[:it.pos])
+ } else {
+ spans = make([]gen.TracingSpan, it.pos)
+ copy(spans, it.ring[:it.pos])
+ }
+
+ suppressed := it.received - int64(len(spans))
+ if suppressed < 0 {
+ suppressed = 0
+ }
+
+ ev := MessageInspectTracing{
+ Node: it.Node().Name(),
+ Spans: spans,
+ Suppressed: suppressed,
+ }
+
+ it.pos = 0
+ it.full = false
+ it.received = 0
+
+ if err := it.SendEvent(it.event, it.token, ev); err != nil {
+ return gen.TerminateReasonNormal
+ }
+
+ it.SendAfter(it.PID(), flushTracing{id: it.loopID}, tracingFlushInterval)
+
+ case requestInspect:
+ response := ResponseInspectTracing{
+ Event: gen.Event{
+ Name: it.event,
+ Node: it.Node().Name(),
+ },
+ }
+ it.SendResponse(m.pid, m.ref, response)
+
+ case shutdown:
+ if it.generating {
+ break
+ }
+ return gen.TerminateReasonNormal
+
+ case gen.MessageEventStart:
+ it.Log().Debug("registering as tracing exporter")
+ it.Node().TracingExporterAddPID(it.PID(), it.PID().String(), it.flags)
+ it.loopID++
+ it.generating = true
+ it.SendAfter(it.PID(), flushTracing{id: it.loopID}, tracingFlushInterval)
+
+ case gen.MessageEventStop:
+ it.Node().TracingExporterDeletePID(it.PID())
+ it.Log().Debug("removed as tracing exporter")
+ it.generating = false
+ it.pos = 0
+ it.full = false
+ it.received = 0
+ it.SendAfter(it.PID(), shutdown{}, inspectTracingIdlePeriod)
+ }
+
+ return nil
+}
+
+func (it *tracing) HandleSpan(span gen.TracingSpan) error {
+ // kind filter: bitmask 1=send, 2=request, 4=response, 8=spawn, 16=terminate
+ if it.kinds != 0 && it.kinds != 31 {
+ kindBit := uint32(1) << (uint32(span.Kind) - 1)
+ if it.kinds&kindBit == 0 {
+ return nil
+ }
+ }
+
+ // point filter: bitmask 1=sent, 2=delivered, 4=processed
+ if it.points != 0 && it.points != 7 {
+ pointBit := uint32(1) << (uint32(span.Point) - 1)
+ if it.points&pointBit == 0 {
+ return nil
+ }
+ }
+
+ if it.messagePattern != "" {
+ match := strings.Contains(span.Message, it.messagePattern) ||
+ strings.Contains(span.Error, it.messagePattern)
+ if it.messageExclude == true && match == true {
+ return nil
+ }
+ if it.messageExclude == false && match == false {
+ return nil
+ }
+ }
+
+ it.ring[it.pos] = span
+ it.pos++
+ if it.pos >= it.limit {
+ it.pos = 0
+ it.full = true
+ }
+ it.received++
+ return nil
+}
+
+func (it *tracing) Terminate(reason error) {
+ it.Node().TracingExporterDeletePID(it.PID())
+}
diff --git a/app/system/metrics.go b/app/system/metrics.go
deleted file mode 100644
index 9736367d2..000000000
--- a/app/system/metrics.go
+++ /dev/null
@@ -1,205 +0,0 @@
-package system
-
-import (
- "bytes"
- "crypto/aes"
- "crypto/cipher"
- "crypto/rand"
- "crypto/rsa"
- "crypto/sha256"
- "crypto/x509"
- "encoding/base64"
- "encoding/binary"
- "fmt"
- "io"
- "net"
- "runtime"
- "strconv"
- "strings"
- "time"
-
- "ergo.services/ergo/act"
- "ergo.services/ergo/gen"
- "ergo.services/ergo/lib"
- "ergo.services/ergo/net/edf"
-)
-
-const (
- period time.Duration = time.Second * 300
-
- DISABLE_METRICS gen.Env = "disable_metrics"
-)
-
-type MessageMetrics struct {
- Name gen.Atom
- Creation int64
- Uptime int64
- Arch string
- OS string
- NumCPU int
- GoVersion string
- Version string
- ErgoVersion string
- Commercial string
-}
-
-func factory_metrics() gen.ProcessBehavior {
- return &metrics{}
-}
-
-type doSendMetrics struct{}
-
-type metrics struct {
- act.Actor
- cancelSend gen.CancelFunc
- key []byte
- block cipher.Block
-}
-
-func (m *metrics) Init(args ...any) error {
- if err := edf.RegisterTypeOf(MessageMetrics{}); err != nil {
- if err != gen.ErrTaken {
- return err
- }
- }
-
- if _, disabled := m.Env(DISABLE_METRICS); disabled {
- if comm := m.Node().Commercial(); len(comm) == 0 {
- m.Log().Trace("metrics disabled")
- return nil
- }
- m.Log().Trace("a commercial package is used. enforce sending metrics")
- }
-
- m.key = []byte(lib.RandomString(32))
- b, err := aes.NewCipher(m.key)
- if err != nil {
- return nil
- }
- m.block = b
-
- m.Log().Trace("scheduled sending metrics in %v", period)
- m.cancelSend, _ = m.SendAfter(m.PID(), doSendMetrics{}, period)
- return nil
-}
-
-func (m *metrics) HandleMessage(from gen.PID, message any) error {
-
- switch message.(type) {
- case doSendMetrics:
- m.send()
- m.Log().Trace("scheduled sending metrics in %v", period)
- m.cancelSend, _ = m.SendAfter(m.PID(), doSendMetrics{}, period)
-
- default:
- m.Log().Trace("received unknown message: %#v", message)
- }
- return nil
-}
-
-func (m *metrics) Terminate(reason error) {
- if m.cancelSend == nil {
- return
- }
- m.cancelSend()
-}
-
-func (m *metrics) send() {
- var msrv = "metrics.ergo.services"
-
- values, err := net.LookupTXT(msrv)
- if err != nil || len(values) == 0 {
- m.Log().Trace("lookup TXT record in %s failed or returned empty result", msrv)
- return
- }
- v, err := base64.StdEncoding.DecodeString(values[0])
- if err != nil {
- return
- }
-
- pk, err := x509.ParsePKCS1PublicKey([]byte(v))
- if err != nil {
- m.Log().Trace("unable to parse public key (TXT record in %s)", msrv)
- return
- }
-
- _, srv, err := net.LookupSRV("data", "mt1", msrv)
- if err != nil || len(srv) == 0 {
- m.Log().Trace("unable to resolve SRV record: %s", err)
- return
- }
-
- dsn := net.JoinHostPort(strings.TrimSuffix(srv[0].Target, "."),
- strconv.Itoa(int(srv[0].Port)))
- c, err := net.Dial("udp", dsn)
- if err != nil {
- m.Log().Trace("unable to dial the host %s: %s", dsn, err)
- return
- }
- defer c.Close()
-
- msg := MessageMetrics{
- Name: m.Node().Name(),
- Creation: m.Node().Creation(),
- Uptime: m.Node().Uptime(),
- Arch: runtime.GOARCH,
- OS: runtime.GOOS,
- NumCPU: runtime.NumCPU(),
- GoVersion: runtime.Version(),
- Version: m.Node().Version().String(),
- ErgoVersion: m.Node().FrameworkVersion().String(),
- Commercial: fmt.Sprintf("%v", m.Node().Commercial()),
- }
-
- buf := lib.TakeBuffer()
- defer lib.ReleaseBuffer(buf)
-
- hash := sha256.New()
- cipher, err := rsa.EncryptOAEP(hash, rand.Reader, pk, m.key, nil)
- if err != nil {
- m.Log().Trace("unable to encrypt metrics message: %s (len: %d)", err, buf.Len())
- return
- }
-
- // 2 (magic: 1144) + 2 (length) + len(cipher)
- buf.Allocate(4)
- buf.Append(cipher)
- binary.BigEndian.PutUint16(buf.B[0:2], uint16(1144))
- binary.BigEndian.PutUint16(buf.B[2:4], uint16(len(cipher)))
-
- // encrypt payload and append to the buf
- payload := lib.TakeBuffer()
- defer lib.ReleaseBuffer(payload)
- if err := edf.Encode(msg, payload, edf.Options{}); err != nil {
- m.Log().Trace("unable to encode metrics message: %s", err)
- return
- }
-
- x := encrypt(payload.B, m.block)
- if x == nil {
- return
- }
- buf.Append(x)
-
- if _, err := c.Write(buf.B); err != nil {
- m.Log().Trace("unable to send metrics: %s", err)
- }
- m.Log().Trace("sent metrics to %s", dsn)
-}
-
-func encrypt(data []byte, block cipher.Block) []byte {
- l := len(data)
- padding := aes.BlockSize - l%aes.BlockSize
- padtext := bytes.Repeat([]byte{byte(padding)}, padding)
- data = append(data, padtext...)
- l = len(data)
-
- x := make([]byte, aes.BlockSize+l)
- iv := x[:aes.BlockSize]
- if _, err := io.ReadFull(rand.Reader, iv); err != nil {
- return nil
- }
- cfb := cipher.NewCFBEncrypter(block, iv)
- cfb.XORKeyStream(x[aes.BlockSize:], data)
- return x
-}
diff --git a/app/system/sup.go b/app/system/sup.go
index 7254d190b..1fae1de87 100644
--- a/app/system/sup.go
+++ b/app/system/sup.go
@@ -19,10 +19,6 @@ func (s *sup) Init(args ...any) (act.SupervisorSpec, error) {
spec := act.SupervisorSpec{
Type: act.SupervisorTypeOneForOne,
Children: []act.SupervisorChildSpec{
- {
- Factory: factory_metrics,
- Name: "system_metrics",
- },
{
Factory: inspect.Factory,
Name: inspect.Name,
diff --git a/docs/.gitbook/assets/Screenshot from 2023-08-18 16-57-50.png b/docs/.gitbook/assets/Screenshot from 2023-08-18 16-57-50.png
deleted file mode 100644
index 2e61f679f..000000000
Binary files a/docs/.gitbook/assets/Screenshot from 2023-08-18 16-57-50.png and /dev/null differ
diff --git a/docs/.gitbook/assets/Screenshot from 2023-08-30 23-00-56.png b/docs/.gitbook/assets/Screenshot from 2023-08-30 23-00-56.png
deleted file mode 100644
index 3aa8dd240..000000000
Binary files a/docs/.gitbook/assets/Screenshot from 2023-08-30 23-00-56.png and /dev/null differ
diff --git a/docs/.gitbook/assets/Screenshot from 2023-08-30 23-07-52.png b/docs/.gitbook/assets/Screenshot from 2023-08-30 23-07-52.png
deleted file mode 100644
index 478212071..000000000
Binary files a/docs/.gitbook/assets/Screenshot from 2023-08-30 23-07-52.png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (1).png b/docs/.gitbook/assets/image (1).png
deleted file mode 100644
index 113a50770..000000000
Binary files a/docs/.gitbook/assets/image (1).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (10).png b/docs/.gitbook/assets/image (10).png
deleted file mode 100644
index 2e61f679f..000000000
Binary files a/docs/.gitbook/assets/image (10).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (11).png b/docs/.gitbook/assets/image (11).png
deleted file mode 100644
index baeb2a358..000000000
Binary files a/docs/.gitbook/assets/image (11).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (12).png b/docs/.gitbook/assets/image (12).png
deleted file mode 100644
index 292cc7164..000000000
Binary files a/docs/.gitbook/assets/image (12).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (14).png b/docs/.gitbook/assets/image (14).png
deleted file mode 100644
index d33192cf7..000000000
Binary files a/docs/.gitbook/assets/image (14).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (15).png b/docs/.gitbook/assets/image (15).png
deleted file mode 100644
index d33192cf7..000000000
Binary files a/docs/.gitbook/assets/image (15).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (16).png b/docs/.gitbook/assets/image (16).png
deleted file mode 100644
index d33192cf7..000000000
Binary files a/docs/.gitbook/assets/image (16).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (17).png b/docs/.gitbook/assets/image (17).png
deleted file mode 100644
index 6b08bb983..000000000
Binary files a/docs/.gitbook/assets/image (17).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (18).png b/docs/.gitbook/assets/image (18).png
deleted file mode 100644
index 0d9c81ea6..000000000
Binary files a/docs/.gitbook/assets/image (18).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (19).png b/docs/.gitbook/assets/image (19).png
deleted file mode 100644
index 8e28afc0d..000000000
Binary files a/docs/.gitbook/assets/image (19).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (2).png b/docs/.gitbook/assets/image (2).png
deleted file mode 100644
index 6d96f5c9f..000000000
Binary files a/docs/.gitbook/assets/image (2).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (20).png b/docs/.gitbook/assets/image (20).png
deleted file mode 100644
index 1c2c457ce..000000000
Binary files a/docs/.gitbook/assets/image (20).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (21).png b/docs/.gitbook/assets/image (21).png
deleted file mode 100644
index 839386179..000000000
Binary files a/docs/.gitbook/assets/image (21).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (22).png b/docs/.gitbook/assets/image (22).png
deleted file mode 100644
index cfdc5af2a..000000000
Binary files a/docs/.gitbook/assets/image (22).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (23).png b/docs/.gitbook/assets/image (23).png
deleted file mode 100644
index 19f8b888c..000000000
Binary files a/docs/.gitbook/assets/image (23).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (24).png b/docs/.gitbook/assets/image (24).png
deleted file mode 100644
index afe15170d..000000000
Binary files a/docs/.gitbook/assets/image (24).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (25).png b/docs/.gitbook/assets/image (25).png
deleted file mode 100644
index 47d23e60d..000000000
Binary files a/docs/.gitbook/assets/image (25).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (26).png b/docs/.gitbook/assets/image (26).png
deleted file mode 100644
index e925c9365..000000000
Binary files a/docs/.gitbook/assets/image (26).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (27).png b/docs/.gitbook/assets/image (27).png
deleted file mode 100644
index eb4bcaeb9..000000000
Binary files a/docs/.gitbook/assets/image (27).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (28).png b/docs/.gitbook/assets/image (28).png
deleted file mode 100644
index 42b85ddf1..000000000
Binary files a/docs/.gitbook/assets/image (28).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (29).png b/docs/.gitbook/assets/image (29).png
deleted file mode 100644
index 449a7e94b..000000000
Binary files a/docs/.gitbook/assets/image (29).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (3).png b/docs/.gitbook/assets/image (3).png
deleted file mode 100644
index 22f1c6bb3..000000000
Binary files a/docs/.gitbook/assets/image (3).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (30).png b/docs/.gitbook/assets/image (30).png
deleted file mode 100644
index c30066eb4..000000000
Binary files a/docs/.gitbook/assets/image (30).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (31).png b/docs/.gitbook/assets/image (31).png
deleted file mode 100644
index cb50d3426..000000000
Binary files a/docs/.gitbook/assets/image (31).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (32).png b/docs/.gitbook/assets/image (32).png
deleted file mode 100644
index d9abc8822..000000000
Binary files a/docs/.gitbook/assets/image (32).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (33).png b/docs/.gitbook/assets/image (33).png
deleted file mode 100644
index dc8690c1b..000000000
Binary files a/docs/.gitbook/assets/image (33).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (34).png b/docs/.gitbook/assets/image (34).png
deleted file mode 100644
index 8de8a9144..000000000
Binary files a/docs/.gitbook/assets/image (34).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (35).png b/docs/.gitbook/assets/image (35).png
deleted file mode 100644
index 8a2d435d0..000000000
Binary files a/docs/.gitbook/assets/image (35).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (36).png b/docs/.gitbook/assets/image (36).png
deleted file mode 100644
index 8a2d435d0..000000000
Binary files a/docs/.gitbook/assets/image (36).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (37).png b/docs/.gitbook/assets/image (37).png
deleted file mode 100644
index 138a90e57..000000000
Binary files a/docs/.gitbook/assets/image (37).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (38).png b/docs/.gitbook/assets/image (38).png
deleted file mode 100644
index d90445547..000000000
Binary files a/docs/.gitbook/assets/image (38).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (39).png b/docs/.gitbook/assets/image (39).png
deleted file mode 100644
index 7da04f4e8..000000000
Binary files a/docs/.gitbook/assets/image (39).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (4).png b/docs/.gitbook/assets/image (4).png
deleted file mode 100644
index 73f534e64..000000000
Binary files a/docs/.gitbook/assets/image (4).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (5).png b/docs/.gitbook/assets/image (5).png
deleted file mode 100644
index aaaaca99c..000000000
Binary files a/docs/.gitbook/assets/image (5).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (6).png b/docs/.gitbook/assets/image (6).png
deleted file mode 100644
index aaaaca99c..000000000
Binary files a/docs/.gitbook/assets/image (6).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (7).png b/docs/.gitbook/assets/image (7).png
deleted file mode 100644
index 93ab08cb4..000000000
Binary files a/docs/.gitbook/assets/image (7).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (8).png b/docs/.gitbook/assets/image (8).png
deleted file mode 100644
index 33ea6bf57..000000000
Binary files a/docs/.gitbook/assets/image (8).png and /dev/null differ
diff --git a/docs/.gitbook/assets/image (9).png b/docs/.gitbook/assets/image (9).png
deleted file mode 100644
index 642d00047..000000000
Binary files a/docs/.gitbook/assets/image (9).png and /dev/null differ
diff --git a/docs/.gitbook/assets/observer.png b/docs/.gitbook/assets/observer.png
new file mode 100644
index 000000000..903ac0561
Binary files /dev/null and b/docs/.gitbook/assets/observer.png differ
diff --git a/docs/README.md b/docs/README.md
index 6b3a38e08..9aa1ae096 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -46,7 +46,7 @@ Benchmarks measuring message passing, network communication, and serialization p
## Zero Dependencies
-The framework uses only the Go standard library. No external dependencies means no version conflicts, no supply chain vulnerabilities, no surprise breaking changes from third-party packages. The requirement is just Go 1.20 or higher.
+The framework uses only the Go standard library. No external dependencies means no version conflicts, no supply chain vulnerabilities, no surprise breaking changes from third-party packages. The requirement is just Go 1.21 or higher.
This isn't ideological purity. It's practical stability. The framework's behavior depends only on Go itself. Updates are predictable. Supply chain is simple. The code you write today will compile and run the same way years from now, assuming Go maintains backward compatibility (which it does).
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
index ec15d0c42..fad927bdb 100644
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -1,6 +1,8 @@
# Table of contents
* [Overview](README.md)
+* [FAQ](faq.md)
+* [AI Agents](ai-agents.md)
## Basics
@@ -56,14 +58,20 @@
* [Message Versioning](advanced/message-versioning.md)
* [Building a Cluster](advanced/building-a-cluster.md)
* [Debugging](advanced/debugging.md)
+* [Distributed Tracing](advanced/distributed-tracing.md)
+* [Inspecting With Observer](advanced/observer.md)
## extra library
* [Actors](extra-library/actors/README.md)
+ * [Health](extra-library/actors/health.md)
* [Leader](extra-library/actors/leader.md)
* [Metrics](extra-library/actors/metrics.md)
* [Applications](extra-library/applications/README.md)
* [Observer](extra-library/applications/observer.md)
+ * [MCP](extra-library/applications/mcp.md)
+ * [Radar](extra-library/applications/radar.md)
+ * [Pulse](extra-library/applications/pulse.md)
* [Meta-Processes](extra-library/meta-processes/README.md)
* [WebSocket](extra-library/meta-processes/websocket.md)
* [SSE](extra-library/meta-processes/sse.md)
@@ -79,5 +87,4 @@
## Tools
* [Boilerplate Code Generation](tools/ergo.md)
-* [Inspecting With Observer](tools/observer.md)
* [Saturn - Central Registrar](tools/saturn.md)
diff --git a/docs/actors/actor.md b/docs/actors/actor.md
index e9535562f..4dabc9f12 100644
--- a/docs/actors/actor.md
+++ b/docs/actors/actor.md
@@ -382,8 +382,8 @@ Log messages have the lowest priority. They're processed after Urgent, System, a
If your actor subscribed to an event (via `LinkEvent` or `MonitorEvent`), it receives event messages:
```go
-func (w *Worker) HandleEvent(message gen.MessageEvent) error {
- switch message.Name {
+func (w *Worker) HandleEvent(event gen.MessageEvent) error {
+ switch event.Event.Name {
case "config_updated":
w.reloadConfig()
case "cache_invalidated":
diff --git a/docs/actors/supervisor.md b/docs/actors/supervisor.md
index 189b4c908..304bca114 100644
--- a/docs/actors/supervisor.md
+++ b/docs/actors/supervisor.md
@@ -80,7 +80,7 @@ func createSupervisorFactory() gen.ProcessBehavior {
pid, err := node.Spawn(createSupervisorFactory, gen.ProcessOptions{})
```
-The supervisor spawns all children during `Init` (except Simple One For One, which starts with zero children). Each child is linked bidirectionally to the supervisor (`LinkChild` and `LinkParent` set automatically). If a child terminates, the supervisor receives an exit signal and applies the restart strategy.
+The supervisor spawns all children during `Init` (except Simple One For One, which starts with zero children). Each child is connected to the supervisor with a pair of unidirectional links (`LinkChild` and `LinkParent` set automatically). If a child terminates, the supervisor receives an exit signal and applies the restart strategy.
Children are started sequentially in declaration order. If any child's spawn fails (the factory's `ProcessInit` returns an error), the supervisor terminates immediately with that error. This ensures the supervision tree is fully initialized or not at all - no partial states.
diff --git a/docs/advanced/building-a-cluster.md b/docs/advanced/building-a-cluster.md
index cd294ff96..955af0a12 100644
--- a/docs/advanced/building-a-cluster.md
+++ b/docs/advanced/building-a-cluster.md
@@ -360,8 +360,8 @@ func (c *Coordinator) Init(args ...any) error {
return nil
}
-func (c *Coordinator) HandleEvent(ev gen.MessageEvent) error {
- switch msg := ev.Message.(type) {
+func (c *Coordinator) HandleEvent(event gen.MessageEvent) error {
+ switch msg := event.Message.(type) {
case etcd.EventApplicationStarted:
if msg.Name == "worker" {
@@ -825,8 +825,8 @@ cacheEnabled := config["cache.enabled"].(bool) // true
React to config changes in real-time:
```go
-func (a *App) HandleEvent(ev gen.MessageEvent) error {
- switch msg := ev.Message.(type) {
+func (a *App) HandleEvent(event gen.MessageEvent) error {
+ switch msg := event.Message.(type) {
case etcd.EventConfigUpdate:
a.Log().Info("config changed: %s = %v", msg.Item, msg.Value)
@@ -937,8 +937,8 @@ func (c *Coordinator) HandleBecomeFollower(leader gen.PID) error {
return nil
}
-func (c *Coordinator) HandleEvent(ev gen.MessageEvent) error {
- switch ev.Message.(type) {
+func (c *Coordinator) HandleEvent(event gen.MessageEvent) error {
+ switch event.Message.(type) {
case etcd.EventApplicationStarted, etcd.EventApplicationStopped:
c.refreshWorkers()
}
diff --git a/docs/advanced/debugging.md b/docs/advanced/debugging.md
index 036737a6a..61b967cb2 100644
--- a/docs/advanced/debugging.md
+++ b/docs/advanced/debugging.md
@@ -47,12 +47,12 @@ With `norecover`, panics propagate normally, providing full stack traces and all
- Tracking down type assertion failures
- Understanding the call sequence leading to a panic
-### The `trace` Tag
+### The `verbose` Tag
-The `trace` tag enables verbose logging of framework internals:
+The `verbose` tag enables verbose logging of framework internals:
```bash
-go run --tags trace ./cmd
+go run --tags verbose ./cmd
```
This produces detailed output about:
@@ -72,6 +72,57 @@ options := gen.NodeOptions{
}
```
+### The `latency` Tag
+
+The `latency` tag enables mailbox latency measurement for all processes:
+
+```bash
+go run --tags latency ./cmd
+```
+
+This activates:
+
+- **Monotonic timestamp** on every message pushed into the MPSC queue
+- **`QueueMPSC.Latency()`** returns the age (in nanoseconds) of the oldest unprocessed message in the queue
+- **`ProcessMailbox.Latency()`** returns the maximum latency across all four mailbox queues (Main, System, Urgent, Log)
+- **`MailboxLatency` field** in `ProcessShortInfo` for per-process latency snapshots
+- **`Node.ProcessRangeShortInfo()`** for efficient iteration over all processes with their latency data
+
+Without the tag, `Latency()` returns -1 (disabled) and there is zero runtime overhead: no timestamps are recorded, no atomic operations are added to the message path.
+
+The overhead with the tag enabled is approximately 10-25% on micro-benchmarks (LOCAL 1-1 scenario with a single producer and consumer exchanging messages). In real applications with many processes, the overhead is lower because the cost is amortized across concurrent operations.
+
+Latency measurement answers the question "how long has the oldest message been sitting in this process's mailbox?" A high value means the process is not keeping up with incoming messages: it is either overloaded, stuck in a long-running callback, or blocked. This is particularly useful for:
+
+- Identifying backpressure in actor pipelines
+- Detecting stuck processes before they cause cascading failures
+- Finding hotspot processes in large clusters
+
+For cluster-wide observability with Prometheus and Grafana, see the [Metrics actor](../extra-library/actors/metrics.md) which integrates latency data into distribution, top-N, and per-node panels when built with the `latency` tag.
+
+### The `typestats` Tag
+
+The `typestats` tag enables per-type encode/decode statistics:
+
+```bash
+go run --tags typestats ./cmd
+```
+
+This activates:
+
+- **Encoded/Decoded counts** per registered EDF type for root-level operations (calls at the message boundary, not nested fields)
+- **EncodedBytes/DecodedBytes** measured as decompressed wire size, pre-compression on encode and post-decompression on decode, including the type-prefix header
+- **`Stats.Enabled` flag** in `gen.RegisteredTypeInfo` set to `true` to signal counters are active
+- Counters visible via **`Network().RegisteredTypes()`** API and the **Observer Types panel**
+
+Without the tag, counters remain zero, `Stats.Enabled` is `false`, and there is zero runtime overhead. Encode and decode go through pass-through wrappers that the Go inliner reduces to direct calls.
+
+The overhead with the tag enabled is approximately 2-3% on encode/decode throughput, from two `atomic.AddInt64` operations per root call.
+
+A counter increments only when a value of that type is the message itself, the top of an `Encode` or `Decode` call. Built-in primitives like `gen.PID`, `gen.Atom`, `gen.Ref` typically appear as fields inside other messages, so their bytes contribute to the parent message's byte total, not to their own counters. Encoded and Decoded are independent: a node may receive some types only and send others only.
+
+Use case: identify message types that dominate network traffic. The average byte size per operation (`EncodedBytes / Encoded`) indicates whether a type is a candidate for compression at the producer process. Types with a high average are strong candidates for compressing at the source; types with a low average are not worth the framing overhead.
+
### Combining Tags
Tags can be combined for comprehensive debugging:
@@ -80,7 +131,19 @@ Tags can be combined for comprehensive debugging:
go run --tags "pprof,norecover,trace" ./cmd
```
-This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.
+or with latency measurement:
+
+```bash
+go run --tags "pprof,latency" ./cmd
+```
+
+or with type statistics:
+
+```bash
+go run --tags "pprof,latency,typestats" ./cmd
+```
+
+This enables all specified features simultaneously. Use combinations when investigating complex issues that span multiple subsystems.
## Profiler Integration
@@ -347,9 +410,10 @@ Observer runs at `http://localhost:9911` by default when included in your node.
Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:
-- **Build tags** that enable profiling and diagnostics without production overhead
+- **Build tags** that enable profiling, diagnostics, and latency measurement without production overhead
- **Goroutine labels** that link runtime goroutines to their actor (PID) and meta process (Alias) identities
- **Shutdown diagnostics** that identify processes preventing clean termination
- **Observer integration** for visual inspection of running systems
Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.
+
diff --git a/docs/advanced/distributed-tracing.md b/docs/advanced/distributed-tracing.md
new file mode 100644
index 000000000..d5542aaba
--- /dev/null
+++ b/docs/advanced/distributed-tracing.md
@@ -0,0 +1,955 @@
+---
+description: Distributed tracing across actor message chains
+---
+
+# Distributed Tracing
+
+In a distributed actor system, a single user request can touch dozens of processes across multiple nodes. Messages hop from actor to actor, crossing network boundaries invisibly. When something goes wrong (latency spikes, a message seems to disappear, an error surfaces three hops away from its cause) you need to follow the message trail across the entire cluster.
+
+Traditional logging shows you individual perspectives. Process A logged a send at 10:00:00.001, process B logged a receive at 10:00:00.003. Connecting these fragments manually, with hundreds of messages per second, is impractical. Tracing solves this by giving the framework itself the job of tracking messages end-to-end.
+
+## What Is a Trace
+
+A trace is an identity that follows a chain of causally related messages. When a process sends a message and the framework decides to track it, a 128-bit trace ID is generated and attached to that message. From that moment, the trace identity travels with every message in the chain. When the recipient handles the message and sends new messages of its own, those messages carry the same trace ID. When those recipients send further messages, the identity continues. The trace follows the causal chain across processes and nodes until the chain ends.
+
+This is fundamentally different from HTTP tracing. In HTTP, a request enters a service, the service calls other services, and eventually a response comes back. The trace follows a request-response tree with clear boundaries. In an actor system, there are no such boundaries. A message arrives, the handler sends three async messages to different processes, each of those handlers sends more messages, and the chain branches and spreads across the cluster. There's no single "response" that marks the end. The trace ends when the last handler in the chain finishes without sending more traced messages.
+
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Node X
+ participant A as Process A
+ end
+ box rgb(200,255,220) Node Y
+ participant B as Process B
+ end
+ box rgb(255,230,200) Node Z
+ participant C as Process C
+ participant D as Process D
+ end
+
+ Note over A: New trace starts (TraceID=abc)
+
+ A->>B: Send(Order)
+ rect rgb(245,245,245)
+ Note over A,B: TraceID=abc travels with the message
+ end
+
+ activate B
+ Note over B: Handling Order...
+
+ B->>C: Send(ReserveStock)
+ B->>D: Send(CreateInvoice)
+ deactivate B
+
+ activate C
+ Note over C: Handling ReserveStock...
+ deactivate C
+
+ activate D
+ Note over D: Handling CreateInvoice...
+ deactivate D
+```
+
+Process B never opted into tracing. Neither did C or D. The trace reached them because the message carried it. This is the key property: you configure tracing on entry-point processes, and the trace propagates through the entire downstream chain automatically.
+
+### The Lifecycle of a Trace
+
+A trace goes through three phases:
+
+**Birth.** A process handles a message and calls `Send`, `Call`, or `SendResponse`. The framework checks: is there an active trace from the incoming message being handled? If yes, the outgoing message inherits it. If no, the framework asks the process's sampler: "should we start a new trace?" If the sampler says yes, a new trace ID is generated. If it says no, the message goes out untraced. The sampler is covered in the Enabling Tracing section below.
+
+**Propagation.** The trace identity travels with the message. When the recipient's handler runs, the framework stores the trace as the "propagating context" for the duration of that handler. Every `Send`, `Call`, or `SendResponse` during the handler inherits the trace identity. When the handler returns, the context is restored. If the handler sends messages to five different processes, all five messages carry the same trace identity. Each recipient propagates it further in the same way.
+
+**End.** A trace has no explicit end and no timeout. It ends naturally when the last handler in the chain finishes processing and sends no further messages. A trace that spans a 30-second `Call` timeout will simply have a 30-second gap between observations. The trace identity is a value in the message, not a timer.
+
+### Observation Points
+
+As a trace flows through the system, the framework records observations at three points for each message:
+
+**Sent.** Recorded when the message leaves the sender. This is the sender's perspective: who sent what, to whom, and when.
+
+**Delivered.** Recorded when the message enters the recipient's mailbox. The recipient hasn't started processing yet, the message is queued.
+
+**Processed.** Recorded when the recipient's handler returns. If the handler returned an error, the observation captures it.
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Sender
+ participant A as Sender
+ end
+ box rgb(200,255,220) Recipient
+ participant B as Recipient
+ end
+
+ A->>B: message
+ Note right of A: Sent
+ Note left of B: Delivered
+ activate B
+ Note over B: Handler runs...
+ deactivate B
+ Note left of B: Processed
+```
+
+These three points are not the trace itself. They are what gets recorded as the trace passes through. One message produces up to three observations. A trace spanning five messages across three nodes produces up to fifteen observations. Together, these observations reconstruct the complete message flow.
+
+The timing gaps between observations tell you where time is spent:
+
+| Gap | What It Tells You |
+|-----|-------------------|
+| Sent to Delivered | Network latency (remote) or scheduling delay (local) |
+| Delivered to Processed | Mailbox wait time + handler execution time |
+| Sent to Processed | Total end-to-end latency for this message |
+
+For local messages, Sent and Delivered happen nearly simultaneously. For remote messages, the gap is the network transit time. This makes tracing particularly valuable in distributed systems: you can see exactly how much time is spent in transit versus in processing.
+
+Each observation carries context: which node emitted it, the sender and recipient identities, the message type name, the actor behavior type, a timestamp, and any custom attributes. Together, the observations for a single trace form a tree that you can visualize as a waterfall in tools like Grafana Tempo or the Observer UI.
+
+### Why Three Points, Not Two
+
+HTTP tracing typically records two points per span: the start and end of a service call. Actor tracing needs three because messages go through a mailbox. In HTTP, when service A calls service B, B starts processing immediately. In an actor system, when A sends to B, the message enters B's mailbox and waits. B might be busy handling a previous message. The wait time can be significant under load.
+
+Without the Delivered point, you'd see Sent at time T and Processed at T+50ms, but you wouldn't know whether the 50ms was network latency, mailbox wait, or handler execution. With Delivered, you know: Sent to Delivered was 2ms (network), Delivered to Processed was 48ms (the message sat in the mailbox for 40ms and the handler took 8ms). This distinction is critical for diagnosing performance issues.
+
+### What Gets Traced
+
+All message kinds that go through the framework's routing:
+
+| Kind | Description | Observations |
+|------|-------------|--------------|
+| Send | Asynchronous message (`Send`) | Sent, Delivered, Processed |
+| Request | Synchronous call (`Call`) | Sent, Delivered, Processed |
+| Response | Return value from `HandleCall` | Sent, Delivered |
+| Spawn | Process creation | Sent, Processed |
+| Terminate | Process termination | Processed |
+
+Response has no Processed because the response delivery completes the Call. There's no separate handler on the caller side. Spawn has no Delivered because it's not a mailbox delivery. Terminate has only Processed because it's an internal lifecycle event, not a message between two processes.
+
+### What Doesn't Get Traced
+
+Exit signals (`SendExit`) do not carry trace context. These are control-plane operations outside of message chains.
+
+Events (`SendEvent`) also do not carry trace context. An event with a thousand subscribers would generate thousands of trace observations from a single publish, creating a storm that overwhelms exporters and backends. If you need to trace event-driven flows, trace the messages that your event handlers send in response to receiving events.
+
+Delayed messages (`SendAfter`) do not carry trace context. A delayed message is a scheduled future action, not a continuation of the current processing chain. By the time it fires, the original handler has long finished. This prevents periodic self-tick patterns from creating infinite traces. Each tick is an independent starting point for the sampler. See the Delayed Messages section for details.
+
+## Enabling Tracing
+
+By default, no processes create traces. You enable tracing by setting a sampler that decides whether to start a new trace for each outgoing message.
+
+```go
+func (a *OrderProcessor) Init(args ...any) error {
+ a.SetTracingSampler(gen.TracingSamplerAlways)
+ return nil
+}
+```
+
+Four sampler types are available:
+
+```go
+gen.TracingSamplerDisable // never start traces (default)
+gen.TracingSamplerAlways // trace every outgoing message
+gen.TracingSamplerRatio(0.01) // trace 1% of messages
+gen.TracingSamplerRateLimit(100) // at most 100 new traces per second
+```
+
+The sampler is only consulted when there is no active trace. If a process is already handling a traced message, every outgoing message inherits the trace regardless of the sampler. This means you can set a sampler on a single entry-point process and the trace will follow the entire message chain automatically.
+
+`TracingSamplerRatio(0.1)` traces approximately 10% of messages. `TracingSamplerRateLimit(100)` allows at most 100 new traces per second. During traffic spikes the effective sampling rate drops, during quiet periods more messages are traced.
+
+The sampler is set during `Init()` but only starts working when the process begins handling messages. Messages sent during `Init()` itself, including periodic ticks set up with `SendAfter`, are not traced. This is because `Init()` is a setup phase, not message processing. The sampler becomes active starting from the first `HandleMessage` or `HandleCall` invocation.
+
+### Setting Samplers at Runtime
+
+You can change a process's sampler without restarting it:
+
+```go
+node.SetProcessTracingSampler(pid, gen.TracingSamplerAlways)
+```
+
+The node itself has a sampler for messages sent via `node.Send()` and `node.Call()`:
+
+```go
+node.SetTracingSampler(gen.TracingSamplerRatio(0.01))
+```
+
+Process samplers and the node sampler are independent.
+
+### Custom Samplers
+
+If the built-in samplers don't fit your needs, implement the `gen.TracingSampler` interface:
+
+```go
+type TracingSampler interface {
+ Sample() bool
+ String() string
+}
+```
+
+`Sample()` is called for each outgoing message that doesn't already carry a trace. Return `true` to start a new trace. `String()` provides a human-readable description shown in Observer and inspection APIs.
+
+## Tracing in Practice: Send
+
+The simplest traced scenario: process A handles a message and sends to process B on the same node.
+
+```go
+func (a *gateway) Init(args ...any) error {
+ a.SetTracingSampler(gen.TracingSamplerAlways)
+ a.SetTracingAttribute("service", "gateway")
+ return nil
+}
+
+func (a *gateway) HandleMessage(from gen.PID, message any) error {
+ req := message.(IncomingRequest)
+ a.Send(processorPID, ProcessOrder{ID: req.OrderID})
+ return nil
+}
+```
+
+When `a.Send()` executes, the sampler decides to start a new trace. The framework generates a trace identity shared by all observations for this message. Three observations are recorded:
+
+1. **Sent** on the sender's node, capturing: sender PID, receiver PID, message type `main.ProcessOrder`, behavior `gateway`, the custom attribute `service=gateway`.
+
+2. **Delivered** on the same node (it's local), capturing: the same message identity, the receiver's behavior name, the receiver's permanent attributes.
+
+3. **Processed** after the receiver's `HandleMessage` returns, capturing: whether the handler succeeded or returned an error, plus any one-shot attributes the receiver set during handling.
+
+The receiver didn't set a sampler. It didn't need to. The trace arrived with the message and the observations were recorded automatically.
+
+### Remote Send
+
+When process A on node X sends to process B on node Y, the trace crosses the network:
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Node X
+ participant A as Process A
+ end
+ box rgb(200,255,220) Node Y
+ participant B as Process B
+ end
+
+ A->>B: Send(ProcessOrder)
+ Note right of A: Sent (on node X)
+
+ rect rgb(245,245,245)
+ Note over A,B: network transit
+ end
+
+ Note left of B: Delivered (on node Y)
+ activate B
+ Note over B: Handler runs...
+ deactivate B
+ Note left of B: Processed (on node Y)
+```
+
+Sent is recorded on node X, but Delivered and Processed are recorded on node Y. The framework preserves the message's identity across the network, so all three observations can be correlated even though they were emitted on different nodes.
+
+The gap between Sent and Delivered now represents real network latency. If you see a 50ms gap, that's 50ms of network transit.
+
+## Tracing in Practice: Message Chains
+
+The real power of tracing appears when messages form chains. Process A sends to B, and B sends to C and D while handling A's message. All hops share the same trace.
+
+```go
+func (p *processor) HandleMessage(from gen.PID, message any) error {
+ order := message.(ProcessOrder)
+ p.SetTracingSpanAttribute("order_id", order.ID)
+
+ p.Send(warehousePID, ReserveStock{OrderID: order.ID})
+ p.Send(billingPID, CreateInvoice{OrderID: order.ID})
+ return nil
+}
+```
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Gateway Node
+ participant GW as gateway
+ end
+ box rgb(200,255,220) Worker Node
+ participant P as processor
+ end
+ box rgb(255,230,200) Service Node
+ participant W as warehouse
+ participant B as billing
+ end
+
+ Note over GW: Sampler starts trace
+
+ GW->>P: ProcessOrder
+ Note right of GW: Sent
+ Note left of P: Delivered
+
+ activate P
+ Note over P: Handler runs
+ P->>W: ReserveStock
+ P->>B: CreateInvoice
+ deactivate P
+ Note left of P: Processed
+
+ activate W
+ Note left of W: Delivered
+ Note over W: Handler runs
+ deactivate W
+ Note left of W: Processed
+
+ activate B
+ Note left of B: Delivered
+ Note over B: Handler runs
+ deactivate B
+ Note left of B: Processed
+```
+
+The gateway started the trace. The processor inherited it from the incoming message. The warehouse and billing processes also inherited it. Five messages, three nodes, one trace.
+
+The propagation is automatic. During a handler, the framework stores the incoming message's trace context. Every `Send`, `Call`, or `SendResponse` during that handler carries the trace forward. When the handler returns, the context is restored to whatever it was before.
+
+The trace captures causality: the processor's messages to warehouse and billing were sent **because of** the gateway's message to the processor. This creates a tree of messages that represents the complete processing flow for the original request.
+
+## Tracing in Practice: Call and Response
+
+Synchronous calls create two traced message flows within the same trace: the request going out and the response coming back.
+
+```go
+func (c *client) HandleMessage(from gen.PID, message any) error {
+ to := gen.ProcessID{Name: "inventory", Node: "warehouse@host"}
+ result, err := c.Call(to, CheckStockRequest{SKU: "WIDGET-42"})
+ if err != nil {
+ c.Log().Warning("stock check failed: %s", err)
+ return nil
+ }
+ resp := result.(CheckStockResponse)
+ c.Log().Info("stock level: %d", resp.Available)
+ return nil
+}
+
+func (inv *inventory) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
+ req := request.(CheckStockRequest)
+ level := inv.checkWarehouse(req.SKU)
+ return CheckStockResponse{Available: level}, nil
+}
+```
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Node A
+ participant C as client
+ end
+ box rgb(200,255,220) Node B
+ participant I as inventory
+ end
+
+ C->>I: Call(CheckStockRequest)
+ Note right of C: Sent (request)
+
+ rect rgb(245,245,245)
+ Note over C,I: network
+ end
+
+ Note left of I: Delivered (request)
+ activate I
+ Note over I: HandleCall runs
+ deactivate I
+
+ I->>C: CheckStockResponse
+ Note right of I: Sent (response)
+ Note left of I: Processed (request)
+
+ rect rgb(245,245,245)
+ Note over C,I: network
+ end
+
+ Note left of C: Delivered (response)
+```
+
+The request and the response are separate messages, each with their own observations. They share a call reference (`gen.Ref`) that links them, so tools like Tempo and Observer can pair request and response even when multiple concurrent calls are in flight.
+
+If the inventory process sends additional messages during `HandleCall` (for example, querying a database actor), those messages are also part of the same trace, linked causally to the incoming request.
+
+### Forward Pattern
+
+In the actor model, a process handling a synchronous request can forward it to another process instead of responding directly. The relay wraps the original caller's identity and reference into the forwarded message, and the final recipient responds straight to the original caller:
+
+```go
+func (r *relay) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
+ req := request.(Request)
+ target := gen.ProcessID{Name: "backend", Node: "target@node"}
+
+ r.Send(target, MessageForward{
+ OriginalFrom: from,
+ OriginalRef: ref,
+ Payload: req,
+ })
+ return nil, nil // no direct response -- backend will respond to the original caller
+}
+
+func (b *backend) HandleMessage(from gen.PID, message any) error {
+ fwd := message.(MessageForward)
+ result := b.process(fwd.Payload)
+ // process message
+ b.SendResponse(fwd.OriginalFrom, fwd.OriginalRef, result)
+ return nil
+}
+```
+
+The trace follows the entire chain: A's call to the relay, the relay's forward to the backend, and the backend's response to A. Three messages, potentially three nodes, one trace. The response skips the relay entirely, and the trace captures this topology accurately.
+
+### Async Response and Trace Context
+
+When `HandleCall` returns `nil, nil` (async response), the process stores the caller's identity and reference to respond later. Between the request handler and the eventual response, other messages may arrive. The response will happen in a different handler invocation, potentially with a different trace context.
+
+If you need the response to be in the same trace as the original request, save the trace context alongside the caller identity:
+
+```go
+type pendingCall struct {
+ From gen.PID
+ Ref gen.Ref
+ Tracing gen.Tracing
+}
+
+func (s *service) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
+ s.pending = pendingCall{
+ From: from,
+ Ref: ref,
+ Tracing: s.PropagatingTrace(),
+ }
+ return nil, nil
+}
+
+func (s *service) HandleMessage(from gen.PID, message any) error {
+ // some event triggers the response
+ saved := s.PropagatingTrace()
+ s.SetPropagatingTrace(s.pending.Tracing)
+ s.SendResponse(s.pending.From, s.pending.Ref, result)
+ s.SetPropagatingTrace(saved)
+ return nil
+}
+```
+
+`PropagatingTrace()` returns the current trace context. In `HandleCall`, this is the request's trace. Saving it and restoring before `SendResponse` ensures the response carries the original request's trace, regardless of which trace context the current handler is working with.
+
+The save-restore pattern is important: `SetPropagatingTrace` changes the trace context for all subsequent operations in the handler. If you don't restore the previous context, the modified trace will leak beyond the current handler into all subsequent handler invocations. Every message the process sends from that point on will carry the leaked trace until another traced message arrives and resets it. Always save before, always restore after.
+
+## Custom Attributes
+
+Traces show message flow. Custom attributes add business context that makes traces searchable and meaningful.
+
+Attributes describe the place where a message was sent, delivered, or processed. They are part of the observation record, not part of the trace context. Over the network, only the trace ID and span ID travel with the message, just enough to link observations into a chain. Attributes stay local to the node that emitted the observation. This keeps the network overhead minimal and lets each process describe its own context independently.
+
+### Permanent Attributes
+
+Set on a process, attached to every observation from that process for its entire lifetime:
+
+```go
+func (a *PaymentService) Init(args ...any) error {
+ a.SetTracingSampler(gen.TracingSamplerRatio(0.01))
+ a.SetTracingAttribute("service", "payment")
+ a.SetTracingAttribute("version", "2.1")
+ a.SetTracingAttribute("region", "eu-west")
+ return nil
+}
+```
+
+When a message passes through this process, its attributes appear on every observation where the process is a participant. If another process sends a message to PaymentService, the Delivered and Processed observations carry `service=payment, version=2.1, region=eu-west`. When PaymentService sends a message to someone else, the Sent observation carries the same attributes. The attributes describe the location in the system where the observation was recorded.
+
+Setting an attribute with a key that already exists overwrites the value. Remove with `RemoveTracingAttribute(key)`.
+
+### Node-Level Attributes
+
+The node has its own permanent attributes, independent from process attributes:
+
+```go
+node.SetTracingAttribute("env", "production")
+node.SetTracingAttribute("cluster", "payments-eu")
+```
+
+Same mechanics as process attributes: set, overwrite, or remove at any time.
+
+### One-Shot Span Attributes
+
+Set during message handling, scoped to a single handler invocation:
+
+```go
+func (a *OrderProcessor) HandleMessage(from gen.PID, message any) error {
+ order := message.(Order)
+ a.SetTracingSpanAttribute("order_id", order.ID)
+ a.SetTracingSpanAttribute("customer", order.CustomerID)
+ a.SetTracingSpanAttribute("amount", fmt.Sprintf("%.2f", order.Total))
+
+ a.Send(warehousePID, ReserveStock{OrderID: order.ID})
+ a.Send(billingPID, CreateInvoice{OrderID: order.ID})
+ return nil
+}
+```
+
+One-shot attributes appear on the observations emitted during this handler invocation: the Processed observation for the incoming message, and the Sent observations for outgoing messages. When the handler returns, one-shot attributes are cleared automatically. The next handler invocation starts with a clean slate.
+
+If a one-shot attribute has the same key as a permanent attribute, the one-shot value takes priority for that handler invocation. The permanent attribute is not modified.
+
+### Where Attributes Appear
+
+Different observations carry different attributes:
+
+| Observation | Attributes |
+|-------------|-----------|
+| Sent | Sender's permanent + one-shot attributes |
+| Delivered | Receiver's permanent attributes |
+| Processed | Receiver's permanent + one-shot attributes |
+
+This means: the sender decides what context to attach at send time. The receiver's permanent identity (service name, version) appears on its Delivered and Processed observations. The receiver can add handler-specific context (order ID, customer) that appears on its Processed observation and on any Sent observations during that handler.
+
+### Searching by Attributes
+
+In Grafana Tempo or the Observer UI, search by any attribute value. If one observation in a trace has `order_id=ORD-456`, searching for it returns the complete trace, all observations across all nodes in the chain. You don't need the same attribute on every observation.
+
+This makes attributes a powerful debugging tool. Set `order_id` on the entry-point process, and you can find the complete processing trace for any order by searching for its ID.
+
+The `ergo.` prefix is reserved for framework-generated attributes (`ergo.node`, `ergo.from`, `ergo.behavior`). Attempts to set attributes with this prefix are silently ignored.
+
+## Delayed Messages
+
+`SendAfter` does not carry trace context. This is a deliberate design choice: a delayed message is a future action, not a continuation of the current processing chain.
+
+Consider a common pattern, a process that does periodic work via a self-tick:
+
+```go
+func (w *worker) Init(args ...any) error {
+ w.SetTracingSampler(gen.TracingSamplerAlways)
+ w.SendAfter(w.PID(), messageTick{}, 3*time.Second)
+ return nil
+}
+
+func (w *worker) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageTick:
+ w.Send(targetPID, DoWork{})
+ w.SendAfter(w.PID(), messageTick{}, 3*time.Second)
+ }
+ return nil
+}
+```
+
+Each tick arrives as an untraced message. The sampler on the worker decides independently for each `Send(targetPID, DoWork{})` whether to create a trace. The `SendAfter` at the end schedules the next tick without trace context, breaking the chain and ensuring the next tick starts fresh.
+
+If `SendAfter` inherited the trace, the first tick that happened to be traced would create an infinite trace: tick carries trace, handler sends traced tick, next handler sends traced tick, forever. A process running for days would accumulate millions of observations in a single trace. Decoupling `SendAfter` from the trace context prevents this.
+
+The same applies to `SendAfter` to other processes. If you need a delayed message to carry trace context, send it through a regular `Send` to an intermediary that schedules the delay, or store the trace context and restore it when the delayed action triggers (the same pattern as Async Response above).
+
+### Self-Send and Trace Propagation
+
+`Send` to self behaves like `Send` to any other process. The message carries the current trace context. This is consistent and enables patterns like async `HandleCall` where a process sends work to itself and responds later within the same trace.
+
+For periodic self-loops, use `SendAfter` which does not carry trace context. This is the natural choice for tick patterns since `SendAfter` provides the timing control that loops need. Each tick starts fresh, and the sampler decides independently whether to trace it.
+
+If your actor uses `Send` to itself for a finite internal sequence (state machine, batch processing), the internal steps will appear in the trace. For a three-step state machine triggered by a traced message, this adds six extra observations. This is proportional to the work done and finite, not a concern in practice.
+
+## Lifecycle Events: Spawn and Terminate
+
+When a process spawns a child during a traced handler, the spawn itself is part of the trace.
+
+```go
+func (m *manager) HandleMessage(from gen.PID, message any) error {
+ task := message.(NewTask)
+
+ pid, err := m.Spawn(workerFactory, gen.ProcessOptions{}, task.Config)
+ if err != nil {
+ return err
+ }
+
+ m.Send(pid, BeginWork{TaskID: task.ID})
+ return nil
+}
+```
+
+The framework records two observations for the spawn:
+
+**Sent.** Emitted before the child's `Init()` runs. This is "spawn initiated."
+
+**Processed.** Emitted after `Init()` returns. If `Init()` returned an error, the error is recorded in this observation's Error field.
+
+The gap between Sent and Processed is the `Init()` execution time. If a spawn is slow, you'll see it in the trace.
+
+After `Init()` completes, the child process starts with a clean slate, no inherited trace context. Messages the child sends during `Init()` are not traced. The child's sampler decides whether to trace its own outgoing messages starting from the first `HandleMessage` or `HandleCall`. The `Send(pid, BeginWork{})` in the example above carries the parent's trace (it's a regular `Send` during the parent's traced handler), so the child receives and processes it within the parent's trace.
+
+### Terminate
+
+A terminate observation is recorded when a process terminates while handling a traced message. If the handler returns an error that causes the process to exit, the framework records the termination reason in the same trace as the message that caused the crash. This gives you the complete picture in one trace: the message arrived, the handler failed, the process terminated.
+
+Processes that terminate between handler invocations (normal shutdown, supervisor stop, `node.Kill`) do not generate a terminate observation. Normal lifecycle events don't produce tracing noise.
+
+## Exporters
+
+Observations go nowhere by themselves. To see them, you register one or more tracing exporters on the node. This works similarly to loggers: a node can have multiple loggers, each receiving log messages according to its own level filter. A node can have multiple tracing exporters, each receiving observations according to its own flags.
+
+The framework emits all observations unconditionally for traced messages. Each exporter declares which types of observations it wants to receive, and the framework delivers only those. One exporter might receive everything for a waterfall UI, while another on the same node receives only Sent observations for counting outgoing messages.
+
+### Exporter Flags
+
+When you register an exporter, you specify which observations it should receive:
+
+```go
+gen.TracingFlagSend // Sent observations
+gen.TracingFlagReceive // Delivered and Processed observations
+gen.TracingFlagProcs // Spawn and Terminate lifecycle events
+```
+
+Combine with bitwise OR:
+
+```go
+// receive everything
+flags := gen.TracingFlagSend | gen.TracingFlagReceive | gen.TracingFlagProcs
+
+// only message delivery observations
+flags := gen.TracingFlagReceive
+```
+
+### Two Kinds of Exporters
+
+**Process-based.** An actor process that receives observations in its mailbox. Use this when the exporter needs actor capabilities: batching with timers, sending over the network, accessing node services. This is how Observer and Pulse work internally.
+
+```go
+node.TracingExporterAddPID(pid, "my-exporter",
+ gen.TracingFlagSend | gen.TracingFlagReceive | gen.TracingFlagProcs)
+```
+
+The process implements `HandleSpan(gen.TracingSpan)` to process each observation. If the process's mailbox is full, observations are silently dropped. Ensure the exporter can keep up with the observation rate.
+
+**Behavior-based.** A simple implementation of the `gen.TracingBehavior` interface. `HandleSpan` is called synchronously when an observation is emitted. Use this for lightweight exporters that don't need actor capabilities.
+
+```go
+type TracingBehavior interface {
+ HandleSpan(TracingSpan)
+ Terminate()
+}
+```
+
+```go
+node.TracingExporterAdd("counter", &spanCounter{},
+ gen.TracingFlagSend | gen.TracingFlagReceive)
+```
+
+Keep `HandleSpan` fast. It blocks delivery to the next exporter in the chain.
+
+### Registering Exporters
+
+At node startup:
+
+```go
+options := gen.NodeOptions{
+ Tracing: gen.TracingOptions{
+ Exporters: []gen.TracingExporter{
+ {
+ Name: "my-exporter",
+ Exporter: &myExporter{},
+ Flags: gen.TracingFlagSend | gen.TracingFlagReceive,
+ },
+ },
+ },
+}
+```
+
+At runtime:
+
+```go
+node.TracingExporterAdd("counter", &spanCounter{}, gen.TracingFlagSend)
+node.TracingExporterAddPID(pid, "observer", gen.TracingFlagSend | gen.TracingFlagReceive | gen.TracingFlagProcs)
+```
+
+Each exporter has a unique name. Attempting to register a name that's already taken returns `gen.ErrTaken`. A process can only be registered as one exporter. A second attempt returns `gen.ErrNotAllowed`.
+
+### Removing Exporters
+
+```go
+names := node.TracingExporters() // list registered exporter names
+node.TracingExporterDelete("name") // remove by name
+node.TracingExporterDeletePID(pid) // remove by PID
+```
+
+Removing a behavior-based exporter calls its `Terminate()` method. Exporters can be added and removed at any time while the node is running.
+
+## Observer and Pulse
+
+Two ready-made exporters are available out of the box.
+
+[Observer](observer.md) provides real-time tracing visualization directly in the web UI. It connects to a specific node and shows traces passing through that node, useful for live debugging and runtime sampler control. Since Observer sees only one node at a time, traces that span multiple nodes will appear partial. See [Inspecting With Observer](observer.md) for details.
+
+[Pulse](../extra-library/applications/pulse.md) exports traces to an OTLP-compatible backend (Grafana Tempo, Jaeger). Each node runs its own Pulse instance, sending observations to a shared collector. The backend assembles complete cross-cluster traces from all nodes, so you can see the full message chain end-to-end. See the [Pulse documentation](../extra-library/applications/pulse.md) for setup and configuration.
+
+## Production Patterns
+
+### Sampling at the Edge
+
+In production, you rarely want to trace everything. Set a ratio sampler on your entry-point processes and let propagation handle the rest:
+
+```go
+func (gw *APIGateway) Init(args ...any) error {
+ gw.SetTracingSampler(gen.TracingSamplerRatio(0.01))
+ gw.SetTracingAttribute("service", "api-gateway")
+ return nil
+}
+```
+
+One percent of requests are traced end-to-end across the entire cluster. The other 99% have near-zero overhead: one `Sample()` call returning `false`.
+
+Downstream processes don't need samplers. They inherit traces from incoming messages. This means adding tracing to a complex system requires changes only at the entry points.
+
+### Rate Limiting Under Load
+
+When traffic volume varies, `TracingSamplerRateLimit` provides a steady flow of traces regardless of load:
+
+```go
+gw.SetTracingSampler(gen.TracingSamplerRateLimit(50))
+```
+
+This creates at most 50 new traces per second. During a traffic spike, the effective sampling rate drops. During quiet periods, more messages are traced.
+
+This is useful when your tracing backend or exporters have throughput limits. You get consistent trace volume without overwhelming the pipeline.
+
+### Debugging a Specific Process
+
+Something is wrong with a particular process. Enable full tracing on it without restarting:
+
+```go
+node.SetProcessTracingSampler(problemPID, gen.TracingSamplerAlways)
+```
+
+Or through the Observer UI: open the process, go to Config, set the sampler to "always". Every message this process handles and every message it sends will be traced. When you're done investigating, set it back to "disable."
+
+Because trace propagation is automatic, you'll see not just this process's messages but the entire downstream chain. If the process calls a remote service, you'll see the round-trip. If it spawns workers, you'll see the spawn and the workers' activity.
+
+### Finding Specific Requests
+
+A customer reports a problem with order ORD-789. You need to see what happened:
+
+```go
+func (a *OrderProcessor) HandleMessage(from gen.PID, message any) error {
+ order := message.(Order)
+ a.SetTracingSpanAttribute("order_id", order.ID)
+ // ... process the order
+ return nil
+}
+```
+
+In Grafana Tempo, search for `order_id=ORD-789`. The complete trace appears: every message in the processing chain, across every node, with timing at every hop. You can see where the latency was, which service returned an error, and what happened next.
+
+This requires that the entry-point process was tracing when order ORD-789 came through. With 1% sampling, you won't have traces for every request. For critical flows where you always need traces, use `TracingSamplerAlways` on the entry-point process or a higher ratio.
+
+### Temporary Tracing for Incident Response
+
+During an incident, you need more visibility. Increase sampling temporarily:
+
+```go
+// before: 1% sampling
+node.SetProcessTracingSampler(gatewayPID, gen.TracingSamplerRatio(0.01))
+
+// during incident: trace everything
+node.SetProcessTracingSampler(gatewayPID, gen.TracingSamplerAlways)
+
+// after resolution: back to normal
+node.SetProcessTracingSampler(gatewayPID, gen.TracingSamplerRatio(0.01))
+```
+
+You can do this through the Observer UI without any code changes: open the process, change the sampler in the Config tab, investigate, and set it back.
+
+## Understanding Trace Trees
+
+As traces propagate through message chains, they form trees. Understanding the tree structure helps when reading traces in Tempo or Observer.
+
+### Linear Chain
+
+The simplest tree: A sends to B, B sends to C, C sends to D.
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255)
+ participant A
+ end
+ box rgb(200,255,220)
+ participant B
+ end
+ box rgb(255,230,200)
+ participant C
+ end
+ box rgb(230,220,255)
+ participant D
+ end
+
+ A->>B: Send
+ activate B
+ Note over B: handles...
+ B->>C: Send
+ deactivate B
+ activate C
+ Note over C: handles...
+ C->>D: Send
+ deactivate C
+ activate D
+ Note over D: handles...
+ deactivate D
+```
+
+Each message is a child of the message that caused it. In a waterfall view, you see a staircase pattern: each hop starts when the previous handler runs.
+
+### Fan-Out
+
+One handler sends to multiple recipients:
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255)
+ participant A
+ end
+ box rgb(200,255,220)
+ participant B
+ end
+ box rgb(255,230,200)
+ participant C
+ participant D
+ participant E
+ end
+
+ A->>B: Send
+ activate B
+ Note over B: handles...
+ B->>C: ReserveStock
+ B->>D: CreateInvoice
+ B->>E: SendNotification
+ deactivate B
+```
+
+B's handler sends three messages. All three are children of B's incoming message. In a waterfall view, the three sends appear at roughly the same timestamp, fanning out from B's processing.
+
+### Fan-Out with Call
+
+B calls C synchronously, then uses the result to send to D:
+
+```go
+func (b *processor) HandleMessage(from gen.PID, message any) error {
+ result, err := b.Call(validatorPID, ValidateRequest{...})
+ if err != nil {
+ return err
+ }
+ b.Send(executorPID, ExecuteRequest{Validated: result})
+ return nil
+}
+```
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255)
+ participant A
+ end
+ box rgb(200,255,220)
+ participant B
+ end
+ box rgb(255,230,200)
+ participant C as C (Validator)
+ end
+ box rgb(230,220,255)
+ participant D as D (Executor)
+ end
+
+ A->>B: Send
+ activate B
+ Note over B: handles...
+ B->>C: Call(Validate)
+ activate C
+ C->>B: Response
+ deactivate C
+ B->>D: Send(Execute)
+ deactivate B
+```
+
+In the waterfall, you see B waiting for C's response before sending to D. The gap between the response arriving and D's Sent observation shows B's processing time between the call return and the next send.
+
+### Deep Chains Across Nodes
+
+In a microservice-style architecture with many nodes, traces can span many hops:
+
+```mermaid
+sequenceDiagram
+ box rgb(200,220,255) Edge
+ participant GW as Gateway
+ end
+ box rgb(200,255,220) Auth
+ participant Auth
+ end
+ box rgb(255,230,200) Orders
+ participant Ord as OrderService
+ end
+ box rgb(230,220,255) Stock
+ participant Inv as Inventory
+ participant WH as Warehouse
+ end
+
+ GW->>Auth: Call
+ activate Auth
+ Auth->>GW: Response
+ deactivate Auth
+ GW->>Ord: Send
+ activate Ord
+ Ord->>Inv: Call
+ activate Inv
+ Inv->>WH: Call
+ activate WH
+ WH->>Inv: Response
+ deactivate WH
+ Inv->>Ord: Response
+ deactivate Inv
+ deactivate Ord
+```
+
+Each arrow is a message with up to three observation points. The complete trace might have 15-20 observations across 5 nodes. In Tempo's waterfall view, you see exactly where time is spent: if the warehouse is slow, the gap between its Delivered and Processed observations will be large.
+
+## Writing a Custom Exporter
+
+For specialized needs beyond Pulse and Observer, you can write your own exporter. Here's an example that counts observations by kind:
+
+```go
+type traceCounter struct {
+ sends int64
+ requests int64
+ responses int64
+}
+
+func (tc *traceCounter) HandleSpan(span gen.TracingSpan) {
+ switch span.Kind {
+ case gen.TracingKindSend:
+ atomic.AddInt64(&tc.sends, 1)
+ case gen.TracingKindRequest:
+ atomic.AddInt64(&tc.requests, 1)
+ case gen.TracingKindResponse:
+ atomic.AddInt64(&tc.responses, 1)
+ }
+}
+
+func (tc *traceCounter) Terminate() {}
+```
+
+Register it at node startup:
+
+```go
+options := gen.NodeOptions{
+ Tracing: gen.TracingOptions{
+ Exporters: []gen.TracingExporter{
+ {
+ Name: "counter",
+ Exporter: &traceCounter{},
+ Flags: gen.TracingFlagSend,
+ },
+ },
+ },
+}
+```
+
+Or register at runtime:
+
+```go
+node.TracingExporterAdd("counter", &traceCounter{},
+ gen.TracingFlagSend)
+```
+
+The flags on the exporter determine which observations it receives. The counter above only gets Sent observations (because of `TracingFlagSend`). To also receive Delivered and Processed, add `gen.TracingFlagReceive`.
+
+For more complex exporters that need actor capabilities (sending messages, using timers, accessing the network), register a process as an exporter with `TracingExporterAddPID` and implement `HandleSpan` in your actor. This is how Pulse works: a pool of actor processes that batch observations and flush them over HTTP.
diff --git a/docs/advanced/images/observer/connection.png b/docs/advanced/images/observer/connection.png
new file mode 100644
index 000000000..735e4802a
Binary files /dev/null and b/docs/advanced/images/observer/connection.png differ
diff --git a/docs/advanced/images/observer/events.png b/docs/advanced/images/observer/events.png
new file mode 100644
index 000000000..97e68c5a3
Binary files /dev/null and b/docs/advanced/images/observer/events.png differ
diff --git a/docs/advanced/images/observer/log.png b/docs/advanced/images/observer/log.png
new file mode 100644
index 000000000..94cf5a09b
Binary files /dev/null and b/docs/advanced/images/observer/log.png differ
diff --git a/docs/advanced/images/observer/network.png b/docs/advanced/images/observer/network.png
new file mode 100644
index 000000000..864113dda
Binary files /dev/null and b/docs/advanced/images/observer/network.png differ
diff --git a/docs/advanced/images/observer/process_info.png b/docs/advanced/images/observer/process_info.png
new file mode 100644
index 000000000..2d3a3b652
Binary files /dev/null and b/docs/advanced/images/observer/process_info.png differ
diff --git a/docs/advanced/images/observer/processes.png b/docs/advanced/images/observer/processes.png
new file mode 100644
index 000000000..b3750d9a0
Binary files /dev/null and b/docs/advanced/images/observer/processes.png differ
diff --git a/docs/advanced/images/observer/profiler.png b/docs/advanced/images/observer/profiler.png
new file mode 100644
index 000000000..b9d0f811e
Binary files /dev/null and b/docs/advanced/images/observer/profiler.png differ
diff --git a/docs/advanced/images/observer/tracing.png b/docs/advanced/images/observer/tracing.png
new file mode 100644
index 000000000..7dda3f571
Binary files /dev/null and b/docs/advanced/images/observer/tracing.png differ
diff --git a/docs/advanced/message-versioning.md b/docs/advanced/message-versioning.md
index defab15e2..9b4c1834f 100644
--- a/docs/advanced/message-versioning.md
+++ b/docs/advanced/message-versioning.md
@@ -43,23 +43,22 @@ func (a *Actor) HandleMessage(from gen.PID, message any) error {
}
```
-All message types must be registered with EDF before connection establishment:
+All message types must be registered with the network stack before connection establishment. Register them from your application's `Load(node)` callback:
```go
-func init() {
- types := []any{
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ err := node.Network().RegisterTypes([]any{
OrderCreatedV1{},
OrderCreatedV2{},
+ })
+ if err != nil {
+ return gen.ApplicationSpec{}, err
}
- for _, t := range types {
- if err := edf.RegisterTypeOf(t); err != nil && err != gen.ErrTaken {
- panic(err)
- }
- }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
```
-For details on EDF and type registration, see [Network Transparency](../networking/network-transparency.md).
+For details on the type registry and the legacy `edf.RegisterTypeOf` API, see [Network Transparency](../networking/network-transparency.md).
## Versioning Strategies
@@ -352,45 +351,41 @@ company.com/
### Registration Helper
-All message types must be registered with EDF before connection establishment - during handshake, nodes exchange their registered type lists which become the encoding dictionaries. Registration typically happens in `init()` functions before node startup. There are two approaches: centralized registration in the shared module or manual registration in each client.
+All message types must be registered with the network stack before connection establishment. During handshake, nodes exchange their registered type lists which become the encoding dictionaries. Registration happens from an application's `Load(node)` callback, which runs after the network stack is initialized but before any traffic. There are two approaches: a centralized helper exported by the shared module, or manual registration per client.
-**Centralized registration** uses `init()` to register all types when the package is imported:
+**Centralized helper** exposes a single function that the consumer's application calls from `Load`:
```go
// events/register.go
package events
-import (
- "ergo.services/ergo/gen"
- "ergo.services/ergo/net/edf"
-)
+import "ergo.services/ergo/gen"
-func init() {
- types := []any{
+func RegisterTypes(network gen.Network) error {
+ return network.RegisterTypes([]any{
OrderCreatedV1{},
OrderCreatedV2{},
PaymentReceivedV1{},
- }
- for _, t := range types {
- if err := edf.RegisterTypeOf(t); err != nil && err != gen.ErrTaken {
- panic(err)
- }
- }
+ })
}
```
-When clients import the package to use message types, `init()` runs automatically at program startup and registers all types:
+Each consumer calls it from its application:
```go
import "company.com/events"
-// Using events.OrderCreatedV1 means the package is imported,
-// init() has already run, types are registered
+func (a *OrderService) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := events.RegisterTypes(node.Network()); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
+}
```
-No risk of forgetting a type.
+The shared `events` module owns the canonical list of types. Consumers register them all without having to enumerate each type, so there is no risk of forgetting one. `RegisterTypes` accepts a slice in any order and resolves nested-type dependencies internally.
-**Manual registration** means each client registers only the types it uses. This gives more control but introduces risk: a missing registration is only detected at runtime - `"no encoder for type"` when sending, `"unknown reg type for decoding"` when receiving. For most projects, centralized registration is simpler and safer. Choose based on your needs.
+**Manual registration** means each client registers only the types it uses. This gives more control but introduces risk: a missing registration is only detected at runtime, surfacing as `"no encoder for type"` when sending or `"unknown reg type for decoding"` when receiving. For most projects, centralized registration is simpler and safer. Choose based on your needs.
For message isolation patterns within a single codebase, see [Project Structure](../basics/project-structure.md).
@@ -635,11 +630,11 @@ type OrderV2 struct {
**Forgetting to register new types**
```go
-// Type exists but not registered - encoding fails at runtime
+// Type exists but not registered. Encoding fails at runtime.
type OrderV3 struct { ... }
-// Must register before node starts
-edf.RegisterTypeOf(OrderV3{})
+// Register from your application's Load callback before any traffic.
+node.Network().RegisterType(OrderV3{})
```
**Long coexistence periods**
@@ -648,7 +643,7 @@ Supporting V1 for months creates maintenance burden. Set clear deprecation deadl
**Registering after connection established**
-Types must be registered before node starts. Dynamic registration requires connection cycling.
+Types must be registered before connections are formed. Dynamic registration requires connection cycling.
## Summary
diff --git a/docs/advanced/observer.md b/docs/advanced/observer.md
new file mode 100644
index 000000000..dd2226b44
--- /dev/null
+++ b/docs/advanced/observer.md
@@ -0,0 +1,259 @@
+---
+description: Real-time inspection and management of Ergo nodes
+---
+
+# Inspecting With Observer
+
+This page walks through each page of the Observer web interface. For installation and configuration, see [Observer Application](../extra-library/applications/observer.md).
+
+The sidebar contains a node selector listing all nodes discovered through the registrar. Select a different node and Observer switches to showing that node's data. You deploy Observer on one node and monitor the entire cluster from a single browser tab.
+
+To try Observer with a live cluster:
+
+```bash
+git clone https://github.com/ergo-services/examples
+cd examples/observability
+make up
+```
+
+This starts a multi-node cluster with Observer, tracing, health probes, Prometheus metrics, and Grafana dashboards. Open `http://localhost:9911` for Observer, `http://localhost:8888/dashboards` for Grafana.
+
+## Dashboard
+
+
+
+The dashboard is the landing page with summary cards, real-time charts, and node-wide counters. Two controls let you manage the node directly from here: the log level dropdown changes the node-level severity threshold (see [Logging](../basics/logging.md)), and the [tracing](distributed-tracing.md) sampler dropdown controls whether the node starts new traces for messages sent via `node.Send()` and `node.Call()`. Both take effect immediately.
+
+The applications page lets you manage the lifecycle of applications running on the node: start in a selected [mode](../basics/application.md#application-modes), stop, or unload. When something goes wrong at the application level, this is where you act. But most investigation happens one level deeper, at individual processes.
+
+## Processes
+
+The processes page is where you spend most of your time when investigating issues.
+
+Every process on the node appears in a table that updates every second. The columns cover identification (PID, name, behavior, application), messaging (messages in/out, mailbox depth, latency), and lifecycle (running time, init time, wakeups, uptime, state). This is enough to answer most diagnostic questions without opening individual process details.
+
+When message counts change between updates, a green delta indicator appears next to the number. A "+42" next to Messages In tells you this process received 42 messages in the last second. The mailbox column changes color as the queue grows, making overloaded processes visually obvious in a list of thousands. The state column shows how long the process has been in its current state. A process stuck in "running" for 30 seconds is probably blocked inside a handler.
+
+All columns are sortable. Clicking Messages In sorts by busiest processes. Clicking Mailbox puts the most backlogged processes at the top. Clicking Running Time reveals which processes spend the most time executing handlers.
+
+Click any PID to open a floating detail window.
+
+### Scope
+
+The table does not show all processes at once. What you see is controlled by the Scope panel, which determines what the server sends to the browser.
+
+The scope works in two modes. In the default mode, you choose a window into the process ID space: "first 500" returns the 500 oldest processes (lowest PIDs), "last 500" returns the 500 newest, and entering a specific PID starts the window from that point. The node scans only the requested range and applies filters within it. This is fast even on nodes with tens of thousands of processes because the node never iterates beyond the requested window.
+
+The "All" mode switches to a full scan: the node iterates all processes, applies filters during iteration, and returns up to 10,000 matches. This mode requires at least one filter to be active to prevent the browser from receiving an unmanageable amount of data.
+
+Filters narrow results by name, behavior type, application, state, or minimum mailbox depth. In windowed mode, filters reduce the result count within the window. In All mode, filters are applied during the scan so only matching processes are counted toward the limit.
+
+Active filters appear as removable chips in the toolbar, and the scope label shows a compact summary like `first 500` or `last 100 . name:"worker"`. A separate search field adds client-side regex filtering on top of the server results for quick ad-hoc lookups without changing the scope.
+
+
+Processes page with scope panel
+
+
+
+**Mailbox.** Total messages across all four mailbox queues (Main, System, Urgent, Log). Changes color as the queue grows: yellow for moderate, red for deep backlog.
+
+**Latency.** Time between a message entering the mailbox and the process starting to handle it. High latency means the process has a backlog and incoming messages are waiting. Requires the `latency` build tag to be enabled (see [Debugging](debugging.md)).
+
+**Running Time.** Total time spent inside handler callbacks (HandleMessage, HandleCall). High running time relative to uptime means the process spends most of its life executing handlers, whether due to computation or blocking I/O.
+
+**Init Time.** Time spent in the `Init` callback during startup. Highlighted red if over one second. Keep initialization fast: spawn has a timeout, and under a supervisor a slow Init blocks the restart of sibling processes.
+
+**Wakeups.** How many times the process was activated to handle messages. Each activation processes one batch from the mailbox. A high wakeup count with low message counts can indicate many small deliveries.
+
+
+
+## Process Details
+
+Floating detail windows are the primary tool for investigating individual processes. Multiple windows can be open simultaneously. They persist when you switch between pages, so you can keep a problematic process open while you check logs or traces elsewhere.
+
+The overview tab shows two real-time charts. The messages chart tracks incoming and outgoing message rates over the last 60 seconds, with a toggle between rate and cumulative views. The mailbox chart tracks the four queue depths: Main, System, Urgent, and Log. Below the charts, cards show running time, init time, and uptime. If the init time is suspiciously long, you know the process took a while to start. If the running time is high relative to uptime, the process is spending most of its life inside handlers rather than waiting for messages. The parent and leader processes appear as clickable links that open their own windows.
+
+The relations tab reveals the process's connections: aliases it has registered, meta processes it owns, events it has created, and its links and monitors grouped by type. This is valuable when you need to understand the supervision tree or figure out which processes will be affected if this one terminates.
+
+The inspect tab shows the output of the process's `HandleInspect` callback as key-value pairs. If your actor implements this method, it can expose internal state: queue lengths, cache sizes, connection counts, or any application-specific metrics. Auto-refresh polls the process once per second.
+
+### Managing a Process
+
+The config tab lets you change settings that take effect immediately. You can raise the log level to get more verbose output from a specific process, enable compression for network messages, change the tracing sampler for targeted diagnostics, or adjust message priority and delivery guarantees. The environment variables section is available if the node has `ExposeEnvInfo` enabled in its security settings.
+
+Three action buttons let you interact with the process. Send Message opens a dialog with a text field; the message is sent as a string value to the process. Send Exit sends an exit signal with a configurable reason. Kill forcefully terminates the process. These actions are disabled for system processes.
+
+
+Process detail window
+
+
+
+
+
+## Events
+
+The events page works like the processes page: it shows only what the scope defines, not the full list.
+
+Each row includes the event name, the producer process, registration time, subscriber count, and message statistics. Delta indicators highlight which events are actively publishing. The default sort is by registration time, newest first.
+
+The Scope panel controls which events the server returns. The From control chooses between First (oldest registered) and Last (newest registered). The node iterates events in registration order and stops after collecting the requested number of matches. Filters narrow by name, notify mode, buffered mode, and minimum subscriber count, and are applied during iteration so only matching events count toward the limit.
+
+Three toggle buttons in the toolbar control how the Registered column displays timestamps: 24h/12h clock format, raw millisecond timestamps for precise correlation, and an optional date prefix. These settings are shared with the Log and Tracing pages.
+
+
+Events page
+
+
+
+**Published.** Total number of times PublishEvent was called by the producer. Each call increments this counter once regardless of how many subscribers receive the message.
+
+**Local Sent.** Total messages delivered to local subscribers. If one publish reaches 5 local subscribers, this increments by 5.
+
+**Remote Sent.** Total messages sent to remote nodes. Counted per remote node, not per subscriber. If a remote node has 10 subscribers, this increments by 1 because the framework uses [shared subscriptions](pub-sub-internals.md#network-optimization-shared-subscriptions) to send one message per node.
+
+**Fanout.** Ratio of Local Sent to Published. Shows the average number of local deliveries per publish. A fanout of 3.0 means each publish reaches about 3 local subscribers.
+
+**Buffer.** Current messages in the event's ring buffer / buffer capacity. [Buffered events](pub-sub-internals.md#buffered-events-partial-optimization) retain recent messages so that new subscribers receive catch-up data. Yellow highlight if the buffer has pending messages.
+
+**Notify.** Whether the producer receives [notifications](pub-sub-internals.md#producer-notifications) (`MessageEventStart`/`MessageEventStop`) when the first subscriber arrives or the last subscriber leaves.
+
+
+
+## Network
+
+The network page shows how the node connects to the rest of the cluster.
+
+The top section displays network configuration: mode, max message size, handshake and protocol versions, and negotiated flags. The registrar section shows the service discovery backend with its capabilities.
+
+The acceptors section lists network listeners with their addresses, TLS configuration, and per-acceptor flags.
+
+Below the acceptors, the page splits into three tabs.
+
+The **Connections** tab is the default view. Four real-time charts show aggregate traffic across all connections: messages per second (in/out), bytes per second (in/out), compression operations per second (sent/received), and fragmentation operations per second (sent/received). A connection list table with its own scope controls shows all connections with delta indicators for message and byte counts. Click a row to open a floating window with detailed connection statistics.
+
+The **Routes** tab shows configured static routes and proxy routes side by side. Static routes are user-defined patterns that tell the node where to dial when a name matches; proxy routes describe how to reach nodes via an intermediate proxy.
+
+The **Types** tab is a one-shot view of the wire-format type registry. Each row shows registration ID, owning proto (the protocol version that registered the type), kind, MinSize (wire size of a zero-value), and canonical name. Click a row to expand its inferred schema (Go-syntax shape, multi-line for structs). Two filters at the top of the panel narrow the list by name and by schema content (useful for finding all types containing a specific field). The Refresh button re-fetches the registry; the panel does not subscribe to live updates because the registry rarely changes after node startup.
+
+When the node is built with `-tags=typestats`, four additional columns appear: **Encoded** and **Decoded** (operation counts), **Bytes Out** and **Bytes In** (decompressed wire-byte totals with average per operation). Counters reflect only root encode/decode at the message boundary; bytes folded inside other messages are accounted to the parent type. See [The typestats Tag](debugging.md#the-typestats-tag) for what gets counted and how to use the averages to pick compression candidates.
+
+The cluster nodes section shows all nodes known through the registrar or active connections, giving you a picture of the cluster topology.
+
+
+Network page
+
+
+
+**Node.** Contains several elements: a direction arrow, the node name, a CRC32 badge, and a TLS badge. The blue arrow (up-right) means the connection was initiated by this node (outgoing). The green arrow (down-left) means the connection was accepted from the remote node (incoming). The badge shows "TLS" if the connection uses TLS or "Plain" if it does not.
+
+**Node Uptime / Connection Uptime.** Node uptime is how long the remote node has been running. Connection uptime is how long this specific connection has been active. If the connection was recently re-established after a network issue, connection uptime will be shorter than node uptime.
+
+**Pool.** Number of TCP connections in the ENP protocol pool for this logical connection. Higher pool size allows more parallel message delivery.
+
+**Reconnections.** How many times the connection was re-established. Non-zero values are highlighted in red. Frequent reconnections may indicate network instability.
+
+**Clock Skew.** Measured difference between the local and remote node clocks. Used by the tracing waterfall to compensate for clock drift when displaying cross-node traces.
+
+
+
+### Connection Details
+
+Clicking a connection row opens a floating window with full connection information.
+
+At the top, four metric cards show messages and bytes in each direction. The identity section shows node and connection uptimes, framework and protocol versions, max message size, and negotiated network flags as colored pills (Remote Spawn, Fragmentation, Important Delivery, etc.). Each flag shows green if both nodes agreed to enable it.
+
+Below the identity section, the pool size and reconnection counter are shown. For outgoing connections, the Pool DSN lists the addresses of TCP connections in the pool.
+
+Two real-time charts track messages per second and bytes per second in each direction. If the connection carries proxy traffic, a third chart shows transit throughput.
+
+The compression section shows how many messages were compressed and decompressed, the compression ratio, and total bytes saved. The fragmentation section shows fragment counts and reassembly timeouts. These sections help diagnose whether compression and fragmentation are working efficiently or causing overhead.
+
+A "Switch observer to this node" button lets you start inspecting the remote node directly.
+
+
+Connection detail window
+
+
+
+
+
+## Log
+
+The log page captures log messages in real time from every source on the node: processes, meta processes, the node itself, and the network stack.
+
+Each log entry shows a timestamp, severity level (color-coded badge), source, registered name, behavior type, and message text. The source column identifies where the message came from: a process PID, meta-process alias, node CRC, or network peer, each with its own color. The rich source toggle adds a type icon and makes the source clickable, opening a floating window for the process, meta-process, or network connection that generated the message. Long messages (over 200 characters or containing newlines) are truncated to three lines and expandable with a click. If the log entry carries structured fields, they appear below the message as key=value pairs.
+
+The Scope panel controls what the server captures. Level toggle buttons let you enable or disable each severity independently. This is server-side filtering: disabling debug means the server stops collecting debug messages entirely, reducing overhead on the node. Additional filters match against source, behavior, field names/values, and message text, with an exclude mode to filter out noise. The limit controls the ring buffer size.
+
+The Play/Pause button stops log capture without disconnecting. When you spot something interesting, pause and read through existing entries without new messages pushing them away.
+
+When the server drops messages because the ring buffer is full, a suppressed count indicator appears as a yellow alert in the toolbar. If you see this frequently, increase the limit in the scope panel.
+
+
+Log page
+
+
+
+
+
+## Profiler
+
+The profiler page has two tabs and a GC Pressure section that is always visible at the top. The key difference between the tabs: the Heap tab updates continuously via a live subscription, while the Goroutines tab captures snapshots on demand when you press the Capture button.
+
+The GC Pressure section shows four real-time charts: allocation rate (objects per second), dead rate (objects collected per second), live ratio (percentage of allocated objects still alive), and GC CPU fraction (percentage of CPU spent in garbage collection). These help you spot memory pressure trends before they become problems.
+
+### Heap
+
+The Heap tab updates continuously and shows allocation records sorted by in-use bytes. Each record shows in-use bytes, in-use objects, total allocated bytes, total allocated objects, and the function name (the first non-runtime function in the allocation stack). Expanding a record reveals the full stack trace. A scope panel filters by function name and limits how many records the server returns. A Pause button freezes the current data so you can examine it without updates overwriting what you are reading.
+
+Use the heap view when memory grows unexpectedly. The allocation stack traces show exactly which code paths are responsible. If a single function dominates the in-use bytes, that is your starting point.
+
+### Goroutines
+
+The Goroutines tab captures snapshots on demand. Press the Capture button to take a goroutine dump. The dump groups goroutines by their call stack: if 500 goroutines are all blocked on the same channel receive, they appear as one group with count 500. Each group shows the count, state (running, IO wait, chan receive, select, sleep, semacquire), wait duration (color-coded: green under 60s, yellow under 5 minutes, red above), and two function names: Origin (where the goroutine was spawned) and Current (where it is now). Expanding a group reveals the full stack trace and goroutine IDs. A scope panel filters the server-side capture by stack content, state, and minimum wait time. A search field filters the captured results client-side.
+
+This is how you diagnose deadlocks and blocking. Filter by state to isolate goroutines stuck in "chan receive". Search by package name to find goroutines from specific actors. A large group with a long wait time in a state that should be transient usually points directly at the problem.
+
+
+Profiler
+
+
+
+
+
+## Tracing
+
+The tracing page shows distributed traces. Traces are collected continuously while Observer is connected, so data is already available when you navigate here. For background on how tracing works, see [Distributed Tracing](distributed-tracing.md).
+
+Because Observer connects to one node at a time, it shows only the observations emitted on that node. For complete cross-cluster traces, use [Pulse](../extra-library/applications/pulse.md) with Grafana Tempo or Jaeger.
+
+### Trace List
+
+Traces are sorted newest first. Each row shows a copyable trace ID, timestamp, root process PID with the root message type, an error icon if any span recorded an error, the span count, a duration bar showing this trace's duration as a proportion of the longest trace in the current scope buffer (red if any span recorded an error), and the total duration.
+
+The search field filters across trace ID, root process, root message, root node, and within spans across span ID, from, to, message text, and attribute keys and values. The Pause button stops the page from accepting new traces until resumed. The Clear button removes all collected traces.
+
+### Waterfall
+
+Click a trace row to expand its waterfall. The waterfall groups all observation points (Sent, Delivered, Processed) for the same message into a single row and arranges rows in a tree by parent-child relationships, with indentation showing the causal chain.
+
+Each row shows a color-coded kind badge (SEND in blue, CALL in violet, RESP in green, SPAWN in amber, TERM in red), the sender and receiver PIDs with their behavior types, the message type, and a timeline bar. The bar renders two phases: a lighter segment for transit time (Sent to Delivered) and a full segment for processing time (Delivered to Processed). Three dot markers show the observation points: blue for Sent, green for Delivered, orange for Processed.
+
+Hovering over the bar shows a tooltip with the node name at each point and the transit and processing durations. The duration column shows both the total and the breakdown. For cross-node spans (where Sent and Delivered happen on different nodes), the transit time calculation subtracts the measured clock skew between the nodes to show a more accurate transit duration.
+
+Local process PIDs in the waterfall are clickable and open detail windows.
+
+Click a span row to expand its detail panel. The panel has two columns: the left shows span fields (trace ID, span ID, kind, which points are present, behavior, from, to, call reference, message, node names, error) and the right shows custom attributes merged from all observation points. All values are copyable.
+
+Expanded traces persist when switching to other pages and back.
+
+### Scope
+
+The Scope panel has toggle buttons for span kinds (SEND, CALL, RESP, SPAWN, TERM) and observation points (Sent, Delivered, Processed). Disabled items appear with strikethrough. A message pattern filter matches against message type and error text, with an exclude toggle that inverts the match. The buffer limit controls how many traces are kept. Active scope filters appear as removable pills below the toolbar with a "Clear all" link.
+
+
+Tracing page with waterfall
+
+
+
+
diff --git a/docs/advanced/pub-sub-internals.md b/docs/advanced/pub-sub-internals.md
index 770b995aa..00d212229 100644
--- a/docs/advanced/pub-sub-internals.md
+++ b/docs/advanced/pub-sub-internals.md
@@ -442,7 +442,7 @@ process.SendEvent("market.prices", token, update)
**What subscribers see:**
```go
-func (c *Consumer) HandleEvent(message gen.MessageEvent) error {
+func (c *Consumer) HandleEvent(event gen.MessageEvent) error {
// Event arrives in your mailbox
// Same timing whether you're the only subscriber or one of thousands
// Same timing whether producer is local or remote
@@ -724,6 +724,8 @@ func (p *Producer) HandleMessage(from gen.PID, message any) error {
You only receive notifications when crossing the zero threshold. The notifications answer: "is anyone listening?" - not "how many are listening?"
+Node-level events do not produce these notifications. The producer of a node-level event is the node core, which does not consume `MessageEventStart` or `MessageEventStop` messages.
+
### Practical Use Case: On-Demand Data Production
```go
diff --git a/docs/ai-agents.md b/docs/ai-agents.md
new file mode 100644
index 000000000..63b90dd08
--- /dev/null
+++ b/docs/ai-agents.md
@@ -0,0 +1,229 @@
+---
+description: Build, run, and diagnose multi-agent AI systems on Ergo Framework
+---
+
+# AI Agents
+
+Modern AI systems are multi-agent by nature. A research agent delegates to an analysis agent. A planner coordinates with executors. A conversation manager spawns short-lived task agents. Moving from a demo with a handful of agents to a production deployment with hundreds or thousands surfaces the same problems that distributed systems have solved for decades: crash isolation, supervision, cross-node coordination, observability at scale.
+
+Ergo was built for telecom workloads where these requirements are baseline. AI agents have the same profile: many concurrent isolated workers with fault tolerance, coordination, and real-time behavior. This page shows how to use Ergo as runtime for your agents and as a live diagnostic surface for the running system.
+
+## Why Ergo fits AI agents
+
+Four problems appear as soon as you move AI agents out of a notebook:
+
+**Agent crashes.** One stuck LLM call or panicking tool handler takes down the whole process. Everything running in that process dies with it.
+
+**Coordination.** Agents need to talk to each other. Without a framework this becomes a web of channels, shared state, and custom routing code.
+
+**Observability.** You can't see what's happening inside a running agent system. Mailbox depth, per-agent CPU, which agents are waiting on which external calls, where cascade failures originate.
+
+**Scaling.** Distributing agents across nodes requires rethinking addressing, message delivery, and failure semantics.
+
+Ergo addresses all four: isolated processes with supervision, named event streams for coordination, a built-in MCP diagnostic surface, and network-transparent PIDs. The design choices were made for telecom-class distributed systems. The fit to AI workloads is incidental and exact.
+
+## Your agent as an actor
+
+An AI agent in Ergo is just an actor: a process with private state and a mailbox, handling messages sequentially.
+
+```go
+type ResearchAgent struct {
+ act.Actor
+ notes []string
+}
+
+type MessageResearchTask struct {
+ Query string
+ ReplyTo gen.PID
+}
+
+func (a *ResearchAgent) HandleMessage(from gen.PID, msg any) error {
+ switch m := msg.(type) {
+ case MessageResearchTask:
+ result := callLLM(m.Query) // blocking call, isolated per agent
+ a.notes = append(a.notes, result)
+ a.Send(m.ReplyTo, result)
+ }
+ return nil
+}
+
+func factory_ResearchAgent() gen.ProcessBehavior { return &ResearchAgent{} }
+```
+
+What you get automatically:
+
+- **Crash isolation.** A panicking LLM call or tool handler terminates only this actor. See [Process](basics/process.md).
+- **Supervision.** Put the agent under a supervisor and it restarts on failure with your chosen strategy. See [Supervisor](actors/supervisor.md).
+- **Distributed addressability.** Each agent has a PID that works across nodes. See [Remote Spawn Process](networking/remote-spawn-process.md).
+- **Event-based coordination.** Agents publish to and subscribe to named event streams, fanning out one network message per node instead of one per subscriber. See [Events](basics/events.md).
+- **Live diagnostics.** Expose the running system to any AI assistant via the [MCP application](extra-library/applications/mcp.md).
+
+The actor's private state (`notes` in the example) is safe without any synchronization. Messages arrive one at a time. The actor never shares memory with anyone.
+
+## Multi-agent architecture patterns
+
+### Agent Pool
+
+Run N identical worker agents and distribute incoming tasks across them. Ideal for stateless agents that process requests in parallel (LLM calls, embedding lookups, tool invocations).
+
+```go
+type AgentPool struct {
+ act.Pool
+}
+
+func (p *AgentPool) Init(args ...any) (act.PoolOptions, error) {
+ return act.PoolOptions{
+ PoolSize: 10,
+ WorkerFactory: factory_ResearchAgent,
+ }, nil
+}
+
+func factory_AgentPool() gen.ProcessBehavior { return &AgentPool{} }
+
+// Spawn the pool
+poolPID, _ := node.Spawn(factory_AgentPool, gen.ProcessOptions{})
+
+// Send tasks. The pool forwards to an available worker automatically.
+node.Send(poolPID, MessageResearchTask{Query: "Summarize Q3 report"})
+```
+
+Pool size and worker mailbox size together form a natural rate limit: at most `PoolSize × WorkerMailboxSize` tasks in flight. See [Pool](actors/pool.md).
+
+### Agent Pipeline
+
+Chain agents by sending from one stage to the next. Each stage runs under a supervisor. Failure in any stage is isolated and restarted.
+
+```go
+// ResearchAgent forwards its result to AnalysisAgent
+func (a *ResearchAgent) HandleMessage(from gen.PID, msg any) error {
+ switch m := msg.(type) {
+ case MessageResearchTask:
+ findings := a.research(m.Query)
+ a.Send(a.analysisPID, MessageAnalyze{Findings: findings, ReplyTo: m.ReplyTo})
+ }
+ return nil
+}
+```
+
+If `AnalysisAgent` crashes, the supervisor restarts it without affecting the other stages. Pipelines compose naturally with pools: a stage can be a single actor or a pool of identical workers.
+
+### Distributed Agent Cluster
+
+Spawn agents on specific nodes and address them with the same API as local agents.
+
+```go
+// Register the factory on the target node. Security: only named factories
+// can be spawned remotely.
+network.EnableSpawn("research-agent", factory_ResearchAgent)
+
+// From any other node, get a handle and spawn
+remote, _ := node.Network().GetNode("worker@otherhost")
+pid, _ := remote.Spawn("research-agent", gen.ProcessOptions{})
+
+// Send works identically whether pid is local or remote
+node.Send(pid, MessageResearchTask{Query: "..."})
+```
+
+See [Remote Spawn Process](networking/remote-spawn-process.md) for the security model and application-level inheritance.
+
+### Event-Driven Coordination
+
+Agents communicate through named event streams. One producer, any number of subscribers on any nodes.
+
+```go
+// Producer: research agent publishes findings
+token, _ := producer.RegisterEvent("research.findings", gen.EventOptions{})
+producer.SendEvent("research.findings", token, Finding{Topic: "market-trends"})
+
+// Subscribers on any nodes
+process.MonitorEvent(gen.Event{Name: "research.findings", Node: "research@host"})
+```
+
+The framework delivers one network message per subscriber node regardless of how many subscribers that node has. 1M subscribers across 10 nodes cost 10 network messages, not 1M. See [Events](basics/events.md) and [Pub/Sub Internals](advanced/pub-sub-internals.md).
+
+## Live diagnostics for AI systems
+
+AI agents are nondeterministic. Behavior depends on prompts, external API latency, model temperature, and tool responses. Predefined metrics cover known failure modes, but the interesting failures are the ones you didn't anticipate.
+
+Add the [MCP application](extra-library/applications/mcp.md) to your node:
+
+```go
+import "ergo.services/application/mcp"
+
+node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{
+ Applications: []gen.ApplicationBehavior{
+ mcp.CreateApp(mcp.Options{Port: 9922}),
+ },
+})
+```
+
+Connect Claude Code (or any MCP-compatible client):
+
+```
+claude mcp add --transport http ergo http://localhost:9922/mcp
+```
+
+Now you describe a symptom in plain English and the AI runs a diagnostic sequence against the live system:
+
+```
+You: "Why is the order processing agent slow?"
+
+AI: Checking process list sorted by mailbox...
+ -> order_processor has 847 queued messages (normal: <10)
+ Inspecting order_processor upstream dependencies...
+ -> payment_validator is processing 1 message per 3.2 seconds
+ Checking payment_validator CPU profile...
+ -> 73% time in external_api.Call(). The payment API is the bottleneck.
+```
+
+The MCP application exposes 48 diagnostic tools covering process inspection, profiling, cluster visibility, and samplers (continuous data collection into ring buffers for trend analysis). One entry point node gives access to every node in the cluster. Other nodes run MCP in agent mode without exposing an HTTP port.
+
+For the full toolkit, cluster proxy mechanics, sampler recipes, profiling options, and Claude Code plugin configuration, see [MCP](extra-library/applications/mcp.md).
+
+## Getting started
+
+```
+# Install the project generator
+go install ergo.tools/ergo@latest
+
+# Create a project
+ergo init AgentNode github.com/myorg/agentnode
+cd agentnode
+
+# Add components
+ergo add supervisor AgentNodeApp:AgentSup
+ergo add actor AgentSup:ResearchAgent
+ergo add actor AgentSup:AnalysisAgent
+
+# Run
+go run ./cmd
+```
+
+Add MCP to the generated node setup:
+
+```go
+import "ergo.services/application/mcp"
+
+options.Applications = []gen.ApplicationBehavior{
+ agentnodeapp.CreateApp(),
+ mcp.CreateApp(mcp.Options{Port: 9922}),
+}
+```
+
+Connect your AI assistant and start investigating:
+
+```
+claude mcp add --transport http ergo http://localhost:9922/mcp
+```
+
+## Cloud-connected agents
+
+Running agents across AWS, GCP, Azure, or bare metal is supported via [ergo.cloud](https://ergo.cloud), a managed overlay network that connects nodes without VPNs, proxies, or tunnels. End-to-end encrypted. Currently available via waitlist.
+
+## Next steps
+
+- [Process](basics/process.md) for the actor lifecycle
+- [Supervisor](actors/supervisor.md) for restart strategies
+- [Events](basics/events.md) for pub/sub coordination
+- [MCP](extra-library/applications/mcp.md) for live diagnostics and AI-driven investigation
+- [Examples](https://github.com/ergo-services/examples) for working reference projects
diff --git a/docs/basics/events.md b/docs/basics/events.md
index b51b89670..ef0dd758d 100644
--- a/docs/basics/events.md
+++ b/docs/basics/events.md
@@ -21,12 +21,33 @@ token, err := process.RegisterEvent("price_update", gen.EventOptions{
})
```
-The `Notify` option controls whether the producer receives notifications about subscriber changes. When enabled, the producer receives `gen.MessageEventStart` when the first subscriber appears and `gen.MessageEventStop` when the last subscriber leaves. This allows the producer to start or stop expensive operations based on demand. If nobody's watching the price feed, why fetch prices?
+The `Notify` option controls whether the producer receives notifications about subscriber changes. When enabled, the producer receives `gen.MessageEventStart` when the first subscriber appears and `gen.MessageEventStop` when the last subscriber leaves. This allows the producer to start or stop expensive operations based on demand. If nobody's watching the price feed, why fetch prices? This option is ignored for events registered at the node level, since the node core does not consume such messages.
The `Buffer` option specifies how many recent events to keep. When a new subscriber joins, it receives the buffered events as a catch-up mechanism. Set this to zero if events are only relevant at the moment they're published. Set it to a reasonable number if new subscribers should see recent history.
Events are identified by name and node. The combination must be unique. Two processes on the same node can't register events with the same name. But processes on different nodes can register events with the same name - they're different events.
+## Open Events
+
+*Introduced in v3.3.0.*
+
+By default, only the token holder can publish to an event. This prevents unauthorized processes from publishing events they don't own. For events that represent an internal node-wide bus, this protection is sometimes more friction than benefit. You end up distributing the token across multiple processes, or through environment variables, just so known participants can publish.
+
+The `Open` option disables the token check on publish.
+
+```go
+token, _ := process.RegisterEvent("app.events", gen.EventOptions{
+ Open: true,
+ Buffer: 50,
+})
+```
+
+Any local process can now publish to this event by name, regardless of the token value. The owner check on `UnregisterEvent` is unaffected. Only the registering process (or the node, for node-level events) can unregister.
+
+Consider a bus inside the node where application events land: "order created", "user signed up", "payment received". They come from different modules, and subscribers (notifier, analytics, search indexer) pick up whichever ones matter. Nobody owns the bus. Handing a shared token to every emitter is plumbing that protects nothing.
+
+Open events trade the typo and bug protection that the token provides for simpler distribution. A process can accidentally publish to an event it was never supposed to touch. Use this option when the event is deliberately a shared bus and the token ceremony adds no real security in your context.
+
## Publishing Events
Publishing an event sends it to all current subscribers.
@@ -75,6 +96,45 @@ The producer can explicitly unregister an event with `UnregisterEvent`. This tri
If a subscriber terminates or unsubscribes (via `UnlinkEvent` or `DemonitorEvent`), the producer doesn't receive notification unless `Notify` was enabled. With `Notify`, the producer receives `gen.MessageEventStop` when the last subscriber leaves.
+## Node-Level Events
+
+All examples so far registered events from a process. That process is the producer, and the event exists only as long as the process runs. When the process terminates, the event is unregistered and subscribers receive termination notifications. If another process later registers the same event name, subscribers must subscribe again.
+
+Some events belong to the node itself, not to any particular process. Application events, health signals, internal buses. You want these events to exist for the entire lifetime of the node, regardless of which process currently publishes. The `gen.Node` interface provides `RegisterEvent` for this.
+
+```go
+token, err := node.RegisterEvent("notifications", gen.EventOptions{
+ Open: true,
+ Buffer: 100,
+})
+```
+
+The event's producer is the node core. It survives any publisher process coming and going. A process that restarts continues publishing to the same event after restart. Subscribers are not affected.
+
+### Race on Subscription
+
+*Introduced in v3.3.0.*
+
+There is a timing problem with process-registered events. If a subscriber's `Init()` tries to `LinkEvent` before the producer process has called `RegisterEvent`, the link fails with `gen.ErrEventUnknown`. The subscriber then needs retry logic or some other coordination mechanism.
+
+The `NodeOptions.Events` field registers node-level events before any application is started.
+
+```go
+options := gen.NodeOptions{
+ Events: []gen.NodeEventSpec{
+ {Name: "notifications", Buffer: 100},
+ {Name: "audit", Buffer: 10},
+ },
+ Applications: []gen.ApplicationBehavior{...},
+}
+
+node, err := ergo.StartNode("mynode@localhost", options)
+```
+
+Events declared here are registered as open events with the node as producer. By the time the first application starts, these events already exist. Any process can subscribe from `Init()` without a race. Any process can publish by name.
+
+If your event requires the token check and you only want a specific process to publish, register it imperatively via `node.RegisterEvent(..., gen.EventOptions{Open: false})` and distribute the token through environment variables or process arguments.
+
## Network Transparency
Events work across nodes seamlessly. A producer on node A can publish events that subscribers on nodes B, C, and D receive. The framework handles the network distribution.
@@ -104,6 +164,38 @@ Each `gen.MessageEvent` contains:
Subscribers receive these wrapped messages and extract the application data. The wrapping provides context: which event this came from, when it was published, allowing subscribers to handle events from multiple sources or correlate timing.
+## Event Statistics
+
+Each registered event tracks per-event counters: how many messages were published, how many were delivered to local subscribers, and how many were sent to remote nodes. These counters are available through `Node.EventInfo` and `Node.EventRangeInfo`.
+
+To query a specific event:
+
+```go
+info, err := node.EventInfo(gen.Event{Name: "price_update", Node: "node@host"})
+// info.MessagesPublished - total messages published to this event
+// info.MessagesLocalSent - messages delivered to local subscribers
+// info.MessagesRemoteSent - messages sent to remote subscriber nodes
+// info.Subscribers - current subscriber count
+```
+
+To iterate over all registered events on the node:
+
+```go
+node.EventRangeInfo(func(info gen.EventInfo) bool {
+ fmt.Printf("event %s: published %d, local %d, remote %d\n",
+ info.Event.Name,
+ info.MessagesPublished,
+ info.MessagesLocalSent,
+ info.MessagesRemoteSent,
+ )
+ return true // continue iteration
+})
+```
+
+Node-level aggregate counters are also available in `gen.NodeInfo` via `node.Info()`: `EventsPublished` (local producer publishes), `EventsReceived` (events arriving from remote nodes), `EventsLocalSent`, and `EventsRemoteSent`.
+
+The [Metrics actor](../extra-library/actors/metrics.md) automatically exports these counters as Prometheus metrics, along with per-event top-N breakdowns by subscribers, published, local deliveries, and remote sent. It also tracks event utilization state: whether events are actively used, waiting on demand, or idle.
+
## Practical Patterns
Events fit several common scenarios.
diff --git a/docs/basics/logging.md b/docs/basics/logging.md
index fd0355c64..8eeb47a08 100644
--- a/docs/basics/logging.md
+++ b/docs/basics/logging.md
@@ -36,7 +36,7 @@ The framework provides six severity levels, ordered from most to least verbose:
`gen.LogLevelError` - Errors that prevent specific operations but don't crash the system. Failed requests, unavailable resources, validation failures.
-`gen.LogLevelPanic` - Critical errors requiring immediate attention. Despite the name, logging at this level doesn't trigger a panic - it's just the highest severity marker.
+`gen.LogLevelPanic` - Recovered panics inside actor callbacks. The highest severity marker.
Setting a level creates a threshold. Set a process to `gen.LogLevelWarning` and it logs warnings, errors, and panics, but suppresses info, debug, and trace. Each level implicitly includes all higher severity levels.
@@ -48,6 +48,8 @@ Two special levels control behavior rather than representing severity:
Trace deserves special mention. It's so verbose that enabling it accidentally could flood storage. You can't enable it dynamically via `SetLevel`. It must be set at startup through `gen.NodeOptions.Log.Level` or `gen.ProcessOptions.LogLevel`. This restriction prevents operational mistakes.
+Panic also deserves explanation. The framework recovers Go panics that occur inside actor callbacks and logs them at this level. A nil pointer dereference in `HandleMessage`, a failed type assertion in `HandleCall`, an index out of bounds in `Init` are all structural problems in actor code, not operational failures. Logging them at Panic level separates them from the business and technical errors you log at Error level. The framework catches these so your node keeps running, but the Panic log entry tells you something in your code needs fixing. Note that Go's standard library `log.Panic()` actually triggers a panic, while Ergo's `Log().Panic()` simply logs at the Panic severity level without panicking. If you are building actors with `act.Actor`, `act.Supervisor`, or `act.Pool`, you won't need to log at this level yourself. The framework handles it. It becomes relevant only if you implement an actor directly through the `gen.ProcessBehavior` interface and want to recover panics in your own processing loop.
+
The node starts at `gen.LogLevelInfo`. Processes inherit this unless their spawn options specify otherwise. After startup, you can adjust a process's level dynamically with `SetLevel`, allowing surgical verbosity changes during debugging.
## Identifying Log Sources
@@ -282,10 +284,10 @@ Different processes often need different verbosity. Most processes log at Info.
```go
// Debugging a specific process
-node.SetLogLevelProcess(suspiciousPID, gen.LogLevelDebug)
+node.SetProcessLogLevel(suspiciousPID, gen.LogLevelDebug)
// Later, restore normal level
-node.SetLogLevelProcess(suspiciousPID, gen.LogLevelInfo)
+node.SetProcessLogLevel(suspiciousPID, gen.LogLevelInfo)
```
For processes generating high-volume logs, route them to a dedicated logger using a hidden logger. A trading engine logging every order would overwhelm general logs:
diff --git a/docs/basics/node.md b/docs/basics/node.md
index 25fa36cee..1dccd24fa 100644
--- a/docs/basics/node.md
+++ b/docs/basics/node.md
@@ -10,13 +10,13 @@ When you start a node, you're launching a complete system with several subsystem
## What a Node Provides
-**Process Management** - The node tracks every process running on it. When you spawn a process, the node assigns it a unique PID, registers it in the process table, and manages its lifecycle. When a process terminates, the node cleans up its resources and notifies any processes that were linked or monitoring it.
+**Process Management** - The node tracks every process running on it. When you spawn a process, the node assigns it a unique PID, registers it in the process table, and manages its lifecycle. When a process terminates, the node cleans up its resources and notifies any processes that were linked or monitoring it. The node provides `ProcessRangeShortInfo` for efficient iteration over all processes with their current state, including mailbox latency when built with `-tags=latency`.
**Message Routing** - When a process sends a message, the node figures out where it needs to go. Local process? Route it directly to the mailbox. Remote process? Establish a network connection if needed and send it there. The sender doesn't need to know these details.
**Network Stack** - The node handles all network communication. It discovers other nodes, establishes connections, encodes messages, and manages the complexity of distributed communication. This is what makes network transparency possible.
-**Pub/Sub System** - Links, monitors, and events all work through a publisher/subscriber mechanism in the node core. When a process terminates or an event fires, the node knows who's subscribed and delivers the notifications.
+**Pub/Sub System** - Links, monitors, and events all work through a publisher/subscriber mechanism in the node core. When a process terminates or an event fires, the node knows who's subscribed and delivers the notifications. The node provides `EventInfo` to query statistics for a specific event and `EventRangeInfo` for callback-based iteration over all registered events with their per-event counters (messages published, local/remote deliveries).
**Logging** - Every log message goes through the node, which fans it out to registered loggers. This centralized logging makes it easy to capture, filter, and route log output.
diff --git a/docs/basics/process.md b/docs/basics/process.md
index acc65dcbe..c9688ad10 100644
--- a/docs/basics/process.md
+++ b/docs/basics/process.md
@@ -8,6 +8,8 @@ A process is an actor - a lightweight entity that handles messages sequentially
Every process has a mailbox where incoming messages wait to be processed. The mailbox contains four queues with different priorities: Urgent for critical system messages, System for framework control, Main for regular application messages, and Log for logging. When the process wakes up to handle messages, it processes them in priority order, taking from Urgent first, then System, then Main, and finally Log.
+When built with `-tags=latency`, each queue tracks the age of its oldest unprocessed message. `ProcessMailbox.Latency()` returns the maximum latency across all four queues in nanoseconds, or -1 if the tag is not enabled. This helps identify processes that are falling behind on message processing. See [Debugging](../advanced/debugging.md) for details.
+
The process runs only when it has messages to handle. When the mailbox is empty, the process sleeps, consuming no CPU. When a message arrives, the process wakes, handles the message, and sleeps again if nothing else is waiting. This efficiency is why you can have thousands of processes in a single application.
## Identifying Processes
diff --git a/docs/basics/project-structure.md b/docs/basics/project-structure.md
index 310a9d3c1..92bb5718e 100644
--- a/docs/basics/project-structure.md
+++ b/docs/basics/project-structure.md
@@ -269,7 +269,8 @@ package types
import (
"time"
- "ergo.services/ergo/net/edf"
+
+ "ergo.services/ergo/gen"
)
// Events published by the orders application
@@ -285,14 +286,16 @@ type OrderCompleted struct {
CompletedAt time.Time
}
-func init() {
- // Register for network serialization
- edf.RegisterTypeOf(OrderCreated{})
- edf.RegisterTypeOf(OrderCompleted{})
+// Helper that consumers call from their application's Load() callback.
+func RegisterTypes(network gen.Network) error {
+ return network.RegisterTypes([]any{
+ OrderCreated{},
+ OrderCompleted{},
+ })
}
```
-Both `apps/orders` and `apps/shipping` can import `types` without importing each other. This breaks the circular dependency while maintaining strong typing.
+Both `apps/orders` and `apps/shipping` can import `types` and call `types.RegisterTypes(node.Network())` from their `Load(node)` callbacks. This breaks the circular dependency while maintaining strong typing.
### Shared Libraries (`lib/`)
@@ -476,8 +479,6 @@ Messages that form public contracts between applications across the cluster.
// types/commands.go
package types
-import "ergo.services/ergo/net/edf"
-
// EXPORTED type, EXPORTED fields
// CAN be referenced by any package
// CAN be serialized
@@ -493,10 +494,29 @@ type TaskResult struct {
Output []byte
Error string
}
+```
+
+Each consuming application registers the shared types from its `Load` callback:
+
+```go
+// apps/worker/app.go
+package worker
-func init() {
- edf.RegisterTypeOf(ProcessTask{})
- edf.RegisterTypeOf(TaskResult{})
+import (
+ "ergo.services/ergo/gen"
+
+ "myapp/types"
+)
+
+func (a *Worker) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ err := node.Network().RegisterTypes([]any{
+ types.ProcessTask{},
+ types.TaskResult{},
+ })
+ if err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
```
@@ -718,8 +738,8 @@ func (l *Listener) Init(args ...any) error {
return nil
}
-func (l *Listener) HandleEvent(ev gen.MessageEvent) error {
- switch e := ev.Message.(type) {
+func (l *Listener) HandleEvent(event gen.MessageEvent) error {
+ switch e := event.Message.(type) {
case types.OrderCompleted:
l.createShipment(e.OrderID)
}
@@ -1134,7 +1154,7 @@ apps/
- Default to Level 4 for everything
- Mix isolation levels arbitrarily
- Use `any` or `interface{}` for messages
-- Include pointers in network messages
+- Include pointers to external resources (connections, files) in network messages
### Dependencies
diff --git a/docs/extra-library/actors/README.md b/docs/extra-library/actors/README.md
index b92ead6f2..0db998c11 100644
--- a/docs/extra-library/actors/README.md
+++ b/docs/extra-library/actors/README.md
@@ -2,6 +2,12 @@
An extra library of actors implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework adheres to a "zero dependency" policy.
+## [Health](health.md)
+
+Kubernetes health probe actor that serves `/health/live`, `/health/ready`, and `/health/startup` endpoints. Actors register named signals with probe type and optional heartbeat timeout. When a signal goes down, the corresponding probe endpoint returns 503.
+
+**Use cases:** Kubernetes liveness/readiness/startup probes, container orchestration integration, dependency health tracking, graceful degradation.
+
## [Leader](leader.md)
Distributed leader election actor implementing Raft-inspired consensus algorithm. Provides coordination primitives for building systems that require single leader selection across a cluster.
diff --git a/docs/extra-library/actors/health.md b/docs/extra-library/actors/health.md
new file mode 100644
index 000000000..d9fbaa254
--- /dev/null
+++ b/docs/extra-library/actors/health.md
@@ -0,0 +1,466 @@
+# Health
+
+The health actor provides Kubernetes-compatible health probe endpoints for Ergo applications. Instead of each application building its own HTTP health check logic, the health actor centralizes probe management into a single process that serves `/health/live`, `/health/ready`, and `/health/startup` endpoints.
+
+Actors register named signals with the health actor, optionally sending periodic heartbeats. The health actor aggregates signal states and serves HTTP responses that Kubernetes (or any other orchestrator) can use to determine whether to restart a pod, route traffic to it, or wait for it to finish starting.
+
+## The Problem
+
+Kubernetes uses three types of probes to manage pod lifecycle:
+
+**Liveness:** Is the application alive? A failing liveness probe causes Kubernetes to restart the pod. Use this for detecting deadlocks, infinite loops, or corrupted state that prevents the application from functioning.
+
+**Readiness:** Can the application serve traffic? A failing readiness probe removes the pod from service endpoints. Use this for temporary conditions like database connection loss, cache warming, or downstream dependency outages where restarting would not help.
+
+**Startup:** Has the application finished initializing? A failing startup probe prevents liveness and readiness checks from running. Use this for slow-starting applications that need time to load data, run migrations, or establish connections before health checks begin.
+
+In traditional applications, you implement these probes as HTTP handlers that check internal state. In actor systems, the "state" is distributed across many processes. A database connection actor, a cache warmer, and a message queue consumer each know their own status, but no single actor knows the overall health.
+
+The health actor solves this by accepting signal registrations from any actor in the system. Each actor reports its own status, and the health actor aggregates these signals into per-probe HTTP responses.
+
+## How It Works
+
+The health actor follows a registration and heartbeat pattern:
+
+1. **Actors register signals:** Each actor that contributes to health sends a `RegisterRequest` to the health actor (synchronous Call), specifying a signal name, which probes it affects, and an optional heartbeat timeout. The Call returns after the signal is registered, preventing race conditions with subsequent heartbeats.
+
+2. **The health actor monitors registrants:** When a signal is registered, the health actor monitors the registering process. If that process terminates, all its signals are automatically marked as down.
+
+3. **Actors send heartbeats:** For signals with a timeout, the registering actor periodically sends `MessageHeartbeat`. If the heartbeat interval exceeds the timeout, the health actor marks the signal as down.
+
+4. **HTTP handlers read atomic state:** The HTTP handlers read pre-built JSON responses from atomic values. The actor goroutine rebuilds these atomic values after every state change. No mutexes or channels are involved in serving HTTP requests.
+
+```mermaid
+sequenceDiagram
+ participant DB as DB Actor
+ participant H as Health Actor
+ participant K as Kubernetes
+
+ DB->>H: RegisterRequest{Signal: "db", Probe: Liveness|Readiness, Timeout: 5s}
+ H->>DB: RegisterResponse{}
+ Note over H: Monitor DB Actor Signal "db" = up
+
+ loop Every 2 seconds
+ DB->>H: MessageHeartbeat{Signal: "db"}
+ end
+
+ K->>H: GET /health/live
+ H->>K: 200 {"status":"healthy","signals":[{"signal":"db","status":"up","timeout":"5s"}]}
+
+ Note over DB: Process crashes
+ Note over H: MessageDownPID received Signal "db" = down
+
+ K->>H: GET /health/live
+ H->>K: 503 {"status":"unhealthy","signals":[{"signal":"db","status":"down","timeout":"5s"}]}
+
+ Note over K: Restart pod
+```
+
+## ActorBehavior Interface
+
+The health actor extends `gen.ProcessBehavior` with a specialized interface:
+
+```go
+type ActorBehavior interface {
+ gen.ProcessBehavior
+
+ Init(args ...any) (Options, error)
+ HandleMessage(from gen.PID, message any) error
+ HandleCall(from gen.PID, ref gen.Ref, message any) (any, error)
+ HandleInspect(from gen.PID, item ...string) map[string]string
+ HandleSignalDown(signal gen.Atom) error
+ HandleSignalUp(signal gen.Atom) error
+ Terminate(reason error)
+}
+```
+
+All callbacks have default (no-op) implementations. You only override what you need.
+
+`HandleSignalDown` is called when a signal transitions from up to down, due to heartbeat timeout, process termination, or explicit `MessageSignalDown`. Use this for alerting, logging, or triggering recovery actions.
+
+`HandleSignalUp` is called when a signal transitions from down to up, via heartbeat recovery or explicit `MessageSignalUp`. Use this to log recovery events or update external systems.
+
+## Basic Usage
+
+Spawn the health actor and register it with a name so other actors can find it:
+
+```go
+package main
+
+import (
+ "ergo.services/actor/health"
+ "ergo.services/ergo"
+ "ergo.services/ergo/gen"
+)
+
+func main() {
+ node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{})
+ defer node.Stop()
+
+ node.SpawnRegister(gen.Atom("health"), health.Factory, gen.ProcessOptions{},
+ health.Options{Port: 8080})
+
+ // Endpoints:
+ // http://localhost:8080/health/live
+ // http://localhost:8080/health/ready
+ // http://localhost:8080/health/startup
+ node.Wait()
+}
+```
+
+Default configuration:
+- **Host**: `localhost`
+- **Port**: `3000`
+- **Path**: `/health`
+- **CheckInterval**: `1 second`
+
+With no signals registered, all three endpoints return 200 with `{"status":"healthy"}`. This means a freshly started health actor does not block deployment. Signals opt in to health checking; only registered signals can cause a probe to fail.
+
+## Configuration
+
+```go
+options := health.Options{
+ Host: "0.0.0.0", // Listen on all interfaces
+ Port: 8080, // HTTP port
+ Path: "/health", // Path prefix (default: "/health")
+ CheckInterval: 2 * time.Second, // Heartbeat check interval
+}
+```
+
+**Host** determines which network interface the HTTP server binds to. Use `"0.0.0.0"` for production/containerized environments.
+
+**Port** should not conflict with other services on the same pod.
+
+**Path** sets the prefix for health endpoints. Endpoints are registered as `Path+"/live"`, `Path+"/ready"`, `Path+"/startup"`. Change this when the default conflicts with your routing or when deploying behind a reverse proxy. For example, with `Path: "/k8s"` the endpoints become `/k8s/live`, `/k8s/ready`, `/k8s/startup`.
+
+**CheckInterval** controls how frequently the actor checks for expired heartbeats. The actor sends itself a timer message at this interval and iterates over all signals with a non-zero timeout, marking expired ones as down. Shorter intervals detect failures faster but increase message processing overhead. For most applications, 1-2 seconds provides a good balance.
+
+**Mux** accepts an external `*http.ServeMux`. When provided, the health actor registers its handlers on this mux and skips starting its own HTTP server. This is useful when you want to serve health endpoints alongside other HTTP handlers on a single port, for example, combining with the [Metrics](metrics.md) actor.
+
+```go
+mux := http.NewServeMux()
+
+healthOpts := health.Options{Mux: mux}
+node.SpawnRegister("health", health.Factory, gen.ProcessOptions{}, healthOpts)
+
+metricsOpts := metrics.Options{Mux: mux}
+node.Spawn(metrics.Factory, gen.ProcessOptions{}, metricsOpts)
+
+// Serve the shared mux yourself
+```
+
+When `Mux` is set, `Host` and `Port` are ignored.
+
+## Signal Registration
+
+### Probe Types
+
+Each signal specifies which probes it affects using a bitmask:
+
+```go
+const (
+ ProbeLiveness Probe = 1 << iota // 1 -- /health/live
+ ProbeReadiness // 2 -- /health/ready
+ ProbeStartup // 4 -- /health/startup
+)
+```
+
+Combine probes with bitwise OR. A database connection that affects both liveness and readiness:
+
+```go
+health.Register(w, gen.Atom("health"), "db",
+ health.ProbeLiveness|health.ProbeReadiness, 5*time.Second)
+```
+
+A migration signal that only affects startup:
+
+```go
+health.Register(w, gen.Atom("health"), "migrations",
+ health.ProbeStartup, 0)
+```
+
+When `Probe` is 0, it defaults to `ProbeLiveness`.
+
+### Helper Functions
+
+The package provides convenience functions:
+
+```go
+// Register a signal (sync Call -- blocks until registered)
+health.Register(process, to, signal, probe, timeout)
+
+// Remove a signal (sync Call -- blocks until removed)
+health.Unregister(process, to, signal)
+
+// Send heartbeat (async Send)
+health.Heartbeat(process, to, signal)
+
+// Manual control (async Send)
+health.SignalUp(process, to, signal)
+health.SignalDown(process, to, signal)
+```
+
+`Register` and `Unregister` use synchronous Call to confirm the operation completed. This prevents race conditions where a heartbeat or status update arrives before the signal is registered. All other helpers use async Send.
+
+The `to` parameter accepts anything that identifies a process: a `gen.Atom` name, `gen.PID`, `gen.ProcessID`, or `gen.Alias`.
+
+### Message Types
+
+If you prefer sending messages directly instead of using helpers:
+
+| Message | Type | Description |
+|---------|------|-------------|
+| `RegisterRequest` / `RegisterResponse` | sync (Call) | Register a signal. Fields: `Signal gen.Atom`, `Probe Probe`, `Timeout time.Duration` |
+| `UnregisterRequest` / `UnregisterResponse` | sync (Call) | Remove a signal. Fields: `Signal gen.Atom` |
+| `MessageHeartbeat` | async (Send) | Update heartbeat timestamp. Fields: `Signal gen.Atom` |
+| `MessageSignalUp` | async (Send) | Mark a signal as up. Fields: `Signal gen.Atom` |
+| `MessageSignalDown` | async (Send) | Mark a signal as down. Fields: `Signal gen.Atom` |
+
+All types are registered with EDF for network transparency. Actors on remote nodes can register signals with a health actor on any node in the cluster.
+
+## Heartbeat Pattern
+
+The heartbeat pattern is the primary mechanism for detecting failures in long-running dependencies. The actor that owns a resource (database, external API, message queue) knows best whether the resource is healthy. It registers a signal with a timeout and sends periodic heartbeats as long as the resource is available.
+
+```go
+type DBWorker struct {
+ act.Actor
+ heartbeatTimer gen.CancelFunc
+}
+
+type messageHeartbeatTick struct{}
+
+func (w *DBWorker) Init(args ...any) error {
+ // Register with 5-second heartbeat timeout
+ health.Register(w, gen.Atom("health"), "db",
+ health.ProbeLiveness|health.ProbeReadiness, 5*time.Second)
+
+ // Send heartbeat every 2 seconds (well within the 5s timeout)
+ w.heartbeatTimer, _ = w.SendAfter(w.PID(), messageHeartbeatTick{}, 2*time.Second)
+ return nil
+}
+
+func (w *DBWorker) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageHeartbeatTick:
+ health.Heartbeat(w, gen.Atom("health"), "db")
+ w.heartbeatTimer, _ = w.SendAfter(w.PID(), messageHeartbeatTick{}, 2*time.Second)
+ }
+ return nil
+}
+
+func (w *DBWorker) Terminate(reason error) {
+ if w.heartbeatTimer != nil {
+ w.heartbeatTimer()
+ }
+}
+```
+
+Choose the heartbeat interval to be at least 2x shorter than the timeout. This provides one missed heartbeat as a safety margin before the signal is marked as down.
+
+When the actor crashes, the health actor receives a `gen.MessageDownPID` (because it monitors the registrant) and marks all signals from that process as down. Heartbeat timeout is a secondary detection mechanism for situations where the process is alive but the resource it manages is not, for example, a database connection pool actor that is running but has lost all connections.
+
+## HTTP Endpoints
+
+| Path | Probe | Default (no signals) |
+|------|-------|---------------------|
+| `{Path}/live` | ProbeLiveness | 200 healthy |
+| `{Path}/ready` | ProbeReadiness | 200 healthy |
+| `{Path}/startup` | ProbeStartup | 200 healthy |
+
+Each endpoint evaluates only signals registered for that specific probe. A signal registered for `ProbeLiveness` only does not affect `/health/ready` or `/health/startup`.
+
+**200 OK:** all signals for this probe are up, or no signals are registered.
+
+**503 Service Unavailable:** at least one signal for this probe is down.
+
+### Response Format
+
+Healthy response with signals:
+
+```json
+{"status":"healthy","signals":[{"signal":"db","status":"up","timeout":"5s"}]}
+```
+
+Unhealthy response:
+
+```json
+{"status":"unhealthy","signals":[{"signal":"db","status":"down","timeout":"5s"},{"signal":"cache","status":"up"}]}
+```
+
+Healthy response with no signals (probe has no registered signals):
+
+```json
+{"status":"healthy"}
+```
+
+The `timeout` field appears only for signals that have a heartbeat timeout configured. Signals without timeout omit this field.
+
+## Failure Detection
+
+The health actor detects failures through three mechanisms:
+
+### Process Termination
+
+When a process that registered signals terminates (normally or abnormally), the health actor receives `gen.MessageDownPID` through its monitor. All signals from that process are immediately marked as down. This is the fastest and most reliable detection mechanism.
+
+### Heartbeat Timeout
+
+For signals with a non-zero timeout, the health actor periodically checks whether the last heartbeat was received within the timeout window. If a heartbeat is overdue, the signal is marked as down and `HandleSignalDown` is called.
+
+Heartbeat timeout catches situations where the process is alive but the resource it monitors is unavailable. The process continues to run (so no `MessageDownPID` arrives) but stops sending heartbeats because the resource check fails.
+
+### Manual Control
+
+Actors can explicitly report status changes using `MessageSignalUp` and `MessageSignalDown`. Use this when you can detect failures immediately without waiting for a timeout, for example, catching a database connection error in a callback and immediately marking the signal as down, then marking it up again when the connection is re-established.
+
+## Extending with Custom Behavior
+
+Embed `health.Actor` in your own struct to add custom behavior:
+
+```go
+type MyHealth struct {
+ health.Actor
+}
+
+func MyHealthFactory() gen.ProcessBehavior {
+ return &MyHealth{}
+}
+
+func (h *MyHealth) Init(args ...any) (health.Options, error) {
+ return health.Options{Port: 8080}, nil
+}
+
+func (h *MyHealth) HandleSignalDown(signal gen.Atom) error {
+ h.Log().Error("signal went down: %s", signal)
+ // Alert external monitoring, update metrics, trigger recovery
+ return nil
+}
+
+func (h *MyHealth) HandleSignalUp(signal gen.Atom) error {
+ h.Log().Info("signal recovered: %s", signal)
+ return nil
+}
+```
+
+Override `HandleMessage` to handle application-specific messages alongside health management. The health actor dispatches its own types internally (`RegisterRequest`/`UnregisterRequest` via HandleCall, `MessageHeartbeat`/`MessageSignalUp`/`MessageSignalDown` via HandleMessage); only unrecognized messages are forwarded to your callbacks.
+
+## Kubernetes Configuration
+
+Configure Kubernetes probes to point to the health actor's endpoints:
+
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+ containers:
+ - name: myapp
+ livenessProbe:
+ httpGet:
+ path: /health/live
+ port: 3000
+ initialDelaySeconds: 5
+ periodSeconds: 10
+ readinessProbe:
+ httpGet:
+ path: /health/ready
+ port: 3000
+ initialDelaySeconds: 5
+ periodSeconds: 10
+ startupProbe:
+ httpGet:
+ path: /health/startup
+ port: 3000
+ failureThreshold: 30
+ periodSeconds: 2
+```
+
+Adjust `initialDelaySeconds` based on how long your application takes to start and register signals. The startup probe with `failureThreshold: 30` and `periodSeconds: 2` gives the application 60 seconds to complete initialization before Kubernetes considers it failed.
+
+## Common Patterns
+
+### Database Health
+
+Register liveness and readiness signals with heartbeat:
+
+```go
+func (w *DBWorker) Init(args ...any) error {
+ health.Register(w, gen.Atom("health"), "postgres",
+ health.ProbeLiveness|health.ProbeReadiness, 10*time.Second)
+
+ w.scheduleHeartbeat()
+ return nil
+}
+
+func (w *DBWorker) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageHeartbeat:
+ if w.db.Ping() == nil {
+ health.Heartbeat(w, gen.Atom("health"), "postgres")
+ }
+ w.scheduleHeartbeat()
+ }
+ return nil
+}
+```
+
+If `db.Ping()` fails, no heartbeat is sent, and the signal times out. The health actor marks it as down, causing Kubernetes to remove the pod from service endpoints (readiness) and eventually restart it (liveness).
+
+### Startup Gate
+
+Use the startup probe for slow initialization:
+
+```go
+func (w *Migrator) Init(args ...any) error {
+ health.Register(w, gen.Atom("health"), "migrations",
+ health.ProbeStartup, 0) // No timeout -- manual control
+
+ w.Send(w.PID(), messageRunMigrations{})
+ return nil
+}
+
+func (w *Migrator) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageRunMigrations:
+ if err := w.runMigrations(); err != nil {
+ health.SignalDown(w, gen.Atom("health"), "migrations")
+ return err
+ }
+ health.SignalUp(w, gen.Atom("health"), "migrations")
+ // Unregister since startup is complete
+ health.Unregister(w, gen.Atom("health"), "migrations")
+ }
+ return nil
+}
+```
+
+While migrations run, the startup probe returns 503, preventing Kubernetes from running liveness and readiness checks. Once migrations complete, the signal is unregistered and the startup probe returns 200.
+
+### Temporary Degradation
+
+Use readiness-only signals for recoverable issues:
+
+```go
+func (w *CacheWorker) HandleMessage(from gen.PID, message any) error {
+ switch msg := message.(type) {
+ case CacheConnectionLost:
+ health.SignalDown(w, gen.Atom("health"), "cache")
+ // Pod removed from service but not restarted
+
+ case CacheConnectionRestored:
+ health.SignalUp(w, gen.Atom("health"), "cache")
+ // Pod added back to service
+ }
+ return nil
+}
+```
+
+Register the signal for `ProbeReadiness` only. The pod stops receiving traffic during the outage but is not restarted, since the cache connection will likely recover on its own.
+
+## Observer Integration
+
+The health actor integrates with Observer via `HandleInspect()`. Inspecting the health actor shows the endpoint URL, signal count, check interval, and current status of each registered signal.
+
+## Radar Application
+
+If your node needs both health probes and Prometheus metrics, consider the [Radar](../applications/radar.md) application. It runs the health actor and metrics actor together on a single HTTP port and provides helper functions so your actors don't need to import either package directly.
diff --git a/docs/extra-library/actors/metrics.md b/docs/extra-library/actors/metrics.md
index aa1771534..ece1b1c33 100644
--- a/docs/extra-library/actors/metrics.md
+++ b/docs/extra-library/actors/metrics.md
@@ -1,29 +1,15 @@
# Metrics
-The metrics actor provides observability for Ergo applications by collecting and exposing runtime statistics in Prometheus format. Instead of manually instrumenting your code with counters and gauges scattered throughout, the metrics actor centralizes telemetry into a single process that exposes an HTTP endpoint for Prometheus to scrape.
+The metrics actor collects runtime statistics from an Ergo node and exposes them as a Prometheus HTTP endpoint. It runs as a regular process: spawn it, and it starts serving `/metrics` with node, network, process, and event telemetry.
-This approach separates monitoring concerns from application logic. Your actors focus on business functionality while the metrics actor handles collection, aggregation, and exposure of operational data. Prometheus or compatible monitoring systems poll the `/metrics` endpoint periodically, building time-series data for alerting and visualization.
+For application-specific metrics (request rates, business counters), you extend the actor with custom Prometheus collectors.
## Why Monitor Actors
-Actor systems present unique monitoring challenges. Traditional thread-based applications have predictable resource usage patterns - you monitor thread pools, request queues, and database connections. Actor systems are more dynamic - processes spawn and terminate constantly, messages flow asynchronously through mailboxes, and work distribution depends on supervision trees and message routing.
-
-The metrics actor addresses this by tracking:
-
-**Process metrics** - How many processes exist, how many are running vs. idle vs. zombie. This reveals whether your node is under load or experiencing process leaks.
-
-**Memory metrics** - Heap allocation and actual memory used. Actor systems can accumulate small allocations across thousands of processes. Memory metrics help identify whether garbage collection keeps pace with allocation.
-
-**Network metrics** - For distributed Ergo clusters, tracking bytes and messages flowing between nodes reveals network bottlenecks, routing inefficiencies, or failing connections.
-
-**Application metrics** - How many applications are loaded and running. Applications failing to start or terminating unexpectedly appear in these counts.
-
-These base metrics provide system-level visibility. For application-specific metrics (request rates, business transactions, custom counters), you extend the metrics actor with your own Prometheus collectors.
+Actor systems are dynamic. Processes spawn and terminate constantly, messages flow through mailboxes asynchronously, and load depends on message routing and supervision trees. Traditional monitoring (thread pools, request queues) does not capture this. The metrics actor tracks process lifecycle, mailbox pressure, message throughput, event fanout, network traffic, and delivery errors, giving visibility into what the actor runtime is actually doing.
## ActorBehavior Interface
-The metrics actor extends `gen.ProcessBehavior` with a specialized interface:
-
```go
type ActorBehavior interface {
gen.ProcessBehavior
@@ -40,32 +26,16 @@ type ActorBehavior interface {
}
```
-Only `Init()` is required - register your custom metrics and return options; all other callbacks have default implementations you can override as needed.
-
-You have two main patterns:
-
-**Periodic collection** - Implement `CollectMetrics()` to query state at intervals. Use when metrics reflect current state from other actors or external sources.
-
-**Event-driven updates** - Implement `HandleMessage()` or `HandleEvent()` to update metrics when events occur. Use when your application produces natural event streams or publishes events.
-
-## How It Works
+Only `Init()` is required. All other callbacks have default implementations.
-When you spawn the metrics actor:
+Two patterns for custom metrics:
-1. **HTTP endpoint starts** at the configured host and port. The `/metrics` endpoint immediately serves Prometheus-formatted data.
+**Periodic collection:** implement `CollectMetrics()` to query state at intervals. Use when metrics reflect current state from other actors or external sources.
-2. **Base metrics collect automatically**. Node information (processes, memory, CPU) and network statistics (connected nodes, message rates) update at the configured interval.
-
-3. **Custom metrics update** via `CollectMetrics()` callback or `HandleMessage()` processing, depending on your implementation.
-
-4. **Prometheus scrapes** the `/metrics` endpoint and receives current values for all registered collectors (base + custom).
-
-The actor handles HTTP serving and registry management. You focus on defining metrics and updating their values.
+**Event-driven updates:** implement `HandleMessage()` or `HandleEvent()` to update metrics as events occur. Use when your application produces natural event streams.
## Basic Usage
-Spawn the metrics actor like any other process:
-
```go
package main
@@ -79,7 +49,6 @@ func main() {
node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{})
defer node.Stop()
- // Spawn metrics actor with defaults
node.Spawn(metrics.Factory, gen.ProcessOptions{}, metrics.Options{})
// Metrics available at http://localhost:3000/metrics
@@ -90,79 +59,125 @@ func main() {
Default configuration:
- **Host**: `localhost`
- **Port**: `3000`
+- **Path**: `/metrics`
- **CollectInterval**: `10 seconds`
-
-The HTTP endpoint starts automatically during initialization. The first metrics collection happens immediately, and subsequent collections run at the configured interval.
+- **TopN**: `50`
## Configuration
-Customize the HTTP endpoint and collection frequency:
-
```go
options := metrics.Options{
Host: "0.0.0.0", // Listen on all interfaces
- Port: 9090, // Prometheus default port
- CollectInterval: 5 * time.Second, // Collect every 5 seconds
+ Port: 9090, // HTTP port
+ Path: "/metrics", // HTTP path
+ CollectInterval: 5 * time.Second, // Collection frequency
+ TopN: 50, // Top-N entries per metric group
}
node.Spawn(metrics.Factory, gen.ProcessOptions{}, options)
```
-**Host** determines which network interface the HTTP server binds to. Use `"localhost"` to restrict access to local connections only (development, testing). Use `"0.0.0.0"` to accept connections from any interface (production, containerized environments).
+**Host** determines which interface the HTTP server binds to. Use `"localhost"` for development, `"0.0.0.0"` for production/containers.
+
+**Port** should not conflict with other services. Prometheus conventionally uses `9090`, Observer UI defaults to `9911`.
+
+**TopN** controls how many top entries are tracked for each metric group (mailbox depth, utilization, latency for processes; subscribers, published, deliveries for events). Higher values increase Prometheus cardinality.
-**Port** should not conflict with other services. Prometheus conventionally uses `9090`, but many Ergo applications use that for other purposes. Choose a port that doesn't collide with your application's HTTP servers, Observer UI (default `9911`), or other metrics exporters.
+**CollectInterval** controls how frequently the actor queries node statistics. Collecting more frequently than your Prometheus scrape interval wastes resources.
-**CollectInterval** controls how frequently the actor queries node statistics. Shorter intervals provide more granular time-series data but increase CPU usage for collection. Longer intervals reduce overhead but miss short-lived spikes. For most applications, 10-15 seconds balances responsiveness with resource usage. Prometheus typically scrapes every 15-60 seconds, so collecting more frequently than your scrape interval wastes resources.
+**Mux** accepts an external `*http.ServeMux`. The metrics actor registers its handler on this mux and skips starting its own HTTP server. Useful for serving metrics alongside other handlers on a single port:
+
+```go
+mux := http.NewServeMux()
+
+metricsOpts := metrics.Options{
+ Mux: mux,
+ CollectInterval: 5 * time.Second,
+}
+node.Spawn(metrics.Factory, gen.ProcessOptions{}, metricsOpts)
+
+healthOpts := health.Options{Mux: mux}
+node.SpawnRegister("health", health.Factory, gen.ProcessOptions{}, healthOpts)
+```
+
+When `Mux` is set, `Host` and `Port` are ignored.
## Base Metrics
-The metrics actor automatically exposes these Prometheus metrics without any configuration:
+The actor automatically collects metrics without any configuration. All metrics carry a `node` label identifying the source node.
### Node Metrics
-| Metric | Type | Description |
-|--------|------|-------------|
-| `ergo_node_uptime_seconds` | Gauge | Time since node started. Useful for detecting node restarts and calculating availability. |
-| `ergo_processes_total` | Gauge | Total number of processes including running, idle, and zombie. High counts suggest process leaks or inefficient cleanup. |
-| `ergo_processes_running` | Gauge | Processes actively handling messages. Low relative to total suggests most processes are idle (good) or blocked (bad - investigate what they're waiting for). |
-| `ergo_processes_zombie` | Gauge | Processes terminated but not yet fully cleaned up. These should be transient. Persistent zombies indicate bugs in termination handling. |
-| `ergo_memory_used_bytes` | Gauge | Total memory obtained from OS (uses `runtime.MemStats.Sys`). |
-| `ergo_memory_alloc_bytes` | Gauge | Bytes of allocated heap objects (uses `runtime.MemStats.Alloc`). |
-| `ergo_cpu_user_seconds` | Gauge | CPU time spent executing user code. Increases as the node does work. Rate of change indicates CPU utilization. |
-| `ergo_cpu_system_seconds` | Gauge | CPU time spent in kernel (system calls). High system time relative to user time suggests I/O bottlenecks or excessive syscalls. |
-| `ergo_applications_total` | Gauge | Number of applications loaded. Should match your expected count. Unexpected changes indicate applications starting or stopping. |
-| `ergo_applications_running` | Gauge | Applications currently active. Compare to total to identify stopped or failed applications. |
-| `ergo_registered_names_total` | Gauge | Processes registered with atom names. High counts suggest heavy use of named processes for routing. |
-| `ergo_registered_aliases_total` | Gauge | Total number of registered aliases. Includes aliases created by processes via `CreateAlias()` and aliases identifying meta-processes. |
-| `ergo_registered_events_total` | Gauge | Event subscriptions active in the node. High counts indicate extensive pub/sub usage. |
+Uptime, process counts (total, running, zombie), spawn/termination counters, memory (OS used, runtime allocated), CPU time (user, system), application counts, registered names/aliases/events, event publish/receive/delivery counters, and Send/Call delivery error counters (local and remote).
+
+Delivery errors are split by type: `ergo_send_errors_local_total` and `ergo_call_errors_local_total` count failures where the target process is unknown, terminated, or has a full mailbox. `ergo_send_errors_remote_total` and `ergo_call_errors_remote_total` count connection failures to remote nodes.
+
+### Log Metrics
+
+Log message count by level (`trace`, `debug`, `info`, `warning`, `error`, `panic`). Counted once before fan-out to loggers.
### Network Metrics
-| Metric | Type | Labels | Description |
-|--------|------|--------|-------------|
-| `ergo_connected_nodes_total` | Gauge | - | Number of remote nodes connected. For distributed systems, this should match your expected cluster size. |
-| `ergo_remote_node_uptime_seconds` | Gauge | `node` | Uptime of each connected remote node. Resets when the remote node restarts. |
-| `ergo_remote_messages_in_total` | Gauge | `node` | Messages received from each remote node. Rate indicates traffic volume. |
-| `ergo_remote_messages_out_total` | Gauge | `node` | Messages sent to each remote node. Asymmetric in/out rates may reveal routing issues. |
-| `ergo_remote_bytes_in_total` | Gauge | `node` | Bytes received from each remote node. Disproportionate bytes-to-messages ratio suggests large messages or inefficient serialization. |
-| `ergo_remote_bytes_out_total` | Gauge | `node` | Bytes sent to each remote node. Monitors network bandwidth usage per peer. |
+Connected node count, per-node uptime, message and byte rates (in/out per remote node), cumulative connections established/lost, and per-acceptor handshake error count. Fragmentation metrics per remote node: fragments sent/received, fragmented messages sent/reassembled, assembly timeouts. Compression metrics per remote node: compressed messages sent, bytes before/after compression, decompressed messages received, bytes before/after decompression. Compression ratio (`original / compressed`) reveals whether compression is effective for each connection.
+
+### Mailbox Latency Metrics
+
+Requires building with `-tags=latency`. Measures how long the oldest message has been waiting in each process's mailbox. Provides distribution across ranges (1ms to 60s+), max latency, and top-N processes by latency.
-Network metrics use labels (`node="..."`) to separate per-node data. This creates multiple time series - one per connected node. Prometheus queries can aggregate across labels or filter to specific nodes.
+### Mailbox Depth Metrics
+
+Always active. Counts messages queued in each process's mailbox. Distribution across ranges (1 to 10K+), max depth, and top-N processes by depth. Complementary to latency: depth is "how many messages are waiting", latency is "how long the oldest has been waiting".
+
+### Process Metrics
+
+Always active. Includes:
+
+- **Utilization:** ratio of callback running time to uptime. Distribution, max, and top-N.
+- **Init time:** ProcessInit duration. Max and top-N.
+- **Throughput:** messages in/out per process (top-N) and node-level aggregates.
+- **Wakeups and drains:** wakeup count and drain ratio (messages processed per wakeup). Drain ratio distinguishes between slow callbacks (drain ~1) and high-throughput batching (drain ~100) at the same utilization level.
+- **Liveness:** detects processes stuck in blocking calls. Computed as `RunningTime / (Uptime * MailboxLatency)`. A healthy process has RunningTime growing with activity (high score). A process blocked in a mutex, channel, or IO has RunningTime frozen while uptime and latency keep growing (score drops over time). Zombie processes are excluded (detected separately). Bottom-N surfaces the most stuck processes. Requires `-tags=latency`.
+
+### Event Metrics
+
+Always active. Per-event subscriber count, publish/delivery counts, and utilization state (`active`, `on_demand`, `idle`, `no_subscribers`, `no_publishing`). See [Events](../../basics/events.md) for the pub/sub model and [Pub/Sub Internals](../../advanced/pub-sub-internals.md) for the shared subscription optimization that affects delivery counters.
+
+For the complete list of metric names, types, labels, and descriptions, see the [metrics actor README](https://github.com/ergo-services/actor).
## Custom Metrics
-Extend the metrics actor by embedding `metrics.Actor`. You register custom Prometheus collectors in `Init()` and update them via `CollectMetrics()` or `HandleMessage()`.
+All custom metrics automatically receive a `node` const label. Do not include `"node"` in your variable label names.
-### Approach 1: Periodic Collection
+### Helper Functions
-Implement `CollectMetrics()` to poll state at regular intervals:
+Any actor on the same node can register and update custom metrics without importing `prometheus` or embedding the metrics actor:
+
+```go
+// Register metrics (sync Call, returns error)
+metrics.RegisterGauge(w, "metrics_actor", "db_connections", "Active connections", []string{"pool"})
+metrics.RegisterCounter(w, "metrics_actor", "cache_ops", "Cache operations", []string{"op"})
+metrics.RegisterHistogram(w, "metrics_actor", "request_seconds", "Latency", []string{"path"}, nil)
+
+// Update metrics (async Send)
+metrics.GaugeSet(w, "metrics_actor", "db_connections", 42, []string{"primary"})
+metrics.CounterAdd(w, "metrics_actor", "cache_ops", 1, []string{"hit"})
+metrics.HistogramObserve(w, "metrics_actor", "request_seconds", 0.023, []string{"/api"})
+
+// Remove a metric (async Send)
+metrics.Unregister(w, "metrics_actor", "db_connections")
+```
+
+When the registering process terminates, the metrics actor automatically unregisters all metrics it owned.
+
+### Embedding metrics.Actor
+
+For direct access to the Prometheus registry or periodic collection via `CollectMetrics`:
```go
type AppMetrics struct {
metrics.Actor
- activeUsers prometheus.Gauge
- queueDepth prometheus.Gauge
+ activeUsers prometheus.Gauge
}
func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
@@ -171,12 +186,7 @@ func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
Help: "Current number of active users",
})
- m.queueDepth = prometheus.NewGauge(prometheus.GaugeOpts{
- Name: "myapp_queue_depth",
- Help: "Current queue depth",
- })
-
- m.Registry().MustRegister(m.activeUsers, m.queueDepth)
+ m.Registry().MustRegister(m.activeUsers)
return metrics.Options{
Port: 9090,
@@ -185,104 +195,79 @@ func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
}
func (m *AppMetrics) CollectMetrics() error {
- // Called every CollectInterval
- // Query other processes for current state
-
count, err := m.Call(userService, getActiveUsersMessage{})
if err != nil {
m.Log().Warning("failed to get user count: %s", err)
- return nil // Non-fatal, continue
+ return nil
}
m.activeUsers.Set(float64(count.(int)))
-
- depth, _ := m.Call(queueService, getDepthMessage{})
- m.queueDepth.Set(float64(depth.(int)))
-
return nil
}
```
-Use this when metrics reflect state you need to query - current values from other actors, computed aggregates, external API calls.
-
-### Approach 2: Event-Driven Updates
-
-Update metrics immediately when events occur:
+For event-driven updates, implement `HandleMessage()` instead of `CollectMetrics()`:
```go
-type AppMetrics struct {
- metrics.Actor
-
- requestsTotal prometheus.Counter
- requestLatency prometheus.Histogram
-}
-
-func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
- m.requestsTotal = prometheus.NewCounter(prometheus.CounterOpts{
- Name: "myapp_requests_total",
- Help: "Total requests processed",
- })
-
- m.requestLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
- Name: "myapp_request_duration_seconds",
- Help: "Request latency distribution",
- Buckets: prometheus.DefBuckets,
- })
-
- m.Registry().MustRegister(m.requestsTotal, m.requestLatency)
-
- return metrics.Options{Port: 9090}, nil
-}
-
func (m *AppMetrics) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case requestCompletedMessage:
m.requestsTotal.Inc()
m.requestLatency.Observe(msg.duration.Seconds())
- case errorOccurredMessage:
- m.errorsTotal.Inc()
}
return nil
}
```
-Application actors send events to the metrics actor:
+### Custom Top-N Metrics
+
+Top-N metrics track the N highest (or lowest) values observed during each collection cycle. Unlike gauges or counters, a top-N metric accumulates observations and periodically flushes only the top entries to Prometheus as a GaugeVec. This is useful when you want to identify the most active, slowest, or largest items out of many, without creating a separate time series for each one.
+
+Each top-N metric is managed by a dedicated actor spawned under a SimpleOneForOne supervisor. Registration creates this actor; observations are sent to it asynchronously. On each flush interval the actor writes the current top-N entries to Prometheus and resets for the next cycle.
```go
-// In your request handler actor
-func (h *RequestHandler) HandleMessage(from gen.PID, message any) error {
- switch msg := message.(type) {
- case ProcessRequest:
- start := time.Now()
- // ... process request ...
- elapsed := time.Since(start)
-
- // Send metrics event
- h.Send(metricsPID, requestCompletedMessage{duration: elapsed})
- }
- return nil
-}
+// Register a top-N metric (sync Call, returns error)
+// TopNMax keeps the N largest values; TopNMin keeps the N smallest
+metrics.RegisterTopN(w, "topn_supervisor_name", "slowest_queries", "Slowest DB queries",
+ 10, metrics.TopNMax, []string{"query", "table"})
+
+// Observe values (async Send)
+metrics.TopNObserve(w, gen.Atom("radar_topn_slowest_queries"), 0.250, []string{"SELECT ...", "users"})
+metrics.TopNObserve(w, gen.Atom("radar_topn_slowest_queries"), 1.100, []string{"JOIN ...", "orders"})
```
-Use this when your application naturally produces events. Metrics update in real-time without polling.
+The `to` parameter in `RegisterTopN` is the name of the supervisor managing top-N actors. The `to` parameter in `TopNObserve` is the actor name, by convention `"radar_topn_" + metricName`.
-## Metric Types
+Ordering modes:
-Prometheus defines four metric types, each suited for different use cases:
+- `metrics.TopNMax`: keeps the N largest values (e.g., slowest queries, busiest actors, highest memory usage)
+- `metrics.TopNMin`: keeps the N smallest values (e.g., lowest latency, least active processes)
-**Counter** - Monotonically increasing value. Use for events that accumulate (requests processed, errors occurred, bytes sent). Counters never decrease except on process restart. Prometheus queries typically use `rate()` to calculate per-second rates or `increase()` for total change over a time window.
+When the process that registered a top-N metric terminates, the actor automatically cleans up and unregisters its GaugeVec from Prometheus.
-**Gauge** - Value that can go up or down. Use for current state (active connections, queue depth, memory usage, CPU utilization). Gauges represent snapshots. Prometheus queries can graph them directly or use functions like `avg_over_time()` to smooth spikes.
+When used through the [Radar](../applications/radar.md) application, the supervisor is already wired in and you use `radar.RegisterTopN` / `radar.TopNObserve` helpers instead.
-**Histogram** - Observations bucketed into configurable ranges. Use for latency or size distributions. Histograms let you calculate percentiles (p50, p95, p99) in Prometheus queries. They're more resource-intensive than gauges because they maintain multiple buckets per metric.
+### Shared Mode
-**Summary** - Similar to histogram but calculates quantiles client-side. Use when you need precise quantiles but can't predict bucket boundaries. Summaries are more expensive than histograms because they track exact quantiles, not approximations.
+A single metrics actor processes messages sequentially. Under high throughput, its mailbox becomes a bottleneck. Shared mode lets multiple metrics actor instances share the same Prometheus registry:
-For most use cases, counters and gauges suffice. Use histograms when you need latency percentiles. Avoid summaries unless you have specific reasons - histograms are more flexible for Prometheus queries.
+```go
+shared := metrics.NewShared()
+// Primary actor: owns HTTP endpoint and base metrics
+primaryOpts := metrics.Options{
+ Port: 9090,
+ Shared: shared,
+}
-## Integration with Prometheus
+// Worker actors: handle custom metric updates only
+workerOpts := metrics.Options{
+ Shared: shared,
+}
+```
+
+The primary actor starts the HTTP server and collects base metrics. Workers only process custom metric messages. All actors write to the same registry through the shared object. Works well with `act.Pool` for automatic load distribution.
-Configure Prometheus to scrape the metrics endpoint:
+## Integration with Prometheus
```yaml
scrape_configs:
@@ -291,37 +276,27 @@ scrape_configs:
- targets:
- 'localhost:3000'
- 'node1.example.com:3000'
- - 'node2.example.com:3000'
scrape_interval: 15s
```
-Prometheus fetches `/metrics` every 15 seconds, parses the text format, and stores time-series data. You can then query, alert, and visualize metrics using Prometheus queries or Grafana dashboards.
+For dynamic discovery in Kubernetes, use Prometheus service discovery instead of static targets.
-For dynamic discovery in Kubernetes or cloud environments, use Prometheus service discovery instead of static targets. The metrics actor itself doesn't need to know about Prometheus - it just exposes an HTTP endpoint.
+## Grafana Dashboard
-## Observer Integration
+The metrics package includes a pre-built Grafana dashboard (`ergo-cluster.json`) for monitoring Ergo clusters.
+
+Import it in Grafana: Dashboards > Import > upload `ergo-cluster.json` > select your Prometheus data source. The `$node` dropdown at the top filters all panels by selected nodes.
-The metrics actor includes built-in Observer support via `HandleInspect()`. When you inspect it in Observer UI (http://localhost:9911), you see:
+The dashboard is organized top-down: Summary row at the top for cluster health at a glance, then Mailbox Latency and Depth for backpressure analysis, then collapsed rows for Events, Process Activity, Processes, Resources, Logging, and Network. The Network row includes compression overview (ratio, rate, percentage), per-node compression ratio, fragmentation rates (cluster and per-node), connectivity strength, and connection events. Each row focuses on a specific aspect of cluster behavior and can be expanded when investigating issues.
-- Total number of registered metrics
-- HTTP endpoint URL for Prometheus scraping
-- Collection interval
-- Current values for all metrics (base + custom)
+For detailed panel descriptions, see the [metrics actor README](https://github.com/ergo-services/actor).
-This works automatically for custom metrics - register them in `Init()` and they appear in Observer alongside base metrics.
+## Observer Integration
-If you need custom inspection behavior, override `HandleInspect()` in your implementation:
+The metrics actor integrates with Observer via `HandleInspect()`. Inspecting the process shows total metric count, HTTP endpoint, collection interval, and current values for all metrics.
-```go
-func (m *AppMetrics) HandleInspect(from gen.PID, item ...string) map[string]string {
- result := make(map[string]string)
-
- // Custom inspection logic
- result["status"] = "healthy"
- result["custom_info"] = "some value"
-
- return result
-}
-```
+When embedding `metrics.Actor` and overriding `HandleInspect()`, your keys are merged on top of base inspection data.
+
+## Radar Application
-For detailed configuration options, see the `metrics.Options` struct and `ActorBehavior` interface in the package. For examples of custom metrics, see the [example directory](https://github.com/ergo-services/actor/tree/main/metrics/example).
+If your node needs both Prometheus metrics and Kubernetes health probes, consider the [Radar](../applications/radar.md) application. It runs the metrics actor and [Health](health.md) actor together on a single HTTP port.
diff --git a/docs/extra-library/applications/mcp.md b/docs/extra-library/applications/mcp.md
new file mode 100644
index 000000000..7c0e35349
--- /dev/null
+++ b/docs/extra-library/applications/mcp.md
@@ -0,0 +1,285 @@
+---
+description: AI-powered diagnostics for running Ergo nodes via Model Context Protocol
+---
+
+# MCP
+
+Diagnosing a distributed actor system is hard. The problem isn't a lack of data - it's knowing what to look for. A node has hundreds of processes, dozens of connections, thousands of events flowing between them. Something is slow, but where? A process is stuck, but why? Memory is growing, but what's holding it?
+
+Traditional monitoring collects predefined metrics at fixed intervals. You decide upfront what matters, build dashboards, and then interpret the data when something breaks. This works for known failure modes. It doesn't work when the failure is something you haven't anticipated - and in distributed systems, the interesting failures are always unanticipated.
+
+MCP takes a different approach. Instead of predefined metrics, it exposes the full diagnostic surface of the node - processes, applications, events, network, profiling, runtime - as tools that an AI agent can call on demand. The agent decides what to inspect based on the symptom you describe. It runs diagnostic sequences, correlates findings across tools, narrows down root causes, and explains what it found. You describe the problem in words; the agent finds the answer in data.
+
+The real power comes from combination. The agent can see your source code, inspect the live cluster via MCP, and query your log storage - all in the same conversation. It reads the actor implementation to understand intent, checks runtime state to see what actually happens, and correlates with error logs to see the history. Together these eliminate guesswork in a way no single tool can.
+
+The application runs as a regular Ergo sidecar. Add it to your node's application list, and every process, connection, and event becomes inspectable - without restarting, redeploying, or attaching a debugger.
+
+## Two Deployment Modes
+
+MCP has two modes: entry point and agent.
+
+An entry point node runs an HTTP listener that accepts MCP protocol requests. This is the node your AI client connects to. An agent node has no HTTP listener at all - it's invisible from outside the cluster. But it runs the same diagnostic tools internally, and any entry point can reach it through cluster proxy.
+
+In practice, you deploy one entry point and make everything else an agent:
+
+```go
+import (
+ "ergo.services/ergo"
+ "ergo.services/application/mcp"
+ "ergo.services/ergo/gen"
+)
+
+func main() {
+ node, _ := ergo.StartNode("example@localhost", gen.NodeOptions{
+ Applications: []gen.ApplicationBehavior{
+ // Entry point - the one HTTP endpoint for the entire cluster
+ mcp.CreateApp(mcp.Options{Port: 9922}),
+ },
+ })
+ node.Wait()
+}
+```
+
+On every other node, the same application with no port:
+
+```go
+// Agent mode - no HTTP, but fully diagnosable via cluster proxy
+mcp.CreateApp(mcp.Options{})
+```
+
+The AI client connects to `http://entry-point:9922/mcp` and reaches any node in the cluster through that single endpoint.
+
+## Configuration
+
+```go
+mcp.Options{
+ Host: "localhost", // Listen address
+ Port: 9922, // HTTP port (0 = agent mode)
+ Token: "secret", // Bearer token (empty = no auth)
+ ReadOnly: false, // Disable action tools
+ AllowedTools: nil, // Tool whitelist (nil = all)
+ PoolSize: 5, // Worker processes
+ CertManager: nil, // TLS certificate manager
+ LogLevel: gen.LogLevelInfo,
+}
+```
+
+**Port** controls the deployment mode. A non-zero value starts an HTTP listener - this is an entry point. Zero means agent mode: no listener, accessible only via cluster proxy from another node that has an entry point.
+
+**Token** enables Bearer token authentication. When set, every HTTP request must include `Authorization: Bearer `. When empty, no authentication is required. Agent mode nodes don't need a token - they're accessed through the Ergo inter-node protocol, which has its own authentication via handshake cookies.
+
+**ReadOnly** disables tools that modify state: `send_message`, `call_process`, `send_exit`, `process_kill`. Everything else - inspection, profiling, sampling - remains available. Use this on production nodes where you want full visibility without the ability to interfere.
+
+**AllowedTools** restricts the tool set to a whitelist. When set, only the named tools are available. This is finer-grained than ReadOnly - you can, for example, allow `send_message` but not `process_kill`. When nil, all tools are enabled (respecting ReadOnly).
+
+## Connecting a Client
+
+### Claude Code
+
+```bash
+claude mcp add --transport http ergo http://localhost:9922/mcp
+
+# Available from any directory (user scope)
+claude mcp add --transport http ergo --scope user http://localhost:9922/mcp
+
+# With authentication
+claude mcp add --transport http ergo http://localhost:9922/mcp \
+ -H "Authorization: Bearer my-secret-token"
+```
+
+To allow all MCP tools without per-call permission prompts, add to `.claude/settings.json`:
+
+```json
+{
+ "permissions": {
+ "allow": ["mcp__ergo"]
+ }
+}
+```
+
+### Other Clients
+
+The application implements MCP protocol version `2025-06-18` over HTTP. Any MCP-compatible client can connect by sending JSON-RPC 2.0 POST requests to `http://:/mcp`.
+
+## How Cluster Proxy Works
+
+Every tool accepts a `node` parameter. When specified, the entry point node forwards the request to the target node via native Ergo inter-node protocol - not HTTP. The target node's MCP worker executes the tool locally and returns the result through the same path.
+
+This works because of network transparency. The entry point calls `gen.ProcessID{Name: "mcp", Node: targetNode}` - the framework establishes a connection if needed, routes the request, and delivers the response. You never need to explicitly connect to a node before querying it. If the registrar knows about the target node, the connection happens automatically.
+
+The `timeout` parameter (default 30 seconds, max 120) controls how long the entry point waits for a remote response. Most tools respond in milliseconds. But CPU profiling collects data for a requested duration before responding, and goroutine dumps on large nodes take time to serialize. For these, pass a higher timeout.
+
+If a remote tool call fails with "remote call failed", it usually means the target node doesn't have the MCP application running. All proxy calls require an MCP pool process on the target node - agent mode is sufficient, but the application must be loaded and started.
+
+## Profiling Remote Nodes
+
+Profiling tools generate large output. A goroutine dump from a node with 500 goroutines can be megabytes of text. A heap profile with hundreds of allocation sites isn't much smaller. Push all of that through the proxy chain - remote node, entry point, HTTP, JSON-RPC - and you hit timeouts or transport limits.
+
+The solution is server-side filtering. All profiling tools accept `filter` and `exclude` parameters that reduce the output before it leaves the remote node. Instead of transferring 500 goroutine stacks and searching locally, you tell the remote node to return only the stacks that match:
+
+```
+pprof_goroutines node=backend@host debug=1 filter="orderHandler" limit=20
+```
+
+The response header preserves the full picture: `goroutine profile: total 500, matched 3, showing 3`. You know the node has 500 goroutines, but only 3 matched your filter, and all 3 were returned. The agent can refine the filter, broaden it, or switch to a different angle - each query is cheap because the heavy lifting happens on the remote node.
+
+### CPU Profiling
+
+The `pprof_cpu` tool collects a CPU profile for a given duration and returns the top functions by CPU usage:
+
+```
+pprof_cpu node=backend@host duration=5 exclude="runtime" limit=15 timeout=30
+```
+
+The node samples CPU activity for 5 seconds, aggregates by function, filters out Go runtime internals, and returns the top 15 application functions with flat and cumulative percentages. The `timeout` should be higher than `duration` to account for collection and transfer time.
+
+### Heap Profiling
+
+The `pprof_heap` tool shows the top memory allocators with two columns: `inuse` (live objects currently in memory) and `alloc` (cumulative allocations over the node's lifetime). A function with low `inuse` but high `alloc` is churning memory - allocating and releasing rapidly, putting pressure on the garbage collector.
+
+```
+pprof_heap node=backend@host filter="myapp" limit=20
+```
+
+### Goroutine Analysis
+
+The `pprof_goroutines` tool has two modes. Without `pid`, it returns all goroutines on the node - use `filter` and `exclude` to narrow down. With `pid`, it returns the stack trace of a specific process's goroutine (requires `-tags=pprof`).
+
+Debug level controls the output format: `debug=1` groups goroutines by identical stack (compact summary with counts), `debug=2` shows individual goroutine traces with state and wait duration.
+
+A sleeping process parks its goroutine - it won't appear in the dump. To catch it, use an active sampler that polls until the process wakes up:
+
+```
+sample_start tool=pprof_goroutines arguments={"pid":""} interval_ms=300 count=1 max_errors=0
+```
+
+The sampler ignores the "goroutine not found" error (`max_errors=0`) and keeps polling every 300ms until it catches the process in a non-sleep state.
+
+## Samplers
+
+Snapshots show one moment. Trends show the story. Samplers bridge this gap by collecting data into ring buffers that the agent reads incrementally.
+
+### Active Samplers
+
+An active sampler periodically calls any MCP tool and stores the results. It's a generic periodic executor - any tool with any arguments can be sampled.
+
+```
+sample_start tool=process_list arguments={"sort_by":"mailbox","limit":10} interval_ms=5000 duration_sec=300
+```
+
+This calls `process_list` every 5 seconds for 5 minutes, storing each result in a ring buffer. The agent reads with `sample_read sampler_id=` to get all buffered entries, or `sample_read sampler_id= since=5` to get only entries newer than sequence 5.
+
+The `max_errors` parameter controls error tolerance. The default (0) means ignore all errors and keep retrying - useful for polling rare conditions. A non-zero value stops the sampler after that many consecutive failures.
+
+### Passive Samplers
+
+A passive sampler listens for events instead of polling. It captures log messages and event publications as they happen:
+
+```
+sample_listen log_levels=["warning","error"] duration_sec=120
+sample_listen event=order_events duration_sec=60
+sample_listen log_levels=["error"] event=order_events duration_sec=120
+```
+
+Log capture and event subscription can be combined in a single sampler.
+
+### Linger
+
+Every sampler has a `linger_sec` parameter (default 30). After the sampler completes - duration expires, count reached, or max errors exceeded - it stays alive for this many additional seconds so the agent can retrieve the collected data. Without linger, a sampler that runs for 10 seconds would terminate before the agent gets a chance to read the results.
+
+The `sample_list` tool shows sampler status: `running`, `completed, lingering 25s`, or `completed`. The `sample_stop` tool terminates a sampler immediately, bypassing the linger period.
+
+### What to Sample
+
+| Goal | Sampler |
+|------|---------|
+| Mailbox pressure trend | `sample_start tool=process_list arguments={"sort_by":"mailbox","limit":10}` |
+| Memory and GC trend | `sample_start tool=runtime_stats interval_ms=5000` |
+| Error storm detection | `sample_listen log_levels=["error","panic"]` |
+| Event traffic monitoring | `sample_listen event=` |
+| Network health trend | `sample_start tool=network_nodes interval_ms=30000` |
+| CPU hotspot sampling | `sample_start tool=pprof_goroutines arguments={"debug":1,"filter":"ProcessRun","exclude":"toolPprof","limit":20} interval_ms=500` |
+
+## Typed Messages
+
+When `ReadOnly` is not set, the agent can send messages to processes and make synchronous calls using the EDF type registry. This isn't raw JSON injection - the framework constructs real Go structs from the type information.
+
+If your application registers a type:
+
+```go
+type StatusRequest struct {
+ Verbose bool
+}
+
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := node.Network().RegisterType(StatusRequest{}); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
+}
+```
+
+The agent discovers it with `message_types`, inspects its fields with `message_type_info`, and sends it with `call_process`. The process receives a real `StatusRequest{Verbose: true}` in its `HandleCall`, not a map or raw bytes.
+
+This makes interactive debugging possible: the agent can call any process with any registered request type, inspect the response, and reason about the behavior.
+
+## Network Diagnostics
+
+The `network_ping` tool sends a request through the full network path - flusher, TCP connection, remote MCP worker, response - and measures the round-trip time. This is an end-to-end health check, not a TCP-level ping. If the flusher is broken, the connection pool is degraded, or the remote node is overloaded, the ping will reflect it.
+
+```
+network_ping name=backend@host
+→ ping backend@host: rtt 0.42ms
+```
+
+For deeper connection analysis, `network_node_info` shows per-connection statistics: messages in/out, bytes in/out, pool size, pool DSN (which side dialed), and a `Reconnections` counter that tracks how many times pool items have reconnected. A non-zero reconnection count indicates connection instability.
+
+When investigating connection problems, always check both sides:
+
+```
+network_node_info node=A name=B # A's view of the connection to B
+network_node_info node=B name=A # B's view of the connection to A
+```
+
+Asymmetry between the two sides - one sees thousands of messages out while the other sees one message in - indicates data loss at the connection level.
+
+## Build Tags
+
+Two build tags enable additional diagnostic capabilities. Both add a small amount of overhead and should be enabled in staging and production builds where diagnostics matter.
+
+**`-tags=pprof`** enables the Go profiler and labels actor goroutines with their process PID. The labels appear in goroutine dumps as `{"pid":""}` for actors and `{"meta":"Alias#...", "role":"reader"}` for meta processes. The `pprof_goroutines` tool with `pid` parameter uses these labels to extract a specific actor's stack trace. Without this tag, the `pid` parameter returns an error.
+
+This tag also starts a pprof HTTP endpoint at `localhost:9009/debug/pprof/` (configurable via `PPROF_HOST` and `PPROF_PORT` environment variables) for use with `go tool pprof`.
+
+**`-tags=latency`** enables mailbox latency measurement. Each mailbox queue tracks the age of its oldest unprocessed message. The `process_list` tool gains `min_mailbox_latency_ms` filter and `mailbox_latency` sort field. Without this tag, latency fields return -1.
+
+## Relationship to Metrics Actor
+
+The [Metrics](../actors/metrics.md) actor collects predefined metrics into Prometheus format for scraping. MCP reads from the same underlying data sources - `ProcessRangeShortInfo`, `NodeInfo`, `EventRangeInfo` - but exposes them interactively.
+
+Active samplers can replicate any Prometheus metric: `sample_start tool=process_list arguments={"sort_by":"mailbox","limit":10}` is equivalent to `ergo_mailbox_depth_top`. The difference is that MCP samplers are on-demand and agent-driven, while Prometheus metrics are always-on and scraper-driven.
+
+Use the metrics actor for long-term trends, alerting, and Grafana dashboards. Use MCP for interactive investigation when alerts fire or when you need to explore something unexpected.
+
+## Agent and Skill for Claude Code
+
+A ready-to-use diagnostic agent and skill are available at [github.com/ergo-services/claude](https://github.com/ergo-services/claude). The agent contains playbooks for common scenarios: performance bottlenecks, process leaks, restart loops, zombie processes, memory growth, network issues, event system problems, goroutine investigation, and cluster health checks. Trigger it by describing a symptom - "why is it slow", "check the cluster", "find the process leak" - and it runs the appropriate diagnostic sequence.
+
+Install as a Claude Code plugin:
+
+```bash
+/plugin marketplace add ergo-services/claude
+/plugin install ergo@ergo-services
+```
+
+Or symlink into `~/.claude/` for local development:
+
+```bash
+cd ergo.services/claude
+ln -sf $(pwd)/agents/devops.md ~/.claude/agents/
+ln -sf $(pwd)/skills/devops ~/.claude/skills/
+```
+
+## Full Tool Reference
+
+The complete list of 48 tools with parameters and descriptions is in the [MCP application README](https://github.com/ergo-services/application/blob/master/mcp/README.md).
diff --git a/docs/extra-library/applications/observer.md b/docs/extra-library/applications/observer.md
index 797e29b77..04a7e8074 100644
--- a/docs/extra-library/applications/observer.md
+++ b/docs/extra-library/applications/observer.md
@@ -1,6 +1,14 @@
+---
+description: Real-time web UI for monitoring and inspecting Ergo nodes
+---
+
# Observer
-The Application _Observer_ provides a convenient web interface to view node status, network activity, and running processes in the node built with Ergo Framework. Additionally, it allows you to inspect the internal state of processes or meta-processes. The application is can also be used as a standalone tool Observer. For more details, see the section [Inspecting With Observer](../../tools/observer.md). You can add the _Observer_ application to your node during startup by including it in the node's startup options:
+Observer is a web application that embeds into your node and provides real-time visibility into the running system. It uses Server-Sent Events (SSE) for live push updates to the browser.
+
+For a detailed description of the UI and all available views, see [Inspecting With Observer](../../advanced/observer.md).
+
+## Adding Observer to a node
-The function `observer.CreateApp` takes `observer.Options` as an argument, allowing you to configure the _Observer_ application. You can set:
+Open `http://localhost:9911` in your browser.
+
+## Options
+
+`observer.CreateApp` accepts `observer.Options`:
+
+* **Host**: interface to listen on. Default: `localhost`.
+* **Port**: HTTP port. Default: `9911`.
+* **PoolSize**: number of worker processes handling requests. Default: `10`.
+* **LogLevel**: log level for Observer's own processes. Default: `gen.LogLevelInfo`.
+
+## What Observer shows
+
+### Node
+
+General node information: name, version, OS, architecture, CPU cores, timezone, uptime, memory usage, process count, goroutine count. Memory graph updates live over the last 60 seconds. Node-level log level can be changed directly from this view.
+
+### Processes
+
+Full process list with per-process metrics: state, mailbox depth, message latency, running time, wakeup count, uptime. Supports filtering by name pattern, behavior type, application, state, and minimum mailbox depth.
+
+Clicking a process opens its detail view: supervision tree position, links, monitors, registered names, aliases, environment variables, and the internal state returned by `HandleInspect`.
+
+Any actor that implements `HandleInspect` exposes its state as a live-updating key-value panel in the browser:
+
+```go
+func (a *MyActor) HandleInspect(from gen.PID, item ...string) map[string]string {
+ return map[string]string{
+ "connections": fmt.Sprintf("%d", a.connCount),
+ "last_error": a.lastError,
+ }
+}
+```
+
+### Meta-processes
+
+Meta-processes (TCP servers, WebSocket handlers, SSE handlers, Port processes, and others) with their state, type, and parent process.
+
+### Applications
+
+All loaded applications with state (loaded, running, stopping), mode, uptime, and their process groups. Full application process tree on click. Applications can be started, stopped, and unloaded from this view.
+
+### Network
+
+Network stack details: mode, acceptors, protocol and handshake versions, registrar. Below the top stat cards and acceptors, three tabs are available:
+
+* **Connections** lists all active remote node connections with traffic counters and sortable columns.
+* **Routes** shows configured static routes and proxy routes.
+* **Types** shows the wire-format type registry (one entry per proto): registration ID, name, kind, MinSize (zero-value wire size), and the inferred schema. Two filters narrow the list by name and by schema content. The data is captured on demand via the Refresh button. With `-tags=typestats`, additional columns show per-type encode/decode counts and decompressed wire-byte totals (see [Debugging: typestats Tag](../../advanced/debugging.md#the-typestats-tag)).
+
+### Events
+
+All registered events: producer PID, subscriber count, buffered flag, and publication statistics. Filter by name, notification mode, buffered mode, and minimum subscriber count.
+
+### Logs
+
+Live log stream from the observed node. Filter by level (debug, info, warning, error, panic).
+
+### Profiler
+
+**Goroutines** - dump of all goroutines with stack traces, grouping, and filtering by state and minimum wait time.
+
+**Heap** - allocation profile showing top call sites by bytes. Filter by minimum allocation size.
+
+Both are available without restarting the node or enabling any special build flags.
+
+## Actions
+
+Observer is not read-only. From process and meta-process views you can:
+
+* **Send a message** to a process or meta-process
+* **Send an exit signal** with a custom reason
+* **Kill** a process
+* **Change log level** for the node, a specific process, or a specific meta-process
+* **Adjust per-process network settings**: send priority, message ordering (`KeepNetworkOrder`), important delivery, compression type/level/threshold
+
+## Inspecting the whole cluster
-* **Port**: The port number for the web server (default: `9911` if not specified).
-* **Host**: The interface name (default: `localhost`).
-* **LogLevel**: The logging level for the Observer application (useful for debugging). The default is `gen.LogLevelInfo`
+Observer communicates with the `system` application, which is started automatically on every Ergo node. Because of this, a single Observer instance can switch to any node in the cluster and inspect it without deploying anything extra to that node. Use the node selector in the UI to connect to any cluster node, via the registrar if configured, or by entering the host, port, and cookie explicitly.
diff --git a/docs/extra-library/applications/pulse.md b/docs/extra-library/applications/pulse.md
new file mode 100644
index 000000000..a80ab55d5
--- /dev/null
+++ b/docs/extra-library/applications/pulse.md
@@ -0,0 +1,260 @@
+# Pulse
+
+Tracing in Ergo Framework records observations locally on each node. To see the complete picture of a trace spanning multiple nodes, you need to send those observations to an external system that assembles them. Pulse exports tracing observations to any OTLP-compatible backend (Grafana Tempo, Jaeger, OpenTelemetry Collector) over HTTP.
+
+Pulse runs as an application on your node. It registers itself as a tracing exporter, receives observations from the framework, batches them, and periodically flushes them to the configured collector. Each node in your cluster runs its own Pulse instance pointing to the same collector, and the backend assembles cross-node traces automatically.
+
+## Adding to Your Node
+
+```go
+import (
+ "ergo.services/application/pulse"
+ "ergo.services/ergo"
+ "ergo.services/ergo/gen"
+)
+
+func main() {
+ node, err := ergo.StartNode("mynode@localhost", gen.NodeOptions{
+ Applications: []gen.ApplicationBehavior{
+ pulse.CreateApp(pulse.Options{
+ URL: "http://tempo:4318/v1/traces",
+ }),
+ },
+ })
+ if err != nil {
+ panic(err)
+ }
+ node.Wait()
+}
+```
+
+With this configuration, Pulse sends observations to `http://tempo:4318/v1/traces` using protobuf encoding. The node name (`mynode@localhost`) is used as the OTLP resource `service.name`, so the backend groups observations by node.
+
+## Configuration
+
+```go
+pulse.Options{
+ URL: "http://tempo:4318/v1/traces", // full collector URL
+ Headers: map[string]string{ // custom HTTP headers
+ "Authorization": "Bearer ",
+ },
+ BatchSize: 512, // flush after N observations
+ FlushInterval: 5 * time.Second, // max time between flushes
+ PoolSize: 3, // number of export workers
+ ExportTimeout: 10 * time.Second, // HTTP request timeout
+ Flags: gen.TracingFlagSend | // which observations to receive
+ gen.TracingFlagReceive |
+ gen.TracingFlagProcs,
+}
+```
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| URL | `http://localhost:4318/v1/traces` | Full OTLP/HTTP collector URL. |
+| Headers | none | Custom HTTP headers sent with every export request. Use for authentication tokens or routing headers. |
+| BatchSize | `512` | Maximum number of observations in a batch. When the batch reaches this size, it is flushed immediately. |
+| FlushInterval | `5s` | Maximum time between flushes. Even if the batch is not full, it is flushed after this interval. |
+| PoolSize | `3` | Number of export workers. Each worker maintains its own HTTP client and batch buffer. Increase if your observation rate exceeds what three workers can export. |
+| ExportTimeout | `10s` | HTTP request timeout per flush. If the collector doesn't respond within this time, the flush fails and the error is logged. |
+| Flags | Send + Receive + Procs | Which observation types Pulse receives. By default, Pulse receives everything. Set a subset to reduce volume, for example `TracingFlagSend` to export only Sent observations. |
+
+## How It Works
+
+Pulse starts a pool of worker actors. The pool registers itself as a process-based tracing exporter on the node. When the framework emits an observation matching the configured flags, it delivers the observation to the pool, which distributes it to a worker.
+
+Each worker maintains a batch buffer. Observations accumulate until either the batch reaches `BatchSize` or `FlushInterval` elapses, whichever comes first. On flush, the worker converts the batch to OTLP protobuf format and sends it via HTTP POST to the collector.
+
+Each worker has its own HTTP client with persistent connections. Workers operate independently. If one worker's flush is slow (waiting on the network), others continue batching and flushing. This provides throughput resilience under variable network conditions.
+
+If a flush fails (network error, collector down, non-2xx response), the error is logged and the worker continues with the next batch. Observations from the failed batch are lost. This is a deliberate trade-off: retrying failed batches would introduce unbounded memory growth and backpressure that could affect the node's primary workload.
+
+On shutdown, each worker flushes any remaining observations before terminating.
+
+## OTLP Span Mapping
+
+Each Ergo observation becomes one OTLP span. The mapping is deterministic. Any node can compute the OTLP span ID for any observation without coordination.
+
+### Span ID Encoding
+
+The OTLP span ID encodes both the Ergo span ID and the observation point:
+
+```
+OTLP SpanID = ErgoSpanID << 2 | Point
+```
+
+Where Point is: Sent=1, Delivered=2, Processed=3.
+
+This means the three observations for a single message (Sent, Delivered, Processed) have related but distinct OTLP span IDs. Given any one, you can compute the other two.
+
+### Parent-Child Relationships
+
+| Observation | OTLP Parent | Meaning |
+|-------------|-------------|---------|
+| Sent (with parent) | Processed of causing message | "sent because of processing that message" |
+| Sent (root) | none | first message in trace |
+| Delivered | Sent of same message | "delivered after sent" |
+| Processed | Sent of same message | "processed after sent" |
+| Terminate.Processed | Processed of parent context | "process terminated" (no Sent for Terminate) |
+
+Sent is the anchor for each message. Delivered and Processed are its children at the same level. Response spans nest under Request.Processed, forming a natural call hierarchy:
+
+```
+Req.Sent
+├── Req.Delivered
+└── Req.Processed
+ └── Resp.Sent
+ └── Resp.Delivered
+```
+
+### Span Attributes
+
+Every OTLP span includes framework attributes prefixed with `ergo.`:
+
+- `ergo.node` : node where the observation was recorded
+- `ergo.from` : sender process identity
+- `ergo.to` : recipient identity
+- `ergo.kind` : Send, Request, Response, Spawn, or Terminate
+- `ergo.point` : Sent, Delivered, or Processed
+- `ergo.behavior` : actor behavior type name
+- `ergo.message` : message type name
+- `ergo.ref` : call reference (for Request/Response correlation)
+
+Custom attributes set by the process via `SetTracingAttribute` and `SetTracingSpanAttribute` are included as additional OTLP span attributes.
+
+### Span Name
+
+The OTLP span name is formatted as:
+
+```
+{behavior} {kind}.{point} {message}
+```
+
+For example: `OrderProcessor Send.Sent main.ReserveStock`.
+
+### Span Kind Mapping
+
+The OTLP SpanKind depends on both the Ergo kind and the observation point:
+
+| Ergo Kind + Point | OTLP SpanKind |
+|-------------------|---------------|
+| Send.Sent | PRODUCER |
+| Send.Delivered | CONSUMER |
+| Send.Processed | CONSUMER |
+| Request.Sent | CLIENT |
+| Request.Delivered | SERVER |
+| Request.Processed | SERVER |
+| Response.Sent | SERVER |
+| Response.Delivered | CLIENT |
+| Response.Processed | SERVER |
+| Spawn | INTERNAL |
+| Terminate | INTERNAL |
+
+The Sent side of a message gets the initiator kind (CLIENT/PRODUCER), while the Delivered/Processed side gets the handler kind (SERVER/CONSUMER). For Response, the roles are inverted: Sent is SERVER (handler sending back), Delivered is CLIENT (caller receiving the answer).
+
+## Reading Traces in Grafana
+
+OTLP was designed for request-response services where a span represents a unit of work with a start and end time. Ergo's actor model is different: messages are instantaneous events (sent, delivered, processed), not duration-based operations. Pulse maps each event to a zero-duration OTLP span placed at the exact timestamp when the event occurred.
+
+In trace visualization tools (Grafana, Jaeger, Zipkin), these appear as dots on a timeline rather than bars. This is expected. The horizontal distance between dots shows actual timing, and the tree structure shows causality.
+
+### Call (Request/Response)
+
+```
+Time ─────────────────────────────────────────────────────────►
+
+Node A ● ●
+ Req.Sent Resp.Delivered
+ (CLIENT) (CLIENT)
+
+Node B ● ● ●
+ Req.Delivered Req.Processed Resp.Sent
+ (SERVER) (SERVER) (SERVER)
+
+ ├── network ──┤── handling ──┤ ├── network ──┤
+```
+
+- Req.Sent to Req.Delivered = network latency from A to B
+- Req.Delivered to Req.Processed = time B spent handling the request
+- Req.Processed to Resp.Sent = response creation time
+- Resp.Sent to Resp.Delivered = network latency from B back to A
+
+### Send (async)
+
+```
+Time ──────────────────────────────────────►
+
+Node A ●
+ Send.Sent
+ (PRODUCER)
+
+Node B ● ●
+ Send.Delivered Send.Processed
+ (CONSUMER) (CONSUMER)
+
+ ├── network ──┤── handling ──┤
+```
+
+### Forward (multi-hop)
+
+```
+Time ──────────────────────────────────────────────────────────────────────►
+
+Node A ● ●
+ Req.Sent Resp.Delivered
+
+Node B ● ● ●
+ Req.Delivered Req.Processed
+ Fwd.Sent
+
+Node C ● ● ●
+ Fwd.Delivered Fwd.Processed
+ Resp.Sent
+
+ ├── network ──┤─ handling ─┤── network ──┤─ handling ─┤── network ──┤
+```
+
+For duration-based visualization with timing bars, use the Observer web UI which renders Ergo traces natively.
+
+## Inspecting Workers
+
+Each Pulse worker exposes statistics through the standard inspection mechanism. In the Observer process list, find the Pulse worker processes and inspect them to see:
+
+- `spans_received` : total observations received by this worker
+- `spans_exported` : total observations successfully exported
+- `export_errors` : total failed flush attempts
+- `batch_size` : current batch length
+
+These counters help diagnose export problems: if `export_errors` is growing, the collector may be unreachable or overloaded.
+
+## Grafana Dashboard
+
+Pulse includes a ready-to-use Grafana dashboard for trace search. Import `grafana-tracing.json` from the Pulse module into your Grafana instance. During import, Grafana will ask you to select a Tempo datasource.
+
+The dashboard provides a TraceQL filter for searching traces by node, behavior, message type, or any span attribute. Results include columns for service name, ergo.kind, ergo.behavior, and ergo.message. Click any Trace ID to open the full waterfall view.
+
+## Grafana Tempo Setup
+
+A minimal Tempo configuration for local development:
+
+```yaml
+# tempo.yaml
+server:
+ http_listen_port: 3200
+
+distributor:
+ receivers:
+ otlp:
+ protocols:
+ http:
+ endpoint: "0.0.0.0:4318"
+
+storage:
+ trace:
+ backend: local
+ local:
+ path: /var/tempo/traces
+ wal:
+ path: /var/tempo/wal
+```
+
+Point Pulse at `tempo:4318` with `Insecure: true`. In Grafana, add Tempo as a data source (`http://tempo:3200`) and use the Explore view to search for traces by trace ID or attributes.
diff --git a/docs/extra-library/applications/radar.md b/docs/extra-library/applications/radar.md
new file mode 100644
index 000000000..f0a8fe74c
--- /dev/null
+++ b/docs/extra-library/applications/radar.md
@@ -0,0 +1,368 @@
+# Radar
+
+Running an Ergo node in production typically requires two things: health probes for Kubernetes and a Prometheus metrics endpoint. Setting them up separately means two HTTP servers on two ports, two actor packages to import, and the same wiring code repeated on every node.
+
+Radar bundles both into a single application on one HTTP port. Internally it runs a [Health](../actors/health.md) actor for probe endpoints, a [Metrics](../actors/metrics.md) actor for base Ergo telemetry, and a pool of metrics workers for custom metric updates, all behind a shared mux served by one HTTP server. Actors interact with Radar through helper functions in the `radar` package without importing the underlying packages or knowing the internal actor names.
+
+## Adding to Your Node
+
+```go
+import (
+ "ergo.services/application/radar"
+ "ergo.services/ergo"
+ "ergo.services/ergo/gen"
+)
+
+func main() {
+ node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{
+ Applications: []gen.ApplicationBehavior{
+ radar.CreateApp(radar.Options{Port: 9090}),
+ },
+ })
+
+ // Health: http://localhost:9090/health/live
+ // http://localhost:9090/health/ready
+ // http://localhost:9090/health/startup
+ // Metrics: http://localhost:9090/metrics
+
+ node.Wait()
+}
+```
+
+With no signals registered, all three health endpoints return 200 with `{"status":"healthy"}`. The metrics endpoint immediately serves base Ergo metrics. No additional configuration is required for a working production setup.
+
+## Configuration
+
+```go
+radar.Options{
+ Host: "0.0.0.0",
+ Port: 9090,
+ HealthPath: "/health",
+ MetricsPath: "/metrics",
+ HealthCheckInterval: 2 * time.Second,
+ MetricsCollectInterval: 15 * time.Second,
+ MetricsTopN: 100,
+ MetricsPoolSize: 5,
+}
+```
+
+**Host** determines which network interface the HTTP server binds to. Default is `"localhost"`. Use `"0.0.0.0"` for containerized environments where probes and scraping come from outside the pod.
+
+**Port** sets the single HTTP port for all endpoints. Default is `9090`. Choose a port that does not conflict with your application's own listeners.
+
+**HealthPath** sets the URL prefix for health probe endpoints. Default is `"/health"`. The actual endpoints become `HealthPath+"/live"`, `HealthPath+"/ready"`, `HealthPath+"/startup"`. Change this when deploying behind a reverse proxy that expects a different path prefix.
+
+**MetricsPath** sets the URL path for the Prometheus scrape target. Default is `"/metrics"`.
+
+**HealthCheckInterval** controls how often the health actor checks for expired heartbeats. Default is 1 second. Shorter intervals detect failures faster but increase internal message traffic. For most applications, 1-2 seconds provides a good balance.
+
+**MetricsCollectInterval** sets how often base Ergo metrics are collected (processes, memory, CPU, network, events). Default is 10 seconds. Align this with your Prometheus scrape interval; collecting more frequently than Prometheus scrapes wastes CPU; collecting less frequently means Prometheus may see stale values.
+
+**MetricsTopN** limits the number of entries in per-process and per-event top-N metrics tables. Default is 50. Increase this for large nodes with thousands of processes where you need broader visibility into the tail. The collection cost scales linearly with TopN.
+
+**MetricsPoolSize** sets the number of worker actors in the custom metrics pool. Default is 3. Under normal load, a single worker is sufficient. Increase this if many actors send frequent metric updates and you observe the metrics mailbox growing.
+
+## Health Probes
+
+Actors register signals with Radar, specifying which probes the signal affects and an optional heartbeat timeout. The health actor monitors the registering process; if it terminates, all its signals are automatically marked as down.
+
+### Registering a Signal
+
+```go
+func (w *DBWorker) Init(args ...any) error {
+ radar.RegisterService(w, "postgres",
+ radar.ProbeLiveness|radar.ProbeReadiness, 10*time.Second)
+
+ w.scheduleHeartbeat()
+ return nil
+}
+
+func (w *DBWorker) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageHeartbeat:
+ radar.Heartbeat(w, "postgres")
+ w.scheduleHeartbeat()
+ }
+ return nil
+}
+
+func (w *DBWorker) scheduleHeartbeat() {
+ w.cancelHeartbeat, _ = w.SendAfter(w.PID(), messageHeartbeat{}, 3*time.Second)
+}
+```
+
+The signal `"postgres"` participates in both liveness and readiness probes. If the heartbeat stops arriving (timeout expires) or the process terminates, Kubernetes receives a 503 on both `/health/live` and `/health/ready`.
+
+### Probe Types
+
+| Constant | Endpoint |
+|----------|----------|
+| `radar.ProbeLiveness` | `/health/live` |
+| `radar.ProbeReadiness` | `/health/ready` |
+| `radar.ProbeStartup` | `/health/startup` |
+
+Combine with bitwise OR. A signal registered for `ProbeLiveness|ProbeReadiness` affects both endpoints independently.
+
+### Manual Signal Control
+
+When you can detect failures immediately without waiting for a timeout:
+
+```go
+case CacheConnectionLost:
+ radar.ServiceDown(w, "cache")
+
+case CacheConnectionRestored:
+ radar.ServiceUp(w, "cache")
+```
+
+### Helper Functions
+
+```go
+radar.RegisterService(process, signal, probe, timeout) // sync Call
+radar.UnregisterService(process, signal) // sync Call
+radar.Heartbeat(process, signal) // async Send
+radar.ServiceUp(process, signal) // async Send
+radar.ServiceDown(process, signal) // async Send
+```
+
+`RegisterService` and `UnregisterService` are synchronous calls that return an error on failure. `Heartbeat`, `ServiceUp`, and `ServiceDown` are asynchronous sends (fire-and-forget).
+
+For a detailed explanation of the heartbeat model, failure detection mechanisms, and the HTTP response format, see the [Health](../actors/health.md) actor documentation.
+
+## Custom Metrics
+
+Actors register Prometheus metric collectors and update them through Radar's helper functions. The underlying metrics actor manages the Prometheus registry and HTTP exposition. Registration is synchronous, updates are asynchronous.
+
+All custom metrics automatically receive a `node` const label set to the node name. Do not include `"node"` in your variable label names; it will cause a "duplicate label names" registration error.
+
+### Registering Metrics
+
+```go
+func (w *APIHandler) Init(args ...any) error {
+ radar.RegisterGauge(w, "active_connections",
+ "Number of active client connections", []string{"protocol"})
+
+ radar.RegisterCounter(w, "requests_total",
+ "Total HTTP requests processed", []string{"method", "status"})
+
+ radar.RegisterHistogram(w, "request_duration_seconds",
+ "Request latency distribution", []string{"method"},
+ []float64{0.01, 0.05, 0.1, 0.5, 1.0, 5.0})
+
+ return nil
+}
+```
+
+The `labels` parameter defines the label names for the metric. When updating, you provide label values in the same order. Pass `nil` for metrics without labels. The `buckets` parameter in `RegisterHistogram` defines histogram bucket boundaries; pass `nil` for Prometheus default buckets.
+
+### Updating Metrics
+
+```go
+func (w *APIHandler) HandleMessage(from gen.PID, message any) error {
+ switch msg := message.(type) {
+ case RequestCompleted:
+ radar.CounterAdd(w, "requests_total", 1,
+ []string{msg.Method, msg.StatusCode})
+ radar.HistogramObserve(w, "request_duration_seconds",
+ msg.Duration.Seconds(), []string{msg.Method})
+ case ConnectionChange:
+ radar.GaugeSet(w, "active_connections",
+ float64(msg.Count), []string{msg.Protocol})
+ }
+ return nil
+}
+```
+
+Updates are distributed across the worker pool. Under high throughput, multiple actors can send updates concurrently without contending on a single actor's mailbox.
+
+### Automatic Cleanup
+
+When a process that registered metrics terminates, all its metrics are automatically unregistered from the Prometheus registry. No explicit cleanup is needed. To remove a metric while the process is still running, use `radar.UnregisterMetric(process, name)`.
+
+### Helper Functions
+
+```go
+// Registration (sync Call, returns error)
+radar.RegisterGauge(process, name, help, labels)
+radar.RegisterCounter(process, name, help, labels)
+radar.RegisterHistogram(process, name, help, labels, buckets)
+radar.UnregisterMetric(process, name)
+
+// Updates (async Send, fire-and-forget)
+radar.GaugeSet(process, name, value, labels)
+radar.GaugeAdd(process, name, value, labels)
+radar.CounterAdd(process, name, value, labels)
+radar.HistogramObserve(process, name, value, labels)
+```
+
+For a detailed explanation of metric types, the Grafana dashboard, and advanced usage (embedding, shared mode), see the [Metrics](../actors/metrics.md) actor documentation.
+
+## Top-N Metrics
+
+Top-N metrics track the N highest (or lowest) values observed during each collection cycle and flush them to Prometheus as a GaugeVec. This is useful when you want to identify outliers (slowest queries, busiest workers, largest payloads) without creating a time series per item.
+
+### Registering and Observing
+
+```go
+func (w *QueryTracker) Init(args ...any) error {
+ // Keep the 10 slowest queries each cycle
+ radar.RegisterTopN(w, "slowest_queries", "Slowest DB queries",
+ 10, radar.TopNMax, []string{"query", "table"})
+ return nil
+}
+
+func (w *QueryTracker) HandleMessage(from gen.PID, message any) error {
+ switch msg := message.(type) {
+ case queryCompleted:
+ radar.TopNObserve(w, "slowest_queries", msg.Duration.Seconds(),
+ []string{msg.SQL, msg.Table})
+ }
+ return nil
+}
+```
+
+Registration is synchronous (returns error). Observations are asynchronous (fire-and-forget). Each top-N metric is managed by a dedicated actor that accumulates observations and flushes the top entries to Prometheus on the same interval as base metrics collection.
+
+### Ordering Modes
+
+- `radar.TopNMax`: keeps the N largest values (e.g., slowest queries, busiest actors, highest memory)
+- `radar.TopNMin`: keeps the N smallest values (e.g., lowest latency, least active processes)
+
+### Automatic Cleanup
+
+When the process that registered a top-N metric terminates, the metric actor cleans up and unregisters from Prometheus. No explicit teardown needed.
+
+### Helper Functions
+
+```go
+// Registration (sync Call, returns error)
+radar.RegisterTopN(process, name, help, topN, order, labels)
+
+// Observation (async Send, fire-and-forget)
+radar.TopNObserve(process, name, value, labels)
+```
+
+## Common Patterns
+
+### Database Connection Pool
+
+An actor that manages a connection pool reports both health and metrics through Radar:
+
+```go
+func (w *DBPool) Init(args ...any) error {
+ // Health: liveness + readiness with heartbeat
+ radar.RegisterService(w, "db_pool",
+ radar.ProbeLiveness|radar.ProbeReadiness, 10*time.Second)
+
+ // Metrics: connection pool gauge
+ radar.RegisterGauge(w, "db_pool_connections",
+ "Database connection pool size", []string{"state"})
+
+ w.scheduleCheck()
+ return nil
+}
+
+func (w *DBPool) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageCheck:
+ if w.pool.Ping() == nil {
+ radar.Heartbeat(w, "db_pool")
+ }
+ radar.GaugeSet(w, "db_pool_connections",
+ float64(w.pool.ActiveCount()), []string{"active"})
+ radar.GaugeSet(w, "db_pool_connections",
+ float64(w.pool.IdleCount()), []string{"idle"})
+
+ w.scheduleCheck()
+ }
+ return nil
+}
+```
+
+A single periodic check updates both the health signal and connection pool metrics. If the database becomes unreachable, the heartbeat stops and Kubernetes removes the pod from service. The metrics endpoint continues to show the last known pool state until the pod restarts.
+
+### Startup Gate with Progress
+
+An actor that runs migrations uses the startup probe to prevent premature traffic, and reports progress via a gauge:
+
+```go
+func (w *Migrator) Init(args ...any) error {
+ radar.RegisterService(w, "migrations", radar.ProbeStartup, 0)
+ radar.RegisterGauge(w, "migrations_pending",
+ "Number of pending migrations", nil)
+
+ w.Send(w.PID(), messageRunMigrations{})
+ return nil
+}
+
+func (w *Migrator) HandleMessage(from gen.PID, message any) error {
+ switch message.(type) {
+ case messageRunMigrations:
+ pending := w.countPending()
+ radar.GaugeSet(w, "migrations_pending", float64(pending), nil)
+
+ if err := w.runNext(); err != nil {
+ return err
+ }
+
+ if w.countPending() > 0 {
+ w.Send(w.PID(), messageRunMigrations{})
+ return nil
+ }
+
+ // All done -- mark startup complete
+ radar.GaugeSet(w, "migrations_pending", 0, nil)
+ radar.ServiceUp(w, "migrations")
+ radar.UnregisterService(w, "migrations")
+ }
+ return nil
+}
+```
+
+While migrations run, the startup probe returns 503, Kubernetes waits, and Prometheus shows the remaining migration count. Once complete, the startup signal is released and liveness/readiness probes take over.
+
+## Kubernetes Configuration
+
+Configure Kubernetes probes and Prometheus scraping to point at the same port:
+
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+ containers:
+ - name: myapp
+ livenessProbe:
+ httpGet:
+ path: /health/live
+ port: 9090
+ periodSeconds: 10
+ readinessProbe:
+ httpGet:
+ path: /health/ready
+ port: 9090
+ periodSeconds: 10
+ startupProbe:
+ httpGet:
+ path: /health/startup
+ port: 9090
+ failureThreshold: 30
+ periodSeconds: 2
+```
+
+Prometheus scrape configuration:
+
+```yaml
+scrape_configs:
+ - job_name: 'ergo'
+ static_configs:
+ - targets: ['localhost:9090']
+ scrape_interval: 15s
+```
+
+Align `scrape_interval` with `MetricsCollectInterval` in Radar options. The default collect interval is 10 seconds; scraping more frequently than the collect interval returns identical data.
+
+## Relationship to Health and Metrics Actors
+
+Radar uses [Health](../actors/health.md) and [Metrics](../actors/metrics.md) actors internally. The helper functions in the `radar` package delegate to these actors by their internal registered names. If you need capabilities beyond what the helpers expose (embedding the metrics actor for direct Prometheus registry access, custom health actor behavior with `HandleSignalDown` callbacks, or shared mux with additional HTTP handlers), use the underlying actors directly.
+
+Radar is designed for the common case: production nodes that need standard health probes and Prometheus metrics with minimal setup. For advanced scenarios, the building blocks are available as separate packages.
diff --git a/docs/extra-library/network-protocols/erlang.md b/docs/extra-library/network-protocols/erlang.md
index 1af0d6b18..c1f58767d 100644
--- a/docs/extra-library/network-protocols/erlang.md
+++ b/docs/extra-library/network-protocols/erlang.md
@@ -33,6 +33,8 @@ To use this package, include `ergo.services/proto/erlang23/handshake`.
The `ergo.services/proto/erlang/dist` package implements the `gen.NetworkProto` and `gen.Connection` interfaces. To create it, use the `dist.Create` function and provide `dist.Options` as an argument, where you can specify the `FragmentationUnit` size in bytes. This value is used for fragmenting large messages. The default size is set to `65000` bytes.
+The Erlang DIST proto deliberately does **not** implement `gen.TypeRegistry`, because the Erlang external term format (ETF) carries primitives, atoms, lists, tuples, and binaries directly on the wire without a separate type-registration step. In a multi-proto setup, calls to `node.Network().RegisterType` skip the Erlang proto and register only in TypeRegistry-capable protos like the default ENP/EDF stack. Use `etf.RegisterTypeOf` (described below) to teach the Erlang decoder how to map incoming tuples or atoms to your Go types.
+
To use this package, include `ergo.services/proto/erlang/dist`.
### ETF data format
diff --git a/docs/faq.md b/docs/faq.md
new file mode 100644
index 000000000..949cf28fd
--- /dev/null
+++ b/docs/faq.md
@@ -0,0 +1,236 @@
+---
+description: Answers to the questions developers and AI assistants ask most often
+---
+
+# FAQ
+
+## General
+
+### What is Ergo Framework?
+
+Ergo is an open-source Go framework for building concurrent and distributed systems using the actor model. It brings Erlang/OTP design patterns, including isolated processes, supervision trees, and network-transparent messaging, to Go with zero external dependencies.
+
+### Is Ergo production-ready?
+
+Yes. Ergo is used in production systems. It supports [mTLS](networking/mutual-tls.md), [NAT traversal](networking/behind-the-nat.md), graceful shutdown, panic recovery with stack traces, and has a comprehensive test suite. The framework has been in active development since 2019.
+
+### What license is Ergo distributed under?
+
+MIT License. Free to use in commercial projects without restrictions.
+
+### What Go version is required?
+
+Go 1.21 or higher. No other dependencies.
+
+## Actor Model
+
+### What is the actor model and why use it in Go?
+
+The actor model is a concurrency paradigm where independent units (actors, also called processes) communicate exclusively through message passing. Each actor has private state and processes messages one at a time. No shared memory, no mutexes, no race conditions.
+
+Go's goroutines and channels are powerful but don't enforce isolation. Goroutines can share memory, which requires manual synchronization. Ergo enforces the actor model guarantees: isolated state, message-only communication, and sequential processing per actor. See [Actor Model](basics/actor-model.md).
+
+### How is an Ergo process different from a goroutine?
+
+| | Goroutine | Ergo Process |
+|---|---|---|
+| Identity | No stable address | Has PID, addressable locally and remotely |
+| State | Can share memory | Strictly private |
+| Failure recovery | Manual | Automatic via supervision |
+| Cross-node messaging | Not built in | Same API, transparent |
+| Race conditions | Possible | Impossible within a process |
+
+See [Process](basics/process.md) for details.
+
+### How many processes can run on a single node?
+
+Thousands to hundreds of thousands. Processes sleep when idle and consume no CPU. Memory footprint per process is minimal, comparable to a goroutine plus a small mailbox struct.
+
+### Can actors communicate synchronously?
+
+Yes. Ergo supports both async (`Send`) and sync (`Call`) patterns. `Call` blocks the calling process until a response arrives or a timeout occurs, while maintaining full actor model guarantees. See [Handling Sync Requests](advanced/handle-sync.md).
+
+## Fault Tolerance
+
+### What happens when an actor crashes?
+
+Its supervisor detects the failure and applies a restart strategy:
+
+- **One-For-One**: restart only the failed child
+- **All-For-One**: restart all children when one fails
+- **Rest-For-One**: restart the failed child and all children started after it
+- **Simple-One-For-One**: identical children spawned dynamically at runtime, restart failed ones
+
+Supervision trees are hierarchical. A failed subtree is isolated and recovered without affecting the rest of the system. See [Supervision Tree](basics/supervision-tree.md) and [Supervisor](actors/supervisor.md).
+
+### Do I need to write retry logic?
+
+No. Supervision handles process recovery automatically. For message delivery, use the [Important Delivery](advanced/important-delivery.md) flag for guaranteed delivery semantics. The sender receives an immediate error if the target doesn't exist, rather than a timeout.
+
+### What happens if a remote node disconnects?
+
+All processes that were monitoring or linked to processes on the disconnected node receive a notification (`MessageDownNode` or exit signal). Your actors handle this notification and decide how to respond: retry, failover, or graceful degradation. See [Links and Monitors](basics/links-and-monitors.md).
+
+## Distributed Systems
+
+### How do nodes find each other?
+
+Through a registrar. Each node runs a minimal built-in registrar by default. Nodes on the same host discover each other automatically via localhost. For production clusters across multiple hosts, configure an external registrar:
+
+- **etcd**: distributed key-value store, widely used
+- **Saturn**: Ergo's own central registrar, purpose-built for Ergo clusters
+
+See [Service Discovering](networking/service-discovering.md).
+
+### Do I need Kubernetes or a service mesh?
+
+No. Ergo eliminates the integration tax of traditional microservice architectures. No HTTP or gRPC endpoints to define between services, no sidecar proxies, no API gateways for internal routing. Process-to-process communication is direct through the framework's network layer.
+
+Ergo does support Kubernetes for deployment. The [Health](extra-library/actors/health.md) actor provides liveness, readiness, and startup health probes, and the [Metrics](extra-library/actors/metrics.md) actor provides Prometheus metrics on a single port.
+
+### How does Ergo handle network partitions?
+
+The [Leader](extra-library/actors/leader.md) actor uses a Raft-inspired consensus algorithm with majority quorum to prevent split-brain scenarios. When a partition occurs, only the partition with a majority of nodes continues to elect a leader. Minority partitions stop processing leader-dependent operations until connectivity is restored.
+
+### Can I run Ergo nodes across different clouds?
+
+Yes. [ergo.cloud](https://ergo.cloud) is a managed overlay network that connects Ergo nodes across AWS, GCP, Azure, and bare metal into one transparent cluster without VPNs, proxies, or tunnels. End-to-end encrypted. Currently available via waitlist.
+
+## Pub/Sub
+
+### How does distributed Pub/Sub work in Ergo?
+
+A producer process registers a named event. Any process on any node subscribes using `LinkEvent` or `MonitorEvent`. The framework delivers messages to all subscribers transparently across the cluster.
+
+```go
+// Producer
+token, _ := producer.RegisterEvent("market.prices", gen.EventOptions{})
+producer.SendEvent("market.prices", token, PriceUpdate{Asset: "BTC", Price: 95000})
+
+// Subscriber on any node
+process.MonitorEvent(gen.Event{Name: "market.prices", Node: "producer@host"})
+
+// Event messages arrive in HandleEvent
+func (s *Sub) HandleEvent(event gen.MessageEvent) error {
+ update := event.Message.(PriceUpdate)
+ // handle update
+ return nil
+}
+
+// Producer termination or event unregister arrives in HandleMessage as MessageDownEvent
+func (s *Sub) HandleMessage(from gen.PID, msg any) error {
+ switch msg.(type) {
+ case gen.MessageDownEvent:
+ // producer terminated or unregistered
+ }
+ return nil
+}
+```
+
+See [Events](basics/events.md).
+
+### How does Ergo Pub/Sub scale?
+
+The framework uses fan-out at the consumer node level, not per subscriber. One network message is sent per remote node regardless of how many subscribers that node has. Local delivery then fans out within the node.
+
+Result: 2.9M messages/second delivery rate to 1,000,000 subscribers across 10 nodes using only 10 network messages, not 1,000,000. See [Pub/Sub Internals](advanced/pub-sub-internals.md).
+
+### What's the difference between Links, Monitors, and Events?
+
+All three use the same underlying pub/sub mechanism internally. All three are unidirectional: the notification flows from the target to the watcher, not the other way around. Note this differs from Erlang, where links are bidirectional.
+
+- **Link**: when the target terminates, the watcher receives an exit signal on its Urgent queue. The default behavior is to terminate the watcher. Actors can enable exit trapping to receive the signal as a `gen.MessageExit*` message and decide how to react.
+- **Monitor**: when the target terminates, the watcher receives a `gen.MessageDown*` notification on its System queue. The watcher continues running.
+- **Event**: the watcher subscribes to a named stream of messages published by a producer. The producer terminating also delivers a notification (exit signal for link-based subscriptions, down message for monitor-based).
+
+See [Links and Monitors](basics/links-and-monitors.md) and [Pub/Sub Internals](advanced/pub-sub-internals.md).
+
+## Performance
+
+### How fast is Ergo?
+
+- 21M+ messages/second locally on a 64-core processor
+- ~5.5M messages/second over the network
+- EDF serialization: up to 47% faster encoding than Protobuf, 6 to 14 times faster than Gob
+- Distributed Pub/Sub: 2.9M msg/sec to 1M subscribers across 10 nodes
+
+Full benchmarks: [benchmarks repository](https://github.com/ergo-services/benchmarks).
+
+### How does Ergo serialization compare to Protobuf?
+
+Ergo uses EDF (Ergo Data Format) with type caching. For repeated message types, type metadata is cached after the first transmission. Subsequent messages of the same type skip type information entirely. This makes EDF significantly faster than Protobuf for encoding and decoding in high-throughput scenarios.
+
+## Observability
+
+### Does Ergo support distributed tracing?
+
+Yes. Ergo has native distributed tracing that follows message chains across processes and nodes. When a traced process sends a message, the trace identity travels with the message and propagates automatically through the entire downstream chain of handlers. You configure tracing on entry-point processes. Downstream actors need no instrumentation.
+
+Traces can be viewed directly in Observer as waterfall diagrams or exported to OTLP-compatible backends (Grafana Tempo, Jaeger, OpenTelemetry Collector) via the [Pulse application](extra-library/applications/pulse.md). See [Distributed Tracing](advanced/distributed-tracing.md) for details.
+
+### How do I inspect a running node?
+
+Run the [Observer](extra-library/applications/observer.md) web UI for live visibility into processes, applications, network connections, events, logs, tracing waterfalls, and heap profiles. For AI-driven investigation, use the [MCP application](extra-library/applications/mcp.md) to expose the running system to Claude Code, Cursor, or any MCP-compatible client. For continuous metrics, the [Radar](extra-library/applications/radar.md) application provides a Prometheus endpoint with a ready-to-use Grafana dashboard.
+
+## Integration
+
+### Can Ergo nodes talk to Erlang/Elixir nodes?
+
+Yes. Ergo supports the full Erlang network stack: EPMD, ETF (External Term Format), and DIST protocol. You can build hybrid Go/Erlang clusters where Ergo nodes and BEAM nodes coexist and communicate natively. See [Erlang protocol](extra-library/network-protocols/erlang.md).
+
+### Does Ergo work with Prometheus and Grafana?
+
+Yes. The [Metrics](extra-library/actors/metrics.md) actor exports node and network telemetry via a Prometheus HTTP endpoint. A ready-to-use Grafana dashboard is provided via [Radar](extra-library/applications/radar.md).
+
+### Does Ergo support WebSockets and SSE?
+
+Yes, via [Meta Processes](basics/meta-process.md). Each [WebSocket](extra-library/meta-processes/websocket.md) or [SSE](extra-library/meta-processes/sse.md) connection becomes an independent meta-process with a stable identifier (`gen.Alias`). Any actor anywhere in the cluster can send messages directly to a specific client connection. No routing intermediaries needed. This enables real-time push from any cluster node to any specific connected client.
+
+### Can I use Ergo with standard Go HTTP libraries?
+
+Yes. Ergo's [Web](meta-processes/web.md) meta-process integrates with standard `net/http`. You use any Go router (stdlib ServeMux, gorilla/mux, chi, echo) and any HTTP middleware. Actors are an implementation detail invisible to the HTTP layer.
+
+## AI and MCP
+
+### Can Ergo be used for AI agent infrastructure?
+
+Yes, and it is particularly well-suited. Each AI agent runs as an isolated process with a mailbox. No shared state between agents, no race conditions. Supervisor trees restart stuck or crashed agents automatically. Multiple agents coordinate through message passing. Agents distribute transparently across cluster nodes as load grows. See [AI Agents](ai-agents.md) for patterns and diagnostics.
+
+### What is MCP support in Ergo?
+
+Ergo has built-in support for the Model Context Protocol (MCP), an emerging standard for AI tool integration. The [MCP application](extra-library/applications/mcp.md) exposes the running cluster to AI assistants (Claude Code, Cursor, and any MCP-compatible client) as a set of diagnostic tools. The AI inspects processes, queries events, captures goroutine dumps, reads logs, and runs samplers through natural language.
+
+Two deployment modes:
+
+- **Entry point**: the node runs an HTTP listener that accepts MCP requests. This is the node your AI client connects to.
+- **Agent**: no HTTP listener. Accessible via cluster proxy from the entry point node. Use this for internal nodes that should be inspectable without exposing an HTTP port.
+
+## Getting Started
+
+### How do I create my first Ergo project?
+
+```
+# Install the project generator
+go install ergo.tools/ergo@latest
+
+# Create a project
+ergo init MyNode github.com/myorg/mynode
+cd mynode
+
+# Add components
+ergo add supervisor MyNodeApp:MySup
+ergo add actor MySup:MyWorker
+
+# Run
+go run ./cmd
+```
+
+See [ergo tool documentation](tools/ergo.md) for the full command reference.
+
+### Where can I get help?
+
+- [Documentation](https://docs.ergo.services)
+- [Examples](https://github.com/ergo-services/examples)
+- [Telegram community](https://t.me/ergo_services)
+- [GitHub Discussions](https://github.com/ergo-services/ergo/discussions)
+- Commercial support: support@ergo.services
diff --git a/docs/networking/network-stack.md b/docs/networking/network-stack.md
index cd9558956..27a96677c 100644
--- a/docs/networking/network-stack.md
+++ b/docs/networking/network-stack.md
@@ -103,9 +103,37 @@ After handshake, the accepting node tells the dialing node to create a connectio
The dialing node opens additional TCP connections using a shortened join handshake (skips full authentication since the first connection already authenticated). These connections join the pool, forming a single logical connection with multiple physical TCP links.
-Multiple connections enable parallel message delivery. Each message goes to a connection based on the sender's identity (derived from sender PID). Messages from the same sender always use the same connection, preserving order. Messages from different senders use different connections, enabling parallelism.
+Multiple connections enable parallel message delivery. Each message goes to a connection based on the sender's identity, and the receiving side creates multiple receive queues per TCP connection for concurrent processing. This two-level mechanism (sender-side link selection and receiver-side queue routing) preserves per-sender message ordering while enabling parallelism across different senders. For details on how ordering works, including the `KeepNetworkOrder` flag and when to disable it, see [Message Ordering](network-transparency.md#message-ordering).
-The receiving side creates 4 receive queues per TCP connection. A 3-connection pool has 12 receive queues processing messages concurrently. This parallel processing improves throughput while preserving per-sender message ordering.
+### Software Keepalive
+
+*Introduced in v3.3.0.*
+
+TCP keepalive operates at the OS level - it detects hard network failures like unplugged cables or crashed hosts. But it can't detect application-level problems: a stuck process that stopped reading from a connection, a flusher that failed silently, a goroutine that never got scheduled. The connection looks alive to TCP while no useful data flows.
+
+Software keepalive works at the protocol level. When a connection pool item has nothing to send, its flusher periodically writes a small keepalive packet. The receiving side expects these packets and sets a read deadline based on the sender's advertised period. If nothing arrives - no real messages and no keepalive packets - the deadline fires and the connection is terminated.
+
+Each side advertises its keepalive period during handshake. This allows asymmetric configuration: a node in a reliable datacenter might send keepalive every 15 seconds, while a node on an unstable network might send every 5 seconds. The receiver calculates its deadline from the sender's period, not its own.
+
+```go
+node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
+ Network: gen.NetworkOptions{
+ Flags: gen.NetworkFlags{
+ // ... other flags ...
+ EnableSoftwareKeepAlive: 15, // send keepalive every 15 seconds when idle
+ },
+ SoftwareKeepAliveMisses: 3, // tolerate 3 missed keepalives before disconnect
+ },
+})
+```
+
+The timeout calculation uses the remote node's period, not the local one. If the remote node advertises a 15-second period and you configure 3 misses, the connection is considered dead after 45 seconds of silence. Real messages reset the deadline just like keepalive packets do - on a busy connection, keepalive is never sent because regular traffic keeps the deadline from expiring.
+
+When a keepalive timeout fires on any pool item, the entire connection is terminated - not just the affected TCP link. A single unresponsive link is strong evidence that the whole network path to the remote node is down. This triggers the standard cleanup flow: monitors receive `MessageDown`, links receive `MessageExit`, and the connection is removed from the node's connection map.
+
+Software keepalive is enabled by default (15-second period, 3 misses, 45-second timeout). Set `EnableSoftwareKeepAlive` to 0 to disable it. Acceptors and routes can override the misses count; zero inherits from `NetworkOptions`.
+
+Both sides must have keepalive enabled for the feature to activate. If either side advertises period 0, the connection falls back to TCP-only keepalive with infinite read deadline - neither side sends keepalive packets and neither side sets read deadlines. This means a single node with keepalive disabled in a cluster removes protection for all its connections, not just its own. During a rolling upgrade from older nodes (which don't support the feature) to newer ones, connections between old and new nodes will not have software keepalive until both sides are upgraded.
## Message Encoding and Transmission
@@ -115,7 +143,7 @@ Once a connection exists, messages flow through encoding and framing.
EDF is a binary encoding specifically designed for the framework's communication patterns. It's type-aware - each value is prefixed with a type tag (e.g., `0x95` for int64, `0xaa` for PID, `0x9d` for slice). The decoder reads the tag and knows what follows.
-Framework types like `gen.PID` and `gen.Ref` have optimized encodings. Structs are encoded field-by-field in declaration order (no field names on the wire). Custom types must be registered on both sides - registration happens during `init()`, and during handshake nodes exchange their type lists to agree on encoding.
+Framework types like `gen.PID` and `gen.Ref` have optimized encodings. Structs are encoded field-by-field in declaration order (no field names on the wire). Custom types must be registered on both sides via `node.Network().RegisterType` (typically from an application's `Load` callback). During handshake, nodes exchange their type lists to agree on encoding.
Compression is automatic. If a message exceeds the compression threshold (default 1024 bytes), it's compressed using GZIP, ZLIB, or LZW. The protocol frame indicates compression, so the receiver decompresses before decoding.
@@ -129,6 +157,39 @@ The order byte preserves message ordering per sender. Messages from the same sen
For details on protocol framing, order bytes, receive queue distribution, and the exact byte layout, see [Network Transparency](network-transparency.md).
+### Message Fragmentation
+
+*Introduced in v3.3.0.*
+
+When a message exceeds the fragment size threshold (default 65000 bytes), the framework splits it into smaller pieces for transmission and reassembles them on the receiving side. This happens after compression; if a compressed message is still too large, it gets fragmented. From your code's perspective, nothing changes. You send a large message, and it arrives intact.
+
+Fragmentation works with all message types: regular sends, important delivery, calls, and events. It composes with compression: a message can be compressed first, then fragmented, and on the receiving side defragmented and then decompressed.
+
+When [`KeepNetworkOrder`](network-transparency.md#message-ordering) is disabled for a process, the framework distributes fragments across all TCP connections in the pool, using the full bandwidth of the connection. This is useful for transferring large payloads where throughput matters more than ordering. When `KeepNetworkOrder` is enabled (the default), all fragments travel through a single TCP connection to preserve message ordering for that sender.
+
+Both nodes must have `EnableFragmentation` in their network flags. If either side doesn't support it, large messages are sent as-is (subject to `MaxMessageSize` limits). During handshake, nodes exchange their fragmentation capability, and the feature activates only when both sides agree.
+
+`MaxMessageSize` is a logical limit on the EDF-encoded message, checked before compression and fragmentation. On the receiving side, the framework tracks the accumulated size of received fragments and rejects the assembly if it exceeds the limit.
+
+```go
+node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
+ Network: gen.NetworkOptions{
+ Flags: gen.NetworkFlags{
+ EnableFragmentation: true, // default: true
+ },
+ FragmentSize: 65000, // bytes per fragment, 0 = default
+ FragmentTimeout: 30, // seconds, assembly timeout, 0 = default
+ MaxFragmentAssemblies: 1000, // max concurrent assemblies, 0 = default
+ },
+})
+```
+
+`FragmentSize` controls at what point messages get split. This is a sender-side setting; the receiver reassembles whatever arrives regardless of the sender's fragment size. Two nodes can use different fragment sizes.
+
+`FragmentTimeout` sets how long the receiver waits for all fragments before discarding an incomplete assembly. If a sender crashes mid-message or a connection drops, partial assemblies are cleaned up after this timeout.
+
+`MaxFragmentAssemblies` limits how many messages can be simultaneously reassembled per connection, protecting against memory exhaustion from many concurrent large messages.
+
## Network Transparency in Practice
Network transparency means remote operations look like local operations. You send to a PID without checking if it's local or remote. You establish links and monitors the same way regardless of location. The framework handles discovery, encoding, and transmission automatically.
@@ -158,7 +219,12 @@ node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
EnableRemoteSpawn: true,
EnableRemoteApplicationStart: true,
EnableImportantDelivery: true,
+ EnableFragmentation: true, // default: true
+ EnableSoftwareKeepAlive: 15, // seconds, 0 to disable
},
+ SoftwareKeepAliveMisses: 3, // tolerate 3 missed keepalives
+ FragmentSize: 65000, // 0 = default
+ FragmentTimeout: 30, // seconds, 0 = default
Acceptors: []gen.AcceptorOptions{
{
Port: 15000,
@@ -176,13 +242,13 @@ node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
**MaxMessageSize** - Maximum incoming message size. Protects against memory exhaustion. Default unlimited (fine for trusted clusters).
-**Flags** - Control capabilities. Remote nodes learn your flags during handshake and can only use features you've enabled. `EnableRemoteSpawn` allows spawning (with explicit permission per process). `EnableImportantDelivery` enables delivery confirmation.
+**Flags** - Control capabilities. Remote nodes learn your flags during handshake and can only use features you've enabled. `EnableRemoteSpawn` allows spawning (with explicit permission per process). `EnableImportantDelivery` enables delivery confirmation. `EnableFragmentation` enables message fragmentation for large messages (both sides must enable). `EnableSoftwareKeepAlive` sets the keepalive period in seconds (see [Software Keepalive](#software-keepalive)).
**Acceptors** - Define listeners for incoming connections. Multiple acceptors on different ports are supported. Each can have its own cookie, TLS, and protocol.
## Custom Network Stacks
-The framework provides three extension points:
+The framework provides four extension points:
**gen.NetworkHandshake** - Control connection establishment and authentication. Implement this to change how nodes authenticate or how connection pools are created.
@@ -190,6 +256,8 @@ The framework provides three extension points:
**gen.Connection** - The actual connection handling. Implement this for custom framing, routing, or error handling.
+**gen.TypeRegistry** - Optional capability that proto implementations may declare to expose a wire-format type registry. The default ENP/EDF stack implements it. The Erlang distribution proto does not, since the Erlang external term format is schemaless on the wire. When a node has multiple protos configured, `node.Network().RegisterType` distributes registration to every TypeRegistry-capable proto strictly: any per-proto failure fails the call. Protos that do not implement TypeRegistry are skipped silently.
+
You can register multiple handshakes and protos, allowing one node to support multiple protocol stacks simultaneously:
```go
diff --git a/docs/networking/network-transparency.md b/docs/networking/network-transparency.md
index bdcd61faa..979dd36cc 100644
--- a/docs/networking/network-transparency.md
+++ b/docs/networking/network-transparency.md
@@ -156,8 +156,11 @@ type Order struct {
Items []string
}
-func init() {
- edf.RegisterTypeOf(Order{}) // Analyzed once, functions built
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := node.Network().RegisterType(Order{}); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
// Later, during message sending:
@@ -166,7 +169,7 @@ process.Send(to, Order{ID: 42, Items: []string{"item1"}}) // Uses pre-built enc
This approach delivers Protocol Buffers-class performance without `.proto` files or `protoc` code generation.
-Registration happens at runtime - no build step, no generated files. You call `edf.RegisterTypeOf()` in your `init()` function, and EDF builds the optimized encoders. Framework types like `gen.PID`, `gen.Ref`, and `gen.Event` have native support with specialized encodings. During node handshake, both sides exchange their registered type lists and negotiate short numeric IDs, turning a full type name into 3 bytes on the wire. Field names aren't encoded - only field values in declaration order.
+Registration happens at runtime - no build step, no generated files. You call `node.Network().RegisterType()` from your application's `Load()` callback, and the framework builds the optimized encoders. Framework types like `gen.PID`, `gen.Ref`, and `gen.Event` have native support with specialized encodings. During node handshake, both sides exchange their registered type lists and negotiate short numeric IDs, turning a full type name into 3 bytes on the wire. Field names aren't encoded - only field values in declaration order.
Performance benchmarks (see `benchmarks/serial/`) show encoding is 50-100% faster than Protocol Buffers, while decoding is 20-60% slower. The encoding advantage comes from the specialized functions built during registration.
@@ -201,9 +204,9 @@ These limits are enforced during encoding. If you attempt to encode a 70,000 byt
## Type Registration Requirements
-For custom types to cross the network, both sending and receiving nodes must register them. Registration tells EDF how to encode and decode the type, and creates a numeric ID that's shared during handshake for efficient encoding.
+For custom types to cross the network, both sending and receiving nodes must register them. Registration tells the active wire-format proto how to encode and decode the type, and creates a numeric ID that's shared during handshake for efficient encoding.
-Register types during initialization:
+Register types from your application's `Load()` callback:
```go
type Order struct {
@@ -211,11 +214,34 @@ type Order struct {
Items []string
}
-func init() {
- edf.RegisterTypeOf(Order{})
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := node.Network().RegisterType(Order{}); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
```
+`Network().RegisterType` distributes registration across every active wire-format proto (e.g., the default ENP/EDF stack). If your node has multiple wire-format protocols configured (for example, a legacy ENP and a newer one running side by side), one call registers in all of them. The call fails if any proto rejects the type. Wire-format consistency is enforced strictly to prevent silent split-brain registries.
+
+For batch registration of multiple types, use `RegisterTypes` (see the **Nested types** subsection below for the dependency-resolution behavior):
+
+```go
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ err := node.Network().RegisterTypes([]any{
+ Order{},
+ Customer{},
+ Address{},
+ })
+ if err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
+}
+```
+
+`RegisterTypes` resolves inter-type dependencies internally. You can list types in any order, and the framework figures out the correct registration sequence.
+
### Registration Requirements
**Only exported fields** - Structs must have all fields exported (starting with uppercase). This is by design: exported fields define your actor's contract. When actors communicate - locally or across the network - they exchange messages according to explicit contracts. Unexported fields are implementation details, internal state that shouldn't cross actor boundaries. If registration encounters unexported fields, it fails with `"struct Order has unexported field(s)"`.
@@ -227,18 +253,21 @@ type Order struct {
}
```
-**No pointer types** - EDF rejects pointer types and structs containing pointer fields. This is by design: pointers are a local memory optimization and shouldn't be part of network contracts. A `*Database` field is meaningless to a remote actor - it can't dereference your memory address. Pointers express local sharing semantics that don't translate across address spaces.
+**Pointer types** - Starting from version 3.3, EDF supports pointer types. Pointers can be `nil` or point to a value, and this state is preserved during encoding/decoding. Nested pointers (`**int`) are not supported.
```go
+var discount *float64 // nil or value
+var prices []*int // slice with nil elements
+var cache map[string]*Config // map with nil values
+
type Order struct {
- ID int64
- Cache *OrderCache // Registration fails - pointer is local optimization
+ Priority *int // optional field
}
```
-For distributed references, use framework types designed for remote access: `gen.PID` (process reference), `gen.Alias` (named reference), `gen.Ref` (call reference). These work across nodes and provide location-independent semantics.
+Note that pointers to external resources like `*Database` or `*Connection` are meaningless to a remote actor - it cannot dereference your memory address. Use pointers for optional value semantics, not for sharing local resources. For distributed references, use framework types: `gen.PID`, `gen.Alias`, `gen.Ref`.
-**Nested types must be registered first** - If your type contains other custom types, register the inner types before the outer type:
+**Nested types** - If your type contains other custom types, the inner types must be registered before the outer type. Use `RegisterTypes` (batch) which resolves dependency order automatically:
```go
type Address struct {
@@ -251,13 +280,18 @@ type Person struct {
Address Address
}
-func init() {
- edf.RegisterTypeOf(Address{}) // register child first
- edf.RegisterTypeOf(Person{}) // then parent
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ // Order in the slice doesn't matter. The framework registers
+ // inner types first and retries until everything resolves.
+ err := node.Network().RegisterTypes([]any{Person{}, Address{}})
+ if err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
```
-The order matters because registration builds the encoding schema by examining fields. When registering `Person`, EDF sees the `Address` field. If `Address` isn't registered yet, registration fails with `"type Address must be registered first"`. If `Address` is already registered, EDF references its schema, creating an efficient nested encoding.
+If you call `RegisterType` (singular) on `Person` before `Address`, registration fails with `"type Address must be registered first"`. With `RegisterTypes`, the framework iteratively retries pending types whose dependencies become available. Only types that genuinely cannot be resolved produce an error. Registration builds the encoding schema by examining fields; once `Address` is registered, registering `Person` references its schema for efficient nested encoding.
### Custom Marshaling for Special Cases
@@ -317,9 +351,14 @@ var (
ErrOutOfStock = errors.New("out of stock")
)
-func init() {
- edf.RegisterError(ErrInvalidOrder)
- edf.RegisterError(ErrOutOfStock)
+func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
+ if err := node.Network().RegisterError(ErrInvalidOrder); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ if err := node.Network().RegisterError(ErrOutOfStock); err != nil {
+ return gen.ApplicationSpec{}, err
+ }
+ return gen.ApplicationSpec{ /* ... */ }, nil
}
```
@@ -333,17 +372,32 @@ Type registration must happen before connection establishment. During handshake,
If you register a type after a connection is established, that type isn't in the dictionary. Attempting to send a value of that type fails - the encoder can't find it in the shared schema. The only way to use the newly registered type is to disconnect and reconnect, forcing a new handshake that includes the type.
-This is why registration typically happens in `init()` functions. The registration runs before `main()`, which runs before node startup, which runs before any connections are established. By the time connections form, all types are registered.
+The recommended place to register types is the application's `Load(node)` callback. Applications are loaded after the network stack is initialized but before any outgoing or incoming traffic, so all types end up in the handshake dictionaries. An application owns its message types and registers them itself, keeping registration co-located with the code that defines the types.
For dynamic type registration (registering types based on runtime configuration or plugin loading), you have limited options:
-**Register before node start** - Load your configuration, determine which types you need, register them all, then start the node. This works but requires knowing all types upfront.
+**Register before any traffic** - Load your configuration, determine which types you need, register them in your application's `Load()` callback. This works but requires knowing all types upfront for the application.
-**Coordinate reconnection** - Register the new type, disconnect existing connections to nodes that need the type, wait for reconnection with new handshake. This is complex and causes temporary communication loss.
+**Coordinate reconnection** - Register the new type via `node.Network().RegisterType`, disconnect existing connections to nodes that need the type, wait for reconnection with new handshake. This is complex and causes temporary communication loss.
**Use custom marshaling** - Implement `edf.Marshaler`/`Unmarshaler` or `encoding.BinaryMarshaler`/`Unmarshaler`. These don't require pre-registration - they work immediately. The tradeoff is you write the encoding logic yourself.
-Most applications register types statically in `init()` and avoid these complications.
+Most applications register types statically from `Load()` and avoid these complications.
+
+## Legacy Registration API
+
+Earlier versions of the framework exposed registration as package-level functions on `ergo.services/ergo/net/edf`:
+
+```go
+// Deprecated. Use node.Network().RegisterType / RegisterError / RegisterAtom instead.
+edf.RegisterTypeOf(Order{})
+edf.RegisterError(ErrInvalidOrder)
+edf.RegisterAtom("my_atom")
+```
+
+These functions remain for backward compatibility but are **deprecated**. They write directly into the EDF package state, bypassing the `gen.Network` abstraction. In a multi-proto setup (more than one wire-format proto registered on the node), they only register in EDF, and other protos won't see the type. The new `Network` API distributes registration to every active wire-format proto strictly.
+
+Prefer `node.Network().RegisterType` / `RegisterTypes` / `RegisterError` / `RegisterAtom` from your application's `Load()` callback. The legacy package-level functions emit a one-time deprecation warning when called from user code.
## Compression
@@ -395,6 +449,8 @@ The caches are bidirectional - both nodes maintain the same mappings. During enc
This caching is automatic. You don't manage the cache or invalidate entries. The framework handles it. You just benefit from smaller messages.
+To measure how much each registered type actually contributes to network traffic and to identify candidates for compression, build the node with `-tags=typestats`. This enables per-type encode/decode counters and wire-byte totals exposed via `Network().RegisteredTypes()` and visible in the Observer Types panel. Counters increment only on root operations (a type sent or received as a message in its own right); bytes embedded inside other messages are accounted to the parent type. The cost is approximately 2-3% on encode/decode throughput; without the tag there is zero overhead. See [The typestats Tag](../advanced/debugging.md#the-typestats-tag) for details.
+
## Important Delivery
Network transparency breaks down when dealing with failures. Sending to a local process that doesn't exist returns an error immediately - the framework checks the process table and sees the PID isn't registered. Sending to a remote process that doesn't exist returns... nothing. The message is encoded, sent to the remote node, and the remote node silently drops it because there's no recipient. Your code doesn't know the process was missing.
@@ -427,6 +483,71 @@ The cost is latency. Normal `Send` returns immediately - it queues the message a
For detailed exploration of Important Delivery patterns, reliability guarantees, and protocols like RR-2PC and FR-2PC, see [Important Delivery](../advanced/important-delivery.md).
+## Message Ordering
+
+Messages sent from process A to process B arrive in sending order. This is a per-sender FIFO guarantee; it applies to each sender independently, not globally across all senders. The guarantee is enabled by default for every process.
+
+### KeepNetworkOrder Flag
+
+Message ordering is controlled by a per-process flag called `KeepNetworkOrder`, which defaults to `true`. You can change it using `SetKeepNetworkOrder(bool)` during `Init` or at any point while the process is running. The flag applies to all outgoing messages from that process: `Send`, `Call`, `SendResponse`, and `SendEvent`. There is no per-message override; ordering is all-or-nothing for a given sender.
+
+### How It Works: Sender Side
+
+With ordering enabled, all messages from a process go through the same TCP link in the connection pool. The link is selected deterministically: `sender.ID % 255 % pool_size`. Since TCP guarantees FIFO delivery within a single connection, messages arrive at the remote node in exactly the order they were sent.
+
+With ordering disabled, messages are distributed round-robin across all pool links. This spreads the load for maximum throughput, but the arrival order across different TCP connections is no longer deterministic.
+
+### How It Works: Receiver Side
+
+Each message carries an **order byte** in the protocol header (byte 6 of the ENP frame). When ordering is enabled, the order byte is derived from the recipient's identity:
+- For `gen.PID` recipients: `to.ID % 255`
+- For `gen.Alias` recipients: `to.ID[1] % 255`
+
+The receiving node routes messages to receive queues based on this byte: `order_byte % queue_count`. Messages destined for the same recipient land in the same queue and are decoded sequentially, preserving order.
+
+When ordering is disabled, the order byte is zero. Messages distribute round-robin across receive queues, enabling parallel decoding at the cost of non-deterministic arrival order.
+
+### Two-Level Guarantee
+
+The ordering mechanism works at two levels:
+
+1. **Sender side:** pins messages to one TCP link, preserving send order in the TCP stream
+2. **Receiver side:** pins messages to one decode queue, preserving decode order
+
+Together they ensure end-to-end FIFO from sender to recipient. The sender side prevents reordering during transmission; the receiver side prevents reordering during decoding and dispatch.
+
+### Special Cases
+
+Some system messages have fixed ordering semantics regardless of the `KeepNetworkOrder` flag:
+
+| Operation | Ordering | Notes |
+|-----------------|---------------|---------------------------------------------|
+| `SendExit` | Always ordered| No `KeepNetworkOrder` check, always uses sender-derived order byte |
+| `SendTerminate` | Always unordered | Order byte is always 0 |
+| Link/Monitor | Always ordered| System operations that must arrive in sequence |
+
+These are internal system messages where ordering behavior is fixed by the protocol, not configurable by the process.
+
+### When to Disable Ordering
+
+Processes that don't need ordering benefit from disabling it. When `KeepNetworkOrder` is `false`, messages spread across all TCP links in the pool and all receive queues on the remote side. This increases parallelism on both ends: more connections are utilized for sending, and more goroutines participate in decoding.
+
+Good candidates for disabling ordering:
+- **Stateless workers** that process each request independently
+- **Fan-out producers** that distribute work to many recipients
+- **High-throughput event emitters** where each event is self-contained
+
+The tradeoff is straightforward: message arrival order becomes non-deterministic. If your process logic doesn't depend on message order, disabling ordering gives you better throughput.
+
+```go
+func (w *Worker) Init(args ...any) error {
+ // This worker processes requests independently,
+ // ordering doesn't matter
+ w.SetKeepNetworkOrder(false)
+ return nil
+}
+```
+
## Protocol Frame Structure
EDF-encoded messages are wrapped in ENP (Ergo Network Protocol) frames for transmission over TCP.
@@ -445,9 +566,7 @@ For PID messages, the frame contains:
- Recipient PID (8 bytes)
- EDF-encoded message payload
-The **order byte** (byte 6) preserves message ordering per sender. It's calculated as `senderPID.ID % 255`, ensuring messages from the same sender have the same order value. This guarantees sequential processing on the receiving side even if messages arrive on different TCP connections in the pool. Messages from different senders have different order values, enabling parallel processing.
-
-When the receiving node reads a frame from TCP, it extracts the order byte and routes the frame to the appropriate receive queue. The connection creates **4 receive queues per TCP connection** in the pool. So a 3-connection pool has 12 receive queues total. Frames are distributed to queues based on `order_byte % queue_count`. Each queue is processed by a dedicated goroutine that decodes frames and delivers messages to recipients. This parallel processing improves throughput while preserving per-sender ordering.
+The **order byte** (byte 6) controls message ordering and receive queue routing. For details on how the order byte is calculated and how it interacts with the connection pool and receive queues, see [Message Ordering](#message-ordering) above.
## Limits of Transparency
@@ -461,7 +580,7 @@ Network transparency is powerful but not magical. The network has physical prope
**Partial failures** - In a distributed system, some nodes can fail while others continue working. A local system either works entirely or crashes entirely. A distributed system can be partially operational - some nodes reachable, others not. This partial failure is the hardest aspect of distributed systems. The framework can't hide it entirely.
-**Ordering** - Message ordering is preserved per-sender within a connection. Messages from process A to process B arrive in the order sent. But messages from different senders can interleave arbitrarily. And if a connection drops and reconnects, messages sent during disconnection are lost or delayed. Don't assume global ordering across the cluster.
+**Ordering** - Message ordering is preserved per-sender, not globally. Messages from process A to process B arrive in sending order, but messages from different senders can interleave arbitrarily. If a connection drops and reconnects, messages sent during disconnection are lost or delayed. Don't assume global ordering across the cluster. See [Message Ordering](#message-ordering) for how the ordering mechanism works and when to disable it.
Network transparency makes distributed programming feel local. But distributed programming has fundamental differences from local programming. The transparency is a tool that simplifies common cases - it doesn't eliminate the need to think about distributed system challenges.
@@ -481,6 +600,6 @@ Understanding network transparency helps you design better distributed systems.
**Leverage compression** - Enable compression for processes that send large messages. The CPU cost of compression is usually worth the network bandwidth savings. But don't compress tiny messages - the overhead exceeds the benefit.
-**Register types early** - Do all type registration in `init()` functions before the node starts. Avoid dynamic type registration that requires connection cycling. Static registration is simpler and more reliable.
+**Register types early** - Do all type registration from your application's `Load(node)` callback so types are in the registry before any traffic. Avoid dynamic type registration that requires connection cycling. Static registration is simpler and more reliable.
For details on how the network stack implements transparency, see [Network Stack](network-stack.md). For understanding how nodes discover each other, see [Service Discovery](service-discovering.md).
diff --git a/docs/networking/remote-spawn-process.md b/docs/networking/remote-spawn-process.md
index fbdc6a199..aeb6087b3 100644
--- a/docs/networking/remote-spawn-process.md
+++ b/docs/networking/remote-spawn-process.md
@@ -222,7 +222,7 @@ node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
Now when you use `process.RemoteSpawn`, the remote process receives a copy of the calling process's environment. The remote node reads these values and sets them on the spawned process.
-**Important:** Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via `edf.RegisterTypeOf`. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote spawn fails entirely with an error like `"no encoder for type "`. The framework doesn't skip problematic variables - any non-serializable value causes the entire spawn request to fail.
+**Important:** Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via `node.Network().RegisterType` (see [Network Transparency](network-transparency.md) for details on the type registry; the legacy `edf.RegisterTypeOf` still works but is deprecated). If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote spawn fails entirely with an error like `"no encoder for type "`. The framework doesn't skip problematic variables: any non-serializable value causes the entire spawn request to fail.
Environment inheritance only works with `process.RemoteSpawn`. Using `RemoteNode.Spawn` doesn't inherit environment because there's no calling process - it's a node-level operation.
diff --git a/docs/networking/remote-start-application.md b/docs/networking/remote-start-application.md
index ebeedd84a..1adbccf7a 100644
--- a/docs/networking/remote-start-application.md
+++ b/docs/networking/remote-start-application.md
@@ -190,7 +190,7 @@ node, err := ergo.StartNode("scheduler@localhost", gen.NodeOptions{
Now when you start an application remotely, the application's processes receive a copy of the requesting node's core environment. This enables configuration propagation - your scheduler node has configuration in its environment, and applications started remotely inherit it.
-**Important:** Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via `edf.RegisterTypeOf`. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote application start fails entirely with an error like `"no encoder for type "`. The framework doesn't skip problematic variables - any non-serializable value causes the entire start request to fail.
+**Important:** Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via `node.Network().RegisterType` (see [Network Transparency](network-transparency.md) for details on the type registry; the legacy `edf.RegisterTypeOf` still works but is deprecated). If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote application start fails entirely with an error like `"no encoder for type "`. The framework doesn't skip problematic variables: any non-serializable value causes the entire start request to fail.
## How It Works
diff --git a/docs/networking/service-discovering.md b/docs/networking/service-discovering.md
index 25ca81078..b0f5d829c 100644
--- a/docs/networking/service-discovering.md
+++ b/docs/networking/service-discovering.md
@@ -250,8 +250,8 @@ if err != nil {
process.LinkEvent(event)
// In your HandleEvent callback (etcd example):
-func (w *Worker) HandleEvent(message gen.MessageEvent) error {
- switch ev := message.Message.(type) {
+func (w *Worker) HandleEvent(event gen.MessageEvent) error {
+ switch ev := event.Message.(type) {
case etcd.EventConfigUpdate:
// Configuration item changed
diff --git a/docs/tools/ergo.md b/docs/tools/ergo.md
index 411c46ffb..81cddc45d 100644
--- a/docs/tools/ergo.md
+++ b/docs/tools/ergo.md
@@ -1,127 +1,384 @@
-# Boilerplate Code Generation
-
-The `ergo` tool allows you to generate the structure and source code for a project based on the Ergo Framework. To install it, use the following command:
-
-`go install ergo.tools/ergo@latest`
-
-Alternatively, you can build it from the source code available at [https://github.com/ergo-services/tools](https://github.com/ergo-services/tools).
-
-When using `ergo` tool, you need to follow the specific template for providing arguments:
-
-`Parent:Actor{param1:value1,param2:value2...}`
-
-* **Parent** can be a _supervisor_ (specified earlier with `-with-sup`) or an _application_ (specified earlier with `-with-app`).
-* **Actor** can be an _actor_ (added earlier with `-with-actor`) or a _supervisor_ (specified earlier with `-with-sup`).
-
-This structured approach ensures the proper hierarchy and parameters are defined for your _actors_ and _supervisors_
-
-### Available Arguments and Parameters :
-
-* **`-init `**: a required argument that sets the name of the node for your service. Available parameters:
- * **`tls`**: enables encryption for network connections (a self-signed certificate will be used).
- * **`module`**: allows you to specify the module name for the `go.mod` file.
-* **`-path `**: specifies the path for the code of the generated project.
-* **`-with-actor `**: adds an actor (based on `act.Actor`).
-* **`-with-app `**: adds an application. Available parameters:
- * **`mode`**: specifies the application's [start mode](../basics/application.md#application-startup-modes) (`temp` - Temporary, `perm` - Permanent, `trans` - Transient). The default mode is `trans`.\
- Example: `-with-app MyApp{mode:perm}`
-* **`-with-sup `**: adds a supervisor (based on `act.Supervisor`). Available parameters:
- * **`type`**: specifies the [type of supervisor](../actors/supervisor.md#supervisor-types) (`ofo` - One For One, `sofo` - Simple One For One, `afo` - All For One, `rfo` - Rest For One). The default type is `ofo`.
- * **`strategy`**: specifies the [restart strategy](../actors/supervisor.md#restart-strategy) for the supervisor (`temp` - Temporary, `perm` - Permanent, `trans` - Transient). The default strategy is `trans`.
-* **`-with-pool `**: adds a process pool actor (based on `act.Pool`). Available parameters:
- * **`size`**: Specifies the number of worker processes in the pool. By default, 3 processes are started.
-* **`-with-web `**: adds a Web server (based on `act.Pool` and `act.WebHandler`). Available parameters:
- * **`host`**: specifies the hostname for the Web server.
- * **`port`**: specifies the port number for the Web server. The default is `9090`.
- * **`tls`**: enables encryption for the Web server using the node's `CertManager`.
-* **`-with-tcp `**: adds a TCP server actor (based on `act.Actor` and `meta.TCPServer` meta-process). Available parameters:
- * **`host`**: specifies the hostname for the TCP server.
- * **`port`**: specifies the port number for the TCP server. The default is `7654`.
- * **`tls`**: enables encryption for the TCP server using the node's `CertManager`.
-* **`-with-udp `**: adds a UDP server actor (based on `act.Pool` , `meta.UDPServer` and `act.Actor` as worker processes). Available parameters:
- * **`host`**: specifies the hostname for the UDP server.
- * **`port`**: specifies the port number for the UDP server. The default is `7654`.
-* **`-with-msg `**: adds a message type for network interactions.
-* **`-with-logger `**: adds a logger from the extended library. Available loggers: [colored](../extra-library/loggers/colored.md), [rotate](../extra-library/loggers/rotate.md)
-* **`-with-observer`**: adds the [Observer application](../extra-library/applications/observer.md).
-
-### Example
-
-For clarity, let's use all available arguments for `ergo` in the following example:
-
-