Skip to content

feat(gnmi-discovery): event-driven gNMI discovery backend#436

Open
leoparente wants to merge 98 commits into
developfrom
feat/gnmi-discovery-backend
Open

feat(gnmi-discovery): event-driven gNMI discovery backend#436
leoparente wants to merge 98 commits into
developfrom
feat/gnmi-discovery-backend

Conversation

@leoparente

@leoparente leoparente commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds gnmi-discovery — a new event-driven network-discovery backend (sibling to snmp-discovery / network-discovery) that maintains long-lived gNMI subscriptions and keeps NetBox current, via Diode, within seconds of an inventory/config change.

It reuses the existing backend contract: HTTP policy API, Diode ingestion, OTEL metrics, dry-run, ${ENV} interpolation — implemented in Go, on default port 8074.

How it works

  • Event-driven, with graceful fallback: per target it resolves auto → on_change → sample → get. on_change is preferred; sample/get are used (and surfaced in status) when a device can't stream. Curated, inventory-relevant OpenConfig paths only (no volatile telemetry — NetBox stays a source of truth, not a metrics store).
  • Layered OpenConfig profiles (_base + thin vendor overlays). Unknown vendors auto-fall-back to _base (zero-config). Operators can drop overrides via profiles_dir with no rebuild.
  • Reconcile model with last-seen-cycle TTL pruning + debounce, so departed objects stop being ingested and bursts collapse into one upsert. Per-flush runs (with run_id/policy_name Diode metadata) surface in /api/v1/status, mirroring the other backends.

Supported platform types

Vendor is auto-detected from the gNMI Capabilities Organization token; the matched overlay only adds a match alias on top of _base (and documents that vendor's known-weak OpenConfig leaves) — real per-vendor leaf-path overrides are deferred until device captures exist, so the overlays bet on standard OpenConfig and degrade gracefully on absent leaves. Interface name→type recognition for each vendor's naming format lives in a shared cross-vendor table.

Overlay Match Notes
arista_eos Arista
nokia_sros Nokia
nvidia_cumulus nvidia / cumulus / mellanox alias-matched across release names
cisco Cisco covers IOS-XR / NX-OS / IOS-XE
juniper Juniper Junos (EX/QFX/MX); strongest OpenConfig fit
huawei Huawei VRP / CloudEngine; 10GE/25GE/40GE/100GE/Eth-Trunk/Vlanif name typing
dell_os10 Dell OS10
sonic SONiC matched as a network-OS signal, independent of the hardware vendor — a Dell-built SONiC box attributes manufacturer Dell while selecting the sonic overlay; requires the device's OpenConfig/Translib framework (SONiC's default sonic-db origin is out of scope)

Any vendor not listed still discovers via _base (standard OpenConfig) with zero configuration.

What it discovers (dcim / ipam, no custom fields — mirrors snmp/device-discovery)

  • Device — hostname, serial (chassis), manufacturer, model (chassis part-no), software version (folded into Platform name).
  • Interface — name, description, admin-status→enabled, MTU, type (OpenConfig state/type → name-pattern → speed ladder), speed, MAC, LAG membership, duplex.
  • IP addresses — IPv4/IPv6 on (sub)interfaces, with device primary IP resolution.
  • VLANs / VLAN groups — switchport access/trunk membership, real VLAN names/status, site-scoped VLAN-group defaults.
  • VRFs — L3VRF detection with subinterface-precise interface/IP binding and RD.
  • Prefixes — connected networks derived from discovered IPs, VRF-inherited and site-scoped.
  • Module + Module Bay — linecards/supervisors/fabric/transceivers with per-component manufacturer, model, serial.

Testing & review

  • go build ./..., go vet ./..., go test ./... -race green; golangci-lint clean; per-package coverage ≥90%. CI workflows (lint/tests/release) added.
  • gnmic transport is isolated behind a gnmi.Session interface (fake-driven unit tests); the real openconfig/gnmic client is pinned and confined to one file.
  • The change went through multiple adversarial review rounds; notable fixes include gNMI subscribe goroutine/connection leaks, an async auto-fallback correctness bug, a getStatus data race, negative/overflowing-interval crash guards, OpenConfig identityref normalization, and decoupling network-OS profile selection from hardware-vendor manufacturer attribution.

Notes

  • Complementary to SNMP/NAPALM, not a replacement (gNMI requires gRPC/TLS + OpenConfig).
  • Removals are intentionally not propagated as NetBox deletes (Diode lacks a delete primitive); departed data simply stops being ingested.
  • Follow-up: a separate orb-agent docs PR for the new backend.

🤖 Generated with Claude Code

leoparente and others added 30 commits June 2, 2026 18:08
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iasing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fake

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements GnmicDialer + gnmicSession satisfying the Session/Dialer
interfaces using github.com/openconfig/gnmic/pkg/api. All gnmi proto
types are confined to gnmi/gnmic.go; build/vet/test green; no leakage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…de Decimal64

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sh-in-name paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ncile

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rrors, active-mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expand _base.yaml with the H-2 comment explaining why speed and
lag_member are intentionally excluded from the interface keys
(subscribed-but-untranslated → churn; deferred pending Interface.Speed
/ Interface.Lag mapping in translate.go). Add arista_eos.yaml and
nokia_sros.yaml vendor overlays that extend _base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add TestDryRunEOSGolden which drives the real Runner with a FakeSession
replaying a canned EOS notification stream (testdata/eos_stream.json)
and asserts the ingested entity set: exactly 1 Device, 1 Interface, and
1 Module (ModuleBay is emitted separately). The test pins profile:
arista_eos and uses the existing recordingClient/FakeDialer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ics shutdown, content-type/timeout hardening

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ction (real-transport path)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ic); robust idempotent metrics shutdown

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@leoparente

Copy link
Copy Markdown
Contributor Author

@codex review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 72 out of 73 changed files in this pull request and generated 3 comments.

Comment thread gnmi-discovery/server/server.go
Comment thread gnmi-discovery/policy/manager.go
Comment thread gnmi-discovery/policy/manager_test.go
…oned IPv6

- server: getStatus computes uptime from the copied st.StartTime (not s.stat),
  consistent with the race-avoidance copy.
- policy: ensurePort validates a zone-qualified IPv6 literal (fe80::1%eth0) by
  ParseIP-ing the zone-stripped host, then brackets the original (with zone) and
  appends the default port, instead of treating it as malformed. Added regression
  cases (manager + ensurePort table tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@leoparente leoparente requested a review from Copilot June 12, 2026 14:45
@leoparente

Copy link
Copy Markdown
Contributor Author

@codex review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 72 out of 73 changed files in this pull request and generated 1 comment.

Comment thread README.md Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e469f4ea0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread gnmi-discovery/mapping/interface_type.go
…p link label

- policy: validateInterfaceRegexes rejects an interface_patterns entry with an
  empty type (would set Interface.Type to "", unresolvable by NetBox/Diode, and
  short-circuit the OC/name/speed/default fallback). resolveInterfaceType also
  defensively skips empty-typed patterns.
- docs: fix the top-level README label "snmp-discover" -> "snmp-discovery".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@leoparente

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 72 out of 73 changed files in this pull request and generated no new comments.

@leoparente

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

Reviewed commit: 8f3ba86b03

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@leoparente leoparente marked this pull request as ready for review June 12, 2026 19:41

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f3ba86b03

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread gnmi-discovery/mapping/translate.go Outdated
…anslation

An interface excluded via interface_exclude_patterns was skipped by
translateInterfaces but its subinterface IPs were still emitted (assigned to a
stub interface), and translatePrefixes then derived prefixes from them — so a
"skipped entirely" interface leaked back in via its addresses. Extracted the
exclude compile/match into shared helpers (compileInterfaceExcludes/nameExcluded)
and applied them in translateIPs too, dropping excluded interfaces' addresses
(and their derived prefixes). Added a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@leoparente

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6aab305cf1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return fmt.Errorf("parse profile %s: %w", filename, err)
}
p.Name = strings.TrimSuffix(filename, ".yaml")
into[p.Name] = &p

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve bundled profile when override inheritance is invalid

When an override reuses a bundled filename but has a bad extends chain or inheritance cycle, this assignment replaces the bundled raw profile before semantic validation. The later resolve error skips that name entirely, so a typo in something like /profiles/arista_eos.yaml removes the built-in Arista profile instead of skipping the bad override and falling back to the bundled profile; keep the bundled entry until the override resolves successfully.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants