Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions enhancements/SEP3-observability-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
|Status | Authors | Created |
|---------------------------------------------------|--------------|------------|
|Implementation | @mayleighnmyers, @yxun | 2026-06-01 |

# Observability Integration (Metrics)

## Overview

Upstream Istio generates telemetry metrics for the control plane and sidecar proxies. Users can customize Istio metrics with the Telemetry API and create scraping jobs by updating a Prometheus ConfigMap. For example, [prometheus.yaml](https://raw.githubusercontent.com/istio/istio/master/samples/addons/prometheus.yaml). However, that approach requires restarting the Prometheus instance and users may not have enough permission to update the existing Prometheus's ConfigMap.

Alternatively, those can be configured by using two custom resources `ServiceMonitor` and `PodMonitor`. We want to automate the creation of `ServiceMonitor` and `PodMonitor` resources with default values in Sail Operator. And then users can query Istio metrics in Prometheus or Kiali dashboard without manual configuration step.

## Goals

* Provide a monitoring controller that reconciles `ServiceMonitor` and `PodMonitor` resources for Istio control plane and sidecar proxy metrics.
* Apply platform-appropriate default relabeling rules on Kubernetes and OpenShift.
Comment on lines +15 to +16

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambient is out of scope?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is out of scope, we should mention this; if not, we need to plan how metrics are gathered from node level ztunnel components and namespace level waypoint proxies

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, ambient is out of scope in this SEP. We want to keep changes smaller in each SEP. We can have the next SEP for ambient mode metrics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update then the SEP to clearly mention that ambient is out of scope


## Non-goals

* Deploying observability stack components (Prometheus, OpenShift Cluster Observability Operator, OpenShift User-Workload Monitoring, etc.). We assume those are installed separately.
* Updating `ServiceMonitor` and `PodMonitor` for user customization of scraping paths, ports, or relabeling rules via the Sail Operator API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to explicitly mention this in the docs:
If users require custom relabeling or scraping configs, they must leave spec.monitoring.enabled: false, and deploy their own independent PodMonitors.

* Observability Distributed Tracing integration.

## Design

### User Stories

1. As a user running Istio with a Prometheus monitoring stack, I want the Sail Operator to configure metrics scraping jobs for Istio-generated telemetry metrics without restart the Prometheus instance.
2. As a user running OpenShift Service Mesh(OSSM), I want the Sail Operator to create `PodMonitor` resources with OSSM-documented relabeling rules so metrics appear correctly in the OpenShift console and Kiali dashboard.

### API Changes

We will add a boolean field `spec.monitoring.enabled` in the `Istio` Custom Resource(CR). For example,

```yaml
apiVersion: sailoperator.io/v1
kind: Istio
metadata:
name: default
spec:
monitoring:
enabled: false
```

This field defaults to `false`. When it is set to `true`, the monitoring controller reconciles `ServiceMonitor` and `PodMonitor` CRs for each owned `IstioRevision`.

#### Alternatives API Considered

A broader Observability Integration API with target references to a monitoring and tracing stacks is being discussed in the initial PR comments. We plan to design and implement the monitoring controller and broader Integration API controller separately. This enhancement proposal is focused on the monitoring controller with API change in the `Istio` Custom Resource. We will work toward a broader Integration API design and implementation in next phase.

### Architecture

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will deletion be handled? I think it worth to mention how we plan to hanlde the deletion of the resources

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a concern from existing environments where users have configured their custom ServiceMonitor or PodMontior resources manually. They expect those existing resources should not be deleted from Sail Operator.
We will add a paragraph about deletion in SEP. Thanks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking ownership can help us to solve this? we can add specific metadata to allow us to easily filter which resources are handle by our operator


We assume `ServiceMonitor` and `PodMonitor` CRDs are available under `monitoring.coreos.com/v1` and/or `monitoring.rhobs/v1` group. The Sail Operator ClusterRole grants permissions for both API groups' `ServiceMonitor` and `PodMonitor` custom resources.

A monitoring controller watches `IstioRevision` resources and reconciles `ServiceMonitor` and `PodMonitor` CRs when the boolean field `spec.monitoring.enabled` is set to `true` in the parent `Istio` CR. It also watches namespaces with the `istio-injection=enabled` label and reconciles `PodMonitor` CRs for Istio sidecar injection namespaces.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means it will only reconcile the resource for namespaces with stio-injection=enabled`? What about the revision labels used?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mayleighnmyers , here is a doc about injection label:
https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#controlling-the-injection-policy

Currently, we only have the istio-injection label watch. There is another label we want to watch. istio.io/rev
when the istio.io/rev label has a value , it is same as the istio-injection=enabled.

I can update the SEP part, would you like to add that additional istio.io/rev label watch in the monitoring controller code ? thanks.


The monitoring controller applies platform-specific relabeling defaults when creating `ServiceMonitor` and `PodMonitor` CRs. The monitoring controller starts and detects the running platform at the Sail Operator startup time.

#### Kubernetes vs OpenShift

On **Kubernetes** platform, applying a single `PodMonitor` CR in the control plane namespace with `namespaceSelector.any: true` is the preferred approach. That is matching upstream Istio Prometheus configuration example.
On **OpenShift** platform, the User-Workload Monitoring stack ignores `namespaceSelector` field, so we need create a `PodMonitor` CR in each sidecar injection enabled namespace.

### Performance Impact

The `ServiceMonitor` reconciliation is only required in the control plane namespace and it has low performance impact. The `PodMonitor` reconciliation watches namespaces with the `istio-injection=enabled` label on OpenShift platform and it may impact performance on clusters when there are large number of existing namespaces.

## Implementation Plan

v1alpha1
- [x] Monitoring controller reconciling `ServiceMonitor` and `PodMonitor` for `IstioRevision`
- [x] `Istio.spec.monitoring.enabled` API
- [x] Platform-specific PodMonitor relabeling defaults
- [x] Unit tests for controller and relabeling package
- [ ] External CRD/API group detection (`monitoring.coreos.com` vs `monitoring.rhobs`)
- [ ] Kubernetes PodMonitor strategy using `namespaceSelector.any: true`
- [ ] Validation when monitoring CRDs are unavailable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add to this document design details for this? From my pov this can be discussed here to set a base. I think we can set a Condition field with Type: MonitoringAvailable, Status: False, Reason: MissingCRDs, Message: "Prometheus Operator CRDs not found" something like this will be very helpfull

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fjglira , yes, I am working on this validation change and having a draft change.
ok, we can add status condition update in the Istio custom resources. I will update this detail in SEP later today.

- [ ] Integration tests (`tests/integration/api/monitoring_test.go`)
- [ ] KinD e2e tests
- [ ] User-facing documentation

v1alpha2
- [ ] Integration API with target references for metrics and tracing

## Test Plan

Functionality will be tested in unit tests, integration tests (envtest), and KinD e2e tests.

## Updates

* 2026-06-11: Updated with platform relabeling implementation, remaining v1alpha1 work items.
* 2026-06-23: Updated Overview, Goals, Non-goals, Design descriptions. Added Alternatives API Considered. Removed OSSM JIRA references.