diff --git a/concepts/AI-monitoring-user-apps.adoc b/concepts/AI-monitoring-user-apps.adoc index 8cdbe01..7d885da 100644 --- a/concepts/AI-monitoring-user-apps.adoc +++ b/concepts/AI-monitoring-user-apps.adoc @@ -66,11 +66,12 @@ Number of requests. *Unit:* integer gen_ai.usage.cost:: -The distribution of GenAI request costs. +The distribution of GenAI request costs. +This is a non-standard metric, available only when explicitly set by the application or when the application is instrumented with {openlit}. + -*Type:* histogram +*Type:* histogram + -*Unit:* USD +*Unit:* USD gen_ai.usage.input_tokens:: Number of prompt tokens processed. @@ -107,7 +108,7 @@ No metrics received from any components. :: No metrics received from the GPU. :: + -* Verify if the RBAC rules were applied. +* Verify that the `clusterRole` configuration is included in the `otel-values.yaml` and the collector has been installed or upgraded with it. * Verify if the metrics receiver scraper is configured. * Check the {nvidia} DCGM Exporter for errors. diff --git a/tasks/AI-monitoring-gpu.adoc b/tasks/AI-monitoring-gpu.adoc index 6cff5a4..a35599c 100644 --- a/tasks/AI-monitoring-gpu.adoc +++ b/tasks/AI-monitoring-gpu.adoc @@ -7,56 +7,15 @@ To effectively monitor the performance and utilization of your GPUs, configure t [#ai-monitoring-gpu-metrics] .Collect GPU metrics (recommended) -. *Grant permissions (RBAC).* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster. -+ -Create a file named `otel-rbac.yaml` -with the following content. -It defines a `Role` with permissions to get services and endpoints, and a `RoleBinding` to grant these permissions to the {otelemetry} Collector's service account. +. *Verify RBAC permissions.* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster. +These permissions are automatically configured when you install the collector with the `clusterRole` section in the `otel-values.yaml` file (see xref:observability-settingup-ai.adoc[]). + ----- ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: Role -metadata: - name: suse-observability-otel-scraper -rules: - - apiGroups: - - "" - resources: - - services - - endpoints - verbs: - - list - - watch - ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding -metadata: - name: suse-observability-otel-scraper -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: Role - name: suse-observability-otel-scraper -subjects: - - kind: ServiceAccount - name: OPENTELEMETRY-COLLECTOR - namespace: OBSERVABILITY ---- ----- -+ -[IMPORTANT] +[NOTE] ==== -Verify that the `ServiceAccount` name and namespace in the `RoleBinding` match your {otelemetry} Collector's deployment. +If you installed the {otelemetry} Collector without the `clusterRole` configuration, you must upgrade the collector with the updated `otel-values.yaml` that includes the `clusterRole` section. ==== + -. Apply this configuration to the `gpu-operator` namespace. -+ -[source,bash] ----- -> kubectl apply -n gpu-operator -f otel-rbac.yaml ----- -. *Configure the {otelemetry} Collector.* Add the following Prometheus receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds. +. *Configure the {otelemetry} Collector.* Add the following {prometheus} receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds. + [source,yaml] ---- diff --git a/tasks/AI-monitoring-owui.adoc b/tasks/AI-monitoring-owui.adoc index aca906a..fd7e0fa 100644 --- a/tasks/AI-monitoring-owui.adoc +++ b/tasks/AI-monitoring-owui.adoc @@ -19,13 +19,13 @@ pipelines: storageClass: longhorn <.> extraEnvVars: <.> - name: PIPELINES_URLS <.> - value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py" + value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py" - name: OTEL_SERVICE_NAME <.> value: "Open WebUI" - name: OTEL_EXPORTER_HTTP_OTLP_ENDPOINT <.> value: "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318" - name: PRICING_JSON <.> - value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/pricing.json" + value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/pricing.json" extraEnvVars: - name: OPENAI_API_KEY <.> value: "0p3n-w3bu!" @@ -102,7 +102,7 @@ include::../snippets/openwebui-requirement-admin-privileges.adoc[] . In the bottom left of the {owui} window, click your avatar icon to open the user menu and select menu:Admin Panel[]. . Click the menu:Settings[] tab and select menu:Pipelines[] from the left menu. -. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL. +. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL. . After the upload is finished, you can review the configuration of the pipeline. Confirm with menu:Save[]. + [#fig-ai-monitoring-owui-pipelines-webui] diff --git a/tasks/observability-settingup-ai.adoc b/tasks/observability-settingup-ai.adoc index a7831eb..0d312c1 100644 --- a/tasks/observability-settingup-ai.adoc +++ b/tasks/observability-settingup-ai.adoc @@ -122,14 +122,14 @@ For multi-cluster deployments, this is the external URL. For single-cluster deployments, this can be the internal service URL. Example: `https://suse-observability-api.your-domain.com` SUSE_OBSERVABILITY_API_KEY:: The API key from the `baseConfig_values.yaml` file used during the {sobservability} installation. -SUSE_OBSERVABILITY_API_TOKEN_TYPE:: Can be `api` for a token from the Web UI or `service` for a Service Token. +SUSE_OBSERVABILITY_API_TOKEN_TYPE:: Can be `api` for a token from the Web UI or `service` for a Service Token (used for automation/CI). SUSE_OBSERVABILITY_TOKEN:: The API or Service token itself. +Ignored if `existingSecret` is set. TLS_CA_CERTIFICATE:: The CA certificate content in PEM format (optional). -TLS_CLIENT_CERTIFICATE:: The client certificate content in PEM format (optional). -TLS_CLIENT_KEY:: The client private key content in PEM format (optional). -OBSERVED_SERVER_NAME:: The name of the cluster to observe. -It must match the name used in the {kube} StackPack configuration. -Example: `suse-ai-cluster`. +Ignored if `tls.existingSecret` is set. +OBSERVED_CLUSTER_NAMES:: A list of {kube} cluster names to install StackPacks for. +Each cluster listed here will get a kubernetes-v2 StackPack instance provisioned in {sobservability}. +Example: `["suse-ai-cluster"]`. .. Create the `genai_values.yaml` file with the following content: + @@ -140,21 +140,26 @@ global: imagePullSecrets: - application-collection <.> serverUrl: -apiKey: tokenType: -apiToken: -clusterName: +apiToken: <.> +kubernetesClusters: <.> + - tls: <.> enabled: false skipVerify: false + existingSecret: "" certificates: ca: "" - client: "" - clientKey: "" ---- <.> Instructs {helm} to use credentials from the {sappco}. For instructions on how to configure the image pull secrets for the {sappco}, refer to the link:https://docs.apps.rancher.io/get-started/authentication/[official documentation]. +<.> Alternatively, you can reference an existing {kube} secret by setting `existingSecret` and `existingSecretKey` instead of providing the token inline. +This is the recommended approach for production deployments. +<.> List of {kube} cluster names to install StackPacks for. +Add one entry per cluster you want to observe. <.> Provides optional TLS configuration for secure communication. +Set `tls.enabled` to `true` and provide a CA certificate via `tls.certificates.ca` or reference an existing secret with `tls.existingSecret`. +Set `tls.skipVerify` to `true` only for development and testing environments. endif::[] ifeval::["{PROF_DEPLOYMENT}" == "airgapped"] [source,yaml] @@ -164,21 +169,26 @@ global: - application-collection <.> imageRegistry: :5043 serverUrl: -apiKey: tokenType: -apiToken: -clusterName: +apiToken: <.> +kubernetesClusters: <.> + - tls: <.> enabled: false skipVerify: false + existingSecret: "" certificates: ca: "" - client: "" - clientKey: "" ---- <.> Instructs {helm} to use credentials from the {sappco}. For instructions on how to configure the image pull secrets for the {sappco}, refer to the link:https://docs.apps.rancher.io/get-started/authentication/[official documentation]. +<.> Alternatively, you can reference an existing {kube} secret by setting `existingSecret` and `existingSecretKey` instead of providing the token inline. +This is the recommended approach for production deployments. +<.> List of {kube} cluster names to install StackPacks for. +Add one entry per cluster you want to observe. <.> Provides optional TLS configuration for secure communication. +Set `tls.enabled` to `true` and provide a CA certificate via `tls.certificates.ca` or reference an existing secret with `tls.existingSecret`. +Set `tls.skipVerify` to `true` only for development and testing environments. endif::[] .. Run the install command. @@ -195,7 +205,7 @@ ifeval::["{PROF_DEPLOYMENT}" == "airgapped"] [source,bash,subs="+attributes"] ---- {prompt_user}helm upgrade --install ai-obs \ - charts/suse-ai-observability-extension-1.5.0.tgz \ + charts/suse-ai-observability-extension-2.0.0.tgz \ -f genai_values.yaml --namespace so-extensions --create-namespace ---- endif::[] @@ -208,6 +218,20 @@ If you are using self-signed certificates or a custom CA, you can provide the ce Alternatively, consider running the extension in the same cluster as {sobservability} and then use the internal {k8s} address. ==== + +[NOTE] +.{otelemetry} GenAI Semantic Conventions compatibility +==== +The {suseai} Observability Extension is compatible with the link:https://opentelemetry.io/docs/specs/semconv/gen-ai/[{otelemetry} Semantic Conventions for Generative AI] as defined in version v1.40.0. +These conventions are currently in *Development* status within the {otelemetry} specification. +==== ++ +[IMPORTANT] +.Upgrading from version 1.5.0 to 2.0.0 +==== +When upgrading the {suseai} Observability Extension from version 1.5.0 to 2.0.0, you must also update the {otelemetry} Collector and the {suseai} Filter (`suse_ai_filter.py`) to their corresponding versions. +These components are designed to work together, and running mismatched versions may result in missing telemetry data or broken topology views. +==== ++ After the installation is complete, a new menu called btn:[GenAI] is added to the Web interface and also a {kube} cron job is created that synchronizes the topology view with the components found in the {productname} cluster. . *Verify {sobservability} extension.* @@ -246,35 +270,63 @@ The endpoint of the {sobservability} Collector. . *Install {nvoperator}.* Follow the instructions in link:https://documentation.suse.com/cloudnative/rke2/latest/en/advanced.html#_deploy_nvidia_operator[]. . *Install {otelemetry} collector.* -Create a secret with your {sobservability} API key in the namespace where you want to install the collector. +Create the namespace for the collector if it does not exist yet. ++ +[source,bash,subs="+attributes"] +---- +{prompt_user}kubectl create namespace observability +---- ++ +Create a secret with your {sobservability} API key. Retrieve the API key using the Web UI or from the `baseConfig_values.yaml` file that you used during the {sobservability} installation. -If the namespace does not exist yet, create it. + -[source,bash] +[source,bash,subs="+attributes"] ---- -kubectl create namespace observability -kubectl create secret generic open-telemetry-collector \ +{prompt_user}kubectl create secret generic open-telemetry-collector \ --namespace observability \ --from-literal=API_KEY='' ---- + +Create the image pull secret for the {sregistry}. +The username is `regcode` and the password is the {scca} registration code of your {productname} subscription. ++ +[source,bash,subs="+attributes"] +---- +{prompt_user}kubectl create secret docker-registry suse-ai-registry \ + --docker-server=registry.suse.com \ + --docker-username=regcode \ + --docker-password= \ + -n observability +---- ++ Create a new file named `otel-values.yaml` with the following content. + ifeval::["{PROF_DEPLOYMENT}" == "standard"] [source,yaml] ---- -image: - registry: docker.io - repository: otel/opentelemetry-collector-contrib - tag: 0.140.0 - pullPolicy: Always global: imagePullSecrets: - - application-collection + - suse-ai-registry +extraEnvs: + - name: K8S_CLUSTER_NAME + value: "" <.> + - name: SUSE_AI_NAMESPACE + value: "" <.> extraEnvsFrom: - secretRef: name: open-telemetry-collector mode: deployment +clusterRole: + create: true + rules: + - apiGroups: [""] + resources: ["pods", "nodes", "endpoints", "services"] + verbs: ["get", "list", "watch"] +image: + registry: registry.suse.com + repository: ai/containers/suse-ai-opentelemetry-collector + tag: 0.149.0 + pullPolicy: IfNotPresent ports: metrics: enabled: true @@ -315,6 +367,11 @@ config: metrics_path: '/metrics' static_configs: - targets: ['..svc.cluster.local:9091'] <.> + - job_name: 'qdrant' + scrape_interval: 10s + metrics_path: '/metrics' + static_configs: + - targets: ['..svc.cluster.local:6333'] <.> - job_name: 'vllm' scrape_interval: 10s scheme: http @@ -329,13 +386,71 @@ config: action: keep regex: '.*.*' <.> exporters: + topology: + endpoint: http://:8080 <.> + api_key: ${env:API_KEY} + instance_url: ${env:K8S_CLUSTER_NAME} + namespace: ${env:SUSE_AI_NAMESPACE} + tls: <.> + insecure_skip_verify: true otlp: - endpoint: https://.suse-observability.svc.cluster.local:4317 <.> + endpoint: http://.suse-observability.svc.cluster.local:4317 <.> headers: Authorization: "SUSEObservability ${env:API_KEY}" tls: insecure: true processors: + resource/elasticsearch: + attributes: + - key: suse.ai.managed + value: "true" + action: insert + - key: suse.ai.component.name + value: "opensearch" + action: insert + - key: suse.ai.component.type + value: "search-engine" + action: insert + - key: service.name + value: "opensearch" + action: insert + - key: service.namespace + value: ${env:SUSE_AI_NAMESPACE} + action: insert + - key: k8s.namespace.name + value: ${env:SUSE_AI_NAMESPACE} + action: upsert + - key: service.instance.id + value: "opensearch-cluster" + action: insert + transform/vllm: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "vllm" + - set(attributes["suse.ai.component.name"], "vllm") where attributes["service.name"] == "vllm" + - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["service.name"] == "vllm" + - set(attributes["service.instance.id"], "vllm") where attributes["service.name"] == "vllm" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "vllm" + transform/qdrant: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "qdrant" + - set(attributes["suse.ai.component.name"], "qdrant") where attributes["service.name"] == "qdrant" + - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "qdrant" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "qdrant" + transform/milvus: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "milvus" + - set(attributes["suse.ai.component.name"], "milvus") where attributes["service.name"] == "milvus" + - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "milvus" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "milvus" tail_sampling: decision_wait: 10s policies: @@ -365,11 +480,88 @@ config: resource: attributes: - key: k8s.cluster.name - action: upsert - value: <.> + action: insert + value: ${env:K8S_CLUSTER_NAME} - key: service.instance.id from_attribute: k8s.pod.uid action: insert + # Infer inference engines from GenAI client metrics. + filter/genai-metrics-only: + error_mode: ignore + metrics: + metric: + - not(IsMatch(name, "gen_ai\\..*")) + groupbyattrs/infer-providers: + keys: + - gen_ai.provider.name + transform/infer-providers: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.component.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["gen_ai.provider.name"] != nil + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.provider.name"] != nil + # Infer LLM models from GenAI client metrics. + groupbyattrs/infer-models: + keys: + - gen_ai.request.model + - gen_ai.provider.name + transform/infer-models: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.component.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.component.type"], "llm-model") where attributes["gen_ai.request.model"] != nil + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.request.model"] != nil + # Exclude services that already have suse.ai.component.name set. + filter/exclude-already-tagged: + error_mode: ignore + metrics: + resource: + - attributes["suse.ai.component.name"] != nil + # Infer application components from GenAI client metrics. + transform/infer-applications: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") + - set(attributes["suse.ai.component.type"], "application") + - set(attributes["suse.ai.component.name"], attributes["service.name"]) + # Create provider -> model relations from GenAI trace spans. + filter/genai-spans: + error_mode: ignore + traces: + span: + - attributes["gen_ai.request.model"] == nil + groupbyattrs/model-relations: + keys: + - gen_ai.provider.name + transform/model-relations: + error_mode: ignore + trace_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - context: span + statements: + - set(attributes["peer.service"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + # Create application -> provider relations from GenAI trace spans. + transform/provider-relations: + error_mode: ignore + trace_statements: + - context: span + statements: + - set(attributes["peer.service"], resource.attributes["gen_ai.provider.name"]) where resource.attributes["gen_ai.provider.name"] != nil filter/dropMissingK8sAttributes: error_mode: ignore traces: @@ -386,7 +578,7 @@ config: error_mode: ignore table: - statement: route() - pipelines: [traces/sampling, traces/spanmetrics] + pipelines: [traces/sampling, traces/spanmetrics, traces/model-relations, traces/provider-relations, traces/topology] service: extensions: - health_check @@ -404,34 +596,86 @@ config: processors: [tail_sampling, batch] exporters: [debug, otlp] metrics: - receivers: [otlp, spanmetrics, prometheus, elasticsearch] - processors: [memory_limiter, resource, batch] + receivers: [otlp, spanmetrics, prometheus] + processors: [memory_limiter, transform/qdrant, transform/milvus, transform/vllm, resource, batch] + exporters: [debug, otlp] + # Infer inference engines from GenAI client metrics. + metrics/infer-providers: + receivers: [otlp] + processors: [filter/genai-metrics-only, groupbyattrs/infer-providers, transform/infer-providers, resource, batch] + exporters: [otlp] + # Infer LLM models from GenAI client metrics. + metrics/infer-models: + receivers: [otlp] + processors: [filter/genai-metrics-only, groupbyattrs/infer-models, transform/infer-models, resource, batch] + exporters: [otlp] + # Infer application components from GenAI client metrics. + metrics/infer-applications: + receivers: [otlp] + processors: [filter/genai-metrics-only, filter/exclude-already-tagged, transform/infer-applications, resource, batch] + exporters: [otlp] + # Create provider -> model relations from trace spans. + traces/model-relations: + receivers: [routing/traces] + processors: [filter/genai-spans, groupbyattrs/model-relations, transform/model-relations, batch] + exporters: [otlp] + # Create application -> provider relations from trace spans. + traces/provider-relations: + receivers: [routing/traces] + processors: [filter/genai-spans, transform/provider-relations, batch] + exporters: [otlp] + # Push product topology to {sobservability}. + traces/topology: + receivers: [routing/traces] + processors: [batch] + exporters: [topology] + metrics/elasticsearch: + receivers: [elasticsearch] + processors: [memory_limiter, resource/elasticsearch, resource, batch] exporters: [debug, otlp] ---- +<.> Replace `` with the cluster's name. +<.> Replace `` with the namespace where {productname} components are installed. <.> Configure the {milvus} service and namespace for the {prometheus} scraper. Because {milvus} will be installed in subsequent steps, you can return to this step and edit the endpoint if necessary. +<.> Configure the {qdrant} service and namespace for the {prometheus} scraper. <.> Update to match the values in the {vllm} deployment section. <.> Update to match the values in the {vllm} deployment section. +<.> Set the topology exporter to your exposed {sobservability} router. +For single-cluster deployments, use the internal service URL (for example, `suse-observability-router.suse-observability.svc.cluster.local`). +For multi-cluster deployments, use the external URL. +<.> Optional TLS configuration for the topology exporter. +Set `insecure_skip_verify` to `true` for self-signed certificates. <.> Set the exporter to your exposed {sobservability} collector. Remember that the value can be distinct, depending on the deployment pattern. For production usage, we recommend using TLS communication. -<.> Replace `` with the cluster's name. endif::[] ifeval::["{PROF_DEPLOYMENT}" == "airgapped"] [source,yaml] ---- -image: - registry: :5043 - repository: opentelemetry-collector-contrib - tag: 0.140.0 - pullPolicy: Always global: imagePullSecrets: - - application-collection + - suse-ai-registry +extraEnvs: + - name: K8S_CLUSTER_NAME + value: "" <.> + - name: SUSE_AI_NAMESPACE + value: "" <.> extraEnvsFrom: - secretRef: name: open-telemetry-collector mode: deployment +clusterRole: + create: true + rules: + - apiGroups: [""] + resources: ["pods", "nodes", "endpoints", "services"] + verbs: ["get", "list", "watch"] +image: + registry: :5043 + repository: suse-ai-opentelemetry-collector + tag: 0.149.0 + pullPolicy: IfNotPresent ports: metrics: enabled: true @@ -472,6 +716,11 @@ config: metrics_path: '/metrics' static_configs: - targets: ['..svc.cluster.local:9091'] <.> + - job_name: 'qdrant' + scrape_interval: 10s + metrics_path: '/metrics' + static_configs: + - targets: ['..svc.cluster.local:6333'] <.> - job_name: 'vllm' scrape_interval: 10s scheme: http @@ -486,13 +735,71 @@ config: action: keep regex: '.*.*' <.> exporters: + topology: + endpoint: http://:8080 <.> + api_key: ${env:API_KEY} + instance_url: ${env:K8S_CLUSTER_NAME} + namespace: ${env:SUSE_AI_NAMESPACE} + tls: <.> + insecure_skip_verify: true otlp: - endpoint: https://.suse-observability.svc.cluster.local:4317 <.> + endpoint: http://.suse-observability.svc.cluster.local:4317 <.> headers: Authorization: "SUSEObservability ${env:API_KEY}" tls: insecure: true processors: + resource/elasticsearch: + attributes: + - key: suse.ai.managed + value: "true" + action: insert + - key: suse.ai.component.name + value: "opensearch" + action: insert + - key: suse.ai.component.type + value: "search-engine" + action: insert + - key: service.name + value: "opensearch" + action: insert + - key: service.namespace + value: ${env:SUSE_AI_NAMESPACE} + action: insert + - key: k8s.namespace.name + value: ${env:SUSE_AI_NAMESPACE} + action: upsert + - key: service.instance.id + value: "opensearch-cluster" + action: insert + transform/vllm: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "vllm" + - set(attributes["suse.ai.component.name"], "vllm") where attributes["service.name"] == "vllm" + - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["service.name"] == "vllm" + - set(attributes["service.instance.id"], "vllm") where attributes["service.name"] == "vllm" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "vllm" + transform/qdrant: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "qdrant" + - set(attributes["suse.ai.component.name"], "qdrant") where attributes["service.name"] == "qdrant" + - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "qdrant" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "qdrant" + transform/milvus: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "milvus" + - set(attributes["suse.ai.component.name"], "milvus") where attributes["service.name"] == "milvus" + - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "milvus" + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "milvus" tail_sampling: decision_wait: 10s policies: @@ -522,11 +829,88 @@ config: resource: attributes: - key: k8s.cluster.name - action: upsert - value: <.> + action: insert + value: ${env:K8S_CLUSTER_NAME} - key: service.instance.id from_attribute: k8s.pod.uid action: insert + # Infer inference engines from GenAI client metrics. + filter/genai-metrics-only: + error_mode: ignore + metrics: + metric: + - not(IsMatch(name, "gen_ai\\..*")) + groupbyattrs/infer-providers: + keys: + - gen_ai.provider.name + transform/infer-providers: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.component.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["gen_ai.provider.name"] != nil + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.provider.name"] != nil + # Infer LLM models from GenAI client metrics. + groupbyattrs/infer-models: + keys: + - gen_ai.request.model + - gen_ai.provider.name + transform/infer-models: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.component.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + - set(attributes["suse.ai.component.type"], "llm-model") where attributes["gen_ai.request.model"] != nil + - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.request.model"] != nil + # Exclude services that already have suse.ai.component.name set. + filter/exclude-already-tagged: + error_mode: ignore + metrics: + resource: + - attributes["suse.ai.component.name"] != nil + # Infer application components from GenAI client metrics. + transform/infer-applications: + error_mode: ignore + metric_statements: + - context: resource + statements: + - set(attributes["suse.ai.managed"], "true") + - set(attributes["suse.ai.component.type"], "application") + - set(attributes["suse.ai.component.name"], attributes["service.name"]) + # Create provider -> model relations from GenAI trace spans. + filter/genai-spans: + error_mode: ignore + traces: + span: + - attributes["gen_ai.request.model"] == nil + groupbyattrs/model-relations: + keys: + - gen_ai.provider.name + transform/model-relations: + error_mode: ignore + trace_statements: + - context: resource + statements: + - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil + - context: span + statements: + - set(attributes["peer.service"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil + # Create application -> provider relations from GenAI trace spans. + transform/provider-relations: + error_mode: ignore + trace_statements: + - context: span + statements: + - set(attributes["peer.service"], resource.attributes["gen_ai.provider.name"]) where resource.attributes["gen_ai.provider.name"] != nil filter/dropMissingK8sAttributes: error_mode: ignore traces: @@ -543,7 +927,7 @@ config: error_mode: ignore table: - statement: route() - pipelines: [traces/sampling, traces/spanmetrics] + pipelines: [traces/sampling, traces/spanmetrics, traces/model-relations, traces/provider-relations, traces/topology] service: extensions: - health_check @@ -561,18 +945,59 @@ config: processors: [tail_sampling, batch] exporters: [debug, otlp] metrics: - receivers: [otlp, spanmetrics, prometheus, elasticsearch] - processors: [memory_limiter, resource, batch] + receivers: [otlp, spanmetrics, prometheus] + processors: [memory_limiter, transform/qdrant, transform/milvus, transform/vllm, resource, batch] + exporters: [debug, otlp] + # Infer inference engines from GenAI client metrics. + metrics/infer-providers: + receivers: [otlp] + processors: [filter/genai-metrics-only, groupbyattrs/infer-providers, transform/infer-providers, resource, batch] + exporters: [otlp] + # Infer LLM models from GenAI client metrics. + metrics/infer-models: + receivers: [otlp] + processors: [filter/genai-metrics-only, groupbyattrs/infer-models, transform/infer-models, resource, batch] + exporters: [otlp] + # Infer application components from GenAI client metrics. + metrics/infer-applications: + receivers: [otlp] + processors: [filter/genai-metrics-only, filter/exclude-already-tagged, transform/infer-applications, resource, batch] + exporters: [otlp] + # Create provider -> model relations from trace spans. + traces/model-relations: + receivers: [routing/traces] + processors: [filter/genai-spans, groupbyattrs/model-relations, transform/model-relations, batch] + exporters: [otlp] + # Create application -> provider relations from trace spans. + traces/provider-relations: + receivers: [routing/traces] + processors: [filter/genai-spans, transform/provider-relations, batch] + exporters: [otlp] + # Push product topology to {sobservability}. + traces/topology: + receivers: [routing/traces] + processors: [batch] + exporters: [topology] + metrics/elasticsearch: + receivers: [elasticsearch] + processors: [memory_limiter, resource/elasticsearch, resource, batch] exporters: [debug, otlp] ---- +<.> Replace `` with the cluster's name. +<.> Replace `` with the namespace where {productname} components are installed. <.> Configure the {milvus} service and namespace for the {prometheus} scraper. Because {milvus} will be installed in subsequent steps, you can return to this step and edit the endpoint if necessary. +<.> Configure the {qdrant} service and namespace for the {prometheus} scraper. <.> Update to match the values in the {vllm} deployment section. <.> Update to match the values in the {vllm} deployment section. +<.> Set the topology exporter to your exposed {sobservability} router. +For single-cluster deployments, use the internal service URL (for example, `suse-observability-router.suse-observability.svc.cluster.local`). +For multi-cluster deployments, use the external URL. +<.> Optional TLS configuration for the topology exporter. +Set `insecure_skip_verify` to `true` for self-signed certificates. <.> Set the exporter to your exposed {sobservability} collector. Remember that the value can be distinct, depending on the deployment pattern. For production usage, we recommend using TLS communication. -<.> Replace `` with the cluster's name. endif::[] + Finally, run the installation command. @@ -585,50 +1010,6 @@ Finally, run the installation command. ---- + Verify the installation by checking the existence of a new deployment and service in the observability namespace. -. The GPU metrics scraper that we configure in the OTEL Collector requires custom RBAC rules. -Create a file named `otel-rbac.yaml` with the following content: -+ -[source,yaml] ----- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: otel-scraper-cluster-role -rules: - - apiGroups: - - "" - resources: - - services - - pods - - nodes - - endpoints - verbs: - - get - - list - - watch - ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - name: otel-scraper-binding -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: otel-scraper-cluster-role -subjects: - - kind: ServiceAccount - name: opentelemetry-collector - namespace: <.> ----- -<.> Replace `` with the namespace where the {otelemetry} Collector is installed, for example, `observability`. -+ -Then apply the configuration by running the following command. -+ -[source,bash,subs="+attributes"] ----- -{prompt_user}kubectl apply -f otel-rbac.yaml ----- . *Install the {sobservability} Agent.* + --