Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions concepts/AI-monitoring-user-apps.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,12 @@ Number of requests.
*Unit:* integer

gen_ai.usage.cost::
The distribution of GenAI request costs.
The distribution of GenAI request costs.
This is a non-standard metric, available only when explicitly set by the application or when the application is instrumented with {openlit}.
+
*Type:* histogram
*Type:* histogram
+
*Unit:* USD
*Unit:* USD

gen_ai.usage.input_tokens::
Number of prompt tokens processed.
Expand Down Expand Up @@ -107,7 +108,7 @@ No metrics received from any components. ::

No metrics received from the GPU. ::
+
* Verify if the RBAC rules were applied.
* Verify that the `clusterRole` configuration is included in the `otel-values.yaml` and the collector has been installed or upgraded with it.
* Verify if the metrics receiver scraper is configured.
* Check the {nvidia} DCGM Exporter for errors.

Expand Down
51 changes: 5 additions & 46 deletions tasks/AI-monitoring-gpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,56 +7,15 @@ To effectively monitor the performance and utilization of your GPUs, configure t

[#ai-monitoring-gpu-metrics]
.Collect GPU metrics (recommended)
. *Grant permissions (RBAC).* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
+
Create a file named `otel-rbac.yaml`
with the following content.
It defines a `Role` with permissions to get services and endpoints, and a `RoleBinding` to grant these permissions to the {otelemetry} Collector's service account.
. *Verify RBAC permissions.* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
These permissions are automatically configured when you install the collector with the `clusterRole` section in the `otel-values.yaml` file (see xref:observability-settingup-ai.adoc[]).
+
----
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: suse-observability-otel-scraper
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
verbs:
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: suse-observability-otel-scraper
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: suse-observability-otel-scraper
subjects:
- kind: ServiceAccount
name: OPENTELEMETRY-COLLECTOR
namespace: OBSERVABILITY
---
----
+
[IMPORTANT]
[NOTE]
====
Verify that the `ServiceAccount` name and namespace in the `RoleBinding` match your {otelemetry} Collector's deployment.
If you installed the {otelemetry} Collector without the `clusterRole` configuration, you must upgrade the collector with the updated `otel-values.yaml` that includes the `clusterRole` section.
====
+
. Apply this configuration to the `gpu-operator` namespace.
+
[source,bash]
----
> kubectl apply -n gpu-operator -f otel-rbac.yaml
----
. *Configure the {otelemetry} Collector.* Add the following Prometheus receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds.
. *Configure the {otelemetry} Collector.* Add the following Prometheus receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds.
Comment thread
tbazant marked this conversation as resolved.
Outdated
+
[source,yaml]
----
Expand Down
6 changes: 3 additions & 3 deletions tasks/AI-monitoring-owui.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ pipelines:
storageClass: longhorn <.>
extraEnvVars: <.>
- name: PIPELINES_URLS <.>
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py"
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py"
- name: OTEL_SERVICE_NAME <.>
value: "Open WebUI"
- name: OTEL_EXPORTER_HTTP_OTLP_ENDPOINT <.>
value: "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318"
- name: PRICING_JSON <.>
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/pricing.json"
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/pricing.json"
extraEnvVars:
- name: OPENAI_API_KEY <.>
value: "0p3n-w3bu!"
Expand Down Expand Up @@ -102,7 +102,7 @@ include::../snippets/openwebui-requirement-admin-privileges.adoc[]

. In the bottom left of the {owui} window, click your avatar icon to open the user menu and select menu:Admin Panel[].
. Click the menu:Settings[] tab and select menu:Pipelines[] from the left menu.
. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL.
. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL.
. After the upload is finished, you can review the configuration of the pipeline. Confirm with menu:Save[].
+
[#fig-ai-monitoring-owui-pipelines-webui]
Expand Down
Loading