Skip to content

msftse/sre-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure SRE Agent

Azure SRE Agent

Deploy an Azure SRE Agent with Terraform. The agent automates site reliability engineering tasks — incident triage, diagnostics, remediation, and scheduled operational workflows — powered by AI.

Quick Start

Prerequisites

Tool Version Install
Terraform >= 1.5.0 Install
Azure CLI >= 2.60 Install
yq any brew install yq
You also need:
  • An active Azure subscription
  • Owner or User Access Administrator role on the subscription
  • The Microsoft.App resource provider registered (the script handles this automatically)

Deploy

# Sign in to Azure
az login

# Review the plan
./scripts/manage-environment.sh plan

# Deploy all resources + agent configuration
./scripts/manage-environment.sh up

# Deploy agent configuration only (connectors, skills, subagents)
./scripts/manage-environment.sh configure

# Tear down everything
./scripts/manage-environment.sh down

After deployment, the script prints the agent portal URL — open it to start using your SRE Agent at sre.azure.com.

What Gets Deployed

Infrastructure (Terraform)

Created in a single resource group:

Resource Purpose
Azure SRE Agent The AI agent (Microsoft.App/agents)
Log Analytics Workspace Backing store for Application Insights and KQL queries
Application Insights Agent telemetry, action logging, performance monitoring
Smart Detection Alert Rule Automatic failure anomaly detection
User-Assigned Managed Identity Agent authentication to Azure (no secrets to manage)
RBAC Role Assignments Reader, Log Analytics Reader, Monitoring Reader/Contributor on the resource group; Monitoring Contributor on the subscription; SRE Agent Administrator for the deployer

Agent Configuration (REST API)

Deployed via configure-agent.sh after infrastructure is up:

Component Source API Description
Connectors configuration/connectors/*.yaml ARM (az rest) Datadog MCP (19 tools), Coralogix MCP (17 tools)
Skills configuration/skills/*.md Data plane Reusable skill definitions with tool lists and instructions
Subagents configuration/agents/*.yaml Data plane Custom AI agents routed to skills via allowed_skills

Connector Secrets

Connector YAML files use ${VAR} placeholders substituted from .env at deploy time:

cp .env.example .env   # copy the template
vim .env               # fill in your API keys

Required variables:

Variable Description
DD_MCP_URL Datadog MCP server endpoint
DD_API_KEY Datadog API key
DD_APPLICATION_KEY Datadog application key
CX_MCP_URL Coralogix MCP server endpoint
CX_API_KEY Coralogix API key

.env is git-ignored. Never commit secrets.

Configuration

Copy the example tfvars file and customise:

cp infrastructure/terraform.tfvars.example infrastructure/terraform.tfvars
agent_name          = "sre-agent-dev"       # Name of the SRE Agent
resource_group_name = "rg-sre-agent-dev"    # Resource group name
location            = "swedencentral"       # Azure region (eastus2 | swedencentral | australiaeast)
access_level        = "High"                # High = Reader + Contributor, Low = Reader only
agent_mode          = "Review"              # Review | Autonomous | ReadOnly
log_retention_days  = 30                    # Log Analytics retention (30-730)

terraform.tfvars is git-ignored. Commit changes via terraform.tfvars.example instead.

Supported Regions

Azure SRE Agent is available in three regions only:

Region Code
East US 2 eastus2
Sweden Central swedencentral
Australia East australiaeast

Project Structure

sre-agent/
├── README.md                                 # This file
├── AGENT.md                                  # Internal project tracker, architecture decisions, changelog
├── .gitignore
├── infrastructure/                           # Stage 1: Terraform
│   ├── providers.tf                          # azurerm ~>4.0, azapi ~>2.0
│   ├── variables.tf                          # Input variables with validation
│   ├── main.tf                               # Root module calling child modules
│   ├── outputs.tf                            # Outputs: agent ID, portal URL, endpoint, UAMI, LAW
│   ├── terraform.tfvars.example              # Template for terraform.tfvars (git-ignored)
│   └── modules/
│       ├── resource-group/                   # azurerm_resource_group
│       ├── monitoring/                       # LAW + App Insights + Smart Detection alert
│       ├── identity/                         # User-Assigned Managed Identity
│       ├── sre-agent/                        # azapi_resource (Microsoft.App/agents)
│       └── role-assignments/                 # RBAC roles for the UAMI and deployer
├── configuration/                            # Stage 2: YAML/MD → REST API
│   ├── agents/                               # Subagent .yaml (api_version, kind, spec)
│   ├── connectors/                           # Connector .yaml (ARM sub-resources)
│   └── skills/                               # Skill .md (frontmatter + skill content)
└── scripts/
    ├── manage-environment.sh                 # up / down / plan / configure
    └── configure-agent.sh                    # Deploys config via SRE Agent REST API

Scripts

manage-environment.sh

Primary script for managing the environment locally.

Usage: manage-environment.sh <command> [options]

Commands:
  up        Initialize, plan, apply Terraform, then deploy agent configuration
  down      Destroy all resources
  plan      Initialize and show Terraform plan
  configure Deploy agent configuration only (connectors, skills, subagents)

Options:
  -v, --var-file FILE   Use a custom .tfvars file (default: terraform.tfvars)
  -h, --help            Show help

configure-agent.sh

Deploys agent configuration (connectors via ARM, skills and subagents via data plane REST API). Called automatically by manage-environment.sh up, or run standalone:

# Deploy all configuration
./scripts/configure-agent.sh

# With explicit endpoint
./scripts/configure-agent.sh --endpoint https://your-agent.azuresre.dev

# With custom config directory
./scripts/configure-agent.sh --config /path/to/config

Agent Configuration

All agent configuration lives in configuration/ as code. The script deploys them in dependency order: connectors → skills → subagents.

Subagents (configuration/agents/*.yaml)

YAML files following the AgentConfiguration schema. The system_prompt field contains the agent's instructions.

api_version: azuresre.ai/v1
kind: AgentConfiguration
spec:
  name: datadog-coralogix-agent
  system_prompt: >-
    You are an Observability Investigation Agent...
  tools: []
  handoffs: []
  handoff_description: ""
  agent_type: Review
  enable_skills: true
  allowed_skills:
    - sre-observability-investigation-skill

Skills (configuration/skills/*.md)

Markdown files with YAML frontmatter. The tools array lists MCP tools the skill can use. The body is the skill content.

---
name: sre-observability-investigation-skill
description: Investigates production issues using Coralogix and Datadog.
tools:
  - datadog-mcp_analyze_datadog_logs
  - coralogix-mcp_get_logs
---

# Skill Behavior
...

Technical Notes

  • No native azurerm resource exists for Azure SRE Agent. The agent is created via the azapi_resource provider using the ARM type Microsoft.App/agents@2025-05-01-preview. Schema sourced from official Microsoft Bicep samples.
  • Schema validation is disabled on the azapi_resource because the provider doesn't include this preview API schema.
  • Agent configuration uses two APIs. Connectors are ARM sub-resources deployed via az rest. Subagents and skills use the SRE Agent data plane REST API. This follows Microsoft's own pattern (infra via IaC, config via post-provision script).
  • Both connectors are MCP type. Datadog uses partnerType: DatadogMcp (appears under Telemetry in portal). Coralogix uses a generic MCP connector with bearer token auth.
  • Terraform state is local by default.
  • Model provider defaults to Azure OpenAI. Anthropic (Claude) is an alternative but excluded from EU Data Boundary commitments.

Roadmap

Stage Description Status
1. Infrastructure Terraform: SRE Agent + RG, LAW, App Insights, UAMI, RBAC Done
2. Agent Configuration Connectors, skills, subagents Done

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors