RunbookAI - AI-Powered Incident Response for Modern SRE Teams

How It Works

Hypothesis-Driven Investigation

Unlike traditional monitoring that alerts on symptoms, RunbookAI thinks like your best engineer - forming hypotheses, gathering evidence, and systematically narrowing down to root cause.

Gather Context

Pulls incident details from PagerDuty/OpsGenie, queries recent deployments, and searches your knowledge base for relevant runbooks and past incidents.

Form Hypotheses

Creates ranked hypotheses based on symptoms, organizational knowledge, and infrastructure patterns. Each hypothesis gets a probability score.

Test & Branch

Runs targeted queries against CloudWatch, Kubernetes, and your infrastructure. Branches deeper on strong evidence, prunes dead ends quickly.

Root Cause

Identifies root cause with a confidence level, suggests prioritized remediation steps, and can auto-execute approved runbooks with safety gates.

Features

Built for Production SRE

Everything you need for real incident response - from investigation to remediation, with safety built in.

Knowledge Integration

Indexes your runbooks, post-mortems, and architecture docs from Confluence, Google Drive, or local files. Learns from your organization's tribal knowledge and past incidents.

runbook knowledge sync

Safety First

All mutations require explicit approval with rollback commands. Full audit trail of every action, hypothesis, and decision.

Cloud Native

First-class AWS and Kubernetes support. Query EC2, RDS, CloudWatch, ECS, EKS, pods, deployments, and more.

Incident Providers

PagerDuty and OpsGenie integration out of the box. Pull incident context, update status, and post findings automatically.

Dynamic Skills

Built-in and custom skills executed step-by-step with approval hooks. Extend with your own workflows and automations.

Slack Gateway

Mention @runbookAI in your alert channels. Socket Mode for local development, Events API for production.

Claude Code Integration

Deep integration with Claude Code for contextual knowledge during coding sessions. Auto-inject relevant runbooks, known issues, and postmortems based on what you're discussing. MCP server for on-demand queries.

runbook integrations claude enable

Learning Loop

Automatically generates runbook updates and postmortems from investigations. Your knowledge base grows with every incident.

Trust & Transparency

See exactly what the agent is thinking. Evidence trails, confidence scores, and clear reasoning for every hypothesis.

Session Checkpoints

Save and resume investigation state across sessions. Never lose context when switching between incidents.

Use Cases

What Can RunbookAI Do?

From investigating production incidents to answering infrastructure questions - automate the repetitive parts of on-call.

PagerDuty Alert

Investigate Production Incidents

Agent pulls incident context, searches for similar past issues, forms hypotheses, and gathers evidence from your infrastructure automatically.

Root cause identified with confidence score and remediation steps

Natural Language Query

Ask About Infrastructure

"What EC2 instances are running in prod?" "Show me pods with restart loops" "Who owns the payments service?" Get instant answers.

Instant answers without digging through consoles

Skill Execution

Execute Runbooks Automatically

Skills are step-by-step workflows loaded from your runbooks. Agent executes them with approval gates for any mutations.

Consistent remediation following your documented procedures

Kubernetes Query

Cluster Health Checks

Get cluster status, list deployments, check node health, view recent events, find resource hogs - all with natural language.

Read-only queries safe to run anytime

Integrations

Connect Your Stack

Out-of-the-box integrations with your existing infrastructure, incident management, and knowledge systems.

Infrastructure

AWS EC2, RDS, CloudWatch, ECS, EKS, Lambda

Kubernetes Pods, deployments, nodes, events, metrics

CloudWatch Metrics, logs, alarms, insights

Incident Management

PagerDuty Incidents, alerts, escalations

OpsGenie Alerts, on-call schedules

Slack Mentions, alerts, notifications

Knowledge Sources

Confluence Pages, runbooks, postmortems

Google Drive Docs, sheets, folders

Filesystem Local markdown, watch mode

Your AI SRE, Always On Call