Documentation Integrations GitHub Get Started
Open Source Claude Code Integration Production Ready

Your AI SRE, Always On Call

Automate incident investigation with hypothesis-driven AI. RunbookAI diagnoses issues, executes runbooks, and suggests remediations - so your team can focus on shipping, not firefighting.

24/7 Coverage
100% Audit Trail
Install with one command:
curl -fsSL https://userunbook.ai/install.sh | bash
runbook investigate PD-12345
$ runbook investigate PD-12345
Gathering incident context from PagerDuty...
Searching knowledge base for similar incidents...
Found 3 relevant runbooks
Forming hypotheses:
H1: Database connection pool exhaustion (0.72)
H2: Recent deployment memory leak (0.54)
H3: Upstream API timeout cascade (0.31)
Testing H1: Querying CloudWatch metrics...
Evidence: STRONG - Connection count 340% above baseline
Branching deeper on H1...
Evidence: STRONG - Traffic spike at 14:32 UTC
ROOT CAUSE IDENTIFIED
Database connection exhaustion due to unexpected traffic spike
Confidence: HIGH (0.94)
Suggested remediation:
1. Scale RDS read replicas (requires approval)
2. Increase connection pool limit to 200
3. Enable connection queuing in pgbouncer

Built for teams running

Hypothesis-Driven Investigation

Unlike traditional monitoring that alerts on symptoms, RunbookAI thinks like your best engineer - forming hypotheses, gathering evidence, and systematically narrowing down to root cause.

1

Gather Context

Pulls incident details from PagerDuty/OpsGenie, queries recent deployments, and searches your knowledge base for relevant runbooks and past incidents.

2

Form Hypotheses

Creates ranked hypotheses based on symptoms, organizational knowledge, and infrastructure patterns. Each hypothesis gets a probability score.

3

Test & Branch

Runs targeted queries against CloudWatch, Kubernetes, and your infrastructure. Branches deeper on strong evidence, prunes dead ends quickly.

4

Root Cause

Identifies root cause with a confidence level, suggests prioritized remediation steps, and can auto-execute approved runbooks with safety gates.

Built for Production SRE

Everything you need for real incident response - from investigation to remediation, with safety built in.

Knowledge Integration

Indexes your runbooks, post-mortems, and architecture docs from Confluence, Google Drive, or local files. Learns from your organization's tribal knowledge and past incidents.

runbook knowledge sync

Safety First

All mutations require explicit approval with rollback commands. Full audit trail of every action, hypothesis, and decision.

Cloud Native

First-class AWS and Kubernetes support. Query EC2, RDS, CloudWatch, ECS, EKS, pods, deployments, and more.

Incident Providers

PagerDuty and OpsGenie integration out of the box. Pull incident context, update status, and post findings automatically.

Dynamic Skills

Built-in and custom skills executed step-by-step with approval hooks. Extend with your own workflows and automations.

Slack Gateway

Mention @runbookAI in your alert channels. Socket Mode for local development, Events API for production.

Claude Code Integration

Deep integration with Claude Code for contextual knowledge during coding sessions. Auto-inject relevant runbooks, known issues, and postmortems based on what you're discussing. MCP server for on-demand queries.

runbook integrations claude enable

Learning Loop

Automatically generates runbook updates and postmortems from investigations. Your knowledge base grows with every incident.

Trust & Transparency

See exactly what the agent is thinking. Evidence trails, confidence scores, and clear reasoning for every hypothesis.

Session Checkpoints

Save and resume investigation state across sessions. Never lose context when switching between incidents.

What Can RunbookAI Do?

From investigating production incidents to answering infrastructure questions - automate the repetitive parts of on-call.

PagerDuty Alert

Investigate Production Incidents

Agent pulls incident context, searches for similar past issues, forms hypotheses, and gathers evidence from your infrastructure automatically.

Root cause identified with confidence score and remediation steps
Natural Language Query

Ask About Infrastructure

"What EC2 instances are running in prod?" "Show me pods with restart loops" "Who owns the payments service?" Get instant answers.

Instant answers without digging through consoles
Skill Execution

Execute Runbooks Automatically

Skills are step-by-step workflows loaded from your runbooks. Agent executes them with approval gates for any mutations.

Consistent remediation following your documented procedures
Kubernetes Query

Cluster Health Checks

Get cluster status, list deployments, check node health, view recent events, find resource hogs - all with natural language.

Read-only queries safe to run anytime

Connect Your Stack

Out-of-the-box integrations with your existing infrastructure, incident management, and knowledge systems.

Infrastructure

AWS EC2, RDS, CloudWatch, ECS, EKS, Lambda
Kubernetes Pods, deployments, nodes, events, metrics
CloudWatch Metrics, logs, alarms, insights

Incident Management

PagerDuty Incidents, alerts, escalations
OpsGenie Alerts, on-call schedules
Slack Mentions, alerts, notifications

Knowledge Sources

Confluence Pages, runbooks, postmortems
Google Drive Docs, sheets, folders
Filesystem Local markdown, watch mode

Up and Running in Minutes

Get started with RunbookAI in just a few commands. No complex setup required.

1 Install
$ git clone https://github.com/Runbook-Agent/RunbookAI.git
$ cd RunbookAI && bun install
2 Configure
$ mkdir -p .runbook
$ cp examples/config.yaml .runbook/config.yaml
$ export ANTHROPIC_API_KEY=your-api-key
3 Run
$ bun run dev ask "What EC2 instances are running?"

Ready to Automate Incident Response?

Join SRE teams using AI to investigate incidents faster and more consistently. Open source, self-hosted, no vendor lock-in.