Documentation

Complete guide to installing, configuring, and using RunbookAI for AI-powered incident response.

Installation

Install RunbookAI with a single command:

Terminal

curl -fsSL https://userunbook.ai/install.sh | bash

After installation, restart your shell:

Terminal

exec $SHELL -l

What the install script does

The script installs Bun (if needed), clones the repo to ~/.runbook, builds the CLI, and adds it to your PATH.

Manual Installation

Alternatively, you can install manually:

Terminal

# Clone the repository
git clone https://github.com/Runbook-Agent/RunbookAI.git
cd RunbookAI

# Install dependencies with Bun (recommended)
bun install

# Build the CLI
bun run build

Requirements

Bun 1.0+ or Node.js 20+
Anthropic API Key for Claude LLM
AWS credentials (optional, for AWS integration)
kubectl configured (optional, for Kubernetes integration)

Configuration

Initialize your configuration with the setup wizard:

Terminal

runbook init

Demo output (abridged):

Output

$ runbook init
═══════════════════════════════════════════
 Runbook Setup Wizard
═══════════════════════════════════════════
Step 1: Choose your AI provider
Step 2: Enter your API key
...
 Setup Complete!
Configuration complete! Your settings have been saved to .runbook/services.yaml

Full Configuration Reference

Here's a complete configuration example with all available options:

.runbook/config.yaml

llm:
  provider: anthropic
  model: claude-sonnet-4-20250514

providers:
  aws:
    enabled: true
    regions: [us-east-1, us-west-2]
  kubernetes:
    enabled: true

incident:
  pagerduty:
    enabled: true
    apiKey: ${PAGERDUTY_API_KEY}
  opsgenie:
    enabled: false
    apiKey: ${OPSGENIE_API_KEY}
  slack:
    enabled: false
    botToken: ${SLACK_BOT_TOKEN}
    appToken: ${SLACK_APP_TOKEN}
    signingSecret: ${SLACK_SIGNING_SECRET}
    events:
      enabled: false
      mode: socket
      port: 3001
      alertChannels: [C01234567]
      allowedUsers: [U01234567]
      requireThreadedMentions: true

knowledge:
  sources:
    - type: filesystem
      path: .runbook/runbooks/
      watch: true

    # Confluence Cloud/Server
    - type: confluence
      baseUrl: https://mycompany.atlassian.net
      spaceKey: SRE
      labels: [runbook, postmortem]
      auth:
        email: ${CONFLUENCE_EMAIL}
        apiToken: ${CONFLUENCE_API_TOKEN}

    # Google Drive (requires OAuth)
    - type: google_drive
      folderIds: ['your-folder-id']
      clientId: ${GOOGLE_CLIENT_ID}
      clientSecret: ${GOOGLE_CLIENT_SECRET}
      refreshToken: ${GOOGLE_REFRESH_TOKEN}
      includeSubfolders: true

Your First Query

Test your installation by running a simple infrastructure query:

Terminal

# Ask about your infrastructure
runbook ask "What EC2 instances are running in prod?"

# Check Kubernetes cluster status
runbook ask "Show cluster status and any warning events"

# Get a status overview
runbook status

Tip

RunbookAI uses read-only queries by default. Any mutations require explicit approval with rollback commands provided.

Commands

runbook ask

Ask questions about your infrastructure in natural language. The agent will query AWS, Kubernetes, and your knowledge base to provide answers.

Terminal

runbook ask "What's the status of the checkout-api service?"
runbook ask "Show me RDS instances with high CPU"
runbook ask "Who owns the payments service?"
runbook ask "List pods with restart loops in the last hour"

runbook investigate

Perform a hypothesis-driven investigation of a PagerDuty or OpsGenie incident. The agent uses a structured approach to identify root causes.

Terminal

# Basic investigation
runbook investigate PD-12345

# Investigation with auto-remediation
runbook investigate PD-12345 --auto-remediate

The investigation workflow:

Gather Context - Pulls incident details, recent deployments, and relevant runbooks
Form Hypotheses - Creates ranked hypotheses based on symptoms and organizational knowledge
Test Hypotheses - Runs targeted queries against infrastructure to gather evidence
Branch or Prune - Branches deeper on strong evidence, prunes dead ends
Root Cause - Identifies root cause with confidence level
Remediation - Suggests remediation steps with approval gates for mutations

runbook knowledge

Manage the knowledge base that powers contextual understanding:

Terminal

# Sync knowledge from all configured sources
runbook knowledge sync

# Search the knowledge base
runbook knowledge search "redis connection timeout"

# Authenticate with Google Drive
runbook knowledge auth google

runbook status

Get a quick overview of your infrastructure health across all configured providers.

Terminal

runbook status

runbook slack-gateway

Start the Slack gateway for @runbookAI mentions in alert channels:

Terminal

# Local development (Socket Mode)
runbook slack-gateway --mode socket

# Production (HTTP Events API)
runbook slack-gateway --mode http --port 3001

Integrations

AWS

RunbookAI provides read-only access to AWS services including EC2, RDS, ECS, EKS, Lambda, CloudWatch, and more.

.runbook/config.yaml

providers:
  aws:
    enabled: true
    regions: [us-east-1, us-west-2, eu-west-1]

Ensure your AWS credentials are configured via:

Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
AWS credentials file (~/.aws/credentials)
IAM role (when running on EC2/ECS)

Available AWS Operations

EC2 - Describe instances, security groups, volumes
RDS - Describe instances, clusters, snapshots
ECS - Describe clusters, services, tasks
EKS - Describe clusters, node groups
Lambda - Describe functions, invocations
CloudWatch - Query metrics, logs, alarms
DynamoDB - Describe tables, scan items
S3 - List buckets, objects

Kubernetes

Query Kubernetes clusters using your configured kubeconfig with read-only operations:

.runbook/config.yaml

providers:
  kubernetes:
    enabled: true

Available Kubernetes Operations

status - Cluster status overview
contexts - List available contexts
namespaces - List namespaces
pods - List pods with status
deployments - List deployments
nodes - Node health and capacity
events - Recent cluster events
top_pods - Resource usage by pod
top_nodes - Resource usage by node

PagerDuty

Integrate with PagerDuty to pull incident context during investigations:

.runbook/config.yaml

incident:
  pagerduty:
    enabled: true
    apiKey: ${PAGERDUTY_API_KEY}

Create a read-only API key in PagerDuty under Configuration → API Access Keys.

OpsGenie

Integrate with OpsGenie for alert and incident context:

.runbook/config.yaml

incident:
  opsgenie:
    enabled: true
    apiKey: ${OPSGENIE_API_KEY}

Slack

Enable the Slack gateway to respond to @runbookAI mentions in alert channels:

.runbook/config.yaml

incident:
  slack:
    enabled: true
    botToken: ${SLACK_BOT_TOKEN}
    appToken: ${SLACK_APP_TOKEN}
    signingSecret: ${SLACK_SIGNING_SECRET}
    events:
      enabled: true
      mode: socket          # 'socket' for dev, 'http' for production
      port: 3001
      alertChannels: [C01234567]
      allowedUsers: [U01234567]
      requireThreadedMentions: true

See the Slack Gateway Guide for detailed setup instructions.

Knowledge Sources

Filesystem

Sync runbooks and documentation from local markdown files:

.runbook/config.yaml

knowledge:
  sources:
    - type: filesystem
      path: .runbook/runbooks/
      watch: true    # Auto-sync on file changes

Confluence

Sync runbooks from Confluence Cloud or Server:

.runbook/config.yaml

knowledge:
  sources:
    - type: confluence
      baseUrl: https://mycompany.atlassian.net
      spaceKey: SRE
      labels: [runbook, postmortem]
      auth:
        email: ${CONFLUENCE_EMAIL}
        apiToken: ${CONFLUENCE_API_TOKEN}

Google Drive

Sync documents from Google Drive folders:

.runbook/config.yaml

knowledge:
  sources:
    - type: google_drive
      folderIds: ['your-folder-id']
      clientId: ${GOOGLE_CLIENT_ID}
      clientSecret: ${GOOGLE_CLIENT_SECRET}
      refreshToken: ${GOOGLE_REFRESH_TOKEN}
      includeSubfolders: true

Run the OAuth authentication flow:

Terminal

# Set up OAuth credentials first
export GOOGLE_CLIENT_ID=your-client-id
export GOOGLE_CLIENT_SECRET=your-client-secret

# Run authentication flow
runbook knowledge auth google

Skills

Skills are step-by-step workflows that can be executed with approval gates. RunbookAI includes built-in skills and supports custom skills.

Built-in Skills

investigate-incident - Full incident investigation workflow
deploy-service - Deploy a service to production
scale-service - Scale a service up or down
rollback-deployment - Rollback to a previous deployment
troubleshoot-service - General service troubleshooting
cost-analysis - Analyze infrastructure costs
security-audit - Security posture check
investigate-cost-spike - Investigate unexpected cost increases

Writing Runbooks

Create markdown files with frontmatter to help RunbookAI understand your runbooks:

.runbook/runbooks/redis-connection.md

---
type: runbook
services: [checkout-api, cart-service]
symptoms:
  - "Redis connection timeout"
  - "Connection pool exhausted"
severity: sev2
---

# Redis Connection Exhaustion

## Symptoms
- Connection timeouts in checkout-api logs
- Connection pool exhausted errors
- Increased latency in checkout flow

## Quick Diagnosis
1. Check Redis connection count: `redis-cli info clients`
2. Check client memory usage: `redis-cli info memory`
3. Review recent traffic patterns in CloudWatch

## Mitigation Steps
1. Scale Redis cluster (requires approval)
2. Increase connection pool limit in application config
3. Enable connection queuing in pgbouncer

Frontmatter Fields

type - Document type: runbook, postmortem, architecture
services - Array of related service names
symptoms - Array of symptom descriptions for matching
severity - Severity level: sev1, sev2, sev3
tags - Additional tags for search and categorization

Evaluation

RunbookAI includes evaluation tooling to benchmark investigation accuracy against datasets:

Terminal

# Run evaluation with RCAEval fixtures
npm run eval:investigate -- \
  --fixtures examples/evals/rcaeval-fixtures.generated.json \
  --out .runbook/evals/rcaeval-report.json

# Run all benchmark adapters
npm run eval:all -- \
  --out-dir .runbook/evals/all-benchmarks \
  --rcaeval-input examples/evals/rcaeval-input.sample.json \
  --tracerca-input examples/evals/tracerca-input.sample.json

See the Investigation Evaluation Guide for detailed documentation on benchmarking.

Need Help?

If you have questions or run into issues, please open an issue on GitHub or join the discussions.