Documentation
Complete guide to installing, configuring, and using RunbookAI for AI-powered incident response.
Installation
Install RunbookAI with a single command:
curl -fsSL https://userunbook.ai/install.sh | bash
After installation, restart your shell:
exec $SHELL -l
The script installs Bun (if needed), clones the repo to ~/.runbook, builds the CLI, and adds it to your PATH.
Manual Installation
Alternatively, you can install manually:
# Clone the repository
git clone https://github.com/Runbook-Agent/RunbookAI.git
cd RunbookAI
# Install dependencies with Bun (recommended)
bun install
# Build the CLI
bun run build
Requirements
- Bun 1.0+ or Node.js 20+
- Anthropic API Key for Claude LLM
- AWS credentials (optional, for AWS integration)
- kubectl configured (optional, for Kubernetes integration)
Configuration
Initialize your configuration with the setup wizard:
runbook init
Demo output (abridged):
$ runbook init
═══════════════════════════════════════════
Runbook Setup Wizard
═══════════════════════════════════════════
Step 1: Choose your AI provider
Step 2: Enter your API key
...
Setup Complete!
Configuration complete! Your settings have been saved to .runbook/services.yaml
Full Configuration Reference
Here's a complete configuration example with all available options:
llm:
provider: anthropic
model: claude-sonnet-4-20250514
providers:
aws:
enabled: true
regions: [us-east-1, us-west-2]
kubernetes:
enabled: true
incident:
pagerduty:
enabled: true
apiKey: ${PAGERDUTY_API_KEY}
opsgenie:
enabled: false
apiKey: ${OPSGENIE_API_KEY}
slack:
enabled: false
botToken: ${SLACK_BOT_TOKEN}
appToken: ${SLACK_APP_TOKEN}
signingSecret: ${SLACK_SIGNING_SECRET}
events:
enabled: false
mode: socket
port: 3001
alertChannels: [C01234567]
allowedUsers: [U01234567]
requireThreadedMentions: true
knowledge:
sources:
- type: filesystem
path: .runbook/runbooks/
watch: true
# Confluence Cloud/Server
- type: confluence
baseUrl: https://mycompany.atlassian.net
spaceKey: SRE
labels: [runbook, postmortem]
auth:
email: ${CONFLUENCE_EMAIL}
apiToken: ${CONFLUENCE_API_TOKEN}
# Google Drive (requires OAuth)
- type: google_drive
folderIds: ['your-folder-id']
clientId: ${GOOGLE_CLIENT_ID}
clientSecret: ${GOOGLE_CLIENT_SECRET}
refreshToken: ${GOOGLE_REFRESH_TOKEN}
includeSubfolders: true
Your First Query
Test your installation by running a simple infrastructure query:
# Ask about your infrastructure
runbook ask "What EC2 instances are running in prod?"
# Check Kubernetes cluster status
runbook ask "Show cluster status and any warning events"
# Get a status overview
runbook status
RunbookAI uses read-only queries by default. Any mutations require explicit approval with rollback commands provided.
Commands
runbook ask
Ask questions about your infrastructure in natural language. The agent will query AWS, Kubernetes, and your knowledge base to provide answers.
runbook ask "What's the status of the checkout-api service?"
runbook ask "Show me RDS instances with high CPU"
runbook ask "Who owns the payments service?"
runbook ask "List pods with restart loops in the last hour"
runbook investigate
Perform a hypothesis-driven investigation of a PagerDuty or OpsGenie incident. The agent uses a structured approach to identify root causes.
# Basic investigation
runbook investigate PD-12345
# Investigation with auto-remediation
runbook investigate PD-12345 --auto-remediate
The investigation workflow:
- Gather Context - Pulls incident details, recent deployments, and relevant runbooks
- Form Hypotheses - Creates ranked hypotheses based on symptoms and organizational knowledge
- Test Hypotheses - Runs targeted queries against infrastructure to gather evidence
- Branch or Prune - Branches deeper on strong evidence, prunes dead ends
- Root Cause - Identifies root cause with confidence level
- Remediation - Suggests remediation steps with approval gates for mutations
runbook knowledge
Manage the knowledge base that powers contextual understanding:
# Sync knowledge from all configured sources
runbook knowledge sync
# Search the knowledge base
runbook knowledge search "redis connection timeout"
# Authenticate with Google Drive
runbook knowledge auth google
runbook status
Get a quick overview of your infrastructure health across all configured providers.
runbook status
runbook slack-gateway
Start the Slack gateway for @runbookAI mentions in alert channels:
# Local development (Socket Mode)
runbook slack-gateway --mode socket
# Production (HTTP Events API)
runbook slack-gateway --mode http --port 3001
Integrations
AWS
RunbookAI provides read-only access to AWS services including EC2, RDS, ECS, EKS, Lambda, CloudWatch, and more.
providers:
aws:
enabled: true
regions: [us-east-1, us-west-2, eu-west-1]
Ensure your AWS credentials are configured via:
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - AWS credentials file (
~/.aws/credentials) - IAM role (when running on EC2/ECS)
Available AWS Operations
- EC2 - Describe instances, security groups, volumes
- RDS - Describe instances, clusters, snapshots
- ECS - Describe clusters, services, tasks
- EKS - Describe clusters, node groups
- Lambda - Describe functions, invocations
- CloudWatch - Query metrics, logs, alarms
- DynamoDB - Describe tables, scan items
- S3 - List buckets, objects
Kubernetes
Query Kubernetes clusters using your configured kubeconfig with read-only operations:
providers:
kubernetes:
enabled: true
Available Kubernetes Operations
status- Cluster status overviewcontexts- List available contextsnamespaces- List namespacespods- List pods with statusdeployments- List deploymentsnodes- Node health and capacityevents- Recent cluster eventstop_pods- Resource usage by podtop_nodes- Resource usage by node
PagerDuty
Integrate with PagerDuty to pull incident context during investigations:
incident:
pagerduty:
enabled: true
apiKey: ${PAGERDUTY_API_KEY}
Create a read-only API key in PagerDuty under Configuration → API Access Keys.
OpsGenie
Integrate with OpsGenie for alert and incident context:
incident:
opsgenie:
enabled: true
apiKey: ${OPSGENIE_API_KEY}
Slack
Enable the Slack gateway to respond to @runbookAI mentions in alert channels:
incident:
slack:
enabled: true
botToken: ${SLACK_BOT_TOKEN}
appToken: ${SLACK_APP_TOKEN}
signingSecret: ${SLACK_SIGNING_SECRET}
events:
enabled: true
mode: socket # 'socket' for dev, 'http' for production
port: 3001
alertChannels: [C01234567]
allowedUsers: [U01234567]
requireThreadedMentions: true
See the Slack Gateway Guide for detailed setup instructions.
Knowledge Sources
Filesystem
Sync runbooks and documentation from local markdown files:
knowledge:
sources:
- type: filesystem
path: .runbook/runbooks/
watch: true # Auto-sync on file changes
Confluence
Sync runbooks from Confluence Cloud or Server:
knowledge:
sources:
- type: confluence
baseUrl: https://mycompany.atlassian.net
spaceKey: SRE
labels: [runbook, postmortem]
auth:
email: ${CONFLUENCE_EMAIL}
apiToken: ${CONFLUENCE_API_TOKEN}
Google Drive
Sync documents from Google Drive folders:
knowledge:
sources:
- type: google_drive
folderIds: ['your-folder-id']
clientId: ${GOOGLE_CLIENT_ID}
clientSecret: ${GOOGLE_CLIENT_SECRET}
refreshToken: ${GOOGLE_REFRESH_TOKEN}
includeSubfolders: true
Run the OAuth authentication flow:
# Set up OAuth credentials first
export GOOGLE_CLIENT_ID=your-client-id
export GOOGLE_CLIENT_SECRET=your-client-secret
# Run authentication flow
runbook knowledge auth google
Skills
Skills are step-by-step workflows that can be executed with approval gates. RunbookAI includes built-in skills and supports custom skills.
Built-in Skills
investigate-incident- Full incident investigation workflowdeploy-service- Deploy a service to productionscale-service- Scale a service up or downrollback-deployment- Rollback to a previous deploymenttroubleshoot-service- General service troubleshootingcost-analysis- Analyze infrastructure costssecurity-audit- Security posture checkinvestigate-cost-spike- Investigate unexpected cost increases
Writing Runbooks
Create markdown files with frontmatter to help RunbookAI understand your runbooks:
---
type: runbook
services: [checkout-api, cart-service]
symptoms:
- "Redis connection timeout"
- "Connection pool exhausted"
severity: sev2
---
# Redis Connection Exhaustion
## Symptoms
- Connection timeouts in checkout-api logs
- Connection pool exhausted errors
- Increased latency in checkout flow
## Quick Diagnosis
1. Check Redis connection count: `redis-cli info clients`
2. Check client memory usage: `redis-cli info memory`
3. Review recent traffic patterns in CloudWatch
## Mitigation Steps
1. Scale Redis cluster (requires approval)
2. Increase connection pool limit in application config
3. Enable connection queuing in pgbouncer
Frontmatter Fields
type- Document type:runbook,postmortem,architectureservices- Array of related service namessymptoms- Array of symptom descriptions for matchingseverity- Severity level:sev1,sev2,sev3tags- Additional tags for search and categorization
Evaluation
RunbookAI includes evaluation tooling to benchmark investigation accuracy against datasets:
# Run evaluation with RCAEval fixtures
npm run eval:investigate -- \
--fixtures examples/evals/rcaeval-fixtures.generated.json \
--out .runbook/evals/rcaeval-report.json
# Run all benchmark adapters
npm run eval:all -- \
--out-dir .runbook/evals/all-benchmarks \
--rcaeval-input examples/evals/rcaeval-input.sample.json \
--tracerca-input examples/evals/tracerca-input.sample.json
See the Investigation Evaluation Guide for detailed documentation on benchmarking.
If you have questions or run into issues, please open an issue on GitHub or join the discussions.