Safety & Control
RunbookAI is built for production. Every design decision prioritizes safety, auditability, and your team's control over what runs in your infrastructure.
Overview
Autonomous incident response is powerful — but only if you trust it. RunbookAI's safety model ensures no action is taken without explicit human approval, every step is logged, and you retain full ownership of your data and deployment.
Three principles guide every feature:
- Human-in-the-loop: Every mutation requires explicit approval before execution.
- Full observability: Every action, query, and decision is logged with complete context.
- Zero trust in the agent: RunbookAI assumes it can be wrong. It presents evidence and recommendations — your team decides.
Architecture & Safety Model
RunbookAI runs entirely within your infrastructure. There is no hosted service, no external data plane, and no telemetry sent back to us.
┌──────────────────────────────────────────────────┐
│ Your Infrastructure │
│ │
│ ┌─────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Alert │───▶│ RunbookAI │───▶│ Slack / │ │
│ │ Source │ │ Agent │ │ Approval│ │
│ └─────────┘ └──────┬───────┘ └──────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ AWS │ │ K8s │ │ Knowledge│ │
│ │ (read) │ │ (read) │ │ Base │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ All queries are read-only by default. │
│ Mutations require human approval via Slack. │
└──────────────────────────────────────────────────┘
Key properties:
- Local execution: The agent runs as a CLI process or self-hosted server on your machines.
- Read-only by default: Infrastructure queries (AWS, Kubernetes, CloudWatch) are read-only. No write operations without approval.
- LLM calls go direct: API calls to your configured LLM provider go directly from your network. RunbookAI never proxies or stores prompts.
- No phone-home: Zero telemetry, no usage tracking, no analytics sent externally.
Approval Gates
Every mutating action — scaling a deployment, restarting a service, modifying a configuration — requires explicit human approval before execution. This is not optional; it's the core safety mechanism.
How it works
- RunbookAI identifies a potential remediation (e.g., "scale payment-service to 5 replicas").
- The proposed action is sent to your team via Slack with full context: what will change, why, and supporting evidence.
- A team member approves or rejects the action directly in Slack.
- Only after approval does RunbookAI execute the command.
Approval gates work independently of runbook permissions. Even if a runbook allows a mutation, the gate still requires human sign-off at execution time.
Granular control
Configure approval requirements per-action, per-environment, or per-team:
safety:
approval_required: true # Global default
auto_approve_reads: true # Skip approval for read-only queries
approval_channel: "#incident-ops"
timeout: 300 # Seconds to wait for approval
environments:
production:
approval_required: true # Always require in prod
staging:
approval_required: false # Auto-approve in staging
Audit Trails
Every action RunbookAI takes is logged with full context: what was done, why, who approved it, and what evidence led to the decision. This creates a complete, queryable record of every incident investigation.
What gets logged
- Hypotheses formed: Ranked list with confidence scores and reasoning.
- Evidence gathered: Every query, its target system, and the response.
- Actions proposed: The exact command, expected impact, and supporting evidence.
- Approval decisions: Who approved/rejected, when, and any comments.
- Execution results: Command output, success/failure, and post-action state.
Audit logs are stored locally alongside your investigation results. Export them to your existing logging infrastructure or compliance tools — there are no proprietary formats.
Self-Hosted
RunbookAI runs entirely on your infrastructure. There is no SaaS component, no hosted control plane, and no external dependency beyond your chosen LLM provider.
- CLI mode: Run directly on any engineer's machine as a local process.
- Server mode: Deploy the shared knowledge server within your network for team-wide access.
- Your LLM, your keys: Bring your own Anthropic (or other provider) API key. Calls go direct from your network.
- Your data stays yours: Runbooks, knowledge bases, investigation results, and audit logs never leave your infrastructure.
With a self-hosted LLM, RunbookAI can operate in fully air-gapped environments with zero external network access.
Open Source
RunbookAI is released under the MIT License. The entire codebase — agent logic, integrations, knowledge indexing, safety mechanisms — is open for inspection, modification, and contribution.
- Full source available: Every line of code that runs in your infrastructure is auditable.
- No hidden services: There are no closed-source components, proprietary plugins, or opaque backends.
- Community-driven: Security issues are reported and fixed in the open. No security-through-obscurity.
- Fork-friendly: MIT license means you can modify, redistribute, and build on RunbookAI without restriction.
No Vendor Lock-In
RunbookAI is designed to integrate with your existing stack, not replace it. Every component is swappable, every format is standard, and there is no proprietary data layer.
- Standard runbook format: Markdown with YAML frontmatter. Your runbooks work with or without RunbookAI.
- Bring any LLM: Swap providers (Anthropic, OpenAI, local models) without changing your runbooks or configuration.
- No proprietary APIs: All integrations use the official APIs of the tools you already run (AWS SDK, Kubernetes API, PagerDuty API).
- Plain-text everything: Configuration is YAML. Knowledge sources are Markdown. Logs are structured JSON. No proprietary formats.
If you stop using RunbookAI, you keep everything: your runbooks, knowledge, and investigation history. Nothing is locked behind a proprietary format or service.
Open a GitHub Discussion or file an issue. Security-related reports can be sent directly via GitHub's private vulnerability reporting.