📋 Runbooks Index
Operator incident hub — grouped by incident type with severity triage guidance.
Agent Fleet Incidents
Primary operator guide for fleet emergencies. Covers beak-key rotation, agent re-registration, pulse escalation, and live fleet triage checklists. Start here when agents go silent or peck failures spike.
Open Ops Runbook →Step-by-step fleet restoration when agents.alive = 0. Guides re-registration, heartbeat validation, beak-key verification, and post-recovery health checks. Use when Mission Control shows zero alive agents.
Open Fleet Recovery →Deployment & Lambda
Chronological record of all deployments, rollbacks, and infrastructure changes. Use to correlate incidents with recent deploys, verify which Lambda versions are live, and audit change history during post-mortems.
Open Deployment Log →Live Lambda function version matrix across all environments. Use to detect version drift, verify deployment parity, and identify which function revision is active per alias. Cross-reference with Deployment Log for rollback targets.
Open Lambda Versions →System Health
Real-time SLA metrics, uptime history, and performance trends. Open first during any suspected outage to establish platform baseline. Shows latency, error rates, and 30-day uptime bars across all core services.
Open System Health →Alert thresholds, notification routing, and on-call configuration. Use to verify that alerts are firing correctly, adjust sensitivity after a noisy incident, or confirm which channels receive P0/P1 notifications.
Open Alerts Config →Incident Management
Active and historical incident tracker. Log new incidents here when escalating, track status updates in real time, and record resolution timelines. Required for any P0/P1 event to maintain operator audit trail.
Open Incident Log →Structured postmortem template for P0/P1 incidents. Use after resolution to document root cause, contributing factors, timeline of events, and action items to prevent recurrence. Template in progress.
Documentation in progressGovernance
Audit trail for operator decisions, policy changes, and compliance checkpoints. Review before major releases or certification audits to verify change authority and operator accountability records.
Open Governance Log →Pre-deployment and daily operator checklist. Run before pushing to production or during scheduled maintenance windows. Covers cert health, fleet pulse, API gateway status, and infrastructure readiness gates.
Open Ops Checklist →