✅ Ops Checklist
Pre-deploy, rollback, and incident-triage checklists. State persists across sessions.
Pre-Deploy
0 / 10
Rollback
0 / 8
Incident Triage
0 / 9
🎛️Mission Control
💚System Health
📖Ops Runbook
🚀Deploy Log
🔐Governance Log
🛠️Admin Panel
📋Runbooks Index
🗒️Incident Postmortem
🔔Alert Thresholds
🚀 Pre-Deploy Checklist
Complete before promoting any Lambda version to prod or deploying frontend changes.
Deployment
Confirm
Mission Control → Alias Drift ↗
lambda_alias_version matches deployed function version. Check alias drift card on Mission Control.Fetch the status endpoint and verify response is valid JSON with
System Health ↗
lambda.version, database, and agents keys present.A dead fleet during deployment may mask downstream failures. Check Agent Fleet tile on Mission Control before promoting.
Mission Control → Agent Fleet ↗
Check Peck Failure Analysis on Mission Control. A high failure rate may indicate a callback issue that would worsen post-deploy.
Peck Audit ↗
Check Audit Activity strip on Mission Control. Look for abnormal counts of peck_notify, peck_requested, or duck.cert_issued events in the last 30 minutes.
Deployment can trigger transactional emails. If quota is near 160/200 (amber), defer to off-peak or notify T-JOSH.
Mission Control → SES Quota ↗
Check the Change Freeze banner on Mission Control. If a freeze is active, deployment is visually blocked until the freeze is cleared or explicitly overridden with T-JOSH approval.
Mission Control → Change Freeze ↗
Invoke the function with test payload and confirm no cold-start errors. Review CloudWatch logs if cold starts exceed 2,000ms.
Document: version number, what changed, timestamp, and operator. This feeds the deployment-log.html surface and incident handoff bundles.
Deployment Log ↗
All Lambda alias promotions require T-JOSH gate. Record the approval timestamp and approval method before executing the promotion.
Governance Log ↗
⏪ Rollback Checklist
Steps to safely roll back a Lambda alias to a previous known-good version.
P0-Ready
Note the current
Mission Control → Alias Drift ↗
lambda.version and lambda_alias_version values. This is the version being rolled back from.Check the deployment history on Mission Control or deployment-log.html. Note the prior Lambda version number that was stable.
Deployment Log ↗
Use the Incident Mode toggle on Mission Control to freeze the dashboard and flag the active incident with a UTC start time. This prevents refresh races during rollback.
Mission Control → Incident Mode ↗
Rollback is a governance action. Record the approval in GOVERNANCE-LOG.md with the target version, reason, and approver before running the AWS CLI command.
Governance Log ↗
Replace
<prev-version> with the known-good version number identified in step 2. Do not use $LATEST for prod alias.Refresh Mission Control and confirm the Alias Drift card shows the expected version with no drift. The
Mission Control ↗
/beak/system/status lambda_alias_version should match the target.After rollback, verify core API routes respond correctly. Check system-health.html for green status across all SLA metrics.
System Health ↗
Document: reverted version, target version, reason for rollback, approver, and timestamp. Exit incident mode on Mission Control after verification.
Incident Log ↗
🚨 Incident Triage Checklist
First-response steps for any platform incident. Work top-to-bottom.
P0 Response
Go to Mission Control and toggle Incident Mode. This freezes the dashboard, highlights drift and failure cards, and records a UTC start time to localStorage.
Mission Control ↗
Mission Control Anomaly Summary surfaces the top operator issues (alias drift, dead agents, SES/SNS sandbox, peck callback errors). Record what's flagged before taking action.
Mission Control → Anomaly Summary ↗
The health score (0–100) on Mission Control penalises dead agents, stale snapshots, peck failures, and sandbox blockers. Score <60 = red alert. Investigate the highest-penalty item first.
System Health ↗
Verify
Mission Control → Route Health ↗
/beak, /beak/metrics, and /beak/system/status each return 200. A route returning 502 or 504 indicates a Lambda cold start or crash loop.A total fleet outage means no spaceducks are pulsing. This prevents peck approvals and cert issuance. Follow INC-003 in the ops runbook.
Ops Runbook → INC-003 ↗
Different failure types require different responses.
Peck Audit ↗
callback_error spikes indicate webhook delivery issues. denied spikes may indicate auth config drift. expired spikes may indicate slow response times.Use the "Copy Diagnostics JSON" or "Copy Incident Snapshot" action on Mission Control to capture platform state (lambda version, pipeline counts, sandbox state) with a UTC timestamp.
Mission Control → Diagnostics ↗
Any destructive action (rollback, cert revocation, connection termination) requires T-JOSH gate. Do not execute without approval captured in GOVERNANCE-LOG.md.
Governance Log ↗
After resolution, document: severity (P0/P1/P2), incident start time, detection time, resolution time, affected components, root cause, and corrective actions. Exit incident mode on Mission Control.
Incident Log ↗
Incident Postmortem ↗