Ops Checklist — Space Duck Mission Control

✅ Ops Checklist

Pre-deploy, rollback, and incident-triage checklists. State persists across sessions.

← Mission Control

Checklist state auto-saved to localStorage.

Pre-Deploy

0 / 10

Rollback

0 / 8

Incident Triage

0 / 9

🎛️Mission Control 💚System Health 📖Ops Runbook 🚀Deploy Log 🔐Governance Log 🛠️Admin Panel 📋Runbooks Index 🗒️Incident Postmortem 🔔Alert Thresholds

⏪ Rollback Checklist

Steps to safely roll back a Lambda alias to a previous known-good version.

P0-Ready

Identify the failing version number from Mission Control alias drift card

Note the current lambda.version and lambda_alias_version values. This is the version being rolled back from.

Mission Control → Alias Drift ↗

Identify the last known-good version from deployment log

Check the deployment history on Mission Control or deployment-log.html. Note the prior Lambda version number that was stable.

Deployment Log ↗

Enable incident mode on Mission Control — pause auto-refresh

Use the Incident Mode toggle on Mission Control to freeze the dashboard and flag the active incident with a UTC start time. This prevents refresh races during rollback.

Mission Control → Incident Mode ↗

Notify T-JOSH — obtain rollback approval before executing

Rollback is a governance action. Record the approval in GOVERNANCE-LOG.md with the target version, reason, and approver before running the AWS CLI command.

Governance Log ↗

Execute rollback: aws lambda update-alias --function-name mission-control-api --name prod --function-version <prev-version>

Replace <prev-version> with the known-good version number identified in step 2. Do not use $LATEST for prod alias.

Verify rollback on Mission Control — confirm alias now points to target version

Refresh Mission Control and confirm the Alias Drift card shows the expected version with no drift. The /beak/system/status lambda_alias_version should match the target.

Mission Control ↗

Run smoke test — confirm /beak/system/status returns 200 with correct version

After rollback, verify core API routes respond correctly. Check system-health.html for green status across all SLA metrics.

System Health ↗

Record rollback in DEPLOY-LOG.md and incident-log.html with root cause

Document: reverted version, target version, reason for rollback, approver, and timestamp. Exit incident mode on Mission Control after verification.

Incident Log ↗

🚨 Incident Triage Checklist

First-response steps for any platform incident. Work top-to-bottom.

P0 Response

Enable incident mode on Mission Control to pause auto-refresh and stamp start time

Go to Mission Control and toggle Incident Mode. This freezes the dashboard, highlights drift and failure cards, and records a UTC start time to localStorage.

Mission Control ↗

Check anomaly summary — note top 3 active anomalies

Mission Control Anomaly Summary surfaces the top operator issues (alias drift, dead agents, SES/SNS sandbox, peck callback errors). Record what's flagged before taking action.

Mission Control → Anomaly Summary ↗

Check System Health Score — note current score and penalties

The health score (0–100) on Mission Control penalises dead agents, stale snapshots, peck failures, and sandbox blockers. Score <60 = red alert. Investigate the highest-penalty item first.

System Health ↗

Confirm API routes are responding — check Route Health matrix

Verify /beak, /beak/metrics, and /beak/system/status each return 200. A route returning 502 or 504 indicates a Lambda cold start or crash loop.

Mission Control → Route Health ↗

Check agent fleet — if agents.alive = 0, initiate fleet recovery

A total fleet outage means no spaceducks are pulsing. This prevents peck approvals and cert issuance. Follow INC-003 in the ops runbook.

Ops Runbook → INC-003 ↗

Check peck failure breakdown — identify dominant failure type (denied / expired / callback_error)

Different failure types require different responses. callback_error spikes indicate webhook delivery issues. denied spikes may indicate auth config drift. expired spikes may indicate slow response times.

Peck Audit ↗

Copy diagnostics JSON — capture system state for incident handoff

Use the "Copy Diagnostics JSON" or "Copy Incident Snapshot" action on Mission Control to capture platform state (lambda version, pipeline counts, sandbox state) with a UTC timestamp.

Mission Control → Diagnostics ↗

Notify T-JOSH if rollback or escalation is required

Any destructive action (rollback, cert revocation, connection termination) requires T-JOSH gate. Do not execute without approval captured in GOVERNANCE-LOG.md.

Governance Log ↗

Log incident in incident-log.html with severity, timeline, impact, and resolution

After resolution, document: severity (P0/P1/P2), incident start time, detection time, resolution time, affected components, root cause, and corrective actions. Exit incident mode on Mission Control.

Incident Log ↗ Incident Postmortem ↗

✅ Ops Checklist

🚀 Pre-Deploy Checklist

⏪ Rollback Checklist

🚨 Incident Triage Checklist