Operational Runbooks
Runbook 1: Emergency Policy Override (All Users Locked Out)
Section titled “Runbook 1: Emergency Policy Override (All Users Locked Out)”Symptom: A policy change has locked out all users — every request returns not authorized. No one can submit requests or approve them.
Cause: A misconfigured eligibility policy with default allow = false that does not match any legitimate request.
Recovery
Section titled “Recovery”# 1. Identify the broken policyjitsudo policy list
# 2. Option A: Disable the broken policyjitsudo policy update --name broken-policy --enabled falsejitsudo server reload-policies
# 3. Verify access is restoredjitsudo policy eval \ --type eligibility \ --input '{"user":{"email":"admin@example.com","groups":["sre"]},"request":{"provider":"aws","role":"prod-infra-admin","resource_scope":"123456789012","duration_seconds":3600}}'If jitsudo CLI access is also broken (can’t authenticate):
# 4. Option B: Direct database update (last resort)# Set enabled = false on all eligibility policies temporarilypsql "$DATABASE_URL" << 'EOF'UPDATE policies SET enabled = false WHERE type = 'ELIGIBILITY';EOF
# 5. Restart jitsudod to pick up the changekubectl rollout restart deployment/jitsudod
# 6. Re-apply the correct policy via CLIjitsudo policy apply -f correct-eligibility.rego --type eligibility
# 7. Re-enable remaining policiespsql "$DATABASE_URL" << 'EOF'UPDATE policies SET enabled = true WHERE type = 'ELIGIBILITY';EOFRunbook 2: TLS Certificate Rotation (Zero-Downtime)
Section titled “Runbook 2: TLS Certificate Rotation (Zero-Downtime)”Trigger: Certificate approaching expiry (automate a 30-day alert), CA rotation, or security incident requiring immediate certificate replacement.
On Kubernetes (Helm)
Section titled “On Kubernetes (Helm)”# 1. Generate new certificate (example using cert-manager or openssl)openssl req -x509 -newkey rsa:4096 \ -keyout new-server.key \ -out new-server.crt \ -days 90 -nodes \ -subj "/CN=jitsudo.internal"
# 2. Update the Kubernetes Secretkubectl create secret tls jitsudo-tls \ --cert=new-server.crt \ --key=new-server.key \ --dry-run=client -o yaml \ | kubectl apply -f -
# 3. Rolling restart (reads new cert from mounted Secret)kubectl rollout restart deployment/jitsudod
# 4. Verify new certificate is in useecho | openssl s_client -connect jitsudo.internal:443 2>/dev/null \ | openssl x509 -noout -dates
# 5. Update CLI clients with new CA if CA changed# Distribute the new CA cert to engineers who need to update ~/.jitsudo/ca.crtOn single-server deployment
Section titled “On single-server deployment”# 1. Replace the cert and key filescp new-server.crt /etc/jitsudo/tls/server.crtcp new-server.key /etc/jitsudo/tls/server.keychmod 600 /etc/jitsudo/tls/server.key
# 2. Restart jitsudodsystemctl restart jitsudod
# 3. Verifysystemctl status jitsudodRunbook 3: Database Recovery from Backup
Section titled “Runbook 3: Database Recovery from Backup”Trigger: Database corruption, accidental data deletion, or migration failure.
# 1. Stop jitsudod to prevent writes during restorekubectl scale deployment jitsudod --replicas=0# Wait for all pods to terminatekubectl wait --for=delete pod -l app=jitsudod --timeout=60s
# 2. Create a fresh database (if needed)psql -U postgres -c "DROP DATABASE IF EXISTS jitsudo_restore;"psql -U postgres -c "CREATE DATABASE jitsudo_restore;"
# 3. Restore from backupgunzip -c backup-2026-03-21.sql.gz \ | psql -U postgres jitsudo_restore
# 4. Verify the restorepsql -U postgres jitsudo_restore -c "SELECT COUNT(*) FROM audit_events;"psql -U postgres jitsudo_restore -c "SELECT COUNT(*) FROM requests;"
# 5. Verify audit log hash chain integrity on the restored dataexport DATABASE_URL="postgres://jitsudo_app:${DB_PASSWORD}@localhost:5432/jitsudo_restore"jitsudo audit verify# Expected: "Chain intact: N events verified"
# 6. If chain is intact, swap databasespsql -U postgres -c "ALTER DATABASE jitsudo RENAME TO jitsudo_old;"psql -U postgres -c "ALTER DATABASE jitsudo_restore RENAME TO jitsudo;"
# 7. Restart jitsudodkubectl scale deployment jitsudod --replicas=2
# 8. Verify jitsudod healthkubectl get pods -l app=jitsudodcurl https://jitsudo.internal/healthzRunbook 4: All Approvers Unavailable — Break-Glass Procedure
Section titled “Runbook 4: All Approvers Unavailable — Break-Glass Procedure”Trigger: P0/P1 incident at an unusual hour; all policy-designated approvers are unreachable.
# 1. Attempt to reach an approver first (Slack, phone, PagerDuty)# Document the attempts in the incident ticket.
# 2. If no approver is reachable within your SLA window, invoke break-glassjitsudo request \ --provider aws \ --role prod-infra-admin \ --scope YOUR_ACCOUNT_ID \ --duration 2h \ --reason "P0 ECS crash - INC-4421. No approvers reachable. Break-glass invoked per runbook." \ --break-glass
# 3. This triggers:# - Immediate ACTIVE grant (no approval needed)# - High-priority Slack alert to ALL configured channels# - Audit log entry with break_glass: true
# 4. After the incident is resolved, revoke the grantjitsudo revoke req_01...
# 5. Post-incident: file a review covering:# - Why approvers were unavailable# - What access was taken and why# - Whether the policy-designated approver group needs updating# - Whether the break-glass eligibility policy should be tightenedRunbook 5: Revoking All Active Grants for a Compromised User
Section titled “Runbook 5: Revoking All Active Grants for a Compromised User”Trigger: Engineer’s credentials are compromised; account takeover suspected; offboarding with active grants.
# 1. Immediately revoke IdP account or remove from all groups# (Do this first — prevents new requests)# Okta: deactivate user, Entra ID: disable account, etc.
# 2. Find all active grants for the userjitsudo status \ --user compromised-user@example.com \ --state active
# 3. Revoke each active grant# For each req_ID in the output:jitsudo revoke req_01...jitsudo revoke req_02...# (repeat for all active grants)
# Or use xargs for automation:jitsudo status \ --user compromised-user@example.com \ --state active \ --output json \ | jq -r '.[].id' \ | xargs -I{} jitsudo revoke {}
# 4. Verify all grants are revokedjitsudo status --user compromised-user@example.com --state active# Should return empty list
# 5. Review the audit log for the userjitsudo audit --user compromised-user@example.com --since 7d
# 6. Check for any unusual access patterns in the audit log:# - Requests at unusual hours# - Unusual providers or roles# - Short-duration grants that expired before you could revoke themRunbook 6: jitsudod Is Down — Restoring the Control Plane
Section titled “Runbook 6: jitsudod Is Down — Restoring the Control Plane”Trigger: All jitsudod instances are returning errors; health checks failing.
# 1. Check pod statuskubectl get pods -l app=jitsudodkubectl describe pod <pod-name>kubectl logs <pod-name> --previous
# 2. Common causes and fixes:## CrashLoopBackOff with "database connection refused":# → Check PostgreSQL connectivity# → Verify DATABASE_URL env var is set correctly# → Check network policy allows jitsudod → PostgreSQL## CrashLoopBackOff with "TLS: failed to load certificate":# → Check TLS cert/key Secret is mounted correctly# → Verify cert has not expired## OOMKilled:# → Increase memory limits in Helm values# → Check for runaway requests or policy eval loops
# 3. Force restartkubectl rollout restart deployment/jitsudod
# 4. Monitor rolloutkubectl rollout status deployment/jitsudod
# 5. Verify healthcurl https://jitsudo.internal/healthzcurl https://jitsudo.internal/readyz