Files
hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-plan.md
2026-06-14 12:26:34 -04:00

8.0 KiB
Raw Blame History

Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring

Goal

Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (/home/garfield/hermes-mcp/).

  1. Write a clear post-incident / RCA document.
  2. Create a step-by-step deployment runbook that the next operator can follow without guessing.
  3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.

Root cause (condensed)

  • Public ports 80/443/8080 are owned by a Docker Traefik container. Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
  • Traefik had no routers or valid TLS certificates for the commercial squaremcp.com / fetcherpay.com domains, so it returned 404 page not found with a self-signed cert.
  • K8s cert-manager held valid certs, but the active nginx-ingress controller uses ingressClass=public while the Ingress resources use ingressClassName=nginx, so K8s never reconciled them and could not serve traffic anyway.
  • Several Docker backends were stopped: fetcherpay-web, poste, postgres, gitea. The temporal-ui container was running but Traefik was pointed at its gRPC port (7233) instead of its HTTP UI port (8080).

Deliverable 1: Post-incident / RCA document

Location: hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md

Sections:

  • Summary — what was down, for how long, user impact.
  • Timeline — detection, mitigation, full restoration.
  • Root cause — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
  • Why detection failed — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
  • Remediation actions taken — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
  • Follow-up work — this plans runbook and monitoring deliverables.

Deliverable 2: Deployment runbook

Location: hermes-mcp/docs/runbooks/deployment.md

The runbook will cover:

  1. Pre-flight checks
    • Confirm Traefik is attached to required networks (hermes-net, obsidian-net, fetcherpay).
    • Confirm all expected Docker networks exist.
    • Confirm static cert directory (/home/garfield/letsencrypt/manual/certs/) contains current certs for all file-provider domains.
  2. Deploy / update the edge proxy
    • Rebuild / restart Traefik from traefik-compose.yml.
    • Validate tls.yml routers, services, and certificate entries.
    • Smoke-test every public host immediately after restart.
  3. Deploy Hermes / SquareMCP (K8s path)
    • Build, push, update digest in hermes-k8s.yaml.
    • Apply manifests and wait for rollout.
    • Verify /health, /openapi-living-brief.json, OAuth endpoints, /api/pilot-request.
  4. Deploy FetcherPay stack (Docker path)
    • Export required env vars (or ensure .env is present).
    • docker compose -p fetcherpay up -d for web, api, mail, git, workflow.
    • Verify fetcherpay.com, mail.fetcherpay.com, git.fetcherpay.com, workflow.fetcherpay.com.
  5. Certificate renewal / rotation
    • When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
    • Step-by-step secret extraction command template.
  6. Rollback checklist
    • Revert image digest / compose change, restart, verify.
  7. Verification script
    • A single hermes-mcp/scripts/verify-public-endpoints.sh that curls every critical URL and exits non-zero on failure.

Deliverable 3: Diagnostics, metrics, and probes

Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.

Why: Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.

Implementation pieces:

  1. Synthetic uptime probes (blackbox exporter)
    • Add prom/blackbox-exporter config inside the repo (e.g. hermes-mcp/monitoring/blackbox.yml).
    • Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
    • Domains: hermes.squaremcp.com/openapi-living-brief.json, app.squaremcp.com, docs.squaremcp.com, squaremcp.com, www.squaremcp.com, tiktok.squaremcp.com, fetcherpay.com, www.fetcherpay.com, workflow.fetcherpay.com, mail.fetcherpay.com, git.fetcherpay.com.
    • Path-specific probes: POST /api/pilot-request, GET /auth/tiktok/start.
  2. Certificate expiry alerting
    • Blackbox probe_ssl_earliest_cert_expiry alert when any cert has < 7 days left.
    • Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
  3. Traefik routing health
    • Enable Traefik metrics endpoint (--metrics.prometheus).
    • Alert on traefik_router_server_errors or traefik_service_server_up == 0.
  4. Container health & restart policy
    • Ensure every commercial service has restart: unless-stopped and a Docker healthcheck.
    • Add a simple systemd user timer or cron that runs docker compose -p fetcherpay ps and alerts if any expected container is not Up.
  5. K8s ingress reconciliation check
    • A probe/script (hermes-mcp/scripts/check-k8s-ingress.sh) that confirms all squaremcp.com Ingresses have a matching ADDRESS and valid TLS secret.
    • Alert if kubectl get ingress -A shows missing addresses or cert-manager CertificateReady=False.
  6. Hermes application metrics
    • Add a /metrics endpoint using prom-client in src/index.ts.
    • Instrument request latency, error rate, active OAuth sessions, tool call counts.
    • Scrape it from Prometheus.
  7. Separate readiness probe
    • Keep /health for liveness; add /ready that checks DB/Redis connectivity before reporting ready.
  8. Alertmanager + Slack / email
    • Deploy prom/alertmanager alongside Prometheus.
    • Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
  9. Verification script
    • hermes-mcp/scripts/verify-public-endpoints.sh used in runbook and optionally in CI.

Option B — Migrate public edge to K8s nginx-ingress

Why: Eliminates the split-ingress complexity that caused the routing confusion.

Implementation pieces:

  1. Reconcile ingressClassName: nginxpublic (or change the controller to nginx).
  2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
  3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
  4. Re-issue all certs via cert-manager and remove the static-cert workaround.
  5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.

Trade-off: Larger architectural change, risk of another outage during migration, but cleaner long term.


Suggested file changes (all under hermes-mcp/)

  • New: docs/runbooks/2026-06-14-public-edge-outage-rca.md
  • New / rewrite: docs/runbooks/deployment.md
  • New: docs/runbooks/monitoring-playbook.md (alert runbook)
  • New: scripts/verify-public-endpoints.sh
  • New: scripts/check-k8s-ingress.sh
  • Modify: src/index.ts — add /metrics, /ready, enhance /health
  • Modify: hermes-k8s.yaml — add startup probe, resource requests/limits
  • New: monitoring/blackbox.yml, monitoring/prometheus.yml, monitoring/alert-rules.yml, monitoring/alertmanager.yml
  • Modify: root docker-compose.fetcherpay.yml or create monitoring/docker-compose.monitoring.yml if the user prefers not to touch the prod compose file.

Phasing recommendation

  • Phase 1 (immediate): RCA doc + runbook + scripts/verify-public-endpoints.sh.
  • Phase 2 (this week): blackbox exporter + cert-expiry alerts + container-up check.
  • Phase 3 (next sprint): Hermes /metrics + dashboards + Alertmanager Slack routing.
  • Phase 4 (future): decide on Option B edge migration after Phase 13 are stable.