# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring ## Goal Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`). 1. Write a clear post-incident / RCA document. 2. Create a step-by-step deployment runbook that the next operator can follow without guessing. 3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice. --- ## Root cause (condensed) - **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it. - **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert. - **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway. - **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`). --- ## Deliverable 1: Post-incident / RCA document **Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md` Sections: - **Summary** — what was down, for how long, user impact. - **Timeline** — detection, mitigation, full restoration. - **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers. - **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services. - **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution. - **Follow-up work** — this plan’s runbook and monitoring deliverables. --- ## Deliverable 2: Deployment runbook **Location:** `hermes-mcp/docs/runbooks/deployment.md` The runbook will cover: 1. **Pre-flight checks** - Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`). - Confirm all expected Docker networks exist. - Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains. 2. **Deploy / update the edge proxy** - Rebuild / restart Traefik from `traefik-compose.yml`. - Validate `tls.yml` routers, services, and certificate entries. - Smoke-test every public host immediately after restart. 3. **Deploy Hermes / SquareMCP (K8s path)** - Build, push, update digest in `hermes-k8s.yaml`. - Apply manifests and wait for rollout. - Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`. 4. **Deploy FetcherPay stack (Docker path)** - Export required env vars (or ensure `.env` is present). - `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow. - Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`. 5. **Certificate renewal / rotation** - When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction. - Step-by-step secret extraction command template. 6. **Rollback checklist** - Revert image digest / compose change, restart, verify. 7. **Verification script** - A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure. --- ## Deliverable 3: Diagnostics, metrics, and probes Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s. ### Option A — Harden the existing Traefik edge (recommended) **Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw. Implementation pieces: 1. **Synthetic uptime probes (blackbox exporter)** - Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`). - Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status. - Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`. - Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`. 2. **Certificate expiry alerting** - Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left. - Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss). 3. **Traefik routing health** - Enable Traefik metrics endpoint (`--metrics.prometheus`). - Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`. 4. **Container health & restart policy** - Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`. - Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`. 5. **K8s ingress reconciliation check** - A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret. - Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`. 6. **Hermes application metrics** - Add a `/metrics` endpoint using `prom-client` in `src/index.ts`. - Instrument request latency, error rate, active OAuth sessions, tool call counts. - Scrape it from Prometheus. 7. **Separate readiness probe** - Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready. 8. **Alertmanager + Slack / email** - Deploy `prom/alertmanager` alongside Prometheus. - Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email. 9. **Verification script** - `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI. ### Option B — Migrate public edge to K8s nginx-ingress **Why:** Eliminates the split-ingress complexity that caused the routing confusion. Implementation pieces: 1. Reconcile `ingressClassName: nginx` → `public` (or change the controller to `nginx`). 2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role. 3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort). 4. Re-issue all certs via cert-manager and remove the static-cert workaround. 5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A. **Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term. --- ## Suggested file changes (all under `hermes-mcp/`) - **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md` - **New / rewrite:** `docs/runbooks/deployment.md` - **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook) - **New:** `scripts/verify-public-endpoints.sh` - **New:** `scripts/check-k8s-ingress.sh` - **Modify:** `src/index.ts` — add `/metrics`, `/ready`, enhance `/health` - **Modify:** `hermes-k8s.yaml` — add startup probe, resource requests/limits - **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml` - **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file. --- ## Phasing recommendation - **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`. - **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check. - **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing. - **Phase 4 (future):** decide on Option B edge migration after Phase 1–3 are stable.