# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring

## Goal
Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`).
1. Write a clear post-incident / RCA document.
2. Create a step-by-step deployment runbook that the next operator can follow without guessing.
3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.

---

## Root cause (condensed)
- **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
- **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert.
- **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway.
- **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`).

---

## Deliverable 1: Post-incident / RCA document
**Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md`

Sections:
- **Summary** — what was down, for how long, user impact.
- **Timeline** — detection, mitigation, full restoration.
- **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
- **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
- **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
- **Follow-up work** — this plan’s runbook and monitoring deliverables.

---

## Deliverable 2: Deployment runbook
**Location:** `hermes-mcp/docs/runbooks/deployment.md`

The runbook will cover:
1. **Pre-flight checks**
   - Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`).
   - Confirm all expected Docker networks exist.
   - Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains.
2. **Deploy / update the edge proxy**
   - Rebuild / restart Traefik from `traefik-compose.yml`.
   - Validate `tls.yml` routers, services, and certificate entries.
   - Smoke-test every public host immediately after restart.
3. **Deploy Hermes / SquareMCP (K8s path)**
   - Build, push, update digest in `hermes-k8s.yaml`.
   - Apply manifests and wait for rollout.
   - Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`.
4. **Deploy FetcherPay stack (Docker path)**
   - Export required env vars (or ensure `.env` is present).
   - `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow.
   - Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`.
5. **Certificate renewal / rotation**
   - When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
   - Step-by-step secret extraction command template.
6. **Rollback checklist**
   - Revert image digest / compose change, restart, verify.
7. **Verification script**
   - A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure.

---

## Deliverable 3: Diagnostics, metrics, and probes
Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.

### Option A — Harden the existing Traefik edge (recommended)
**Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.

Implementation pieces:
1. **Synthetic uptime probes (blackbox exporter)**
   - Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`).
   - Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
   - Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`.
   - Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`.
2. **Certificate expiry alerting**
   - Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left.
   - Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
3. **Traefik routing health**
   - Enable Traefik metrics endpoint (`--metrics.prometheus`).
   - Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`.
4. **Container health & restart policy**
   - Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`.
   - Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`.
5. **K8s ingress reconciliation check**
   - A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret.
   - Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`.
6. **Hermes application metrics**
   - Add a `/metrics` endpoint using `prom-client` in `src/index.ts`.
   - Instrument request latency, error rate, active OAuth sessions, tool call counts.
   - Scrape it from Prometheus.
7. **Separate readiness probe**
   - Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready.
8. **Alertmanager + Slack / email**
   - Deploy `prom/alertmanager` alongside Prometheus.
   - Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
9. **Verification script**
   - `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI.

### Option B — Migrate public edge to K8s nginx-ingress
**Why:** Eliminates the split-ingress complexity that caused the routing confusion.

Implementation pieces:
1. Reconcile `ingressClassName: nginx` → `public` (or change the controller to `nginx`).
2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
4. Re-issue all certs via cert-manager and remove the static-cert workaround.
5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.

**Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term.

---

## Suggested file changes (all under `hermes-mcp/`)
- **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md`
- **New / rewrite:** `docs/runbooks/deployment.md`
- **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook)
- **New:** `scripts/verify-public-endpoints.sh`
- **New:** `scripts/check-k8s-ingress.sh`
- **Modify:** `src/index.ts` — add `/metrics`, `/ready`, enhance `/health`
- **Modify:** `hermes-k8s.yaml` — add startup probe, resource requests/limits
- **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml`
- **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file.

---

## Phasing recommendation
- **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`.
- **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check.
- **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing.
- **Phase 4 (future):** decide on Option B edge migration after Phase 1–3 are stable.