8.0 KiB
Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring
Goal
Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (/home/garfield/hermes-mcp/).
- Write a clear post-incident / RCA document.
- Create a step-by-step deployment runbook that the next operator can follow without guessing.
- Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.
Root cause (condensed)
- Public ports 80/443/8080 are owned by a Docker Traefik container. Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
- Traefik had no routers or valid TLS certificates for the commercial
squaremcp.com/fetcherpay.comdomains, so it returned404 page not foundwith a self-signed cert. - K8s cert-manager held valid certs, but the active nginx-ingress controller uses
ingressClass=publicwhile the Ingress resources useingressClassName=nginx, so K8s never reconciled them and could not serve traffic anyway. - Several Docker backends were stopped:
fetcherpay-web,poste,postgres,gitea. Thetemporal-uicontainer was running but Traefik was pointed at its gRPC port (7233) instead of its HTTP UI port (8080).
Deliverable 1: Post-incident / RCA document
Location: hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md
Sections:
- Summary — what was down, for how long, user impact.
- Timeline — detection, mitigation, full restoration.
- Root cause — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
- Why detection failed — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
- Remediation actions taken — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
- Follow-up work — this plan’s runbook and monitoring deliverables.
Deliverable 2: Deployment runbook
Location: hermes-mcp/docs/runbooks/deployment.md
The runbook will cover:
- Pre-flight checks
- Confirm Traefik is attached to required networks (
hermes-net,obsidian-net,fetcherpay). - Confirm all expected Docker networks exist.
- Confirm static cert directory (
/home/garfield/letsencrypt/manual/certs/) contains current certs for all file-provider domains.
- Confirm Traefik is attached to required networks (
- Deploy / update the edge proxy
- Rebuild / restart Traefik from
traefik-compose.yml. - Validate
tls.ymlrouters, services, and certificate entries. - Smoke-test every public host immediately after restart.
- Rebuild / restart Traefik from
- Deploy Hermes / SquareMCP (K8s path)
- Build, push, update digest in
hermes-k8s.yaml. - Apply manifests and wait for rollout.
- Verify
/health,/openapi-living-brief.json, OAuth endpoints,/api/pilot-request.
- Build, push, update digest in
- Deploy FetcherPay stack (Docker path)
- Export required env vars (or ensure
.envis present). docker compose -p fetcherpay up -dfor web, api, mail, git, workflow.- Verify
fetcherpay.com,mail.fetcherpay.com,git.fetcherpay.com,workflow.fetcherpay.com.
- Export required env vars (or ensure
- Certificate renewal / rotation
- When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
- Step-by-step secret extraction command template.
- Rollback checklist
- Revert image digest / compose change, restart, verify.
- Verification script
- A single
hermes-mcp/scripts/verify-public-endpoints.shthat curls every critical URL and exits non-zero on failure.
- A single
Deliverable 3: Diagnostics, metrics, and probes
Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.
Option A — Harden the existing Traefik edge (recommended)
Why: Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.
Implementation pieces:
- Synthetic uptime probes (blackbox exporter)
- Add
prom/blackbox-exporterconfig inside the repo (e.g.hermes-mcp/monitoring/blackbox.yml). - Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
- Domains:
hermes.squaremcp.com/openapi-living-brief.json,app.squaremcp.com,docs.squaremcp.com,squaremcp.com,www.squaremcp.com,tiktok.squaremcp.com,fetcherpay.com,www.fetcherpay.com,workflow.fetcherpay.com,mail.fetcherpay.com,git.fetcherpay.com. - Path-specific probes:
POST /api/pilot-request,GET /auth/tiktok/start.
- Add
- Certificate expiry alerting
- Blackbox
probe_ssl_earliest_cert_expiryalert when any cert has < 7 days left. - Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
- Blackbox
- Traefik routing health
- Enable Traefik metrics endpoint (
--metrics.prometheus). - Alert on
traefik_router_server_errorsortraefik_service_server_up == 0.
- Enable Traefik metrics endpoint (
- Container health & restart policy
- Ensure every commercial service has
restart: unless-stoppedand a Dockerhealthcheck. - Add a simple systemd user timer or cron that runs
docker compose -p fetcherpay psand alerts if any expected container is notUp.
- Ensure every commercial service has
- K8s ingress reconciliation check
- A probe/script (
hermes-mcp/scripts/check-k8s-ingress.sh) that confirms allsquaremcp.comIngresses have a matchingADDRESSand valid TLS secret. - Alert if
kubectl get ingress -Ashows missing addresses or cert-managerCertificateReady=False.
- A probe/script (
- Hermes application metrics
- Add a
/metricsendpoint usingprom-clientinsrc/index.ts. - Instrument request latency, error rate, active OAuth sessions, tool call counts.
- Scrape it from Prometheus.
- Add a
- Separate readiness probe
- Keep
/healthfor liveness; add/readythat checks DB/Redis connectivity before reporting ready.
- Keep
- Alertmanager + Slack / email
- Deploy
prom/alertmanageralongside Prometheus. - Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
- Deploy
- Verification script
hermes-mcp/scripts/verify-public-endpoints.shused in runbook and optionally in CI.
Option B — Migrate public edge to K8s nginx-ingress
Why: Eliminates the split-ingress complexity that caused the routing confusion.
Implementation pieces:
- Reconcile
ingressClassName: nginx→public(or change the controller tonginx). - Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
- Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
- Re-issue all certs via cert-manager and remove the static-cert workaround.
- Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.
Trade-off: Larger architectural change, risk of another outage during migration, but cleaner long term.
Suggested file changes (all under hermes-mcp/)
- New:
docs/runbooks/2026-06-14-public-edge-outage-rca.md - New / rewrite:
docs/runbooks/deployment.md - New:
docs/runbooks/monitoring-playbook.md(alert runbook) - New:
scripts/verify-public-endpoints.sh - New:
scripts/check-k8s-ingress.sh - Modify:
src/index.ts— add/metrics,/ready, enhance/health - Modify:
hermes-k8s.yaml— add startup probe, resource requests/limits - New:
monitoring/blackbox.yml,monitoring/prometheus.yml,monitoring/alert-rules.yml,monitoring/alertmanager.yml - Modify: root
docker-compose.fetcherpay.ymlor createmonitoring/docker-compose.monitoring.ymlif the user prefers not to touch the prod compose file.
Phasing recommendation
- Phase 1 (immediate): RCA doc + runbook +
scripts/verify-public-endpoints.sh. - Phase 2 (this week): blackbox exporter + cert-expiry alerts + container-up check.
- Phase 3 (next sprint): Hermes
/metrics+ dashboards + Alertmanager Slack routing. - Phase 4 (future): decide on Option B edge migration after Phase 1–3 are stable.