Files
hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md
2026-06-14 12:26:34 -04:00

5.7 KiB
Raw Permalink Blame History

Public Edge Outage — Root Cause Analysis

Date: 2026-06-14
Severity: High — all public squaremcp.com and fetcherpay.com properties unreachable or certificate-invalid.
Status: Resolved. All listed commercial domains reachable with valid TLS.


Summary

On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning 404 page not found or serving an invalid/default TLS certificate. The root cause was a misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.


Timeline (all times UTC-4)

  • ~09:30 — User reports that commercial sites are not reachable.
  • 09:3010:00 — Diagnosis: Traefik container owns public :80/:443, has default cert, no routers for .squaremcp.com/.fetcherpay.com`.
  • 10:0010:30 — Added file-provider routers and static K8s-extracted certificates for squaremcp.com, www.squaremcp.com, app.squaremcp.com, docs.squaremcp.com, tiktok.squaremcp.com.
  • 10:3011:00 — Fixed fetcherpay.com / www.fetcherpay.com by attaching Traefik to the fetcherpay Docker network and starting the stopped fetcherpay-web container.
  • 11:0011:30 — Fixed workflow.fetcherpay.com (Traefik was routing to gRPC port 7233 instead of HTTP UI port 8080).
  • 11:3012:00 — Fixed mail.fetcherpay.com by starting poste, extracting the K8s cert, and adding a Traefik router/service.
  • 12:0013:30 — Fixed git.fetcherpay.com by starting postgres and gitea, extracting the K8s cert, adding a router/service, and resolving a host port 2222 conflict by remapping Gitea SSH to 22222.
  • 13:3014:00 — Final verification of all domains and Hermes path-specific routes.

Root cause

1. Docker Traefik intercepts all public ingress

  • The Traefik v3 container binds host ports 80, 443, and 8080.
  • Docker publishes these ports via docker-proxy, which inserts iptables DNAT rules.
  • Those rules intercept all inbound public HTTP/S traffic before it can reach the host-network MicroK8s nginx-ingress controller.

2. Traefik had no routes or valid TLS for the commercial domains

  • Traefiks dynamic config comes from Docker labels and a file provider (/home/garfield/letsencrypt/manual/tls.yml).
  • At the start of the incident the file provider only had a partial/incomplete set of routers.
  • There were no valid Lets Encrypt certificates for most domains because GoDaddy DNS-01 returns DUPLICATE_RECORD for _acme-challenge.* TXT records, blocking issuance.
  • Result: any request for an unmatched host fell through to Traefiks default self-signed certificate and returned 404 page not found.

3. K8s nginx-ingress was unreachable even though it had valid certs

  • Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
  • The active nginx-ingress-microk8s controller is configured for ingressClass=public.
  • Most Ingress resources specify ingressClassName: nginx.
  • Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.

4. Several Docker backends were stopped

  • fetcherpay-web — stopped.
  • poste (mail) — stopped.
  • postgres and gitea (git) — stopped.
  • temporal-ui was running, but the Traefik Docker label pointed at the gRPC port 7233 instead of the HTTP UI port 8080, causing 502s for workflow.fetcherpay.com.

Why detection failed

  • No synthetic uptime probes were running against the public endpoints.
  • No certificate-expiry or certificate-default alerting.
  • No Traefik routing-health alert.
  • Docker restart: unless-stopped only helps if the container was started; there was no watchdog for expected-but-stopped services.
  • K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.

Remediation actions taken

  1. Rebuilt the Traefik file-provider config (/home/garfield/letsencrypt/manual/tls.yml) with explicit routers and services for every commercial domain.
  2. Attached Traefik to the fetcherpay Docker network in /home/garfield/traefik-compose.yml so it could reach FetcherPay backends.
  3. Extracted valid K8s cert-manager secrets and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
  4. Started stopped backend containers: fetcherpay-web, poste, postgres, gitea.
  5. Fixed workflow.fetcherpay.com by routing to temporal-ui:8080 instead of 7233.
  6. Fixed git.fetcherpay.com SSH port conflict by changing the host mapping from 2222:22 to 22222:22 in /home/garfield/Downloads/docker-compose.prod.yml.
  7. Verified all public endpoints return expected HTTP codes with TLS certificates that validate.

Remaining technical debt

  • K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
  • Several fetcherpay.com subdomains that depend on stopped services remain down: api.fetcherpay.com, prometheus.fetcherpay.com, grafana.fetcherpay.com, adminer.fetcherpay.com, traefik.fetcherpay.com.
  • Secrets are still stored plaintext in manifests and compose files.
  • No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.

Follow-up work

See 2026-06-14-public-edge-outage-plan.md for the full runbook / monitoring / probing plan.