# Public Edge Outage — Root Cause Analysis **Date:** 2026-06-14 **Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid. **Status:** Resolved. All listed commercial domains reachable with valid TLS. --- ## Summary On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains. --- ## Timeline (all times UTC-4) - **~09:30** — User reports that commercial sites are not reachable. - **09:30–10:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`. - **10:00–10:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`. - **10:30–11:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container. - **11:00–11:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`). - **11:30–12:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service. - **12:00–13:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`. - **13:30–14:00** — Final verification of all domains and Hermes path-specific routes. --- ## Root cause ### 1. Docker Traefik intercepts all public ingress - The Traefik v3 container binds host ports `80`, `443`, and `8080`. - Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules. - Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller. ### 2. Traefik had no routes or valid TLS for the commercial domains - Traefik’s dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`). - At the start of the incident the file provider only had a partial/incomplete set of routers. - There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance. - Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned `404 page not found`. ### 3. K8s nginx-ingress was unreachable even though it had valid certs - Cert-manager inside MicroK8s held valid TLS secrets for the affected domains. - The active nginx-ingress-microk8s controller is configured for `ingressClass=public`. - Most Ingress resources specify `ingressClassName: nginx`. - Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it. ### 4. Several Docker backends were stopped - `fetcherpay-web` — stopped. - `poste` (mail) — stopped. - `postgres` and `gitea` (git) — stopped. - `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`. --- ## Why detection failed - No synthetic uptime probes were running against the public endpoints. - No certificate-expiry or certificate-default alerting. - No Traefik routing-health alert. - Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services. - K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed. --- ## Remediation actions taken 1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain. 2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends. 3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue. 4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. 5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`. 6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`. 7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate. --- ## Remaining technical debt - K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller. - Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`. - Secrets are still stored plaintext in manifests and compose files. - No centralized logging, metrics, or alerting exists for Hermes or the edge proxy. --- ## Follow-up work See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.