5.7 KiB
Public Edge Outage — Root Cause Analysis
Date: 2026-06-14
Severity: High — all public squaremcp.com and fetcherpay.com properties unreachable or certificate-invalid.
Status: Resolved. All listed commercial domains reachable with valid TLS.
Summary
On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning 404 page not found or serving an invalid/default TLS certificate. The root cause was a misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.
Timeline (all times UTC-4)
- ~09:30 — User reports that commercial sites are not reachable.
- 09:30–10:00 — Diagnosis: Traefik container owns public
:80/:443, has default cert, no routers for.squaremcp.com/.fetcherpay.com`. - 10:00–10:30 — Added file-provider routers and static K8s-extracted certificates for
squaremcp.com,www.squaremcp.com,app.squaremcp.com,docs.squaremcp.com,tiktok.squaremcp.com. - 10:30–11:00 — Fixed
fetcherpay.com/www.fetcherpay.comby attaching Traefik to thefetcherpayDocker network and starting the stoppedfetcherpay-webcontainer. - 11:00–11:30 — Fixed
workflow.fetcherpay.com(Traefik was routing to gRPC port7233instead of HTTP UI port8080). - 11:30–12:00 — Fixed
mail.fetcherpay.comby startingposte, extracting the K8s cert, and adding a Traefik router/service. - 12:00–13:30 — Fixed
git.fetcherpay.comby startingpostgresandgitea, extracting the K8s cert, adding a router/service, and resolving a host port2222conflict by remapping Gitea SSH to22222. - 13:30–14:00 — Final verification of all domains and Hermes path-specific routes.
Root cause
1. Docker Traefik intercepts all public ingress
- The Traefik v3 container binds host ports
80,443, and8080. - Docker publishes these ports via
docker-proxy, which insertsiptablesDNAT rules. - Those rules intercept all inbound public HTTP/S traffic before it can reach the host-network MicroK8s nginx-ingress controller.
2. Traefik had no routes or valid TLS for the commercial domains
- Traefik’s dynamic config comes from Docker labels and a file provider (
/home/garfield/letsencrypt/manual/tls.yml). - At the start of the incident the file provider only had a partial/incomplete set of routers.
- There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns
DUPLICATE_RECORDfor_acme-challenge.*TXT records, blocking issuance. - Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned
404 page not found.
3. K8s nginx-ingress was unreachable even though it had valid certs
- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
- The active nginx-ingress-microk8s controller is configured for
ingressClass=public. - Most Ingress resources specify
ingressClassName: nginx. - Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.
4. Several Docker backends were stopped
fetcherpay-web— stopped.poste(mail) — stopped.postgresandgitea(git) — stopped.temporal-uiwas running, but the Traefik Docker label pointed at the gRPC port7233instead of the HTTP UI port8080, causing 502s forworkflow.fetcherpay.com.
Why detection failed
- No synthetic uptime probes were running against the public endpoints.
- No certificate-expiry or certificate-default alerting.
- No Traefik routing-health alert.
- Docker
restart: unless-stoppedonly helps if the container was started; there was no watchdog for expected-but-stopped services. - K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.
Remediation actions taken
- Rebuilt the Traefik file-provider config (
/home/garfield/letsencrypt/manual/tls.yml) with explicit routers and services for every commercial domain. - Attached Traefik to the
fetcherpayDocker network in/home/garfield/traefik-compose.ymlso it could reach FetcherPay backends. - Extracted valid K8s cert-manager secrets and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
- Started stopped backend containers:
fetcherpay-web,poste,postgres,gitea. - Fixed
workflow.fetcherpay.comby routing totemporal-ui:8080instead of7233. - Fixed
git.fetcherpay.comSSH port conflict by changing the host mapping from2222:22to22222:22in/home/garfield/Downloads/docker-compose.prod.yml. - Verified all public endpoints return expected HTTP codes with TLS certificates that validate.
Remaining technical debt
- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
- Several
fetcherpay.comsubdomains that depend on stopped services remain down:api.fetcherpay.com,prometheus.fetcherpay.com,grafana.fetcherpay.com,adminer.fetcherpay.com,traefik.fetcherpay.com. - Secrets are still stored plaintext in manifests and compose files.
- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.
Follow-up work
See 2026-06-14-public-edge-outage-plan.md for the full runbook / monitoring / probing plan.