# Public Edge Outage — Root Cause Analysis

**Date:** 2026-06-14  
**Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid.  
**Status:** Resolved. All listed commercial domains reachable with valid TLS.

---

## Summary

On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.

---

## Timeline (all times UTC-4)

- **~09:30** — User reports that commercial sites are not reachable.
- **09:30–10:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`.
- **10:00–10:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`.
- **10:30–11:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container.
- **11:00–11:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`).
- **11:30–12:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service.
- **12:00–13:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`.
- **13:30–14:00** — Final verification of all domains and Hermes path-specific routes.

---

## Root cause

### 1. Docker Traefik intercepts all public ingress
- The Traefik v3 container binds host ports `80`, `443`, and `8080`.
- Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules.
- Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller.

### 2. Traefik had no routes or valid TLS for the commercial domains
- Traefik’s dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`).
- At the start of the incident the file provider only had a partial/incomplete set of routers.
- There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance.
- Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned `404 page not found`.

### 3. K8s nginx-ingress was unreachable even though it had valid certs
- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
- The active nginx-ingress-microk8s controller is configured for `ingressClass=public`.
- Most Ingress resources specify `ingressClassName: nginx`.
- Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.

### 4. Several Docker backends were stopped
- `fetcherpay-web` — stopped.
- `poste` (mail) — stopped.
- `postgres` and `gitea` (git) — stopped.
- `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`.

---

## Why detection failed

- No synthetic uptime probes were running against the public endpoints.
- No certificate-expiry or certificate-default alerting.
- No Traefik routing-health alert.
- Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services.
- K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.

---

## Remediation actions taken

1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain.
2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends.
3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`.
5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`.
6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`.
7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate.

---

## Remaining technical debt

- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
- Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`.
- Secrets are still stored plaintext in manifests and compose files.
- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.

---

## Follow-up work

See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.