# Active Issues and Remaining Debt — 2026-06-14 ## What is working now All commercial domains verified reachable with valid TLS: - `hermes.squaremcp.com` / `openapi-living-brief.json` - `app.squaremcp.com` - `docs.squaremcp.com` - `squaremcp.com` / `www.squaremcp.com` - `tiktok.squaremcp.com` - `fetcherpay.com` / `www.fetcherpay.com` - `workflow.fetcherpay.com` - `mail.fetcherpay.com` - `git.fetcherpay.com` Hermes path-specific routes verified: - `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com` - `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com` --- ## Still down / not addressed | Subdomain / Service | Why it is down | What would fix it | |---|---|---| | `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) | | `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` | | `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` | | `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` | | `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN | --- ## Architectural debt 1. **K8s nginx-ingress is bypassed** - Traefik’s Docker iptables rules intercept all public HTTP/S traffic. - The active nginx-ingress controller class is `public`; manifests use `nginx`. - Long term: either reconcile `ingressClassName` or migrate the public edge to K8s. 2. **Manual static certificate workaround** - Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors. - Certs are extracted from K8s cert-manager secrets and loaded statically. - These must be manually rotated before expiry. 3. **No observability** - No synthetic uptime probes. - No cert-expiry alerting. - No Hermes `/metrics` endpoint. - No Alertmanager / Slack alerts. - No centralized logs. 4. **Secret management** - Plaintext secrets in `hermes-k8s.yaml` and compose env vars. - No Sealed Secrets / External Secrets / Vault. 5. **Single point of failure** - One host, one residential IP, one edge proxy. - No redundancy or failover. 6. **Gitea SSH port** - Changed from `2222` to `22222` due to an unknown process binding `2222`. - The original occupant of port `2222` was never identified; a reboot would be needed to clear it. --- ## Recommended next steps See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities: 1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`. 2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check. 3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing. 4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.