docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan
Some checks failed
CI / test (push) Has been cancelled

This commit is contained in:
Garfield
2026-06-14 12:26:34 -04:00
parent 2014e03190
commit 0e255e570a
6 changed files with 848 additions and 0 deletions

View File

@@ -0,0 +1,75 @@
# Active Issues and Remaining Debt — 2026-06-14
## What is working now
All commercial domains verified reachable with valid TLS:
- `hermes.squaremcp.com` / `openapi-living-brief.json`
- `app.squaremcp.com`
- `docs.squaremcp.com`
- `squaremcp.com` / `www.squaremcp.com`
- `tiktok.squaremcp.com`
- `fetcherpay.com` / `www.fetcherpay.com`
- `workflow.fetcherpay.com`
- `mail.fetcherpay.com`
- `git.fetcherpay.com`
Hermes path-specific routes verified:
- `POST /api/pilot-request``201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
- `GET /auth/tiktok/start``302` on `tiktok.squaremcp.com`
---
## Still down / not addressed
| Subdomain / Service | Why it is down | What would fix it |
|---|---|---|
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
---
## Architectural debt
1. **K8s nginx-ingress is bypassed**
- Traefiks Docker iptables rules intercept all public HTTP/S traffic.
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
2. **Manual static certificate workaround**
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
- Certs are extracted from K8s cert-manager secrets and loaded statically.
- These must be manually rotated before expiry.
3. **No observability**
- No synthetic uptime probes.
- No cert-expiry alerting.
- No Hermes `/metrics` endpoint.
- No Alertmanager / Slack alerts.
- No centralized logs.
4. **Secret management**
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
- No Sealed Secrets / External Secrets / Vault.
5. **Single point of failure**
- One host, one residential IP, one edge proxy.
- No redundancy or failover.
6. **Gitea SSH port**
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
---
## Recommended next steps
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.