docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan
Some checks failed
CI / test (push) Has been cancelled
Some checks failed
CI / test (push) Has been cancelled
This commit is contained in:
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Active Issues and Remaining Debt — 2026-06-14
|
||||
|
||||
## What is working now
|
||||
|
||||
All commercial domains verified reachable with valid TLS:
|
||||
|
||||
- `hermes.squaremcp.com` / `openapi-living-brief.json`
|
||||
- `app.squaremcp.com`
|
||||
- `docs.squaremcp.com`
|
||||
- `squaremcp.com` / `www.squaremcp.com`
|
||||
- `tiktok.squaremcp.com`
|
||||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||||
- `workflow.fetcherpay.com`
|
||||
- `mail.fetcherpay.com`
|
||||
- `git.fetcherpay.com`
|
||||
|
||||
Hermes path-specific routes verified:
|
||||
- `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
|
||||
- `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com`
|
||||
|
||||
---
|
||||
|
||||
## Still down / not addressed
|
||||
|
||||
| Subdomain / Service | Why it is down | What would fix it |
|
||||
|---|---|---|
|
||||
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
|
||||
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
|
||||
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
|
||||
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
|
||||
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
|
||||
|
||||
---
|
||||
|
||||
## Architectural debt
|
||||
|
||||
1. **K8s nginx-ingress is bypassed**
|
||||
- Traefik’s Docker iptables rules intercept all public HTTP/S traffic.
|
||||
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
|
||||
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
|
||||
|
||||
2. **Manual static certificate workaround**
|
||||
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
|
||||
- Certs are extracted from K8s cert-manager secrets and loaded statically.
|
||||
- These must be manually rotated before expiry.
|
||||
|
||||
3. **No observability**
|
||||
- No synthetic uptime probes.
|
||||
- No cert-expiry alerting.
|
||||
- No Hermes `/metrics` endpoint.
|
||||
- No Alertmanager / Slack alerts.
|
||||
- No centralized logs.
|
||||
|
||||
4. **Secret management**
|
||||
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
|
||||
- No Sealed Secrets / External Secrets / Vault.
|
||||
|
||||
5. **Single point of failure**
|
||||
- One host, one residential IP, one edge proxy.
|
||||
- No redundancy or failover.
|
||||
|
||||
6. **Gitea SSH port**
|
||||
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
|
||||
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
|
||||
|
||||
---
|
||||
|
||||
## Recommended next steps
|
||||
|
||||
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
|
||||
|
||||
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
|
||||
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
|
||||
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
|
||||
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.
|
||||
Reference in New Issue
Block a user