76 lines
3.0 KiB
Markdown
76 lines
3.0 KiB
Markdown
# Active Issues and Remaining Debt — 2026-06-14
|
||
|
||
## What is working now
|
||
|
||
All commercial domains verified reachable with valid TLS:
|
||
|
||
- `hermes.squaremcp.com` / `openapi-living-brief.json`
|
||
- `app.squaremcp.com`
|
||
- `docs.squaremcp.com`
|
||
- `squaremcp.com` / `www.squaremcp.com`
|
||
- `tiktok.squaremcp.com`
|
||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||
- `workflow.fetcherpay.com`
|
||
- `mail.fetcherpay.com`
|
||
- `git.fetcherpay.com`
|
||
|
||
Hermes path-specific routes verified:
|
||
- `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
|
||
- `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com`
|
||
|
||
---
|
||
|
||
## Still down / not addressed
|
||
|
||
| Subdomain / Service | Why it is down | What would fix it |
|
||
|---|---|---|
|
||
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
|
||
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
|
||
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
|
||
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
|
||
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
|
||
|
||
---
|
||
|
||
## Architectural debt
|
||
|
||
1. **K8s nginx-ingress is bypassed**
|
||
- Traefik’s Docker iptables rules intercept all public HTTP/S traffic.
|
||
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
|
||
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
|
||
|
||
2. **Manual static certificate workaround**
|
||
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
|
||
- Certs are extracted from K8s cert-manager secrets and loaded statically.
|
||
- These must be manually rotated before expiry.
|
||
|
||
3. **No observability**
|
||
- No synthetic uptime probes.
|
||
- No cert-expiry alerting.
|
||
- No Hermes `/metrics` endpoint.
|
||
- No Alertmanager / Slack alerts.
|
||
- No centralized logs.
|
||
|
||
4. **Secret management**
|
||
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
|
||
- No Sealed Secrets / External Secrets / Vault.
|
||
|
||
5. **Single point of failure**
|
||
- One host, one residential IP, one edge proxy.
|
||
- No redundancy or failover.
|
||
|
||
6. **Gitea SSH port**
|
||
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
|
||
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
|
||
|
||
---
|
||
|
||
## Recommended next steps
|
||
|
||
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
|
||
|
||
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
|
||
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
|
||
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
|
||
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.
|