Files
hermes-mcp/docs/runbooks/2026-06-14-active-issues-and-debt.md
2026-06-14 12:26:34 -04:00

76 lines
3.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Active Issues and Remaining Debt — 2026-06-14
## What is working now
All commercial domains verified reachable with valid TLS:
- `hermes.squaremcp.com` / `openapi-living-brief.json`
- `app.squaremcp.com`
- `docs.squaremcp.com`
- `squaremcp.com` / `www.squaremcp.com`
- `tiktok.squaremcp.com`
- `fetcherpay.com` / `www.fetcherpay.com`
- `workflow.fetcherpay.com`
- `mail.fetcherpay.com`
- `git.fetcherpay.com`
Hermes path-specific routes verified:
- `POST /api/pilot-request``201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
- `GET /auth/tiktok/start``302` on `tiktok.squaremcp.com`
---
## Still down / not addressed
| Subdomain / Service | Why it is down | What would fix it |
|---|---|---|
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
---
## Architectural debt
1. **K8s nginx-ingress is bypassed**
- Traefiks Docker iptables rules intercept all public HTTP/S traffic.
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
2. **Manual static certificate workaround**
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
- Certs are extracted from K8s cert-manager secrets and loaded statically.
- These must be manually rotated before expiry.
3. **No observability**
- No synthetic uptime probes.
- No cert-expiry alerting.
- No Hermes `/metrics` endpoint.
- No Alertmanager / Slack alerts.
- No centralized logs.
4. **Secret management**
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
- No Sealed Secrets / External Secrets / Vault.
5. **Single point of failure**
- One host, one residential IP, one edge proxy.
- No redundancy or failover.
6. **Gitea SSH port**
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
---
## Recommended next steps
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.