Files
hermes-mcp/docs/runbooks/2026-06-14-active-issues-and-debt.md
2026-06-14 12:26:34 -04:00

3.0 KiB
Raw Blame History

Active Issues and Remaining Debt — 2026-06-14

What is working now

All commercial domains verified reachable with valid TLS:

  • hermes.squaremcp.com / openapi-living-brief.json
  • app.squaremcp.com
  • docs.squaremcp.com
  • squaremcp.com / www.squaremcp.com
  • tiktok.squaremcp.com
  • fetcherpay.com / www.fetcherpay.com
  • workflow.fetcherpay.com
  • mail.fetcherpay.com
  • git.fetcherpay.com

Hermes path-specific routes verified:

  • POST /api/pilot-request201 on squaremcp.com, www.squaremcp.com, tiktok.squaremcp.com
  • GET /auth/tiktok/start302 on tiktok.squaremcp.com

Still down / not addressed

Subdomain / Service Why it is down What would fix it
api.fetcherpay.com fetcherpay-api container not running Start fetcherpay-api (needs env vars, Postgres, Redis)
prometheus.fetcherpay.com Prometheus container not running Start Prometheus from docker-compose.fetcherpay.yml
grafana.fetcherpay.com Grafana container not running Start Grafana from docker-compose.fetcherpay.yml
adminer.fetcherpay.com Adminer container not running Start Adminer from docker-compose.fetcherpay.yml
traefik.fetcherpay.com Traefik dashboard is on :8080 but not routed through a public host label Add a secure router or restrict dashboard to localhost/VPN

Architectural debt

  1. K8s nginx-ingress is bypassed

    • Traefiks Docker iptables rules intercept all public HTTP/S traffic.
    • The active nginx-ingress controller class is public; manifests use nginx.
    • Long term: either reconcile ingressClassName or migrate the public edge to K8s.
  2. Manual static certificate workaround

    • Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of DUPLICATE_RECORD TXT errors.
    • Certs are extracted from K8s cert-manager secrets and loaded statically.
    • These must be manually rotated before expiry.
  3. No observability

    • No synthetic uptime probes.
    • No cert-expiry alerting.
    • No Hermes /metrics endpoint.
    • No Alertmanager / Slack alerts.
    • No centralized logs.
  4. Secret management

    • Plaintext secrets in hermes-k8s.yaml and compose env vars.
    • No Sealed Secrets / External Secrets / Vault.
  5. Single point of failure

    • One host, one residential IP, one edge proxy.
    • No redundancy or failover.
  6. Gitea SSH port

    • Changed from 2222 to 22222 due to an unknown process binding 2222.
    • The original occupant of port 2222 was never identified; a reboot would be needed to clear it.

See 2026-06-14-public-edge-outage-plan.md for the full phased plan. Priorities:

  1. Immediate: finalize RCA, runbook, and scripts/verify-public-endpoints.sh.
  2. This week: deploy blackbox exporter + cert-expiry alerts + container-up check.
  3. Next sprint: add Hermes /metrics, Grafana dashboards, Alertmanager Slack routing.
  4. Future: decide on K8s edge migration vs. reconciling ingress classes.