Files
hermes-mcp/docs/runbooks/2026-06-14-outage-index.md
2026-06-14 12:26:34 -04:00

1.6 KiB
Raw Permalink Blame History

2026-06-14 Public Edge Outage — Vault Index

All documentation for the outage, its root cause, the fix, and the follow-up plan lives in this SquareMCP vault folder.

Files

File Purpose
2026-06-14-public-edge-outage-rca.md Root cause analysis and incident timeline.
2026-06-14-outage-fix-log.md Step-by-step record of every config change, command, and verification result.
2026-06-14-infrastructure-findings.md As-built architecture, Traefik/K8s behavior, Hermes route table, and monitoring gaps.
2026-06-14-active-issues-and-debt.md What is still down, remaining technical debt, and recommended next steps.
2026-06-14-public-edge-outage-plan.md Proposed runbook, monitoring, probes, and alerting plan (Phase 14).
2026-06-14-outage-index.md This file.

Quick status

  • All listed squaremcp.com domains reachable with valid TLS.
  • All listed fetcherpay.com domains reachable with valid TLS.
  • Hermes path routes (/api/pilot-request, /auth/tiktok) verified.
  • ⚠️ K8s nginx-ingress remains bypassed by Traefik.
  • ⚠️ Several FetcherPay services still stopped (api, Prometheus, Grafana, Adminer).
  • ⚠️ No automated monitoring or alerting yet.

Reference paths on disk

  • Traefik compose: /home/garfield/traefik-compose.yml
  • Traefik static config: /home/garfield/traefik.yml
  • Traefik dynamic config: /home/garfield/letsencrypt/manual/tls.yml
  • Static certs: /home/garfield/letsencrypt/manual/certs/
  • FetcherPay prod compose: /home/garfield/Downloads/docker-compose.prod.yml
  • Hermes K8s manifest: /home/garfield/hermes-mcp/hermes-k8s.yaml