From 0e255e570a4324626ab60e67f44eb52d07b0ca42 Mon Sep 17 00:00:00 2001 From: Garfield Date: Sun, 14 Jun 2026 12:26:34 -0400 Subject: [PATCH] docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan --- .../2026-06-14-active-issues-and-debt.md | 75 ++++ .../2026-06-14-infrastructure-findings.md | 164 ++++++++ docs/runbooks/2026-06-14-outage-fix-log.md | 360 ++++++++++++++++++ docs/runbooks/2026-06-14-outage-index.md | 32 ++ .../2026-06-14-public-edge-outage-plan.md | 129 +++++++ .../2026-06-14-public-edge-outage-rca.md | 88 +++++ 6 files changed, 848 insertions(+) create mode 100644 docs/runbooks/2026-06-14-active-issues-and-debt.md create mode 100644 docs/runbooks/2026-06-14-infrastructure-findings.md create mode 100644 docs/runbooks/2026-06-14-outage-fix-log.md create mode 100644 docs/runbooks/2026-06-14-outage-index.md create mode 100644 docs/runbooks/2026-06-14-public-edge-outage-plan.md create mode 100644 docs/runbooks/2026-06-14-public-edge-outage-rca.md diff --git a/docs/runbooks/2026-06-14-active-issues-and-debt.md b/docs/runbooks/2026-06-14-active-issues-and-debt.md new file mode 100644 index 0000000..2922598 --- /dev/null +++ b/docs/runbooks/2026-06-14-active-issues-and-debt.md @@ -0,0 +1,75 @@ +# Active Issues and Remaining Debt — 2026-06-14 + +## What is working now + +All commercial domains verified reachable with valid TLS: + +- `hermes.squaremcp.com` / `openapi-living-brief.json` +- `app.squaremcp.com` +- `docs.squaremcp.com` +- `squaremcp.com` / `www.squaremcp.com` +- `tiktok.squaremcp.com` +- `fetcherpay.com` / `www.fetcherpay.com` +- `workflow.fetcherpay.com` +- `mail.fetcherpay.com` +- `git.fetcherpay.com` + +Hermes path-specific routes verified: +- `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com` +- `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com` + +--- + +## Still down / not addressed + +| Subdomain / Service | Why it is down | What would fix it | +|---|---|---| +| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) | +| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` | +| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` | +| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` | +| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN | + +--- + +## Architectural debt + +1. **K8s nginx-ingress is bypassed** + - Traefik’s Docker iptables rules intercept all public HTTP/S traffic. + - The active nginx-ingress controller class is `public`; manifests use `nginx`. + - Long term: either reconcile `ingressClassName` or migrate the public edge to K8s. + +2. **Manual static certificate workaround** + - Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors. + - Certs are extracted from K8s cert-manager secrets and loaded statically. + - These must be manually rotated before expiry. + +3. **No observability** + - No synthetic uptime probes. + - No cert-expiry alerting. + - No Hermes `/metrics` endpoint. + - No Alertmanager / Slack alerts. + - No centralized logs. + +4. **Secret management** + - Plaintext secrets in `hermes-k8s.yaml` and compose env vars. + - No Sealed Secrets / External Secrets / Vault. + +5. **Single point of failure** + - One host, one residential IP, one edge proxy. + - No redundancy or failover. + +6. **Gitea SSH port** + - Changed from `2222` to `22222` due to an unknown process binding `2222`. + - The original occupant of port `2222` was never identified; a reboot would be needed to clear it. + +--- + +## Recommended next steps + +See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities: + +1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`. +2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check. +3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing. +4. **Future:** decide on K8s edge migration vs. reconciling ingress classes. diff --git a/docs/runbooks/2026-06-14-infrastructure-findings.md b/docs/runbooks/2026-06-14-infrastructure-findings.md new file mode 100644 index 0000000..b90e8d8 --- /dev/null +++ b/docs/runbooks/2026-06-14-infrastructure-findings.md @@ -0,0 +1,164 @@ +# Infrastructure Findings — SquareMCP / FetcherPay + +This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response. + +--- + +## 1. High-level architecture + +The single production server (`104.190.60.129`) hosts two separate ingress layers: + +| Ingress Layer | Technology | Serves | +|---|---|---| +| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` | +| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) | + +Both layers use Let’s Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services. + +--- + +## 2. Traefik configuration + +### Static config +**File:** `/home/garfield/traefik.yml` + +- Dashboard enabled on `:8080` with `insecure: true`. +- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`). +- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`). +- Certificate resolver: `letsencrypt` via GoDaddy DNS-01. + +### Compose +**File:** `/home/garfield/traefik-compose.yml` + +- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external). +- Volumes: Docker socket, static config, `letsencrypt` directory. + +### Dynamic routing +**File:** `/home/garfield/letsencrypt/manual/tls.yml` + +Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes. + +--- + +## 3. Kubernetes ingress mismatch + +- **Controller class:** `public` +- **Ingress class used by manifests:** `nginx` + +This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled. + +Affected manifests include: +- `hermes-mcp/hermes-k8s.yaml` +- `hermes-mcp/product/app/app-k8s.yaml` +- `hermes-mcp/docs/docs-k8s.yaml` +- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml` + +--- + +## 4. Hermes MCP route table + +**File:** `hermes-mcp/src/index.ts` + +### Public / commercial endpoints + +| Method | Path | Notes | +|---|---|---| +| `GET` | `/` | Static files from `../product` | +| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT | +| `GET` | `/openapi.json` | Full OpenAPI spec | +| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit | +| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback | +| `POST` | `/api/pilot-request` | Public form submission; origin-gated | +| `GET` | `/health` | Liveness/readiness probe | + +### OAuth / MCP discovery + +| Method | Path | +|---|---| +| `POST` | `/oauth/register` | +| `GET` / `POST` | `/oauth/authorize` | +| `POST` | `/oauth/token` | +| `GET` | `/.well-known/oauth-authorization-server` | +| `GET` | `/.well-known/openid-configuration` | +| `GET` / `POST` / `DELETE` | `/mcp` | +| `GET` | `/sse` | +| `POST` | `/messages` | +| `GET` | `/tools` | + +### Capability-guarded tool API + +All `/api/*` tool routes require auth + capability grant: + +| Capability | Example endpoints | +|---|---| +| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` | +| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` | +| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` | +| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` | +| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` | +| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` | +| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` | +| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` | +| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` | +| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` | + +### Health endpoint + +```typescript +app.get('/health', (_req, res) => { + res.json({ + status: 'ok', + service: 'hermes-mcp', + toolCount, + transports, + endpoints, + }); +}); +``` + +Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`. + +--- + +## 5. Monitoring gaps + +### Prometheus / Grafana + +- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`. +- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`. +- **Hermes MCP is not scraped** and has no `/metrics` endpoint. +- No Alertmanager, no alert rules. + +### Health checks + +- Hermes has `/health` but no `/ready` or `/livez` separation. +- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**. + +### Uptime / synthetic probes + +- No blackbox exporter. +- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.). +- No cert-expiry alerting. +- No K8s ingress reconciliation check. + +### Logs + +- No centralized log aggregation (Loki, Vector, Fluentd). + +--- + +## 6. Secret management + +- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys). +- Docker Compose stacks rely on exported env vars or `.env` files. +- No Sealed Secrets, External Secrets Operator, or Vault in use. + +--- + +## 7. Notable risks + +1. **Single point of failure:** one residential IP, one host, one edge proxy. +2. **Split edge:** two ingress controllers with conflicting class configuration. +3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry. +4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains. +5. **Stopped services not detected:** Docker restart policies only help if containers were initially started. diff --git a/docs/runbooks/2026-06-14-outage-fix-log.md b/docs/runbooks/2026-06-14-outage-fix-log.md new file mode 100644 index 0000000..e573541 --- /dev/null +++ b/docs/runbooks/2026-06-14-outage-fix-log.md @@ -0,0 +1,360 @@ +# Outage Fix Log — 2026-06-14 + +This is the step-by-step record of what was changed to restore public access to the SquareMCP / FetcherPay commercial sites. + +--- + +## Environment + +- **Host:** `104.190.60.129` (MicroK8s + Docker) +- **Edge proxy:** Traefik v3 in Docker, binds `:80`, `:443`, `:8080` +- **Hermes MCP:** K8s pod with `hostNetwork: true` on `:3456` +- **Key files:** + - `/home/garfield/traefik-compose.yml` + - `/home/garfield/traefik.yml` + - `/home/garfield/letsencrypt/manual/tls.yml` + - `/home/garfield/Downloads/docker-compose.prod.yml` + +--- + +## 1. Attach Traefik to the FetcherPay network + +**File:** `/home/garfield/traefik-compose.yml` + +Added the `fetcherpay` external network so Traefik can reach FetcherPay Docker backends. + +```yaml +services: + traefik: + ... + networks: + - hermes-net + - obsidian-net + - fetcherpay + +networks: + hermes-net: + external: true + name: hermes-mcp_hermes-net + obsidian-net: + external: true + name: obsidian_obsidian-net + fetcherpay: + external: true + name: fetcherpay_fetcherpay +``` + +--- + +## 2. Rebuild the Traefik file-provider routing config + +**File:** `/home/garfield/letsencrypt/manual/tls.yml` + +Final config includes routers and services for: +- `hermes.squaremcp.com` +- `app.squaremcp.com` +- `docs.squaremcp.com` +- `squaremcp.com` / `www.squaremcp.com` +- `tiktok.squaremcp.com` +- `fetcherpay.com` / `www.fetcherpay.com` +- `workflow.fetcherpay.com` +- `mail.fetcherpay.com` +- `git.fetcherpay.com` + +Path-specific rules that route to Hermes (`104.190.60.129:3456`): +- `/api/pilot-request` on `squaremcp.com` / `www.squaremcp.com` +- `/auth/tiktok` and `/api/pilot-request` on `tiktok.squaremcp.com` + +Full final config: + +```yaml +http: + routers: + hermes: + rule: "Host(`hermes.squaremcp.com`)" + service: hermes + entryPoints: [websecure] + tls: { certResolver: letsencrypt } + + squaremcp-app: + rule: "Host(`app.squaremcp.com`)" + service: squaremcp-app + entryPoints: [websecure] + tls: {} + + squaremcp-docs: + rule: "Host(`docs.squaremcp.com`)" + service: squaremcp-docs + entryPoints: [websecure] + tls: {} + + squaremcp-site-main: + rule: "Host(`squaremcp.com`) || Host(`www.squaremcp.com`)" + service: squaremcp-site + priority: 10 + entryPoints: [websecure] + tls: {} + + squaremcp-site-pilot: + rule: "(Host(`squaremcp.com`) || Host(`www.squaremcp.com`)) && PathPrefix(`/api/pilot-request`)" + service: hermes + priority: 30 + entryPoints: [websecure] + tls: {} + + squaremcp-tiktok-main: + rule: "Host(`tiktok.squaremcp.com`)" + service: squaremcp-site + priority: 10 + entryPoints: [websecure] + tls: {} + + squaremcp-tiktok-auth: + rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/auth/tiktok`)" + service: hermes + priority: 30 + entryPoints: [websecure] + tls: {} + + squaremcp-tiktok-pilot: + rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/api/pilot-request`)" + service: hermes + priority: 30 + entryPoints: [websecure] + tls: {} + + fetcherpay-root: + rule: "Host(`fetcherpay.com`) || Host(`www.fetcherpay.com`)" + service: fetcherpay-web + priority: 60 + entryPoints: [websecure] + tls: {} + + workflow: + rule: "Host(`workflow.fetcherpay.com`)" + service: temporal-ui + priority: 60 + entryPoints: [websecure] + tls: {} + + mail: + rule: "Host(`mail.fetcherpay.com`)" + service: poste + priority: 60 + entryPoints: [websecure] + tls: {} + + git: + rule: "Host(`git.fetcherpay.com`)" + service: gitea + priority: 60 + entryPoints: [websecure] + tls: {} + + services: + hermes: + loadBalancer: + servers: [{ url: "http://104.190.60.129:3456" }] + passHostHeader: true + squaremcp-app: + loadBalancer: + servers: [{ url: "http://10.152.183.164:80" }] + passHostHeader: true + squaremcp-docs: + loadBalancer: + servers: [{ url: "http://10.152.183.130:80" }] + passHostHeader: true + squaremcp-site: + loadBalancer: + servers: [{ url: "http://10.152.183.48:80" }] + passHostHeader: true + fetcherpay-web: + loadBalancer: + servers: [{ url: "http://172.20.0.9:80" }] + passHostHeader: true + temporal-ui: + loadBalancer: + servers: [{ url: "http://172.20.0.3:8080" }] + passHostHeader: true + poste: + loadBalancer: + servers: [{ url: "http://poste:80" }] + passHostHeader: true + gitea: + loadBalancer: + servers: [{ url: "http://gitea:3000" }] + passHostHeader: true + +tls: + certificates: + - certFile: /letsencrypt/manual/certs/squaremcp-app.crt + keyFile: /letsencrypt/manual/certs/squaremcp-app.key + - certFile: /letsencrypt/manual/certs/squaremcp-docs.crt + keyFile: /letsencrypt/manual/certs/squaremcp-docs.key + - certFile: /letsencrypt/manual/certs/squaremcp-site.crt + keyFile: /letsencrypt/manual/certs/squaremcp-site.key + - certFile: /letsencrypt/manual/certs/fetcherpay-root.crt + keyFile: /letsencrypt/manual/certs/fetcherpay-root.key + - certFile: /letsencrypt/manual/certs/mail-fetcherpay.crt + keyFile: /letsencrypt/manual/certs/mail-fetcherpay.key + - certFile: /letsencrypt/manual/certs/git-fetcherpay.crt + keyFile: /letsencrypt/manual/certs/git-fetcherpay.key +``` + +--- + +## 3. Extract static TLS certificates from K8s cert-manager secrets + +Because Traefik’s GoDaddy DNS-01 resolver fails with `DUPLICATE_RECORD` for existing `_acme-challenge.*` TXT records, valid certificates were pulled from the K8s secrets that cert-manager already held. + +```bash +mkdir -p /home/garfield/letsencrypt/manual/certs + +# squaremcp-app +microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-app.crt +microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-app.key + +# squaremcp-docs +microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-docs.crt +microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-docs.key + +# squaremcp-site (covers squaremcp.com / www.squaremcp.com / tiktok.squaremcp.com) +microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-site.crt +microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-site.key + +# fetcherpay-root +microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > fetcherpay-root.crt +microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > fetcherpay-root.key + +# mail.fetcherpay.com +microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.crt}' | base64 -d > mail-fetcherpay.crt +microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.key}' | base64 -d > mail-fetcherpay.key + +# git.fetcherpay.com +microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > git-fetcherpay.crt +microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > git-fetcherpay.key +``` + +--- + +## 4. Start stopped backend containers + +### FetcherPay web + +```bash +docker compose -p fetcherpay -f /home/garfield/docker-compose.fetcherpay.yml up -d fetcherpay-web +``` + +### Poste (mail) + +```bash +docker compose -p fetcherpay -f /home/garfield/Downloads/docker-compose.prod.yml up -d poste +``` + +### Postgres + Gitea (git) + +Gitea credentials were recovered from the existing Gitea config volume: + +```bash +docker run --rm -v fetcherpay_gitea_data:/data alpine \ + sh -c 'cat /data/gitea/conf/app.ini | grep -E "^(NAME|USER|PASSWD|HOST|DB_TYPE)"' +# DB_TYPE = postgres +# HOST = postgres:5432 +# NAME = gitea +# USER = fetcherpay +# PASSWD = fetcherpay_secure_2024 +``` + +Then postgres and gitea were started with the required env vars: + +```bash +cd /home/garfield/Downloads +export POSTGRES_USER=fetcherpay +export POSTGRES_PASSWORD=fetcherpay_secure_2024 +export POSTGRES_DB=postgres +export GITEA_HOST=git.fetcherpay.com +export GITEA_DB=gitea +export MAIL_HOST=mail.fetcherpay.com +export WEB_HOST=fetcherpay.com +export API_HOST=api.fetcherpay.com +export PROM_HOST=prometheus.fetcherpay.com +export GRAFANA_HOST=grafana.fetcherpay.com +export ADMINER_HOST=adminer.fetcherpay.com +export TEMPORAL_HOST=workflow.fetcherpay.com +export REDIS_PASSWORD=redis_pass +export MYSQL_ROOT_PASSWORD=mysql_root +export MYSQL_DATABASE=fetcherpay +export MYSQL_USER=fetcherpay +export MYSQL_PASSWORD=mysql_pass +export GRAFANA_ADMIN_PASSWORD=admin +export ADMINER_USERS=admin:admin +export TRAEFIK_DASHBOARD_HOST=traefik.fetcherpay.com + +docker compose -p fetcherpay -f docker-compose.prod.yml up -d postgres gitea +``` + +--- + +## 5. Fix `workflow.fetcherpay.com` + +The Docker label on the `temporal` service pointed Traefik at port `7233` (gRPC), causing 502s. A file-provider router was added in `tls.yml` pointing `workflow.fetcherpay.com` → `temporal-ui:8080`. + +--- + +## 6. Fix Gitea SSH port conflict + +The host port `2222` was already in use by an unknown process and could not be freed. The Gitea SSH mapping was changed from `2222:22` to `22222:22`. + +**File:** `/home/garfield/Downloads/docker-compose.prod.yml` + +```yaml +gitea: + ... + ports: + - "22222:22" # SSH (optional for git over SSH) +``` + +The `gitea` container was then recreated with the new mapping. + +--- + +## 7. Restart Traefik after every config change + +```bash +docker restart traefik +``` + +--- + +## 8. Verification results + +Final public reachability check: + +``` +https://hermes.squaremcp.com/openapi-living-brief.json -> 200 (cert=0) +https://app.squaremcp.com/ -> 200 (cert=0) +https://docs.squaremcp.com/ -> 200 (cert=0) +https://squaremcp.com/ -> 200 (cert=0) +https://www.squaremcp.com/ -> 200 (cert=0) +https://tiktok.squaremcp.com/ -> 200 (cert=0) +https://tiktok.squaremcp.com/auth/tiktok/start -> 302 (cert=0) +https://fetcherpay.com/ -> 200 (cert=0) +https://www.fetcherpay.com/ -> 200 (cert=0) +https://workflow.fetcherpay.com/ -> 200 (cert=0) +https://mail.fetcherpay.com/ -> 302 (cert=0) +https://git.fetcherpay.com/ -> 200 (cert=0) + +POST /api/pilot-request (tiktok) -> 201 +POST /api/pilot-request (root/www) -> 201 +GET /auth/tiktok/start -> 302 +``` + +`cert:0` means TLS verification passed. + +--- + +## Notes / gotchas + +- `/api/pilot-request` is `POST`-only. A `GET` request returns `404`, which is expected. +- The `/auth/tiktok` routes are `/auth/tiktok/start` and `/auth/tiktok/callback`; the Traefik `PathPrefix(`/auth/tiktok`)` rule correctly forwards both. +- Static certificate extraction required root access; Docker root containers were used when `sudo` began prompting for a password. diff --git a/docs/runbooks/2026-06-14-outage-index.md b/docs/runbooks/2026-06-14-outage-index.md new file mode 100644 index 0000000..01ac1ad --- /dev/null +++ b/docs/runbooks/2026-06-14-outage-index.md @@ -0,0 +1,32 @@ +# 2026-06-14 Public Edge Outage — Vault Index + +All documentation for the outage, its root cause, the fix, and the follow-up plan lives in this SquareMCP vault folder. + +## Files + +| File | Purpose | +|---|---| +| `2026-06-14-public-edge-outage-rca.md` | Root cause analysis and incident timeline. | +| `2026-06-14-outage-fix-log.md` | Step-by-step record of every config change, command, and verification result. | +| `2026-06-14-infrastructure-findings.md` | As-built architecture, Traefik/K8s behavior, Hermes route table, and monitoring gaps. | +| `2026-06-14-active-issues-and-debt.md` | What is still down, remaining technical debt, and recommended next steps. | +| `2026-06-14-public-edge-outage-plan.md` | Proposed runbook, monitoring, probes, and alerting plan (Phase 1–4). | +| `2026-06-14-outage-index.md` | This file. | + +## Quick status + +- ✅ All listed `squaremcp.com` domains reachable with valid TLS. +- ✅ All listed `fetcherpay.com` domains reachable with valid TLS. +- ✅ Hermes path routes (`/api/pilot-request`, `/auth/tiktok`) verified. +- ⚠️ K8s nginx-ingress remains bypassed by Traefik. +- ⚠️ Several FetcherPay services still stopped (`api`, Prometheus, Grafana, Adminer). +- ⚠️ No automated monitoring or alerting yet. + +## Reference paths on disk + +- Traefik compose: `/home/garfield/traefik-compose.yml` +- Traefik static config: `/home/garfield/traefik.yml` +- Traefik dynamic config: `/home/garfield/letsencrypt/manual/tls.yml` +- Static certs: `/home/garfield/letsencrypt/manual/certs/` +- FetcherPay prod compose: `/home/garfield/Downloads/docker-compose.prod.yml` +- Hermes K8s manifest: `/home/garfield/hermes-mcp/hermes-k8s.yaml` diff --git a/docs/runbooks/2026-06-14-public-edge-outage-plan.md b/docs/runbooks/2026-06-14-public-edge-outage-plan.md new file mode 100644 index 0000000..f52e7f4 --- /dev/null +++ b/docs/runbooks/2026-06-14-public-edge-outage-plan.md @@ -0,0 +1,129 @@ +# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring + +## Goal +Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`). +1. Write a clear post-incident / RCA document. +2. Create a step-by-step deployment runbook that the next operator can follow without guessing. +3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice. + +--- + +## Root cause (condensed) +- **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it. +- **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert. +- **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway. +- **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`). + +--- + +## Deliverable 1: Post-incident / RCA document +**Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md` + +Sections: +- **Summary** — what was down, for how long, user impact. +- **Timeline** — detection, mitigation, full restoration. +- **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers. +- **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services. +- **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution. +- **Follow-up work** — this plan’s runbook and monitoring deliverables. + +--- + +## Deliverable 2: Deployment runbook +**Location:** `hermes-mcp/docs/runbooks/deployment.md` + +The runbook will cover: +1. **Pre-flight checks** + - Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`). + - Confirm all expected Docker networks exist. + - Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains. +2. **Deploy / update the edge proxy** + - Rebuild / restart Traefik from `traefik-compose.yml`. + - Validate `tls.yml` routers, services, and certificate entries. + - Smoke-test every public host immediately after restart. +3. **Deploy Hermes / SquareMCP (K8s path)** + - Build, push, update digest in `hermes-k8s.yaml`. + - Apply manifests and wait for rollout. + - Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`. +4. **Deploy FetcherPay stack (Docker path)** + - Export required env vars (or ensure `.env` is present). + - `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow. + - Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`. +5. **Certificate renewal / rotation** + - When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction. + - Step-by-step secret extraction command template. +6. **Rollback checklist** + - Revert image digest / compose change, restart, verify. +7. **Verification script** + - A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure. + +--- + +## Deliverable 3: Diagnostics, metrics, and probes +Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s. + +### Option A — Harden the existing Traefik edge (recommended) +**Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw. + +Implementation pieces: +1. **Synthetic uptime probes (blackbox exporter)** + - Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`). + - Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status. + - Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`. + - Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`. +2. **Certificate expiry alerting** + - Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left. + - Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss). +3. **Traefik routing health** + - Enable Traefik metrics endpoint (`--metrics.prometheus`). + - Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`. +4. **Container health & restart policy** + - Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`. + - Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`. +5. **K8s ingress reconciliation check** + - A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret. + - Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`. +6. **Hermes application metrics** + - Add a `/metrics` endpoint using `prom-client` in `src/index.ts`. + - Instrument request latency, error rate, active OAuth sessions, tool call counts. + - Scrape it from Prometheus. +7. **Separate readiness probe** + - Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready. +8. **Alertmanager + Slack / email** + - Deploy `prom/alertmanager` alongside Prometheus. + - Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email. +9. **Verification script** + - `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI. + +### Option B — Migrate public edge to K8s nginx-ingress +**Why:** Eliminates the split-ingress complexity that caused the routing confusion. + +Implementation pieces: +1. Reconcile `ingressClassName: nginx` → `public` (or change the controller to `nginx`). +2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role. +3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort). +4. Re-issue all certs via cert-manager and remove the static-cert workaround. +5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A. + +**Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term. + +--- + +## Suggested file changes (all under `hermes-mcp/`) +- **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md` +- **New / rewrite:** `docs/runbooks/deployment.md` +- **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook) +- **New:** `scripts/verify-public-endpoints.sh` +- **New:** `scripts/check-k8s-ingress.sh` +- **Modify:** `src/index.ts` — add `/metrics`, `/ready`, enhance `/health` +- **Modify:** `hermes-k8s.yaml` — add startup probe, resource requests/limits +- **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml` +- **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file. + +--- + +## Phasing recommendation +- **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`. +- **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check. +- **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing. +- **Phase 4 (future):** decide on Option B edge migration after Phase 1–3 are stable. diff --git a/docs/runbooks/2026-06-14-public-edge-outage-rca.md b/docs/runbooks/2026-06-14-public-edge-outage-rca.md new file mode 100644 index 0000000..1c7870b --- /dev/null +++ b/docs/runbooks/2026-06-14-public-edge-outage-rca.md @@ -0,0 +1,88 @@ +# Public Edge Outage — Root Cause Analysis + +**Date:** 2026-06-14 +**Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid. +**Status:** Resolved. All listed commercial domains reachable with valid TLS. + +--- + +## Summary + +On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains. + +--- + +## Timeline (all times UTC-4) + +- **~09:30** — User reports that commercial sites are not reachable. +- **09:30–10:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`. +- **10:00–10:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`. +- **10:30–11:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container. +- **11:00–11:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`). +- **11:30–12:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service. +- **12:00–13:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`. +- **13:30–14:00** — Final verification of all domains and Hermes path-specific routes. + +--- + +## Root cause + +### 1. Docker Traefik intercepts all public ingress +- The Traefik v3 container binds host ports `80`, `443`, and `8080`. +- Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules. +- Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller. + +### 2. Traefik had no routes or valid TLS for the commercial domains +- Traefik’s dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`). +- At the start of the incident the file provider only had a partial/incomplete set of routers. +- There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance. +- Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned `404 page not found`. + +### 3. K8s nginx-ingress was unreachable even though it had valid certs +- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains. +- The active nginx-ingress-microk8s controller is configured for `ingressClass=public`. +- Most Ingress resources specify `ingressClassName: nginx`. +- Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it. + +### 4. Several Docker backends were stopped +- `fetcherpay-web` — stopped. +- `poste` (mail) — stopped. +- `postgres` and `gitea` (git) — stopped. +- `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`. + +--- + +## Why detection failed + +- No synthetic uptime probes were running against the public endpoints. +- No certificate-expiry or certificate-default alerting. +- No Traefik routing-health alert. +- Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services. +- K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed. + +--- + +## Remediation actions taken + +1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain. +2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends. +3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue. +4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. +5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`. +6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`. +7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate. + +--- + +## Remaining technical debt + +- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller. +- Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`. +- Secrets are still stored plaintext in manifests and compose files. +- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy. + +--- + +## Follow-up work + +See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.