docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan
Some checks failed
CI / test (push) Has been cancelled

This commit is contained in:
Garfield
2026-06-14 12:26:34 -04:00
parent 2014e03190
commit 0e255e570a
6 changed files with 848 additions and 0 deletions

View File

@@ -0,0 +1,75 @@
# Active Issues and Remaining Debt — 2026-06-14
## What is working now
All commercial domains verified reachable with valid TLS:
- `hermes.squaremcp.com` / `openapi-living-brief.json`
- `app.squaremcp.com`
- `docs.squaremcp.com`
- `squaremcp.com` / `www.squaremcp.com`
- `tiktok.squaremcp.com`
- `fetcherpay.com` / `www.fetcherpay.com`
- `workflow.fetcherpay.com`
- `mail.fetcherpay.com`
- `git.fetcherpay.com`
Hermes path-specific routes verified:
- `POST /api/pilot-request``201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
- `GET /auth/tiktok/start``302` on `tiktok.squaremcp.com`
---
## Still down / not addressed
| Subdomain / Service | Why it is down | What would fix it |
|---|---|---|
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
---
## Architectural debt
1. **K8s nginx-ingress is bypassed**
- Traefiks Docker iptables rules intercept all public HTTP/S traffic.
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
2. **Manual static certificate workaround**
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
- Certs are extracted from K8s cert-manager secrets and loaded statically.
- These must be manually rotated before expiry.
3. **No observability**
- No synthetic uptime probes.
- No cert-expiry alerting.
- No Hermes `/metrics` endpoint.
- No Alertmanager / Slack alerts.
- No centralized logs.
4. **Secret management**
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
- No Sealed Secrets / External Secrets / Vault.
5. **Single point of failure**
- One host, one residential IP, one edge proxy.
- No redundancy or failover.
6. **Gitea SSH port**
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
---
## Recommended next steps
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.

View File

@@ -0,0 +1,164 @@
# Infrastructure Findings — SquareMCP / FetcherPay
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
---
## 1. High-level architecture
The single production server (`104.190.60.129`) hosts two separate ingress layers:
| Ingress Layer | Technology | Serves |
|---|---|---|
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
Both layers use Lets Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
---
## 2. Traefik configuration
### Static config
**File:** `/home/garfield/traefik.yml`
- Dashboard enabled on `:8080` with `insecure: true`.
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
### Compose
**File:** `/home/garfield/traefik-compose.yml`
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
- Volumes: Docker socket, static config, `letsencrypt` directory.
### Dynamic routing
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
---
## 3. Kubernetes ingress mismatch
- **Controller class:** `public`
- **Ingress class used by manifests:** `nginx`
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
Affected manifests include:
- `hermes-mcp/hermes-k8s.yaml`
- `hermes-mcp/product/app/app-k8s.yaml`
- `hermes-mcp/docs/docs-k8s.yaml`
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
---
## 4. Hermes MCP route table
**File:** `hermes-mcp/src/index.ts`
### Public / commercial endpoints
| Method | Path | Notes |
|---|---|---|
| `GET` | `/` | Static files from `../product` |
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
| `GET` | `/openapi.json` | Full OpenAPI spec |
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
| `GET` | `/health` | Liveness/readiness probe |
### OAuth / MCP discovery
| Method | Path |
|---|---|
| `POST` | `/oauth/register` |
| `GET` / `POST` | `/oauth/authorize` |
| `POST` | `/oauth/token` |
| `GET` | `/.well-known/oauth-authorization-server` |
| `GET` | `/.well-known/openid-configuration` |
| `GET` / `POST` / `DELETE` | `/mcp` |
| `GET` | `/sse` |
| `POST` | `/messages` |
| `GET` | `/tools` |
### Capability-guarded tool API
All `/api/*` tool routes require auth + capability grant:
| Capability | Example endpoints |
|---|---|
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
### Health endpoint
```typescript
app.get('/health', (_req, res) => {
res.json({
status: 'ok',
service: 'hermes-mcp',
toolCount,
transports,
endpoints,
});
});
```
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
---
## 5. Monitoring gaps
### Prometheus / Grafana
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
- No Alertmanager, no alert rules.
### Health checks
- Hermes has `/health` but no `/ready` or `/livez` separation.
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
### Uptime / synthetic probes
- No blackbox exporter.
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
- No cert-expiry alerting.
- No K8s ingress reconciliation check.
### Logs
- No centralized log aggregation (Loki, Vector, Fluentd).
---
## 6. Secret management
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
- Docker Compose stacks rely on exported env vars or `.env` files.
- No Sealed Secrets, External Secrets Operator, or Vault in use.
---
## 7. Notable risks
1. **Single point of failure:** one residential IP, one host, one edge proxy.
2. **Split edge:** two ingress controllers with conflicting class configuration.
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.

View File

@@ -0,0 +1,360 @@
# Outage Fix Log — 2026-06-14
This is the step-by-step record of what was changed to restore public access to the SquareMCP / FetcherPay commercial sites.
---
## Environment
- **Host:** `104.190.60.129` (MicroK8s + Docker)
- **Edge proxy:** Traefik v3 in Docker, binds `:80`, `:443`, `:8080`
- **Hermes MCP:** K8s pod with `hostNetwork: true` on `:3456`
- **Key files:**
- `/home/garfield/traefik-compose.yml`
- `/home/garfield/traefik.yml`
- `/home/garfield/letsencrypt/manual/tls.yml`
- `/home/garfield/Downloads/docker-compose.prod.yml`
---
## 1. Attach Traefik to the FetcherPay network
**File:** `/home/garfield/traefik-compose.yml`
Added the `fetcherpay` external network so Traefik can reach FetcherPay Docker backends.
```yaml
services:
traefik:
...
networks:
- hermes-net
- obsidian-net
- fetcherpay
networks:
hermes-net:
external: true
name: hermes-mcp_hermes-net
obsidian-net:
external: true
name: obsidian_obsidian-net
fetcherpay:
external: true
name: fetcherpay_fetcherpay
```
---
## 2. Rebuild the Traefik file-provider routing config
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
Final config includes routers and services for:
- `hermes.squaremcp.com`
- `app.squaremcp.com`
- `docs.squaremcp.com`
- `squaremcp.com` / `www.squaremcp.com`
- `tiktok.squaremcp.com`
- `fetcherpay.com` / `www.fetcherpay.com`
- `workflow.fetcherpay.com`
- `mail.fetcherpay.com`
- `git.fetcherpay.com`
Path-specific rules that route to Hermes (`104.190.60.129:3456`):
- `/api/pilot-request` on `squaremcp.com` / `www.squaremcp.com`
- `/auth/tiktok` and `/api/pilot-request` on `tiktok.squaremcp.com`
Full final config:
```yaml
http:
routers:
hermes:
rule: "Host(`hermes.squaremcp.com`)"
service: hermes
entryPoints: [websecure]
tls: { certResolver: letsencrypt }
squaremcp-app:
rule: "Host(`app.squaremcp.com`)"
service: squaremcp-app
entryPoints: [websecure]
tls: {}
squaremcp-docs:
rule: "Host(`docs.squaremcp.com`)"
service: squaremcp-docs
entryPoints: [websecure]
tls: {}
squaremcp-site-main:
rule: "Host(`squaremcp.com`) || Host(`www.squaremcp.com`)"
service: squaremcp-site
priority: 10
entryPoints: [websecure]
tls: {}
squaremcp-site-pilot:
rule: "(Host(`squaremcp.com`) || Host(`www.squaremcp.com`)) && PathPrefix(`/api/pilot-request`)"
service: hermes
priority: 30
entryPoints: [websecure]
tls: {}
squaremcp-tiktok-main:
rule: "Host(`tiktok.squaremcp.com`)"
service: squaremcp-site
priority: 10
entryPoints: [websecure]
tls: {}
squaremcp-tiktok-auth:
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/auth/tiktok`)"
service: hermes
priority: 30
entryPoints: [websecure]
tls: {}
squaremcp-tiktok-pilot:
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/api/pilot-request`)"
service: hermes
priority: 30
entryPoints: [websecure]
tls: {}
fetcherpay-root:
rule: "Host(`fetcherpay.com`) || Host(`www.fetcherpay.com`)"
service: fetcherpay-web
priority: 60
entryPoints: [websecure]
tls: {}
workflow:
rule: "Host(`workflow.fetcherpay.com`)"
service: temporal-ui
priority: 60
entryPoints: [websecure]
tls: {}
mail:
rule: "Host(`mail.fetcherpay.com`)"
service: poste
priority: 60
entryPoints: [websecure]
tls: {}
git:
rule: "Host(`git.fetcherpay.com`)"
service: gitea
priority: 60
entryPoints: [websecure]
tls: {}
services:
hermes:
loadBalancer:
servers: [{ url: "http://104.190.60.129:3456" }]
passHostHeader: true
squaremcp-app:
loadBalancer:
servers: [{ url: "http://10.152.183.164:80" }]
passHostHeader: true
squaremcp-docs:
loadBalancer:
servers: [{ url: "http://10.152.183.130:80" }]
passHostHeader: true
squaremcp-site:
loadBalancer:
servers: [{ url: "http://10.152.183.48:80" }]
passHostHeader: true
fetcherpay-web:
loadBalancer:
servers: [{ url: "http://172.20.0.9:80" }]
passHostHeader: true
temporal-ui:
loadBalancer:
servers: [{ url: "http://172.20.0.3:8080" }]
passHostHeader: true
poste:
loadBalancer:
servers: [{ url: "http://poste:80" }]
passHostHeader: true
gitea:
loadBalancer:
servers: [{ url: "http://gitea:3000" }]
passHostHeader: true
tls:
certificates:
- certFile: /letsencrypt/manual/certs/squaremcp-app.crt
keyFile: /letsencrypt/manual/certs/squaremcp-app.key
- certFile: /letsencrypt/manual/certs/squaremcp-docs.crt
keyFile: /letsencrypt/manual/certs/squaremcp-docs.key
- certFile: /letsencrypt/manual/certs/squaremcp-site.crt
keyFile: /letsencrypt/manual/certs/squaremcp-site.key
- certFile: /letsencrypt/manual/certs/fetcherpay-root.crt
keyFile: /letsencrypt/manual/certs/fetcherpay-root.key
- certFile: /letsencrypt/manual/certs/mail-fetcherpay.crt
keyFile: /letsencrypt/manual/certs/mail-fetcherpay.key
- certFile: /letsencrypt/manual/certs/git-fetcherpay.crt
keyFile: /letsencrypt/manual/certs/git-fetcherpay.key
```
---
## 3. Extract static TLS certificates from K8s cert-manager secrets
Because Traefiks GoDaddy DNS-01 resolver fails with `DUPLICATE_RECORD` for existing `_acme-challenge.*` TXT records, valid certificates were pulled from the K8s secrets that cert-manager already held.
```bash
mkdir -p /home/garfield/letsencrypt/manual/certs
# squaremcp-app
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-app.crt
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-app.key
# squaremcp-docs
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-docs.crt
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-docs.key
# squaremcp-site (covers squaremcp.com / www.squaremcp.com / tiktok.squaremcp.com)
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-site.crt
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-site.key
# fetcherpay-root
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > fetcherpay-root.crt
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > fetcherpay-root.key
# mail.fetcherpay.com
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.crt}' | base64 -d > mail-fetcherpay.crt
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.key}' | base64 -d > mail-fetcherpay.key
# git.fetcherpay.com
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > git-fetcherpay.crt
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > git-fetcherpay.key
```
---
## 4. Start stopped backend containers
### FetcherPay web
```bash
docker compose -p fetcherpay -f /home/garfield/docker-compose.fetcherpay.yml up -d fetcherpay-web
```
### Poste (mail)
```bash
docker compose -p fetcherpay -f /home/garfield/Downloads/docker-compose.prod.yml up -d poste
```
### Postgres + Gitea (git)
Gitea credentials were recovered from the existing Gitea config volume:
```bash
docker run --rm -v fetcherpay_gitea_data:/data alpine \
sh -c 'cat /data/gitea/conf/app.ini | grep -E "^(NAME|USER|PASSWD|HOST|DB_TYPE)"'
# DB_TYPE = postgres
# HOST = postgres:5432
# NAME = gitea
# USER = fetcherpay
# PASSWD = fetcherpay_secure_2024
```
Then postgres and gitea were started with the required env vars:
```bash
cd /home/garfield/Downloads
export POSTGRES_USER=fetcherpay
export POSTGRES_PASSWORD=fetcherpay_secure_2024
export POSTGRES_DB=postgres
export GITEA_HOST=git.fetcherpay.com
export GITEA_DB=gitea
export MAIL_HOST=mail.fetcherpay.com
export WEB_HOST=fetcherpay.com
export API_HOST=api.fetcherpay.com
export PROM_HOST=prometheus.fetcherpay.com
export GRAFANA_HOST=grafana.fetcherpay.com
export ADMINER_HOST=adminer.fetcherpay.com
export TEMPORAL_HOST=workflow.fetcherpay.com
export REDIS_PASSWORD=redis_pass
export MYSQL_ROOT_PASSWORD=mysql_root
export MYSQL_DATABASE=fetcherpay
export MYSQL_USER=fetcherpay
export MYSQL_PASSWORD=mysql_pass
export GRAFANA_ADMIN_PASSWORD=admin
export ADMINER_USERS=admin:admin
export TRAEFIK_DASHBOARD_HOST=traefik.fetcherpay.com
docker compose -p fetcherpay -f docker-compose.prod.yml up -d postgres gitea
```
---
## 5. Fix `workflow.fetcherpay.com`
The Docker label on the `temporal` service pointed Traefik at port `7233` (gRPC), causing 502s. A file-provider router was added in `tls.yml` pointing `workflow.fetcherpay.com``temporal-ui:8080`.
---
## 6. Fix Gitea SSH port conflict
The host port `2222` was already in use by an unknown process and could not be freed. The Gitea SSH mapping was changed from `2222:22` to `22222:22`.
**File:** `/home/garfield/Downloads/docker-compose.prod.yml`
```yaml
gitea:
...
ports:
- "22222:22" # SSH (optional for git over SSH)
```
The `gitea` container was then recreated with the new mapping.
---
## 7. Restart Traefik after every config change
```bash
docker restart traefik
```
---
## 8. Verification results
Final public reachability check:
```
https://hermes.squaremcp.com/openapi-living-brief.json -> 200 (cert=0)
https://app.squaremcp.com/ -> 200 (cert=0)
https://docs.squaremcp.com/ -> 200 (cert=0)
https://squaremcp.com/ -> 200 (cert=0)
https://www.squaremcp.com/ -> 200 (cert=0)
https://tiktok.squaremcp.com/ -> 200 (cert=0)
https://tiktok.squaremcp.com/auth/tiktok/start -> 302 (cert=0)
https://fetcherpay.com/ -> 200 (cert=0)
https://www.fetcherpay.com/ -> 200 (cert=0)
https://workflow.fetcherpay.com/ -> 200 (cert=0)
https://mail.fetcherpay.com/ -> 302 (cert=0)
https://git.fetcherpay.com/ -> 200 (cert=0)
POST /api/pilot-request (tiktok) -> 201
POST /api/pilot-request (root/www) -> 201
GET /auth/tiktok/start -> 302
```
`cert:0` means TLS verification passed.
---
## Notes / gotchas
- `/api/pilot-request` is `POST`-only. A `GET` request returns `404`, which is expected.
- The `/auth/tiktok` routes are `/auth/tiktok/start` and `/auth/tiktok/callback`; the Traefik `PathPrefix(`/auth/tiktok`)` rule correctly forwards both.
- Static certificate extraction required root access; Docker root containers were used when `sudo` began prompting for a password.

View File

@@ -0,0 +1,32 @@
# 2026-06-14 Public Edge Outage — Vault Index
All documentation for the outage, its root cause, the fix, and the follow-up plan lives in this SquareMCP vault folder.
## Files
| File | Purpose |
|---|---|
| `2026-06-14-public-edge-outage-rca.md` | Root cause analysis and incident timeline. |
| `2026-06-14-outage-fix-log.md` | Step-by-step record of every config change, command, and verification result. |
| `2026-06-14-infrastructure-findings.md` | As-built architecture, Traefik/K8s behavior, Hermes route table, and monitoring gaps. |
| `2026-06-14-active-issues-and-debt.md` | What is still down, remaining technical debt, and recommended next steps. |
| `2026-06-14-public-edge-outage-plan.md` | Proposed runbook, monitoring, probes, and alerting plan (Phase 14). |
| `2026-06-14-outage-index.md` | This file. |
## Quick status
- ✅ All listed `squaremcp.com` domains reachable with valid TLS.
- ✅ All listed `fetcherpay.com` domains reachable with valid TLS.
- ✅ Hermes path routes (`/api/pilot-request`, `/auth/tiktok`) verified.
- ⚠️ K8s nginx-ingress remains bypassed by Traefik.
- ⚠️ Several FetcherPay services still stopped (`api`, Prometheus, Grafana, Adminer).
- ⚠️ No automated monitoring or alerting yet.
## Reference paths on disk
- Traefik compose: `/home/garfield/traefik-compose.yml`
- Traefik static config: `/home/garfield/traefik.yml`
- Traefik dynamic config: `/home/garfield/letsencrypt/manual/tls.yml`
- Static certs: `/home/garfield/letsencrypt/manual/certs/`
- FetcherPay prod compose: `/home/garfield/Downloads/docker-compose.prod.yml`
- Hermes K8s manifest: `/home/garfield/hermes-mcp/hermes-k8s.yaml`

View File

@@ -0,0 +1,129 @@
# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring
## Goal
Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`).
1. Write a clear post-incident / RCA document.
2. Create a step-by-step deployment runbook that the next operator can follow without guessing.
3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.
---
## Root cause (condensed)
- **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
- **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert.
- **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway.
- **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`).
---
## Deliverable 1: Post-incident / RCA document
**Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md`
Sections:
- **Summary** — what was down, for how long, user impact.
- **Timeline** — detection, mitigation, full restoration.
- **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
- **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
- **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
- **Follow-up work** — this plans runbook and monitoring deliverables.
---
## Deliverable 2: Deployment runbook
**Location:** `hermes-mcp/docs/runbooks/deployment.md`
The runbook will cover:
1. **Pre-flight checks**
- Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`).
- Confirm all expected Docker networks exist.
- Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains.
2. **Deploy / update the edge proxy**
- Rebuild / restart Traefik from `traefik-compose.yml`.
- Validate `tls.yml` routers, services, and certificate entries.
- Smoke-test every public host immediately after restart.
3. **Deploy Hermes / SquareMCP (K8s path)**
- Build, push, update digest in `hermes-k8s.yaml`.
- Apply manifests and wait for rollout.
- Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`.
4. **Deploy FetcherPay stack (Docker path)**
- Export required env vars (or ensure `.env` is present).
- `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow.
- Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`.
5. **Certificate renewal / rotation**
- When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
- Step-by-step secret extraction command template.
6. **Rollback checklist**
- Revert image digest / compose change, restart, verify.
7. **Verification script**
- A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure.
---
## Deliverable 3: Diagnostics, metrics, and probes
Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.
### Option A — Harden the existing Traefik edge (recommended)
**Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.
Implementation pieces:
1. **Synthetic uptime probes (blackbox exporter)**
- Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`).
- Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
- Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`.
- Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`.
2. **Certificate expiry alerting**
- Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left.
- Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
3. **Traefik routing health**
- Enable Traefik metrics endpoint (`--metrics.prometheus`).
- Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`.
4. **Container health & restart policy**
- Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`.
- Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`.
5. **K8s ingress reconciliation check**
- A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret.
- Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`.
6. **Hermes application metrics**
- Add a `/metrics` endpoint using `prom-client` in `src/index.ts`.
- Instrument request latency, error rate, active OAuth sessions, tool call counts.
- Scrape it from Prometheus.
7. **Separate readiness probe**
- Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready.
8. **Alertmanager + Slack / email**
- Deploy `prom/alertmanager` alongside Prometheus.
- Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
9. **Verification script**
- `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI.
### Option B — Migrate public edge to K8s nginx-ingress
**Why:** Eliminates the split-ingress complexity that caused the routing confusion.
Implementation pieces:
1. Reconcile `ingressClassName: nginx` `public` (or change the controller to `nginx`).
2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
4. Re-issue all certs via cert-manager and remove the static-cert workaround.
5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.
**Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term.
---
## Suggested file changes (all under `hermes-mcp/`)
- **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md`
- **New / rewrite:** `docs/runbooks/deployment.md`
- **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook)
- **New:** `scripts/verify-public-endpoints.sh`
- **New:** `scripts/check-k8s-ingress.sh`
- **Modify:** `src/index.ts` add `/metrics`, `/ready`, enhance `/health`
- **Modify:** `hermes-k8s.yaml` add startup probe, resource requests/limits
- **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml`
- **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file.
---
## Phasing recommendation
- **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`.
- **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check.
- **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing.
- **Phase 4 (future):** decide on Option B edge migration after Phase 13 are stable.

View File

@@ -0,0 +1,88 @@
# Public Edge Outage — Root Cause Analysis
**Date:** 2026-06-14
**Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid.
**Status:** Resolved. All listed commercial domains reachable with valid TLS.
---
## Summary
On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.
---
## Timeline (all times UTC-4)
- **~09:30** — User reports that commercial sites are not reachable.
- **09:3010:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`.
- **10:0010:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`.
- **10:3011:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container.
- **11:0011:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`).
- **11:3012:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service.
- **12:0013:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`.
- **13:3014:00** — Final verification of all domains and Hermes path-specific routes.
---
## Root cause
### 1. Docker Traefik intercepts all public ingress
- The Traefik v3 container binds host ports `80`, `443`, and `8080`.
- Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules.
- Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller.
### 2. Traefik had no routes or valid TLS for the commercial domains
- Traefiks dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`).
- At the start of the incident the file provider only had a partial/incomplete set of routers.
- There were no valid Lets Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance.
- Result: any request for an unmatched host fell through to Traefiks default self-signed certificate and returned `404 page not found`.
### 3. K8s nginx-ingress was unreachable even though it had valid certs
- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
- The active nginx-ingress-microk8s controller is configured for `ingressClass=public`.
- Most Ingress resources specify `ingressClassName: nginx`.
- Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.
### 4. Several Docker backends were stopped
- `fetcherpay-web` — stopped.
- `poste` (mail) — stopped.
- `postgres` and `gitea` (git) — stopped.
- `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`.
---
## Why detection failed
- No synthetic uptime probes were running against the public endpoints.
- No certificate-expiry or certificate-default alerting.
- No Traefik routing-health alert.
- Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services.
- K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.
---
## Remediation actions taken
1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain.
2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends.
3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`.
5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`.
6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`.
7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate.
---
## Remaining technical debt
- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
- Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`.
- Secrets are still stored plaintext in manifests and compose files.
- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.
---
## Follow-up work
See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.