docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan
Some checks failed
CI / test (push) Has been cancelled

This commit is contained in:
Garfield
2026-06-14 12:26:34 -04:00
parent 2014e03190
commit 0e255e570a
6 changed files with 848 additions and 0 deletions

View File

@@ -0,0 +1,164 @@
# Infrastructure Findings — SquareMCP / FetcherPay
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
---
## 1. High-level architecture
The single production server (`104.190.60.129`) hosts two separate ingress layers:
| Ingress Layer | Technology | Serves |
|---|---|---|
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
Both layers use Lets Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
---
## 2. Traefik configuration
### Static config
**File:** `/home/garfield/traefik.yml`
- Dashboard enabled on `:8080` with `insecure: true`.
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
### Compose
**File:** `/home/garfield/traefik-compose.yml`
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
- Volumes: Docker socket, static config, `letsencrypt` directory.
### Dynamic routing
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
---
## 3. Kubernetes ingress mismatch
- **Controller class:** `public`
- **Ingress class used by manifests:** `nginx`
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
Affected manifests include:
- `hermes-mcp/hermes-k8s.yaml`
- `hermes-mcp/product/app/app-k8s.yaml`
- `hermes-mcp/docs/docs-k8s.yaml`
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
---
## 4. Hermes MCP route table
**File:** `hermes-mcp/src/index.ts`
### Public / commercial endpoints
| Method | Path | Notes |
|---|---|---|
| `GET` | `/` | Static files from `../product` |
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
| `GET` | `/openapi.json` | Full OpenAPI spec |
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
| `GET` | `/health` | Liveness/readiness probe |
### OAuth / MCP discovery
| Method | Path |
|---|---|
| `POST` | `/oauth/register` |
| `GET` / `POST` | `/oauth/authorize` |
| `POST` | `/oauth/token` |
| `GET` | `/.well-known/oauth-authorization-server` |
| `GET` | `/.well-known/openid-configuration` |
| `GET` / `POST` / `DELETE` | `/mcp` |
| `GET` | `/sse` |
| `POST` | `/messages` |
| `GET` | `/tools` |
### Capability-guarded tool API
All `/api/*` tool routes require auth + capability grant:
| Capability | Example endpoints |
|---|---|
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
### Health endpoint
```typescript
app.get('/health', (_req, res) => {
res.json({
status: 'ok',
service: 'hermes-mcp',
toolCount,
transports,
endpoints,
});
});
```
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
---
## 5. Monitoring gaps
### Prometheus / Grafana
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
- No Alertmanager, no alert rules.
### Health checks
- Hermes has `/health` but no `/ready` or `/livez` separation.
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
### Uptime / synthetic probes
- No blackbox exporter.
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
- No cert-expiry alerting.
- No K8s ingress reconciliation check.
### Logs
- No centralized log aggregation (Loki, Vector, Fluentd).
---
## 6. Secret management
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
- Docker Compose stacks rely on exported env vars or `.env` files.
- No Sealed Secrets, External Secrets Operator, or Vault in use.
---
## 7. Notable risks
1. **Single point of failure:** one residential IP, one host, one edge proxy.
2. **Split edge:** two ingress controllers with conflicting class configuration.
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.