Files
hermes-mcp/docs/runbooks/2026-06-14-infrastructure-findings.md
2026-06-14 12:26:34 -04:00

165 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Infrastructure Findings — SquareMCP / FetcherPay
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
---
## 1. High-level architecture
The single production server (`104.190.60.129`) hosts two separate ingress layers:
| Ingress Layer | Technology | Serves |
|---|---|---|
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
Both layers use Lets Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
---
## 2. Traefik configuration
### Static config
**File:** `/home/garfield/traefik.yml`
- Dashboard enabled on `:8080` with `insecure: true`.
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
### Compose
**File:** `/home/garfield/traefik-compose.yml`
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
- Volumes: Docker socket, static config, `letsencrypt` directory.
### Dynamic routing
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
---
## 3. Kubernetes ingress mismatch
- **Controller class:** `public`
- **Ingress class used by manifests:** `nginx`
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
Affected manifests include:
- `hermes-mcp/hermes-k8s.yaml`
- `hermes-mcp/product/app/app-k8s.yaml`
- `hermes-mcp/docs/docs-k8s.yaml`
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
---
## 4. Hermes MCP route table
**File:** `hermes-mcp/src/index.ts`
### Public / commercial endpoints
| Method | Path | Notes |
|---|---|---|
| `GET` | `/` | Static files from `../product` |
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
| `GET` | `/openapi.json` | Full OpenAPI spec |
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
| `GET` | `/health` | Liveness/readiness probe |
### OAuth / MCP discovery
| Method | Path |
|---|---|
| `POST` | `/oauth/register` |
| `GET` / `POST` | `/oauth/authorize` |
| `POST` | `/oauth/token` |
| `GET` | `/.well-known/oauth-authorization-server` |
| `GET` | `/.well-known/openid-configuration` |
| `GET` / `POST` / `DELETE` | `/mcp` |
| `GET` | `/sse` |
| `POST` | `/messages` |
| `GET` | `/tools` |
### Capability-guarded tool API
All `/api/*` tool routes require auth + capability grant:
| Capability | Example endpoints |
|---|---|
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
### Health endpoint
```typescript
app.get('/health', (_req, res) => {
res.json({
status: 'ok',
service: 'hermes-mcp',
toolCount,
transports,
endpoints,
});
});
```
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
---
## 5. Monitoring gaps
### Prometheus / Grafana
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
- No Alertmanager, no alert rules.
### Health checks
- Hermes has `/health` but no `/ready` or `/livez` separation.
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
### Uptime / synthetic probes
- No blackbox exporter.
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
- No cert-expiry alerting.
- No K8s ingress reconciliation check.
### Logs
- No centralized log aggregation (Loki, Vector, Fluentd).
---
## 6. Secret management
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
- Docker Compose stacks rely on exported env vars or `.env` files.
- No Sealed Secrets, External Secrets Operator, or Vault in use.
---
## 7. Notable risks
1. **Single point of failure:** one residential IP, one host, one edge proxy.
2. **Split edge:** two ingress controllers with conflicting class configuration.
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.