165 lines
5.7 KiB
Markdown
165 lines
5.7 KiB
Markdown
# Infrastructure Findings — SquareMCP / FetcherPay
|
||
|
||
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
|
||
|
||
---
|
||
|
||
## 1. High-level architecture
|
||
|
||
The single production server (`104.190.60.129`) hosts two separate ingress layers:
|
||
|
||
| Ingress Layer | Technology | Serves |
|
||
|---|---|---|
|
||
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
|
||
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
|
||
|
||
Both layers use Let’s Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
|
||
|
||
---
|
||
|
||
## 2. Traefik configuration
|
||
|
||
### Static config
|
||
**File:** `/home/garfield/traefik.yml`
|
||
|
||
- Dashboard enabled on `:8080` with `insecure: true`.
|
||
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
|
||
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
|
||
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
|
||
|
||
### Compose
|
||
**File:** `/home/garfield/traefik-compose.yml`
|
||
|
||
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
|
||
- Volumes: Docker socket, static config, `letsencrypt` directory.
|
||
|
||
### Dynamic routing
|
||
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
|
||
|
||
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
|
||
|
||
---
|
||
|
||
## 3. Kubernetes ingress mismatch
|
||
|
||
- **Controller class:** `public`
|
||
- **Ingress class used by manifests:** `nginx`
|
||
|
||
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
|
||
|
||
Affected manifests include:
|
||
- `hermes-mcp/hermes-k8s.yaml`
|
||
- `hermes-mcp/product/app/app-k8s.yaml`
|
||
- `hermes-mcp/docs/docs-k8s.yaml`
|
||
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
|
||
|
||
---
|
||
|
||
## 4. Hermes MCP route table
|
||
|
||
**File:** `hermes-mcp/src/index.ts`
|
||
|
||
### Public / commercial endpoints
|
||
|
||
| Method | Path | Notes |
|
||
|---|---|---|
|
||
| `GET` | `/` | Static files from `../product` |
|
||
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
|
||
| `GET` | `/openapi.json` | Full OpenAPI spec |
|
||
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
|
||
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
|
||
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
|
||
| `GET` | `/health` | Liveness/readiness probe |
|
||
|
||
### OAuth / MCP discovery
|
||
|
||
| Method | Path |
|
||
|---|---|
|
||
| `POST` | `/oauth/register` |
|
||
| `GET` / `POST` | `/oauth/authorize` |
|
||
| `POST` | `/oauth/token` |
|
||
| `GET` | `/.well-known/oauth-authorization-server` |
|
||
| `GET` | `/.well-known/openid-configuration` |
|
||
| `GET` / `POST` / `DELETE` | `/mcp` |
|
||
| `GET` | `/sse` |
|
||
| `POST` | `/messages` |
|
||
| `GET` | `/tools` |
|
||
|
||
### Capability-guarded tool API
|
||
|
||
All `/api/*` tool routes require auth + capability grant:
|
||
|
||
| Capability | Example endpoints |
|
||
|---|---|
|
||
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
|
||
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
|
||
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
|
||
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
|
||
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
|
||
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
|
||
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
|
||
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
|
||
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
|
||
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
|
||
|
||
### Health endpoint
|
||
|
||
```typescript
|
||
app.get('/health', (_req, res) => {
|
||
res.json({
|
||
status: 'ok',
|
||
service: 'hermes-mcp',
|
||
toolCount,
|
||
transports,
|
||
endpoints,
|
||
});
|
||
});
|
||
```
|
||
|
||
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
|
||
|
||
---
|
||
|
||
## 5. Monitoring gaps
|
||
|
||
### Prometheus / Grafana
|
||
|
||
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
|
||
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
|
||
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
|
||
- No Alertmanager, no alert rules.
|
||
|
||
### Health checks
|
||
|
||
- Hermes has `/health` but no `/ready` or `/livez` separation.
|
||
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
|
||
|
||
### Uptime / synthetic probes
|
||
|
||
- No blackbox exporter.
|
||
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
|
||
- No cert-expiry alerting.
|
||
- No K8s ingress reconciliation check.
|
||
|
||
### Logs
|
||
|
||
- No centralized log aggregation (Loki, Vector, Fluentd).
|
||
|
||
---
|
||
|
||
## 6. Secret management
|
||
|
||
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
|
||
- Docker Compose stacks rely on exported env vars or `.env` files.
|
||
- No Sealed Secrets, External Secrets Operator, or Vault in use.
|
||
|
||
---
|
||
|
||
## 7. Notable risks
|
||
|
||
1. **Single point of failure:** one residential IP, one host, one edge proxy.
|
||
2. **Split edge:** two ingress controllers with conflicting class configuration.
|
||
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
|
||
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
|
||
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.
|