Files
hermes-mcp/docs/runbooks/2026-06-14-infrastructure-findings.md
2026-06-14 12:26:34 -04:00

5.7 KiB
Raw Permalink Blame History

Infrastructure Findings — SquareMCP / FetcherPay

This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.


1. High-level architecture

The single production server (104.190.60.129) hosts two separate ingress layers:

Ingress Layer Technology Serves
Docker edge proxy Traefik v3 *.fetcherpay.com Docker Compose stacks, plus static file-provider routes for *.squaremcp.com
Kubernetes ingress nginx-ingress-microk8s + cert-manager *.squaremcp.com K8s workloads (currently bypassed by Traefik)

Both layers use Lets Encrypt TLS. Public ports 80/443 are bound by the Docker Traefik container, so its iptables rules win over host-network K8s services.


2. Traefik configuration

Static config

File: /home/garfield/traefik.yml

  • Dashboard enabled on :8080 with insecure: true.
  • Entrypoints: web (HTTP → HTTPS redirect) and websecure (HTTPS, :443).
  • Providers: Docker (socket) + file provider (/letsencrypt/manual/tls.yml, watch: true).
  • Certificate resolver: letsencrypt via GoDaddy DNS-01.

Compose

File: /home/garfield/traefik-compose.yml

  • Networks: hermes-net, obsidian-net, fetcherpay (all external).
  • Volumes: Docker socket, static config, letsencrypt directory.

Dynamic routing

File: /home/garfield/letsencrypt/manual/tls.yml

Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send /api/pilot-request and /auth/tiktok to Hermes.


3. Kubernetes ingress mismatch

  • Controller class: public
  • Ingress class used by manifests: nginx

This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.

Affected manifests include:

  • hermes-mcp/hermes-k8s.yaml
  • hermes-mcp/product/app/app-k8s.yaml
  • hermes-mcp/docs/docs-k8s.yaml
  • hermes-mcp/product/site/squaremcp-k8s-ingress.yaml

4. Hermes MCP route table

File: hermes-mcp/src/index.ts

Public / commercial endpoints

Method Path Notes
GET / Static files from ../product
GET /openapi-living-brief.json Obsidian-only OpenAPI spec for ChatGPT
GET /openapi.json Full OpenAPI spec
GET /auth/tiktok/start Redirect to TikTok Login Kit
GET /auth/tiktok/callback TikTok OAuth callback
POST /api/pilot-request Public form submission; origin-gated
GET /health Liveness/readiness probe

OAuth / MCP discovery

Method Path
POST /oauth/register
GET / POST /oauth/authorize
POST /oauth/token
GET /.well-known/oauth-authorization-server
GET /.well-known/openid-configuration
GET / POST / DELETE /mcp
GET /sse
POST /messages
GET /tools

Capability-guarded tool API

All /api/* tool routes require auth + capability grant:

Capability Example endpoints
obsidian /api/obsidian/search, /api/obsidian/note, /api/obsidian/note/append, /api/obsidian/sync
email /api/email/profile, /api/email/search, /api/email/read, /api/email/send
whatsapp /api/whatsapp/send, /api/whatsapp/templates
linkedin /api/linkedin/profile, /api/linkedin/post, /api/linkedin/message
telegram /api/telegram/me, /api/telegram/message, /api/telegram/updates
discord /api/discord/me, /api/discord/guilds, /api/discord/message
instagram /api/instagram/profile, /api/instagram/media, /api/instagram/post
twitter /api/twitter/search, /api/twitter/tweets, /api/twitter/tweet
facebook /api/facebook/page, /api/facebook/posts, /api/facebook/post
tiktok /api/tiktok/profile, /api/tiktok/video, /api/tiktok/video/status

Health endpoint

app.get('/health', (_req, res) => {
  res.json({
    status: 'ok',
    service: 'hermes-mcp',
    toolCount,
    transports,
    endpoints,
  });
});

Used by both K8s readiness and liveness probes in hermes-k8s.yaml.


5. Monitoring gaps

Prometheus / Grafana

  • Prometheus and Grafana containers exist in docker-compose.fetcherpay.yml.
  • Prometheus scrapes itself, fetcherpay-api:3000, and Docker metrics at 172.20.0.1:9323.
  • Hermes MCP is not scraped and has no /metrics endpoint.
  • No Alertmanager, no alert rules.

Health checks

  • Hermes has /health but no /ready or /livez separation.
  • Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but not for Hermes.

Uptime / synthetic probes

  • No blackbox exporter.
  • No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
  • No cert-expiry alerting.
  • No K8s ingress reconciliation check.

Logs

  • No centralized log aggregation (Loki, Vector, Fluentd).

6. Secret management

  • hermes-k8s.yaml is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
  • Docker Compose stacks rely on exported env vars or .env files.
  • No Sealed Secrets, External Secrets Operator, or Vault in use.

7. Notable risks

  1. Single point of failure: one residential IP, one host, one edge proxy.
  2. Split edge: two ingress controllers with conflicting class configuration.
  3. Manual certificate workaround: static K8s-extracted certs in Traefik must be manually rotated before expiry.
  4. No observability: no metrics, alerting, or synthetic probes for the commercial domains.
  5. Stopped services not detected: Docker restart policies only help if containers were initially started.