Compare commits
10 Commits
a326611806
...
0e255e570a
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0e255e570a | ||
|
|
2014e03190 | ||
|
|
f084be6bc6 | ||
|
|
6604ab5d2b | ||
|
|
51315527c0 | ||
|
|
d4b2ec2902 | ||
|
|
7e32dca0d8 | ||
|
|
45cf9cafe6 | ||
|
|
723cf17869 | ||
|
|
de6d6ae9de |
@@ -1,7 +1,7 @@
|
||||
# Reviewer Account Setup — Execution Summary
|
||||
|
||||
**Date:** 2026-06-12
|
||||
**Status:** Steps 1 and 4 complete; Step 2 blocked on network access
|
||||
**Status:** All steps complete
|
||||
|
||||
---
|
||||
|
||||
@@ -34,10 +34,13 @@ Cannot reach the mail server from the agent environment:
|
||||
|
||||
**Action required by user:** Create `reviewer@squaremcp.com` mailbox on the mail server.
|
||||
|
||||
## Step 3 — Connect email via API ⏳ PENDING
|
||||
## Step 3 — Connect email via API ✅
|
||||
|
||||
Blocked until Step 2 completes. Once the mailbox exists, run:
|
||||
Ran after user created the mailbox. Also required a Redis reconnection fix (`src/redis.ts`).
|
||||
|
||||
**Result:** `{"connected":true,"platform":"email"}`
|
||||
|
||||
**Command:**
|
||||
```bash
|
||||
curl -X POST https://hermes.squaremcp.com/api/connect/email \
|
||||
-H "Content-Type: application/json" \
|
||||
@@ -52,6 +55,13 @@ curl -X POST https://hermes.squaremcp.com/api/connect/email \
|
||||
}'
|
||||
```
|
||||
|
||||
**Verified:**
|
||||
```bash
|
||||
curl -s -H "x-api-key: fdb6fb01bb7f4c50a9ab329c7287b81c" \
|
||||
"https://hermes.squaremcp.com/api/email/profile?account=sqcp_reviewer"
|
||||
# → {"email":"reviewer@squaremcp.com","name":"reviewer","account":"custom"}
|
||||
```
|
||||
|
||||
## Reviewer credentials reference
|
||||
|
||||
| Field | Value |
|
||||
|
||||
@@ -0,0 +1,120 @@
|
||||
# Handoff: Generate + Deploy Long-Lived Facebook/Instagram Token
|
||||
|
||||
**For:** Claude Cowork (browser session)
|
||||
**Goal:** Replace the expired Facebook/Instagram env token in K8s with a long-lived Page token
|
||||
**Blocker:** Claude.ai MCP Directory form #18 cannot be checked until Facebook + Instagram API calls return success
|
||||
|
||||
---
|
||||
|
||||
## Current state
|
||||
|
||||
- `https://hermes.squaremcp.com/api/facebook/page` returns:
|
||||
> Error validating access token: Session has expired on Friday, 12-Jun-26 08:00:00 PDT
|
||||
- `https://hermes.squaremcp.com/api/instagram/profile` returns the same error
|
||||
- The env vars `FACEBOOK_DEFAULT_ACCESS_TOKEN` and `INSTAGRAM_DEFAULT_ACCESS_TOKEN` are set in `hermes-k8s.yaml` but the token is dead
|
||||
- `INSTAGRAM_DEFAULT_BUSINESS_ACCOUNT_ID` is already correct: `17841422623735880`
|
||||
- `FACEBOOK_DEFAULT_PAGE_ID` is already correct: `1152192567968569`
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Get a short-lived User Token
|
||||
|
||||
1. Open [Facebook Developer Console → Graph API Explorer](https://developers.facebook.com/tools/explorer/)
|
||||
2. Select app: `squaremcp`
|
||||
3. Click **Generate Access Token**
|
||||
4. Grant these permissions:
|
||||
- `pages_show_list`
|
||||
- `pages_read_engagement`
|
||||
- `pages_manage_posts`
|
||||
- `instagram_basic`
|
||||
- `instagram_content_publish`
|
||||
5. Copy the User Token (starts with `EAAY...`)
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Exchange for a long-lived User Token
|
||||
|
||||
In Graph API Explorer, run:
|
||||
|
||||
```
|
||||
GET /oauth/access_token
|
||||
?grant_type=fb_exchange_token
|
||||
&client_id=<SQUAREMCP_APP_ID>
|
||||
&client_secret=<SQUAREMCP_APP_SECRET>
|
||||
&fb_exchange_token=<SHORT_LIVED_USER_TOKEN_FROM_STEP_1>
|
||||
```
|
||||
|
||||
Copy the `access_token` from the response. This is the **long-lived User Token**.
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Get the long-lived Page Token + Instagram Business Account ID
|
||||
|
||||
In Graph API Explorer, run with the long-lived User Token:
|
||||
|
||||
```
|
||||
GET /me/accounts?fields=id,name,access_token,instagram_business_account{username,id}
|
||||
```
|
||||
|
||||
From the response, copy:
|
||||
- `data[0].access_token` for the page named **"Squaremcp"** → this is the new `FACEBOOK_DEFAULT_ACCESS_TOKEN` and `INSTAGRAM_DEFAULT_ACCESS_TOKEN`
|
||||
- `data[0].instagram_business_account.id` → confirm it is `17841422623735880`
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Test the new token
|
||||
|
||||
Run these in a terminal and confirm they return page/profile info, not an error:
|
||||
|
||||
```bash
|
||||
TOKEN="<PASTE_LONG_LIVED_PAGE_TOKEN_HERE>"
|
||||
curl -s "https://graph.facebook.com/v22.0/1152192567968569?fields=id,name,category,about,fan_count,followers_count,link&access_token=$TOKEN" | python3 -m json.tool
|
||||
curl -s "https://graph.facebook.com/v22.0/17841422623735880?fields=username,name,followers_count,follows_count,media_count&access_token=$TOKEN" | python3 -m json.tool
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Deploy to K8s
|
||||
|
||||
Paste the new token into the terminal and run:
|
||||
|
||||
```bash
|
||||
NEW_TOKEN="<PASTE_LONG_LIVED_PAGE_TOKEN_HERE>"
|
||||
ssh -p 2222 garfield@23.120.207.35 "microk8s kubectl set env deployment/hermes-mcp -n fetcherpay \
|
||||
FACEBOOK_DEFAULT_ACCESS_TOKEN='$NEW_TOKEN' \
|
||||
INSTAGRAM_DEFAULT_ACCESS_TOKEN='$NEW_TOKEN' && \
|
||||
microk8s kubectl rollout restart deployment/hermes-mcp -n fetcherpay && \
|
||||
microk8s kubectl rollout status deployment/hermes-mcp -n fetcherpay"
|
||||
```
|
||||
|
||||
If SSH is unavailable from the browser environment, give the token to kimi-cli instead.
|
||||
|
||||
---
|
||||
|
||||
## Step 6: Verify through SquareMCP API
|
||||
|
||||
```bash
|
||||
API_KEY="fdb6fb01bb7f4c50a9ab329c7287b81c"
|
||||
|
||||
echo "=== Facebook ==="
|
||||
curl -s -H "x-api-key: $API_KEY" "https://hermes.squaremcp.com/api/facebook/page" | python3 -m json.tool
|
||||
|
||||
echo "=== Instagram ==="
|
||||
curl -s -H "x-api-key: $API_KEY" "https://hermes.squaremcp.com/api/instagram/profile" | python3 -m json.tool
|
||||
```
|
||||
|
||||
Both must return actual data (not an error) before checking box #18 on the Claude.ai form.
|
||||
|
||||
---
|
||||
|
||||
## Step 7: Update hermes-k8s.yaml
|
||||
|
||||
If kimi-cli is handling deploy, also ask it to update `hermes-k8s.yaml` placeholders with the new token so the manifest stays in sync.
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- The Page token from a long-lived User token should not expire on a fixed schedule.
|
||||
- If the token expires again, the root cause is using a short-lived User Token in Step 1. Make sure to do Step 2 exchange.
|
||||
- Do not commit the token to git. It lives in `hermes-k8s.yaml` (which is `.gitignore`d) and in K8s env vars only.
|
||||
@@ -15,7 +15,7 @@ spec:
|
||||
spec:
|
||||
containers:
|
||||
- name: squaremcp-docs
|
||||
image: localhost:32000/squaremcp-docs@sha256:2e49e8ab602cd5069be89bbba538db06ce9dc2c49064472f399566a8fcc54d9c
|
||||
image: localhost:32000/squaremcp-docs@sha256:762051a6eeadc6a95d22816f3495a567a85ebe955f32189dda5c0346ae427687
|
||||
imagePullPolicy: Always
|
||||
ports:
|
||||
- containerPort: 80
|
||||
|
||||
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Active Issues and Remaining Debt — 2026-06-14
|
||||
|
||||
## What is working now
|
||||
|
||||
All commercial domains verified reachable with valid TLS:
|
||||
|
||||
- `hermes.squaremcp.com` / `openapi-living-brief.json`
|
||||
- `app.squaremcp.com`
|
||||
- `docs.squaremcp.com`
|
||||
- `squaremcp.com` / `www.squaremcp.com`
|
||||
- `tiktok.squaremcp.com`
|
||||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||||
- `workflow.fetcherpay.com`
|
||||
- `mail.fetcherpay.com`
|
||||
- `git.fetcherpay.com`
|
||||
|
||||
Hermes path-specific routes verified:
|
||||
- `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
|
||||
- `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com`
|
||||
|
||||
---
|
||||
|
||||
## Still down / not addressed
|
||||
|
||||
| Subdomain / Service | Why it is down | What would fix it |
|
||||
|---|---|---|
|
||||
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
|
||||
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
|
||||
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
|
||||
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
|
||||
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
|
||||
|
||||
---
|
||||
|
||||
## Architectural debt
|
||||
|
||||
1. **K8s nginx-ingress is bypassed**
|
||||
- Traefik’s Docker iptables rules intercept all public HTTP/S traffic.
|
||||
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
|
||||
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
|
||||
|
||||
2. **Manual static certificate workaround**
|
||||
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
|
||||
- Certs are extracted from K8s cert-manager secrets and loaded statically.
|
||||
- These must be manually rotated before expiry.
|
||||
|
||||
3. **No observability**
|
||||
- No synthetic uptime probes.
|
||||
- No cert-expiry alerting.
|
||||
- No Hermes `/metrics` endpoint.
|
||||
- No Alertmanager / Slack alerts.
|
||||
- No centralized logs.
|
||||
|
||||
4. **Secret management**
|
||||
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
|
||||
- No Sealed Secrets / External Secrets / Vault.
|
||||
|
||||
5. **Single point of failure**
|
||||
- One host, one residential IP, one edge proxy.
|
||||
- No redundancy or failover.
|
||||
|
||||
6. **Gitea SSH port**
|
||||
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
|
||||
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
|
||||
|
||||
---
|
||||
|
||||
## Recommended next steps
|
||||
|
||||
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
|
||||
|
||||
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
|
||||
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
|
||||
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
|
||||
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.
|
||||
164
docs/runbooks/2026-06-14-infrastructure-findings.md
Normal file
164
docs/runbooks/2026-06-14-infrastructure-findings.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Infrastructure Findings — SquareMCP / FetcherPay
|
||||
|
||||
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
|
||||
|
||||
---
|
||||
|
||||
## 1. High-level architecture
|
||||
|
||||
The single production server (`104.190.60.129`) hosts two separate ingress layers:
|
||||
|
||||
| Ingress Layer | Technology | Serves |
|
||||
|---|---|---|
|
||||
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
|
||||
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
|
||||
|
||||
Both layers use Let’s Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
|
||||
|
||||
---
|
||||
|
||||
## 2. Traefik configuration
|
||||
|
||||
### Static config
|
||||
**File:** `/home/garfield/traefik.yml`
|
||||
|
||||
- Dashboard enabled on `:8080` with `insecure: true`.
|
||||
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
|
||||
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
|
||||
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
|
||||
|
||||
### Compose
|
||||
**File:** `/home/garfield/traefik-compose.yml`
|
||||
|
||||
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
|
||||
- Volumes: Docker socket, static config, `letsencrypt` directory.
|
||||
|
||||
### Dynamic routing
|
||||
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
|
||||
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes ingress mismatch
|
||||
|
||||
- **Controller class:** `public`
|
||||
- **Ingress class used by manifests:** `nginx`
|
||||
|
||||
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
|
||||
|
||||
Affected manifests include:
|
||||
- `hermes-mcp/hermes-k8s.yaml`
|
||||
- `hermes-mcp/product/app/app-k8s.yaml`
|
||||
- `hermes-mcp/docs/docs-k8s.yaml`
|
||||
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
|
||||
|
||||
---
|
||||
|
||||
## 4. Hermes MCP route table
|
||||
|
||||
**File:** `hermes-mcp/src/index.ts`
|
||||
|
||||
### Public / commercial endpoints
|
||||
|
||||
| Method | Path | Notes |
|
||||
|---|---|---|
|
||||
| `GET` | `/` | Static files from `../product` |
|
||||
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
|
||||
| `GET` | `/openapi.json` | Full OpenAPI spec |
|
||||
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
|
||||
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
|
||||
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
|
||||
| `GET` | `/health` | Liveness/readiness probe |
|
||||
|
||||
### OAuth / MCP discovery
|
||||
|
||||
| Method | Path |
|
||||
|---|---|
|
||||
| `POST` | `/oauth/register` |
|
||||
| `GET` / `POST` | `/oauth/authorize` |
|
||||
| `POST` | `/oauth/token` |
|
||||
| `GET` | `/.well-known/oauth-authorization-server` |
|
||||
| `GET` | `/.well-known/openid-configuration` |
|
||||
| `GET` / `POST` / `DELETE` | `/mcp` |
|
||||
| `GET` | `/sse` |
|
||||
| `POST` | `/messages` |
|
||||
| `GET` | `/tools` |
|
||||
|
||||
### Capability-guarded tool API
|
||||
|
||||
All `/api/*` tool routes require auth + capability grant:
|
||||
|
||||
| Capability | Example endpoints |
|
||||
|---|---|
|
||||
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
|
||||
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
|
||||
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
|
||||
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
|
||||
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
|
||||
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
|
||||
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
|
||||
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
|
||||
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
|
||||
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
|
||||
|
||||
### Health endpoint
|
||||
|
||||
```typescript
|
||||
app.get('/health', (_req, res) => {
|
||||
res.json({
|
||||
status: 'ok',
|
||||
service: 'hermes-mcp',
|
||||
toolCount,
|
||||
transports,
|
||||
endpoints,
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring gaps
|
||||
|
||||
### Prometheus / Grafana
|
||||
|
||||
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
|
||||
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
|
||||
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
|
||||
- No Alertmanager, no alert rules.
|
||||
|
||||
### Health checks
|
||||
|
||||
- Hermes has `/health` but no `/ready` or `/livez` separation.
|
||||
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
|
||||
|
||||
### Uptime / synthetic probes
|
||||
|
||||
- No blackbox exporter.
|
||||
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
|
||||
- No cert-expiry alerting.
|
||||
- No K8s ingress reconciliation check.
|
||||
|
||||
### Logs
|
||||
|
||||
- No centralized log aggregation (Loki, Vector, Fluentd).
|
||||
|
||||
---
|
||||
|
||||
## 6. Secret management
|
||||
|
||||
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
|
||||
- Docker Compose stacks rely on exported env vars or `.env` files.
|
||||
- No Sealed Secrets, External Secrets Operator, or Vault in use.
|
||||
|
||||
---
|
||||
|
||||
## 7. Notable risks
|
||||
|
||||
1. **Single point of failure:** one residential IP, one host, one edge proxy.
|
||||
2. **Split edge:** two ingress controllers with conflicting class configuration.
|
||||
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
|
||||
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
|
||||
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.
|
||||
360
docs/runbooks/2026-06-14-outage-fix-log.md
Normal file
360
docs/runbooks/2026-06-14-outage-fix-log.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Outage Fix Log — 2026-06-14
|
||||
|
||||
This is the step-by-step record of what was changed to restore public access to the SquareMCP / FetcherPay commercial sites.
|
||||
|
||||
---
|
||||
|
||||
## Environment
|
||||
|
||||
- **Host:** `104.190.60.129` (MicroK8s + Docker)
|
||||
- **Edge proxy:** Traefik v3 in Docker, binds `:80`, `:443`, `:8080`
|
||||
- **Hermes MCP:** K8s pod with `hostNetwork: true` on `:3456`
|
||||
- **Key files:**
|
||||
- `/home/garfield/traefik-compose.yml`
|
||||
- `/home/garfield/traefik.yml`
|
||||
- `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
- `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
|
||||
---
|
||||
|
||||
## 1. Attach Traefik to the FetcherPay network
|
||||
|
||||
**File:** `/home/garfield/traefik-compose.yml`
|
||||
|
||||
Added the `fetcherpay` external network so Traefik can reach FetcherPay Docker backends.
|
||||
|
||||
```yaml
|
||||
services:
|
||||
traefik:
|
||||
...
|
||||
networks:
|
||||
- hermes-net
|
||||
- obsidian-net
|
||||
- fetcherpay
|
||||
|
||||
networks:
|
||||
hermes-net:
|
||||
external: true
|
||||
name: hermes-mcp_hermes-net
|
||||
obsidian-net:
|
||||
external: true
|
||||
name: obsidian_obsidian-net
|
||||
fetcherpay:
|
||||
external: true
|
||||
name: fetcherpay_fetcherpay
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Rebuild the Traefik file-provider routing config
|
||||
|
||||
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
|
||||
Final config includes routers and services for:
|
||||
- `hermes.squaremcp.com`
|
||||
- `app.squaremcp.com`
|
||||
- `docs.squaremcp.com`
|
||||
- `squaremcp.com` / `www.squaremcp.com`
|
||||
- `tiktok.squaremcp.com`
|
||||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||||
- `workflow.fetcherpay.com`
|
||||
- `mail.fetcherpay.com`
|
||||
- `git.fetcherpay.com`
|
||||
|
||||
Path-specific rules that route to Hermes (`104.190.60.129:3456`):
|
||||
- `/api/pilot-request` on `squaremcp.com` / `www.squaremcp.com`
|
||||
- `/auth/tiktok` and `/api/pilot-request` on `tiktok.squaremcp.com`
|
||||
|
||||
Full final config:
|
||||
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
hermes:
|
||||
rule: "Host(`hermes.squaremcp.com`)"
|
||||
service: hermes
|
||||
entryPoints: [websecure]
|
||||
tls: { certResolver: letsencrypt }
|
||||
|
||||
squaremcp-app:
|
||||
rule: "Host(`app.squaremcp.com`)"
|
||||
service: squaremcp-app
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-docs:
|
||||
rule: "Host(`docs.squaremcp.com`)"
|
||||
service: squaremcp-docs
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-site-main:
|
||||
rule: "Host(`squaremcp.com`) || Host(`www.squaremcp.com`)"
|
||||
service: squaremcp-site
|
||||
priority: 10
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-site-pilot:
|
||||
rule: "(Host(`squaremcp.com`) || Host(`www.squaremcp.com`)) && PathPrefix(`/api/pilot-request`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-main:
|
||||
rule: "Host(`tiktok.squaremcp.com`)"
|
||||
service: squaremcp-site
|
||||
priority: 10
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-auth:
|
||||
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/auth/tiktok`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-pilot:
|
||||
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/api/pilot-request`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
fetcherpay-root:
|
||||
rule: "Host(`fetcherpay.com`) || Host(`www.fetcherpay.com`)"
|
||||
service: fetcherpay-web
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
workflow:
|
||||
rule: "Host(`workflow.fetcherpay.com`)"
|
||||
service: temporal-ui
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
mail:
|
||||
rule: "Host(`mail.fetcherpay.com`)"
|
||||
service: poste
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
git:
|
||||
rule: "Host(`git.fetcherpay.com`)"
|
||||
service: gitea
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
services:
|
||||
hermes:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://104.190.60.129:3456" }]
|
||||
passHostHeader: true
|
||||
squaremcp-app:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.164:80" }]
|
||||
passHostHeader: true
|
||||
squaremcp-docs:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.130:80" }]
|
||||
passHostHeader: true
|
||||
squaremcp-site:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.48:80" }]
|
||||
passHostHeader: true
|
||||
fetcherpay-web:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://172.20.0.9:80" }]
|
||||
passHostHeader: true
|
||||
temporal-ui:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://172.20.0.3:8080" }]
|
||||
passHostHeader: true
|
||||
poste:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://poste:80" }]
|
||||
passHostHeader: true
|
||||
gitea:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://gitea:3000" }]
|
||||
passHostHeader: true
|
||||
|
||||
tls:
|
||||
certificates:
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-app.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-app.key
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-docs.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-docs.key
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-site.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-site.key
|
||||
- certFile: /letsencrypt/manual/certs/fetcherpay-root.crt
|
||||
keyFile: /letsencrypt/manual/certs/fetcherpay-root.key
|
||||
- certFile: /letsencrypt/manual/certs/mail-fetcherpay.crt
|
||||
keyFile: /letsencrypt/manual/certs/mail-fetcherpay.key
|
||||
- certFile: /letsencrypt/manual/certs/git-fetcherpay.crt
|
||||
keyFile: /letsencrypt/manual/certs/git-fetcherpay.key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Extract static TLS certificates from K8s cert-manager secrets
|
||||
|
||||
Because Traefik’s GoDaddy DNS-01 resolver fails with `DUPLICATE_RECORD` for existing `_acme-challenge.*` TXT records, valid certificates were pulled from the K8s secrets that cert-manager already held.
|
||||
|
||||
```bash
|
||||
mkdir -p /home/garfield/letsencrypt/manual/certs
|
||||
|
||||
# squaremcp-app
|
||||
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-app.crt
|
||||
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-app.key
|
||||
|
||||
# squaremcp-docs
|
||||
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-docs.crt
|
||||
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-docs.key
|
||||
|
||||
# squaremcp-site (covers squaremcp.com / www.squaremcp.com / tiktok.squaremcp.com)
|
||||
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-site.crt
|
||||
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-site.key
|
||||
|
||||
# fetcherpay-root
|
||||
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > fetcherpay-root.crt
|
||||
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > fetcherpay-root.key
|
||||
|
||||
# mail.fetcherpay.com
|
||||
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.crt}' | base64 -d > mail-fetcherpay.crt
|
||||
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.key}' | base64 -d > mail-fetcherpay.key
|
||||
|
||||
# git.fetcherpay.com
|
||||
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > git-fetcherpay.crt
|
||||
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > git-fetcherpay.key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Start stopped backend containers
|
||||
|
||||
### FetcherPay web
|
||||
|
||||
```bash
|
||||
docker compose -p fetcherpay -f /home/garfield/docker-compose.fetcherpay.yml up -d fetcherpay-web
|
||||
```
|
||||
|
||||
### Poste (mail)
|
||||
|
||||
```bash
|
||||
docker compose -p fetcherpay -f /home/garfield/Downloads/docker-compose.prod.yml up -d poste
|
||||
```
|
||||
|
||||
### Postgres + Gitea (git)
|
||||
|
||||
Gitea credentials were recovered from the existing Gitea config volume:
|
||||
|
||||
```bash
|
||||
docker run --rm -v fetcherpay_gitea_data:/data alpine \
|
||||
sh -c 'cat /data/gitea/conf/app.ini | grep -E "^(NAME|USER|PASSWD|HOST|DB_TYPE)"'
|
||||
# DB_TYPE = postgres
|
||||
# HOST = postgres:5432
|
||||
# NAME = gitea
|
||||
# USER = fetcherpay
|
||||
# PASSWD = fetcherpay_secure_2024
|
||||
```
|
||||
|
||||
Then postgres and gitea were started with the required env vars:
|
||||
|
||||
```bash
|
||||
cd /home/garfield/Downloads
|
||||
export POSTGRES_USER=fetcherpay
|
||||
export POSTGRES_PASSWORD=fetcherpay_secure_2024
|
||||
export POSTGRES_DB=postgres
|
||||
export GITEA_HOST=git.fetcherpay.com
|
||||
export GITEA_DB=gitea
|
||||
export MAIL_HOST=mail.fetcherpay.com
|
||||
export WEB_HOST=fetcherpay.com
|
||||
export API_HOST=api.fetcherpay.com
|
||||
export PROM_HOST=prometheus.fetcherpay.com
|
||||
export GRAFANA_HOST=grafana.fetcherpay.com
|
||||
export ADMINER_HOST=adminer.fetcherpay.com
|
||||
export TEMPORAL_HOST=workflow.fetcherpay.com
|
||||
export REDIS_PASSWORD=redis_pass
|
||||
export MYSQL_ROOT_PASSWORD=mysql_root
|
||||
export MYSQL_DATABASE=fetcherpay
|
||||
export MYSQL_USER=fetcherpay
|
||||
export MYSQL_PASSWORD=mysql_pass
|
||||
export GRAFANA_ADMIN_PASSWORD=admin
|
||||
export ADMINER_USERS=admin:admin
|
||||
export TRAEFIK_DASHBOARD_HOST=traefik.fetcherpay.com
|
||||
|
||||
docker compose -p fetcherpay -f docker-compose.prod.yml up -d postgres gitea
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Fix `workflow.fetcherpay.com`
|
||||
|
||||
The Docker label on the `temporal` service pointed Traefik at port `7233` (gRPC), causing 502s. A file-provider router was added in `tls.yml` pointing `workflow.fetcherpay.com` → `temporal-ui:8080`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Fix Gitea SSH port conflict
|
||||
|
||||
The host port `2222` was already in use by an unknown process and could not be freed. The Gitea SSH mapping was changed from `2222:22` to `22222:22`.
|
||||
|
||||
**File:** `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
|
||||
```yaml
|
||||
gitea:
|
||||
...
|
||||
ports:
|
||||
- "22222:22" # SSH (optional for git over SSH)
|
||||
```
|
||||
|
||||
The `gitea` container was then recreated with the new mapping.
|
||||
|
||||
---
|
||||
|
||||
## 7. Restart Traefik after every config change
|
||||
|
||||
```bash
|
||||
docker restart traefik
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Verification results
|
||||
|
||||
Final public reachability check:
|
||||
|
||||
```
|
||||
https://hermes.squaremcp.com/openapi-living-brief.json -> 200 (cert=0)
|
||||
https://app.squaremcp.com/ -> 200 (cert=0)
|
||||
https://docs.squaremcp.com/ -> 200 (cert=0)
|
||||
https://squaremcp.com/ -> 200 (cert=0)
|
||||
https://www.squaremcp.com/ -> 200 (cert=0)
|
||||
https://tiktok.squaremcp.com/ -> 200 (cert=0)
|
||||
https://tiktok.squaremcp.com/auth/tiktok/start -> 302 (cert=0)
|
||||
https://fetcherpay.com/ -> 200 (cert=0)
|
||||
https://www.fetcherpay.com/ -> 200 (cert=0)
|
||||
https://workflow.fetcherpay.com/ -> 200 (cert=0)
|
||||
https://mail.fetcherpay.com/ -> 302 (cert=0)
|
||||
https://git.fetcherpay.com/ -> 200 (cert=0)
|
||||
|
||||
POST /api/pilot-request (tiktok) -> 201
|
||||
POST /api/pilot-request (root/www) -> 201
|
||||
GET /auth/tiktok/start -> 302
|
||||
```
|
||||
|
||||
`cert:0` means TLS verification passed.
|
||||
|
||||
---
|
||||
|
||||
## Notes / gotchas
|
||||
|
||||
- `/api/pilot-request` is `POST`-only. A `GET` request returns `404`, which is expected.
|
||||
- The `/auth/tiktok` routes are `/auth/tiktok/start` and `/auth/tiktok/callback`; the Traefik `PathPrefix(`/auth/tiktok`)` rule correctly forwards both.
|
||||
- Static certificate extraction required root access; Docker root containers were used when `sudo` began prompting for a password.
|
||||
32
docs/runbooks/2026-06-14-outage-index.md
Normal file
32
docs/runbooks/2026-06-14-outage-index.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# 2026-06-14 Public Edge Outage — Vault Index
|
||||
|
||||
All documentation for the outage, its root cause, the fix, and the follow-up plan lives in this SquareMCP vault folder.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `2026-06-14-public-edge-outage-rca.md` | Root cause analysis and incident timeline. |
|
||||
| `2026-06-14-outage-fix-log.md` | Step-by-step record of every config change, command, and verification result. |
|
||||
| `2026-06-14-infrastructure-findings.md` | As-built architecture, Traefik/K8s behavior, Hermes route table, and monitoring gaps. |
|
||||
| `2026-06-14-active-issues-and-debt.md` | What is still down, remaining technical debt, and recommended next steps. |
|
||||
| `2026-06-14-public-edge-outage-plan.md` | Proposed runbook, monitoring, probes, and alerting plan (Phase 1–4). |
|
||||
| `2026-06-14-outage-index.md` | This file. |
|
||||
|
||||
## Quick status
|
||||
|
||||
- ✅ All listed `squaremcp.com` domains reachable with valid TLS.
|
||||
- ✅ All listed `fetcherpay.com` domains reachable with valid TLS.
|
||||
- ✅ Hermes path routes (`/api/pilot-request`, `/auth/tiktok`) verified.
|
||||
- ⚠️ K8s nginx-ingress remains bypassed by Traefik.
|
||||
- ⚠️ Several FetcherPay services still stopped (`api`, Prometheus, Grafana, Adminer).
|
||||
- ⚠️ No automated monitoring or alerting yet.
|
||||
|
||||
## Reference paths on disk
|
||||
|
||||
- Traefik compose: `/home/garfield/traefik-compose.yml`
|
||||
- Traefik static config: `/home/garfield/traefik.yml`
|
||||
- Traefik dynamic config: `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
- Static certs: `/home/garfield/letsencrypt/manual/certs/`
|
||||
- FetcherPay prod compose: `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
- Hermes K8s manifest: `/home/garfield/hermes-mcp/hermes-k8s.yaml`
|
||||
129
docs/runbooks/2026-06-14-public-edge-outage-plan.md
Normal file
129
docs/runbooks/2026-06-14-public-edge-outage-plan.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring
|
||||
|
||||
## Goal
|
||||
Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`).
|
||||
1. Write a clear post-incident / RCA document.
|
||||
2. Create a step-by-step deployment runbook that the next operator can follow without guessing.
|
||||
3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.
|
||||
|
||||
---
|
||||
|
||||
## Root cause (condensed)
|
||||
- **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
|
||||
- **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert.
|
||||
- **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway.
|
||||
- **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`).
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 1: Post-incident / RCA document
|
||||
**Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md`
|
||||
|
||||
Sections:
|
||||
- **Summary** — what was down, for how long, user impact.
|
||||
- **Timeline** — detection, mitigation, full restoration.
|
||||
- **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
|
||||
- **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
|
||||
- **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
|
||||
- **Follow-up work** — this plan’s runbook and monitoring deliverables.
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 2: Deployment runbook
|
||||
**Location:** `hermes-mcp/docs/runbooks/deployment.md`
|
||||
|
||||
The runbook will cover:
|
||||
1. **Pre-flight checks**
|
||||
- Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`).
|
||||
- Confirm all expected Docker networks exist.
|
||||
- Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains.
|
||||
2. **Deploy / update the edge proxy**
|
||||
- Rebuild / restart Traefik from `traefik-compose.yml`.
|
||||
- Validate `tls.yml` routers, services, and certificate entries.
|
||||
- Smoke-test every public host immediately after restart.
|
||||
3. **Deploy Hermes / SquareMCP (K8s path)**
|
||||
- Build, push, update digest in `hermes-k8s.yaml`.
|
||||
- Apply manifests and wait for rollout.
|
||||
- Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`.
|
||||
4. **Deploy FetcherPay stack (Docker path)**
|
||||
- Export required env vars (or ensure `.env` is present).
|
||||
- `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow.
|
||||
- Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`.
|
||||
5. **Certificate renewal / rotation**
|
||||
- When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
|
||||
- Step-by-step secret extraction command template.
|
||||
6. **Rollback checklist**
|
||||
- Revert image digest / compose change, restart, verify.
|
||||
7. **Verification script**
|
||||
- A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure.
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 3: Diagnostics, metrics, and probes
|
||||
Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.
|
||||
|
||||
### Option A — Harden the existing Traefik edge (recommended)
|
||||
**Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.
|
||||
|
||||
Implementation pieces:
|
||||
1. **Synthetic uptime probes (blackbox exporter)**
|
||||
- Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`).
|
||||
- Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
|
||||
- Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`.
|
||||
- Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`.
|
||||
2. **Certificate expiry alerting**
|
||||
- Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left.
|
||||
- Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
|
||||
3. **Traefik routing health**
|
||||
- Enable Traefik metrics endpoint (`--metrics.prometheus`).
|
||||
- Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`.
|
||||
4. **Container health & restart policy**
|
||||
- Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`.
|
||||
- Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`.
|
||||
5. **K8s ingress reconciliation check**
|
||||
- A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret.
|
||||
- Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`.
|
||||
6. **Hermes application metrics**
|
||||
- Add a `/metrics` endpoint using `prom-client` in `src/index.ts`.
|
||||
- Instrument request latency, error rate, active OAuth sessions, tool call counts.
|
||||
- Scrape it from Prometheus.
|
||||
7. **Separate readiness probe**
|
||||
- Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready.
|
||||
8. **Alertmanager + Slack / email**
|
||||
- Deploy `prom/alertmanager` alongside Prometheus.
|
||||
- Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
|
||||
9. **Verification script**
|
||||
- `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI.
|
||||
|
||||
### Option B — Migrate public edge to K8s nginx-ingress
|
||||
**Why:** Eliminates the split-ingress complexity that caused the routing confusion.
|
||||
|
||||
Implementation pieces:
|
||||
1. Reconcile `ingressClassName: nginx` → `public` (or change the controller to `nginx`).
|
||||
2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
|
||||
3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
|
||||
4. Re-issue all certs via cert-manager and remove the static-cert workaround.
|
||||
5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.
|
||||
|
||||
**Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term.
|
||||
|
||||
---
|
||||
|
||||
## Suggested file changes (all under `hermes-mcp/`)
|
||||
- **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md`
|
||||
- **New / rewrite:** `docs/runbooks/deployment.md`
|
||||
- **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook)
|
||||
- **New:** `scripts/verify-public-endpoints.sh`
|
||||
- **New:** `scripts/check-k8s-ingress.sh`
|
||||
- **Modify:** `src/index.ts` — add `/metrics`, `/ready`, enhance `/health`
|
||||
- **Modify:** `hermes-k8s.yaml` — add startup probe, resource requests/limits
|
||||
- **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml`
|
||||
- **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file.
|
||||
|
||||
---
|
||||
|
||||
## Phasing recommendation
|
||||
- **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`.
|
||||
- **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check.
|
||||
- **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing.
|
||||
- **Phase 4 (future):** decide on Option B edge migration after Phase 1–3 are stable.
|
||||
88
docs/runbooks/2026-06-14-public-edge-outage-rca.md
Normal file
88
docs/runbooks/2026-06-14-public-edge-outage-rca.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Public Edge Outage — Root Cause Analysis
|
||||
|
||||
**Date:** 2026-06-14
|
||||
**Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid.
|
||||
**Status:** Resolved. All listed commercial domains reachable with valid TLS.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.
|
||||
|
||||
---
|
||||
|
||||
## Timeline (all times UTC-4)
|
||||
|
||||
- **~09:30** — User reports that commercial sites are not reachable.
|
||||
- **09:30–10:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`.
|
||||
- **10:00–10:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`.
|
||||
- **10:30–11:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container.
|
||||
- **11:00–11:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`).
|
||||
- **11:30–12:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service.
|
||||
- **12:00–13:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`.
|
||||
- **13:30–14:00** — Final verification of all domains and Hermes path-specific routes.
|
||||
|
||||
---
|
||||
|
||||
## Root cause
|
||||
|
||||
### 1. Docker Traefik intercepts all public ingress
|
||||
- The Traefik v3 container binds host ports `80`, `443`, and `8080`.
|
||||
- Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules.
|
||||
- Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller.
|
||||
|
||||
### 2. Traefik had no routes or valid TLS for the commercial domains
|
||||
- Traefik’s dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`).
|
||||
- At the start of the incident the file provider only had a partial/incomplete set of routers.
|
||||
- There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance.
|
||||
- Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned `404 page not found`.
|
||||
|
||||
### 3. K8s nginx-ingress was unreachable even though it had valid certs
|
||||
- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
|
||||
- The active nginx-ingress-microk8s controller is configured for `ingressClass=public`.
|
||||
- Most Ingress resources specify `ingressClassName: nginx`.
|
||||
- Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.
|
||||
|
||||
### 4. Several Docker backends were stopped
|
||||
- `fetcherpay-web` — stopped.
|
||||
- `poste` (mail) — stopped.
|
||||
- `postgres` and `gitea` (git) — stopped.
|
||||
- `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`.
|
||||
|
||||
---
|
||||
|
||||
## Why detection failed
|
||||
|
||||
- No synthetic uptime probes were running against the public endpoints.
|
||||
- No certificate-expiry or certificate-default alerting.
|
||||
- No Traefik routing-health alert.
|
||||
- Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services.
|
||||
- K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.
|
||||
|
||||
---
|
||||
|
||||
## Remediation actions taken
|
||||
|
||||
1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain.
|
||||
2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends.
|
||||
3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
|
||||
4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`.
|
||||
5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`.
|
||||
6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`.
|
||||
7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate.
|
||||
|
||||
---
|
||||
|
||||
## Remaining technical debt
|
||||
|
||||
- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
|
||||
- Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`.
|
||||
- Secrets are still stored plaintext in manifests and compose files.
|
||||
- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.
|
||||
|
||||
---
|
||||
|
||||
## Follow-up work
|
||||
|
||||
See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.
|
||||
@@ -22,7 +22,7 @@ spec:
|
||||
fsGroup: 1000
|
||||
containers:
|
||||
- name: hermes-mcp
|
||||
image: localhost:32000/hermes-mcp@sha256:f7895aad093acb740dde7f1acbb97644ac33b825c68b8119c294d2ed6d675158
|
||||
image: localhost:32000/hermes-mcp@sha256:c12b7fcfa46eac5cbb5a2ccbb8e9ea8062f52494ddf2700cb5d0bcdd51744e4b
|
||||
imagePullPolicy: Always
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
@@ -158,11 +158,11 @@ spec:
|
||||
- name: PILOT_CUSTOMER_ID
|
||||
value: "9a3f1a23-3080-4f9f-932c-02dae813ee96"
|
||||
- name: FACEBOOK_DEFAULT_ACCESS_TOKEN
|
||||
value: "EAAYG3FLDWzMBRgOmCM5GX7E3L6zk5utoZCn9eZAVvk0Ein6NaYtDZCtD5aMP3yMDnB0X2EoqvIYeOU77PhCCNaCve9LwX8iyQ2UsxsCajeHc7SXQL4EYWB7UEsDbcRA2gRF8GITYgbhBKKRlE3ehlwWBySwfxVexzMDgkGgz3ctzK4144hgJnE3LZB8EHP2FvolqNpXPVitexunWN0hxRwVXUSDgZCiOfzXfa1t0smxDs5wZDZD"
|
||||
value: "EAAYG3FLDWzMBRmZBDhn1rePtuKDCLUkzHLyJHNJA7yXXdcNUPXmyZA36BwLp7vXHhOxguCIGZB3JfJIhgX2ZBRZBTmZCDfdAYeZBrFAye2L5cIUKvYdjYYA3mlT3ZAacEQgmbhYuKBp4eCOQz0rrNUwLZB2qspvO9wczZAM3tWqFctYBP10oGfgOJIQ8ITweRU2Bgdte2hod66"
|
||||
- name: FACEBOOK_DEFAULT_PAGE_ID
|
||||
value: "1152192567968569"
|
||||
- name: INSTAGRAM_DEFAULT_ACCESS_TOKEN
|
||||
value: "EAAYG3FLDWzMBRgOmCM5GX7E3L6zk5utoZCn9eZAVvk0Ein6NaYtDZCtD5aMP3yMDnB0X2EoqvIYeOU77PhCCNaCve9LwX8iyQ2UsxsCajeHc7SXQL4EYWB7UEsDbcRA2gRF8GITYgbhBKKRlE3ehlwWBySwfxVexzMDgkGgz3ctzK4144hgJnE3LZB8EHP2FvolqNpXPVitexunWN0hxRwVXUSDgZCiOfzXfa1t0smxDs5wZDZD"
|
||||
value: "EAAYG3FLDWzMBRmZBDhn1rePtuKDCLUkzHLyJHNJA7yXXdcNUPXmyZA36BwLp7vXHhOxguCIGZB3JfJIhgX2ZBRZBTmZCDfdAYeZBrFAye2L5cIUKvYdjYYA3mlT3ZAacEQgmbhYuKBp4eCOQz0rrNUwLZB2qspvO9wczZAM3tWqFctYBP10oGfgOJIQ8ITweRU2Bgdte2hod66"
|
||||
- name: INSTAGRAM_DEFAULT_BUSINESS_ACCOUNT_ID
|
||||
value: "17841422623735880"
|
||||
- name: WHATSAPP_DEFAULT_ACCESS_TOKEN
|
||||
|
||||
@@ -15,7 +15,7 @@ spec:
|
||||
spec:
|
||||
containers:
|
||||
- name: squaremcp-app
|
||||
image: localhost:32000/squaremcp-app@sha256:c2bc1ee1bd6eed3981c6cf4c253d61cc1022373720f65debaea03dd8b53ed494
|
||||
image: localhost:32000/squaremcp-app@sha256:9c2601dd74bfca9f22350a38dc616eb8a76580090587803911bb2e5633ace361
|
||||
imagePullPolicy: Always
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
|
||||
@@ -252,9 +252,71 @@ logoutBtn.addEventListener('click', async () => {
|
||||
showLogin();
|
||||
});
|
||||
|
||||
// Connect MCP Client — start the browser OAuth flow
|
||||
// Connect MCP Client — show picker for Claude.ai / ChatGPT / desktop / CLI
|
||||
document.getElementById('connect-mcp-btn')?.addEventListener('click', () => {
|
||||
window.open(`${API_BASE}/oauth/connect-mcp`, '_blank', 'width=560,height=600,noopener');
|
||||
openModal(renderMcpClientPicker());
|
||||
});
|
||||
|
||||
function renderMcpClientPicker() {
|
||||
return `
|
||||
<div class="mcp-picker">
|
||||
<h3>Connect an AI client</h3>
|
||||
<p class="picker-subtitle">Choose where you want to use SquareMCP tools.</p>
|
||||
|
||||
<div class="picker-option">
|
||||
<div class="picker-meta">
|
||||
<div class="picker-title">Claude.ai (web)</div>
|
||||
<div class="picker-desc">Use SquareMCP directly in your browser at claude.ai.</div>
|
||||
</div>
|
||||
<a class="btn btn-primary" href="${API_BASE}/oauth/connect-claude-ai" target="_blank" rel="noopener" onclick="window.closeMcpPicker && window.closeMcpPicker()">Connect</a>
|
||||
</div>
|
||||
|
||||
<div class="picker-option">
|
||||
<div class="picker-meta">
|
||||
<div class="picker-title">Claude Desktop</div>
|
||||
<div class="picker-desc">macOS / Windows app with local MCP config.</div>
|
||||
</div>
|
||||
<button class="btn btn-primary" data-connect="claude-desktop">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="picker-option">
|
||||
<div class="picker-meta">
|
||||
<div class="picker-title">Codex CLI / OpenCode</div>
|
||||
<div class="picker-desc">Terminal-based agents (OpenAI, opencode).</div>
|
||||
</div>
|
||||
<button class="btn btn-primary" data-connect="codex">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="picker-option">
|
||||
<div class="picker-meta">
|
||||
<div class="picker-title">ChatGPT (web)</div>
|
||||
<div class="picker-desc">Copy the OpenAPI spec URL for GPT Actions.</div>
|
||||
</div>
|
||||
<button class="btn btn-secondary" data-connect="chatgpt">Get URL</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
|
||||
window.closeMcpPicker = closeModal;
|
||||
|
||||
modalBody.addEventListener('click', (e) => {
|
||||
const btn = e.target.closest('[data-connect]');
|
||||
if (!btn) return;
|
||||
const type = btn.dataset.connect;
|
||||
if (type === 'claude-desktop' || type === 'codex') {
|
||||
window.open(`${API_BASE}/oauth/connect-mcp`, '_blank', 'width=560,height=600,noopener');
|
||||
} else if (type === 'chatgpt') {
|
||||
openModal(`
|
||||
<div class="mcp-picker">
|
||||
<h3>ChatGPT / GPT Actions</h3>
|
||||
<p class="picker-subtitle">ChatGPT browser does not yet support native MCP. Use this OpenAPI spec URL in a GPT Action:</p>
|
||||
<div class="token-box" style="margin:16px 0;">${API_BASE}/openapi.json</div>
|
||||
<p class="picker-subtitle">Set authentication to <strong>Bearer token</strong> and paste your API key from <em>Settings → API Keys</em>.</p>
|
||||
<button class="btn btn-primary" onclick="window.open('${API_BASE}/openapi.json','_blank')">Open spec</button>
|
||||
</div>
|
||||
`);
|
||||
}
|
||||
});
|
||||
|
||||
// Password reset request
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
<p>Enter your email to receive a reset link</p>
|
||||
</div>
|
||||
<form id="reset-request-form" class="auth-form">
|
||||
<input type="email" name="email" placeholder="Email" required>
|
||||
<input type="email" name="email" placeholder="Email" aria-label="Email address" required>
|
||||
<button type="submit" class="btn btn-primary">Send Reset Link</button>
|
||||
<p class="error-msg" id="reset-request-error"></p>
|
||||
<p class="success-msg" id="reset-request-success"></p>
|
||||
@@ -35,7 +35,7 @@
|
||||
<p>Enter your new password below</p>
|
||||
</div>
|
||||
<form id="reset-confirm-form" class="auth-form">
|
||||
<input type="password" name="password" placeholder="New password (min 8 chars)" required minlength="8">
|
||||
<input type="password" name="password" placeholder="New password (min 8 chars)" aria-label="New password" required minlength="8">
|
||||
<button type="submit" class="btn btn-primary">Update Password</button>
|
||||
<p class="error-msg" id="reset-confirm-error"></p>
|
||||
<p class="success-msg" id="reset-confirm-success"></p>
|
||||
@@ -56,14 +56,14 @@
|
||||
<button class="tab-btn" data-tab="signup">Create Account</button>
|
||||
</div>
|
||||
<form id="login-form" class="auth-form">
|
||||
<input type="email" name="email" placeholder="Email" required>
|
||||
<input type="password" name="password" placeholder="Password" required minlength="8">
|
||||
<input type="email" name="email" placeholder="Email" aria-label="Email address" required>
|
||||
<input type="password" name="password" placeholder="Password" aria-label="Password" required minlength="8">
|
||||
<button type="submit" class="btn btn-primary">Sign In</button>
|
||||
<p class="error-msg" id="login-error"></p>
|
||||
</form>
|
||||
<form id="signup-form" class="auth-form hidden">
|
||||
<input type="email" name="email" placeholder="Email" required>
|
||||
<input type="password" name="password" placeholder="Password (min 8 chars)" required minlength="8">
|
||||
<input type="email" name="email" placeholder="Email" aria-label="Email address" required>
|
||||
<input type="password" name="password" placeholder="Password (min 8 chars)" aria-label="Password" required minlength="8">
|
||||
<button type="submit" class="btn btn-primary">Create Account</button>
|
||||
<p class="error-msg" id="signup-error"></p>
|
||||
</form>
|
||||
@@ -94,7 +94,7 @@
|
||||
<section class="welcome">
|
||||
<h2>Connect your accounts</h2>
|
||||
<p>Connect once. Then ask Claude or ChatGPT to post, search your notes, or send email — without touching any of these apps.</p>
|
||||
<button id="connect-mcp-btn" class="btn btn-primary" style="margin-top:16px;">Connect to Claude / ChatGPT</button>
|
||||
<button id="connect-mcp-btn" class="btn btn-primary" style="margin-top:16px;" aria-label="Connect Claude.ai, ChatGPT, Claude Desktop, or Codex CLI" title="Connect Claude.ai, ChatGPT, Claude Desktop, or Codex CLI">Connect AI Client</button>
|
||||
</section>
|
||||
|
||||
<section class="usage-bar" id="usage-bar">
|
||||
@@ -118,7 +118,7 @@
|
||||
<p class="platform-desc">Search and edit your notes vault</p>
|
||||
<span class="status-badge disconnected" id="status-obsidian">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="obsidian">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="obsidian" aria-label="Connect Obsidian" title="Connect Obsidian">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card v1-platform" data-platform="email">
|
||||
@@ -128,7 +128,7 @@
|
||||
<p class="platform-desc">Gmail, Yahoo, and IMAP accounts</p>
|
||||
<span class="status-badge disconnected" id="status-email">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="email">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="email" aria-label="Connect Email" title="Connect Email">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card v1-platform" data-platform="facebook">
|
||||
@@ -138,7 +138,7 @@
|
||||
<p class="platform-desc">Post to pages and manage content</p>
|
||||
<span class="status-badge disconnected" id="status-facebook">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="facebook">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="facebook" aria-label="Connect Facebook" title="Connect Facebook">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card v1-platform" data-platform="instagram">
|
||||
@@ -148,7 +148,7 @@
|
||||
<p class="platform-desc">Publish reels and images</p>
|
||||
<span class="status-badge disconnected" id="status-instagram">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="instagram">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="instagram" aria-label="Connect Instagram" title="Connect Instagram">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-divider">
|
||||
@@ -163,7 +163,7 @@
|
||||
<p class="platform-desc">Share posts, images, and videos</p>
|
||||
<span class="status-badge disconnected" id="status-linkedin">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="linkedin">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="linkedin" aria-label="Connect LinkedIn" title="Connect LinkedIn">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="twitter">
|
||||
@@ -173,7 +173,7 @@
|
||||
<p class="platform-desc">Tweet with media support</p>
|
||||
<span class="status-badge disconnected" id="status-twitter">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="twitter">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="twitter" aria-label="Connect Twitter / X" title="Connect Twitter / X">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="whatsapp">
|
||||
@@ -183,7 +183,7 @@
|
||||
<p class="platform-desc">Business messaging</p>
|
||||
<span class="status-badge disconnected" id="status-whatsapp">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="whatsapp">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="whatsapp" aria-label="Connect WhatsApp" title="Connect WhatsApp">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="telegram">
|
||||
@@ -193,7 +193,7 @@
|
||||
<p class="platform-desc">Send messages via bot</p>
|
||||
<span class="status-badge disconnected" id="status-telegram">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="telegram">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="telegram" aria-label="Connect Telegram" title="Connect Telegram">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="discord">
|
||||
@@ -203,7 +203,7 @@
|
||||
<p class="platform-desc">Send messages to channels</p>
|
||||
<span class="status-badge disconnected" id="status-discord">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="discord">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="discord" aria-label="Connect Discord" title="Connect Discord">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="slack">
|
||||
@@ -213,7 +213,7 @@
|
||||
<p class="platform-desc">Send messages to channels</p>
|
||||
<span class="status-badge disconnected" id="status-slack">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="slack">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="slack" aria-label="Connect Slack" title="Connect Slack">Connect</button>
|
||||
</div>
|
||||
|
||||
<div class="platform-card" data-platform="tiktok">
|
||||
@@ -223,7 +223,7 @@
|
||||
<p class="platform-desc">Publish videos and view analytics</p>
|
||||
<span class="status-badge disconnected" id="status-tiktok">Not connected</span>
|
||||
</div>
|
||||
<button class="btn btn-connect" data-platform="tiktok">Connect</button>
|
||||
<button class="btn btn-connect" data-platform="tiktok" aria-label="Connect TikTok" title="Connect TikTok">Connect</button>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
@@ -261,10 +261,10 @@
|
||||
<div class="webhook-card" id="webhook-card">
|
||||
<div id="webhook-status-row" class="webhook-status-row">
|
||||
<span id="webhook-url-display" class="webhook-url-display">No webhook configured</span>
|
||||
<button id="webhook-delete-btn" class="btn btn-ghost hidden">Remove</button>
|
||||
<button id="webhook-delete-btn" class="btn btn-ghost hidden" aria-label="Remove webhook URL" title="Remove webhook URL">Remove</button>
|
||||
</div>
|
||||
<form id="webhook-form" class="webhook-form">
|
||||
<input type="url" id="webhook-url-input" placeholder="https://your-server.com/webhook" required>
|
||||
<input type="url" id="webhook-url-input" placeholder="https://your-server.com/webhook" aria-label="Webhook URL" required>
|
||||
<button type="submit" class="btn btn-primary">Save & generate secret</button>
|
||||
</form>
|
||||
<div id="webhook-secret-box" class="webhook-secret-box hidden">
|
||||
@@ -284,7 +284,7 @@
|
||||
<div id="connect-modal" class="modal hidden">
|
||||
<div class="modal-backdrop"></div>
|
||||
<div class="modal-content">
|
||||
<button class="modal-close">×</button>
|
||||
<button class="modal-close" aria-label="Close" title="Close">×</button>
|
||||
<div id="modal-body"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -751,3 +751,66 @@ body {
|
||||
text-align: center;
|
||||
min-height: 18px;
|
||||
}
|
||||
|
||||
|
||||
/* MCP client picker modal */
|
||||
.mcp-picker h3 {
|
||||
margin: 0 0 6px;
|
||||
font-size: 1.25rem;
|
||||
}
|
||||
|
||||
.picker-subtitle {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.9rem;
|
||||
margin: 0 0 20px;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
.picker-option {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: space-between;
|
||||
gap: 16px;
|
||||
padding: 16px;
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
margin-bottom: 12px;
|
||||
background: var(--background);
|
||||
}
|
||||
|
||||
.picker-option:last-child {
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
.picker-meta {
|
||||
flex: 1;
|
||||
min-width: 0;
|
||||
}
|
||||
|
||||
.picker-title {
|
||||
font-weight: 600;
|
||||
font-size: 0.95rem;
|
||||
margin-bottom: 4px;
|
||||
}
|
||||
|
||||
.picker-desc {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.8rem;
|
||||
line-height: 1.4;
|
||||
}
|
||||
|
||||
.picker-option .btn {
|
||||
white-space: nowrap;
|
||||
padding: 8px 14px;
|
||||
font-size: 0.85rem;
|
||||
}
|
||||
|
||||
.token-box {
|
||||
background: var(--background);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
padding: 12px;
|
||||
font-family: 'SF Mono', monospace;
|
||||
font-size: 0.85rem;
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
@@ -22,8 +22,8 @@ async function resolveCreds(
|
||||
): Promise<{ accessToken: string; pageId: string }> {
|
||||
if (customer) {
|
||||
const creds = await customer.getCredential<FacebookCredentials>('facebook');
|
||||
if (!creds) throw new Error('Facebook not connected for this account');
|
||||
return { accessToken: creds.accessToken, pageId: creds.pageId };
|
||||
if (creds) return { accessToken: creds.accessToken, pageId: creds.pageId };
|
||||
// Fall back to default env credentials when customer has no per-account creds
|
||||
}
|
||||
const account = args.account ?? 'default';
|
||||
const accessToken = getEnvToken(account);
|
||||
|
||||
@@ -24,8 +24,8 @@ async function resolveCreds(
|
||||
): Promise<{ accessToken: string; businessAccountId: string }> {
|
||||
if (customer) {
|
||||
const creds = await customer.getCredential<InstagramCredentials>('instagram');
|
||||
if (!creds) throw new Error('Instagram not connected for this account');
|
||||
return { accessToken: creds.accessToken, businessAccountId: creds.businessAccountId };
|
||||
if (creds) return { accessToken: creds.accessToken, businessAccountId: creds.businessAccountId };
|
||||
// Fall back to default env credentials when customer has no per-account creds
|
||||
}
|
||||
const account = args.account ?? 'default';
|
||||
const accessToken = getEnvToken(account);
|
||||
|
||||
59
src/index.ts
59
src/index.ts
@@ -724,6 +724,26 @@ app.get('/oauth/connect-mcp', (req, res) => {
|
||||
res.redirect(`/oauth/authorize?${params}`);
|
||||
});
|
||||
|
||||
// Dedicated entry point for the Claude.ai web MCP client. It uses the official
|
||||
// Anthropic redirect_uri so Claude.ai receives the authorization code directly.
|
||||
// A state parameter is included because Claude.ai's callback requires it.
|
||||
app.get('/oauth/connect-claude-ai', (req, res) => {
|
||||
const clientId = process.env.OAUTH_CLIENT_ID;
|
||||
if (!clientId) {
|
||||
res.status(503).send('MCP OAuth app not configured (OAUTH_CLIENT_ID missing)');
|
||||
return;
|
||||
}
|
||||
const state = crypto.randomBytes(16).toString('hex');
|
||||
const params = new URLSearchParams({
|
||||
client_id: clientId,
|
||||
redirect_uri: 'https://claude.ai/api/mcp/auth_callback',
|
||||
response_type: 'code',
|
||||
scope: 'mcp',
|
||||
state,
|
||||
});
|
||||
res.redirect(`/oauth/authorize?${params}`);
|
||||
});
|
||||
|
||||
// Callback — exchange code for token and render the config snippet page
|
||||
app.get('/oauth/mcp-callback', async (req, res) => {
|
||||
const code = req.query.code as string | undefined;
|
||||
@@ -762,11 +782,12 @@ h1{color:#dc2626;margin:0 0 12px}p{color:#888;margin:0}</style></head>
|
||||
}
|
||||
|
||||
const { token, serverUrl } = opts;
|
||||
const mcpUrl = `${serverUrl}/mcp`;
|
||||
const claudeConfig = JSON.stringify({
|
||||
mcpServers: { 'hermes-mcp': { type: 'http', url: `${serverUrl}/mcp`, headers: { Authorization: `Bearer ${token}` } } }
|
||||
mcpServers: { 'hermes-mcp': { type: 'http', url: mcpUrl, headers: { Authorization: `Bearer ${token}` } } }
|
||||
}, null, 2);
|
||||
const codexConfig = JSON.stringify({
|
||||
mcpServers: { 'hermes-mcp': { type: 'http', url: `${serverUrl}/mcp`, headers: { Authorization: `Bearer ${token}` } } }
|
||||
mcpServers: { 'hermes-mcp': { type: 'http', url: mcpUrl, headers: { Authorization: `Bearer ${token}` } } }
|
||||
}, null, 2);
|
||||
|
||||
const esc = (s: string) => s.replace(/&/g, '&').replace(/</g, '<').replace(/>/g, '>');
|
||||
@@ -781,12 +802,18 @@ body{font-family:system-ui,sans-serif;background:#0f0f10;color:#e5e5e5;margin:0;
|
||||
.card{background:#1a1a1b;border:1px solid #2a2a2b;border-radius:12px;padding:32px;max-width:680px;margin:0 auto}
|
||||
h1{font-size:22px;margin:0 0 8px;color:#10a37f}
|
||||
.subtitle{color:#888;margin:0 0 28px;font-size:14px}
|
||||
h2{font-size:14px;font-weight:600;color:#888;text-transform:uppercase;letter-spacing:.05em;margin:20px 0 8px}
|
||||
h2{font-size:14px;font-weight:600;color:#888;text-transform:uppercase;letter-spacing:.05em;margin:24px 0 10px}
|
||||
pre{background:#0f0f10;border:1px solid #2a2a2b;border-radius:8px;padding:16px;font-size:12px;overflow-x:auto;position:relative}
|
||||
.copy-btn{position:absolute;top:8px;right:8px;background:#2a2a2b;border:none;color:#888;padding:4px 10px;border-radius:6px;cursor:pointer;font-size:11px}
|
||||
.copy-btn:hover{color:#e5e5e5}
|
||||
.token-box{background:#0f0f10;border:1px solid #2a2a2b;border-radius:8px;padding:12px 16px;font-family:monospace;font-size:13px;word-break:break-all;margin-bottom:8px}
|
||||
.warn{color:#888;font-size:12px;margin:4px 0 20px}
|
||||
.instruct{color:#a1a1aa;font-size:13px;line-height:1.6;margin:8px 0}
|
||||
.instruct code{background:#0f0f10;border:1px solid #2a2a2b;border-radius:4px;padding:2px 5px;font-size:12px}
|
||||
.instruct ol{margin:8px 0;padding-left:20px}
|
||||
.instruct li{margin:6px 0}
|
||||
.client-section{border-top:1px solid #2a2a2b;padding-top:18px;margin-top:18px}
|
||||
.client-section:first-of-type{border-top:none;padding-top:0;margin-top:0}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
@@ -798,11 +825,28 @@ pre{background:#0f0f10;border:1px solid #2a2a2b;border-radius:8px;padding:16px;f
|
||||
<div class="token-box">${esc(token!)}</div>
|
||||
<p class="warn">Store this securely — it won't be shown again.</p>
|
||||
|
||||
<h2>Claude Desktop <code>claude_desktop_config.json</code></h2>
|
||||
<pre id="claude-cfg">${esc(claudeConfig)}<button class="copy-btn" onclick="copy('claude-cfg')">Copy</button></pre>
|
||||
<div class="client-section">
|
||||
<h2>Claude.ai (browser)</h2>
|
||||
<p class="instruct">In <a href="https://claude.ai" target="_blank" rel="noopener" style="color:#10a37f">claude.ai</a> go to <strong>Settings → Integrations → Add MCP server</strong> and paste:</p>
|
||||
<pre id="claude-web-cfg">${esc(mcpUrl)}<button class="copy-btn" onclick="copy('claude-web-cfg')">Copy</button></pre>
|
||||
<p class="instruct">When prompted, use the access token above.</p>
|
||||
</div>
|
||||
|
||||
<h2>Codex CLI / opencode config</h2>
|
||||
<pre id="codex-cfg">${esc(codexConfig)}<button class="copy-btn" onclick="copy('codex-cfg')">Copy</button></pre>
|
||||
<div class="client-section">
|
||||
<h2>Claude Desktop</h2>
|
||||
<p class="instruct">Paste this into <code>claude_desktop_config.json</code>:</p>
|
||||
<pre id="claude-cfg">${esc(claudeConfig)}<button class="copy-btn" onclick="copy('claude-cfg')">Copy</button></pre>
|
||||
</div>
|
||||
|
||||
<div class="client-section">
|
||||
<h2>ChatGPT / GPT Actions</h2>
|
||||
<p class="instruct">For ChatGPT, use the OpenAPI spec at <code>${esc(serverUrl!)}/openapi.json</code> and add a Bearer token header with the token above. Native MCP support in chatgpt.com is not yet available.</p>
|
||||
</div>
|
||||
|
||||
<div class="client-section">
|
||||
<h2>Codex CLI / OpenCode</h2>
|
||||
<pre id="codex-cfg">${esc(codexConfig)}<button class="copy-btn" onclick="copy('codex-cfg')">Copy</button></pre>
|
||||
</div>
|
||||
</div>
|
||||
<script>
|
||||
function copy(id) {
|
||||
@@ -2320,6 +2364,7 @@ async function main() {
|
||||
if (oauthClientId && oauthClientSecret) {
|
||||
await ensureOAuthAppRegistered(oauthClientId, oauthClientSecret, [
|
||||
`${SERVER_URL}/oauth/mcp-callback`,
|
||||
'https://claude.ai/api/mcp/auth_callback',
|
||||
'http://localhost:*',
|
||||
'claude-desktop://callback',
|
||||
'opencode://callback',
|
||||
|
||||
13
src/redis.ts
13
src/redis.ts
@@ -2,9 +2,18 @@ import { createClient } from 'redis';
|
||||
|
||||
const redis = createClient({
|
||||
url: process.env.REDIS_URL,
|
||||
socket: { connectTimeout: 3000, socketTimeout: 5000 },
|
||||
socket: {
|
||||
connectTimeout: 3000,
|
||||
socketTimeout: 5000,
|
||||
reconnectStrategy: (retries) => Math.min(retries * 100, 3000),
|
||||
},
|
||||
});
|
||||
|
||||
redis.on('error', (err) => console.error('[redis] error:', err.message));
|
||||
redis.connect().catch((err) => console.error('[redis] connect error:', err));
|
||||
redis.on('connect', () => console.log('[redis] connected'));
|
||||
redis.on('reconnecting', () => console.log('[redis] reconnecting...'));
|
||||
redis.on('end', () => console.log('[redis] connection ended'));
|
||||
|
||||
redis.connect().catch((err) => console.error('[redis] initial connect error:', err.message));
|
||||
|
||||
export default redis;
|
||||
|
||||
Reference in New Issue
Block a user