docs(runbooks): add 2026-06-14 public edge outage RCA, fix log, infra findings, debt, and monitoring plan
Some checks failed
CI / test (push) Has been cancelled
Some checks failed
CI / test (push) Has been cancelled
This commit is contained in:
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
75
docs/runbooks/2026-06-14-active-issues-and-debt.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Active Issues and Remaining Debt — 2026-06-14
|
||||
|
||||
## What is working now
|
||||
|
||||
All commercial domains verified reachable with valid TLS:
|
||||
|
||||
- `hermes.squaremcp.com` / `openapi-living-brief.json`
|
||||
- `app.squaremcp.com`
|
||||
- `docs.squaremcp.com`
|
||||
- `squaremcp.com` / `www.squaremcp.com`
|
||||
- `tiktok.squaremcp.com`
|
||||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||||
- `workflow.fetcherpay.com`
|
||||
- `mail.fetcherpay.com`
|
||||
- `git.fetcherpay.com`
|
||||
|
||||
Hermes path-specific routes verified:
|
||||
- `POST /api/pilot-request` → `201` on `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`
|
||||
- `GET /auth/tiktok/start` → `302` on `tiktok.squaremcp.com`
|
||||
|
||||
---
|
||||
|
||||
## Still down / not addressed
|
||||
|
||||
| Subdomain / Service | Why it is down | What would fix it |
|
||||
|---|---|---|
|
||||
| `api.fetcherpay.com` | `fetcherpay-api` container not running | Start `fetcherpay-api` (needs env vars, Postgres, Redis) |
|
||||
| `prometheus.fetcherpay.com` | Prometheus container not running | Start Prometheus from `docker-compose.fetcherpay.yml` |
|
||||
| `grafana.fetcherpay.com` | Grafana container not running | Start Grafana from `docker-compose.fetcherpay.yml` |
|
||||
| `adminer.fetcherpay.com` | Adminer container not running | Start Adminer from `docker-compose.fetcherpay.yml` |
|
||||
| `traefik.fetcherpay.com` | Traefik dashboard is on `:8080` but not routed through a public host label | Add a secure router or restrict dashboard to localhost/VPN |
|
||||
|
||||
---
|
||||
|
||||
## Architectural debt
|
||||
|
||||
1. **K8s nginx-ingress is bypassed**
|
||||
- Traefik’s Docker iptables rules intercept all public HTTP/S traffic.
|
||||
- The active nginx-ingress controller class is `public`; manifests use `nginx`.
|
||||
- Long term: either reconcile `ingressClassName` or migrate the public edge to K8s.
|
||||
|
||||
2. **Manual static certificate workaround**
|
||||
- Traefik cannot issue new certs via GoDaddy DNS-01 for several domains because of `DUPLICATE_RECORD` TXT errors.
|
||||
- Certs are extracted from K8s cert-manager secrets and loaded statically.
|
||||
- These must be manually rotated before expiry.
|
||||
|
||||
3. **No observability**
|
||||
- No synthetic uptime probes.
|
||||
- No cert-expiry alerting.
|
||||
- No Hermes `/metrics` endpoint.
|
||||
- No Alertmanager / Slack alerts.
|
||||
- No centralized logs.
|
||||
|
||||
4. **Secret management**
|
||||
- Plaintext secrets in `hermes-k8s.yaml` and compose env vars.
|
||||
- No Sealed Secrets / External Secrets / Vault.
|
||||
|
||||
5. **Single point of failure**
|
||||
- One host, one residential IP, one edge proxy.
|
||||
- No redundancy or failover.
|
||||
|
||||
6. **Gitea SSH port**
|
||||
- Changed from `2222` to `22222` due to an unknown process binding `2222`.
|
||||
- The original occupant of port `2222` was never identified; a reboot would be needed to clear it.
|
||||
|
||||
---
|
||||
|
||||
## Recommended next steps
|
||||
|
||||
See `2026-06-14-public-edge-outage-plan.md` for the full phased plan. Priorities:
|
||||
|
||||
1. **Immediate:** finalize RCA, runbook, and `scripts/verify-public-endpoints.sh`.
|
||||
2. **This week:** deploy blackbox exporter + cert-expiry alerts + container-up check.
|
||||
3. **Next sprint:** add Hermes `/metrics`, Grafana dashboards, Alertmanager Slack routing.
|
||||
4. **Future:** decide on K8s edge migration vs. reconciling ingress classes.
|
||||
164
docs/runbooks/2026-06-14-infrastructure-findings.md
Normal file
164
docs/runbooks/2026-06-14-infrastructure-findings.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Infrastructure Findings — SquareMCP / FetcherPay
|
||||
|
||||
This document captures the as-built architecture, ingress behavior, monitoring state, and Hermes route table discovered during the 2026-06-14 outage response.
|
||||
|
||||
---
|
||||
|
||||
## 1. High-level architecture
|
||||
|
||||
The single production server (`104.190.60.129`) hosts two separate ingress layers:
|
||||
|
||||
| Ingress Layer | Technology | Serves |
|
||||
|---|---|---|
|
||||
| **Docker edge proxy** | Traefik v3 | `*.fetcherpay.com` Docker Compose stacks, plus static file-provider routes for `*.squaremcp.com` |
|
||||
| **Kubernetes ingress** | nginx-ingress-microk8s + cert-manager | `*.squaremcp.com` K8s workloads (currently bypassed by Traefik) |
|
||||
|
||||
Both layers use Let’s Encrypt TLS. Public ports `80`/`443` are bound by the Docker Traefik container, so its `iptables` rules win over host-network K8s services.
|
||||
|
||||
---
|
||||
|
||||
## 2. Traefik configuration
|
||||
|
||||
### Static config
|
||||
**File:** `/home/garfield/traefik.yml`
|
||||
|
||||
- Dashboard enabled on `:8080` with `insecure: true`.
|
||||
- Entrypoints: `web` (HTTP → HTTPS redirect) and `websecure` (HTTPS, `:443`).
|
||||
- Providers: Docker (socket) + file provider (`/letsencrypt/manual/tls.yml`, `watch: true`).
|
||||
- Certificate resolver: `letsencrypt` via GoDaddy DNS-01.
|
||||
|
||||
### Compose
|
||||
**File:** `/home/garfield/traefik-compose.yml`
|
||||
|
||||
- Networks: `hermes-net`, `obsidian-net`, `fetcherpay` (all external).
|
||||
- Volumes: Docker socket, static config, `letsencrypt` directory.
|
||||
|
||||
### Dynamic routing
|
||||
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
|
||||
Final state after the fix has file-provider routers for all commercial domains and path-specific rules that send `/api/pilot-request` and `/auth/tiktok` to Hermes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes ingress mismatch
|
||||
|
||||
- **Controller class:** `public`
|
||||
- **Ingress class used by manifests:** `nginx`
|
||||
|
||||
This means the active controller ignores most Ingress resources. Even if Traefik were removed, those Ingresses would not be served until the class is reconciled.
|
||||
|
||||
Affected manifests include:
|
||||
- `hermes-mcp/hermes-k8s.yaml`
|
||||
- `hermes-mcp/product/app/app-k8s.yaml`
|
||||
- `hermes-mcp/docs/docs-k8s.yaml`
|
||||
- `hermes-mcp/product/site/squaremcp-k8s-ingress.yaml`
|
||||
|
||||
---
|
||||
|
||||
## 4. Hermes MCP route table
|
||||
|
||||
**File:** `hermes-mcp/src/index.ts`
|
||||
|
||||
### Public / commercial endpoints
|
||||
|
||||
| Method | Path | Notes |
|
||||
|---|---|---|
|
||||
| `GET` | `/` | Static files from `../product` |
|
||||
| `GET` | `/openapi-living-brief.json` | Obsidian-only OpenAPI spec for ChatGPT |
|
||||
| `GET` | `/openapi.json` | Full OpenAPI spec |
|
||||
| `GET` | `/auth/tiktok/start` | Redirect to TikTok Login Kit |
|
||||
| `GET` | `/auth/tiktok/callback` | TikTok OAuth callback |
|
||||
| `POST` | `/api/pilot-request` | Public form submission; origin-gated |
|
||||
| `GET` | `/health` | Liveness/readiness probe |
|
||||
|
||||
### OAuth / MCP discovery
|
||||
|
||||
| Method | Path |
|
||||
|---|---|
|
||||
| `POST` | `/oauth/register` |
|
||||
| `GET` / `POST` | `/oauth/authorize` |
|
||||
| `POST` | `/oauth/token` |
|
||||
| `GET` | `/.well-known/oauth-authorization-server` |
|
||||
| `GET` | `/.well-known/openid-configuration` |
|
||||
| `GET` / `POST` / `DELETE` | `/mcp` |
|
||||
| `GET` | `/sse` |
|
||||
| `POST` | `/messages` |
|
||||
| `GET` | `/tools` |
|
||||
|
||||
### Capability-guarded tool API
|
||||
|
||||
All `/api/*` tool routes require auth + capability grant:
|
||||
|
||||
| Capability | Example endpoints |
|
||||
|---|---|
|
||||
| `obsidian` | `/api/obsidian/search`, `/api/obsidian/note`, `/api/obsidian/note/append`, `/api/obsidian/sync` |
|
||||
| `email` | `/api/email/profile`, `/api/email/search`, `/api/email/read`, `/api/email/send` |
|
||||
| `whatsapp` | `/api/whatsapp/send`, `/api/whatsapp/templates` |
|
||||
| `linkedin` | `/api/linkedin/profile`, `/api/linkedin/post`, `/api/linkedin/message` |
|
||||
| `telegram` | `/api/telegram/me`, `/api/telegram/message`, `/api/telegram/updates` |
|
||||
| `discord` | `/api/discord/me`, `/api/discord/guilds`, `/api/discord/message` |
|
||||
| `instagram` | `/api/instagram/profile`, `/api/instagram/media`, `/api/instagram/post` |
|
||||
| `twitter` | `/api/twitter/search`, `/api/twitter/tweets`, `/api/twitter/tweet` |
|
||||
| `facebook` | `/api/facebook/page`, `/api/facebook/posts`, `/api/facebook/post` |
|
||||
| `tiktok` | `/api/tiktok/profile`, `/api/tiktok/video`, `/api/tiktok/video/status` |
|
||||
|
||||
### Health endpoint
|
||||
|
||||
```typescript
|
||||
app.get('/health', (_req, res) => {
|
||||
res.json({
|
||||
status: 'ok',
|
||||
service: 'hermes-mcp',
|
||||
toolCount,
|
||||
transports,
|
||||
endpoints,
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
Used by both K8s readiness and liveness probes in `hermes-k8s.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring gaps
|
||||
|
||||
### Prometheus / Grafana
|
||||
|
||||
- Prometheus and Grafana containers exist in `docker-compose.fetcherpay.yml`.
|
||||
- Prometheus scrapes itself, `fetcherpay-api:3000`, and Docker metrics at `172.20.0.1:9323`.
|
||||
- **Hermes MCP is not scraped** and has no `/metrics` endpoint.
|
||||
- No Alertmanager, no alert rules.
|
||||
|
||||
### Health checks
|
||||
|
||||
- Hermes has `/health` but no `/ready` or `/livez` separation.
|
||||
- Docker health checks exist for Postgres, MySQL, Redis, Gitea, and FetcherPay API, but **not for Hermes**.
|
||||
|
||||
### Uptime / synthetic probes
|
||||
|
||||
- No blackbox exporter.
|
||||
- No external uptime monitoring (Pingdom, UptimeRobot, Grafana Cloud, etc.).
|
||||
- No cert-expiry alerting.
|
||||
- No K8s ingress reconciliation check.
|
||||
|
||||
### Logs
|
||||
|
||||
- No centralized log aggregation (Loki, Vector, Fluentd).
|
||||
|
||||
---
|
||||
|
||||
## 6. Secret management
|
||||
|
||||
- `hermes-k8s.yaml` is gitignored and contains plaintext secrets (email, DB, OAuth, API keys).
|
||||
- Docker Compose stacks rely on exported env vars or `.env` files.
|
||||
- No Sealed Secrets, External Secrets Operator, or Vault in use.
|
||||
|
||||
---
|
||||
|
||||
## 7. Notable risks
|
||||
|
||||
1. **Single point of failure:** one residential IP, one host, one edge proxy.
|
||||
2. **Split edge:** two ingress controllers with conflicting class configuration.
|
||||
3. **Manual certificate workaround:** static K8s-extracted certs in Traefik must be manually rotated before expiry.
|
||||
4. **No observability:** no metrics, alerting, or synthetic probes for the commercial domains.
|
||||
5. **Stopped services not detected:** Docker restart policies only help if containers were initially started.
|
||||
360
docs/runbooks/2026-06-14-outage-fix-log.md
Normal file
360
docs/runbooks/2026-06-14-outage-fix-log.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Outage Fix Log — 2026-06-14
|
||||
|
||||
This is the step-by-step record of what was changed to restore public access to the SquareMCP / FetcherPay commercial sites.
|
||||
|
||||
---
|
||||
|
||||
## Environment
|
||||
|
||||
- **Host:** `104.190.60.129` (MicroK8s + Docker)
|
||||
- **Edge proxy:** Traefik v3 in Docker, binds `:80`, `:443`, `:8080`
|
||||
- **Hermes MCP:** K8s pod with `hostNetwork: true` on `:3456`
|
||||
- **Key files:**
|
||||
- `/home/garfield/traefik-compose.yml`
|
||||
- `/home/garfield/traefik.yml`
|
||||
- `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
- `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
|
||||
---
|
||||
|
||||
## 1. Attach Traefik to the FetcherPay network
|
||||
|
||||
**File:** `/home/garfield/traefik-compose.yml`
|
||||
|
||||
Added the `fetcherpay` external network so Traefik can reach FetcherPay Docker backends.
|
||||
|
||||
```yaml
|
||||
services:
|
||||
traefik:
|
||||
...
|
||||
networks:
|
||||
- hermes-net
|
||||
- obsidian-net
|
||||
- fetcherpay
|
||||
|
||||
networks:
|
||||
hermes-net:
|
||||
external: true
|
||||
name: hermes-mcp_hermes-net
|
||||
obsidian-net:
|
||||
external: true
|
||||
name: obsidian_obsidian-net
|
||||
fetcherpay:
|
||||
external: true
|
||||
name: fetcherpay_fetcherpay
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Rebuild the Traefik file-provider routing config
|
||||
|
||||
**File:** `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
|
||||
Final config includes routers and services for:
|
||||
- `hermes.squaremcp.com`
|
||||
- `app.squaremcp.com`
|
||||
- `docs.squaremcp.com`
|
||||
- `squaremcp.com` / `www.squaremcp.com`
|
||||
- `tiktok.squaremcp.com`
|
||||
- `fetcherpay.com` / `www.fetcherpay.com`
|
||||
- `workflow.fetcherpay.com`
|
||||
- `mail.fetcherpay.com`
|
||||
- `git.fetcherpay.com`
|
||||
|
||||
Path-specific rules that route to Hermes (`104.190.60.129:3456`):
|
||||
- `/api/pilot-request` on `squaremcp.com` / `www.squaremcp.com`
|
||||
- `/auth/tiktok` and `/api/pilot-request` on `tiktok.squaremcp.com`
|
||||
|
||||
Full final config:
|
||||
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
hermes:
|
||||
rule: "Host(`hermes.squaremcp.com`)"
|
||||
service: hermes
|
||||
entryPoints: [websecure]
|
||||
tls: { certResolver: letsencrypt }
|
||||
|
||||
squaremcp-app:
|
||||
rule: "Host(`app.squaremcp.com`)"
|
||||
service: squaremcp-app
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-docs:
|
||||
rule: "Host(`docs.squaremcp.com`)"
|
||||
service: squaremcp-docs
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-site-main:
|
||||
rule: "Host(`squaremcp.com`) || Host(`www.squaremcp.com`)"
|
||||
service: squaremcp-site
|
||||
priority: 10
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-site-pilot:
|
||||
rule: "(Host(`squaremcp.com`) || Host(`www.squaremcp.com`)) && PathPrefix(`/api/pilot-request`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-main:
|
||||
rule: "Host(`tiktok.squaremcp.com`)"
|
||||
service: squaremcp-site
|
||||
priority: 10
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-auth:
|
||||
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/auth/tiktok`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
squaremcp-tiktok-pilot:
|
||||
rule: "Host(`tiktok.squaremcp.com`) && PathPrefix(`/api/pilot-request`)"
|
||||
service: hermes
|
||||
priority: 30
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
fetcherpay-root:
|
||||
rule: "Host(`fetcherpay.com`) || Host(`www.fetcherpay.com`)"
|
||||
service: fetcherpay-web
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
workflow:
|
||||
rule: "Host(`workflow.fetcherpay.com`)"
|
||||
service: temporal-ui
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
mail:
|
||||
rule: "Host(`mail.fetcherpay.com`)"
|
||||
service: poste
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
git:
|
||||
rule: "Host(`git.fetcherpay.com`)"
|
||||
service: gitea
|
||||
priority: 60
|
||||
entryPoints: [websecure]
|
||||
tls: {}
|
||||
|
||||
services:
|
||||
hermes:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://104.190.60.129:3456" }]
|
||||
passHostHeader: true
|
||||
squaremcp-app:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.164:80" }]
|
||||
passHostHeader: true
|
||||
squaremcp-docs:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.130:80" }]
|
||||
passHostHeader: true
|
||||
squaremcp-site:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://10.152.183.48:80" }]
|
||||
passHostHeader: true
|
||||
fetcherpay-web:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://172.20.0.9:80" }]
|
||||
passHostHeader: true
|
||||
temporal-ui:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://172.20.0.3:8080" }]
|
||||
passHostHeader: true
|
||||
poste:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://poste:80" }]
|
||||
passHostHeader: true
|
||||
gitea:
|
||||
loadBalancer:
|
||||
servers: [{ url: "http://gitea:3000" }]
|
||||
passHostHeader: true
|
||||
|
||||
tls:
|
||||
certificates:
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-app.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-app.key
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-docs.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-docs.key
|
||||
- certFile: /letsencrypt/manual/certs/squaremcp-site.crt
|
||||
keyFile: /letsencrypt/manual/certs/squaremcp-site.key
|
||||
- certFile: /letsencrypt/manual/certs/fetcherpay-root.crt
|
||||
keyFile: /letsencrypt/manual/certs/fetcherpay-root.key
|
||||
- certFile: /letsencrypt/manual/certs/mail-fetcherpay.crt
|
||||
keyFile: /letsencrypt/manual/certs/mail-fetcherpay.key
|
||||
- certFile: /letsencrypt/manual/certs/git-fetcherpay.crt
|
||||
keyFile: /letsencrypt/manual/certs/git-fetcherpay.key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Extract static TLS certificates from K8s cert-manager secrets
|
||||
|
||||
Because Traefik’s GoDaddy DNS-01 resolver fails with `DUPLICATE_RECORD` for existing `_acme-challenge.*` TXT records, valid certificates were pulled from the K8s secrets that cert-manager already held.
|
||||
|
||||
```bash
|
||||
mkdir -p /home/garfield/letsencrypt/manual/certs
|
||||
|
||||
# squaremcp-app
|
||||
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-app.crt
|
||||
microk8s kubectl get secret squaremcp-app-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-app.key
|
||||
|
||||
# squaremcp-docs
|
||||
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-docs.crt
|
||||
microk8s kubectl get secret squaremcp-docs-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-docs.key
|
||||
|
||||
# squaremcp-site (covers squaremcp.com / www.squaremcp.com / tiktok.squaremcp.com)
|
||||
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > squaremcp-site.crt
|
||||
microk8s kubectl get secret squaremcp-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > squaremcp-site.key
|
||||
|
||||
# fetcherpay-root
|
||||
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > fetcherpay-root.crt
|
||||
microk8s kubectl get secret fetcherpay-root-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > fetcherpay-root.key
|
||||
|
||||
# mail.fetcherpay.com
|
||||
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.crt}' | base64 -d > mail-fetcherpay.crt
|
||||
microk8s kubectl get secret mail-fetcherpay-tls -n email -o jsonpath='{.data.tls\.key}' | base64 -d > mail-fetcherpay.key
|
||||
|
||||
# git.fetcherpay.com
|
||||
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.crt}' | base64 -d > git-fetcherpay.crt
|
||||
microk8s kubectl get secret fetcherpay-git-tls -n fetcherpay -o jsonpath='{.data.tls\.key}' | base64 -d > git-fetcherpay.key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Start stopped backend containers
|
||||
|
||||
### FetcherPay web
|
||||
|
||||
```bash
|
||||
docker compose -p fetcherpay -f /home/garfield/docker-compose.fetcherpay.yml up -d fetcherpay-web
|
||||
```
|
||||
|
||||
### Poste (mail)
|
||||
|
||||
```bash
|
||||
docker compose -p fetcherpay -f /home/garfield/Downloads/docker-compose.prod.yml up -d poste
|
||||
```
|
||||
|
||||
### Postgres + Gitea (git)
|
||||
|
||||
Gitea credentials were recovered from the existing Gitea config volume:
|
||||
|
||||
```bash
|
||||
docker run --rm -v fetcherpay_gitea_data:/data alpine \
|
||||
sh -c 'cat /data/gitea/conf/app.ini | grep -E "^(NAME|USER|PASSWD|HOST|DB_TYPE)"'
|
||||
# DB_TYPE = postgres
|
||||
# HOST = postgres:5432
|
||||
# NAME = gitea
|
||||
# USER = fetcherpay
|
||||
# PASSWD = fetcherpay_secure_2024
|
||||
```
|
||||
|
||||
Then postgres and gitea were started with the required env vars:
|
||||
|
||||
```bash
|
||||
cd /home/garfield/Downloads
|
||||
export POSTGRES_USER=fetcherpay
|
||||
export POSTGRES_PASSWORD=fetcherpay_secure_2024
|
||||
export POSTGRES_DB=postgres
|
||||
export GITEA_HOST=git.fetcherpay.com
|
||||
export GITEA_DB=gitea
|
||||
export MAIL_HOST=mail.fetcherpay.com
|
||||
export WEB_HOST=fetcherpay.com
|
||||
export API_HOST=api.fetcherpay.com
|
||||
export PROM_HOST=prometheus.fetcherpay.com
|
||||
export GRAFANA_HOST=grafana.fetcherpay.com
|
||||
export ADMINER_HOST=adminer.fetcherpay.com
|
||||
export TEMPORAL_HOST=workflow.fetcherpay.com
|
||||
export REDIS_PASSWORD=redis_pass
|
||||
export MYSQL_ROOT_PASSWORD=mysql_root
|
||||
export MYSQL_DATABASE=fetcherpay
|
||||
export MYSQL_USER=fetcherpay
|
||||
export MYSQL_PASSWORD=mysql_pass
|
||||
export GRAFANA_ADMIN_PASSWORD=admin
|
||||
export ADMINER_USERS=admin:admin
|
||||
export TRAEFIK_DASHBOARD_HOST=traefik.fetcherpay.com
|
||||
|
||||
docker compose -p fetcherpay -f docker-compose.prod.yml up -d postgres gitea
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Fix `workflow.fetcherpay.com`
|
||||
|
||||
The Docker label on the `temporal` service pointed Traefik at port `7233` (gRPC), causing 502s. A file-provider router was added in `tls.yml` pointing `workflow.fetcherpay.com` → `temporal-ui:8080`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Fix Gitea SSH port conflict
|
||||
|
||||
The host port `2222` was already in use by an unknown process and could not be freed. The Gitea SSH mapping was changed from `2222:22` to `22222:22`.
|
||||
|
||||
**File:** `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
|
||||
```yaml
|
||||
gitea:
|
||||
...
|
||||
ports:
|
||||
- "22222:22" # SSH (optional for git over SSH)
|
||||
```
|
||||
|
||||
The `gitea` container was then recreated with the new mapping.
|
||||
|
||||
---
|
||||
|
||||
## 7. Restart Traefik after every config change
|
||||
|
||||
```bash
|
||||
docker restart traefik
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Verification results
|
||||
|
||||
Final public reachability check:
|
||||
|
||||
```
|
||||
https://hermes.squaremcp.com/openapi-living-brief.json -> 200 (cert=0)
|
||||
https://app.squaremcp.com/ -> 200 (cert=0)
|
||||
https://docs.squaremcp.com/ -> 200 (cert=0)
|
||||
https://squaremcp.com/ -> 200 (cert=0)
|
||||
https://www.squaremcp.com/ -> 200 (cert=0)
|
||||
https://tiktok.squaremcp.com/ -> 200 (cert=0)
|
||||
https://tiktok.squaremcp.com/auth/tiktok/start -> 302 (cert=0)
|
||||
https://fetcherpay.com/ -> 200 (cert=0)
|
||||
https://www.fetcherpay.com/ -> 200 (cert=0)
|
||||
https://workflow.fetcherpay.com/ -> 200 (cert=0)
|
||||
https://mail.fetcherpay.com/ -> 302 (cert=0)
|
||||
https://git.fetcherpay.com/ -> 200 (cert=0)
|
||||
|
||||
POST /api/pilot-request (tiktok) -> 201
|
||||
POST /api/pilot-request (root/www) -> 201
|
||||
GET /auth/tiktok/start -> 302
|
||||
```
|
||||
|
||||
`cert:0` means TLS verification passed.
|
||||
|
||||
---
|
||||
|
||||
## Notes / gotchas
|
||||
|
||||
- `/api/pilot-request` is `POST`-only. A `GET` request returns `404`, which is expected.
|
||||
- The `/auth/tiktok` routes are `/auth/tiktok/start` and `/auth/tiktok/callback`; the Traefik `PathPrefix(`/auth/tiktok`)` rule correctly forwards both.
|
||||
- Static certificate extraction required root access; Docker root containers were used when `sudo` began prompting for a password.
|
||||
32
docs/runbooks/2026-06-14-outage-index.md
Normal file
32
docs/runbooks/2026-06-14-outage-index.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# 2026-06-14 Public Edge Outage — Vault Index
|
||||
|
||||
All documentation for the outage, its root cause, the fix, and the follow-up plan lives in this SquareMCP vault folder.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `2026-06-14-public-edge-outage-rca.md` | Root cause analysis and incident timeline. |
|
||||
| `2026-06-14-outage-fix-log.md` | Step-by-step record of every config change, command, and verification result. |
|
||||
| `2026-06-14-infrastructure-findings.md` | As-built architecture, Traefik/K8s behavior, Hermes route table, and monitoring gaps. |
|
||||
| `2026-06-14-active-issues-and-debt.md` | What is still down, remaining technical debt, and recommended next steps. |
|
||||
| `2026-06-14-public-edge-outage-plan.md` | Proposed runbook, monitoring, probes, and alerting plan (Phase 1–4). |
|
||||
| `2026-06-14-outage-index.md` | This file. |
|
||||
|
||||
## Quick status
|
||||
|
||||
- ✅ All listed `squaremcp.com` domains reachable with valid TLS.
|
||||
- ✅ All listed `fetcherpay.com` domains reachable with valid TLS.
|
||||
- ✅ Hermes path routes (`/api/pilot-request`, `/auth/tiktok`) verified.
|
||||
- ⚠️ K8s nginx-ingress remains bypassed by Traefik.
|
||||
- ⚠️ Several FetcherPay services still stopped (`api`, Prometheus, Grafana, Adminer).
|
||||
- ⚠️ No automated monitoring or alerting yet.
|
||||
|
||||
## Reference paths on disk
|
||||
|
||||
- Traefik compose: `/home/garfield/traefik-compose.yml`
|
||||
- Traefik static config: `/home/garfield/traefik.yml`
|
||||
- Traefik dynamic config: `/home/garfield/letsencrypt/manual/tls.yml`
|
||||
- Static certs: `/home/garfield/letsencrypt/manual/certs/`
|
||||
- FetcherPay prod compose: `/home/garfield/Downloads/docker-compose.prod.yml`
|
||||
- Hermes K8s manifest: `/home/garfield/hermes-mcp/hermes-k8s.yaml`
|
||||
129
docs/runbooks/2026-06-14-public-edge-outage-plan.md
Normal file
129
docs/runbooks/2026-06-14-public-edge-outage-plan.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Plan: Document the outage, build a deployment runbook, and add diagnostics/monitoring
|
||||
|
||||
## Goal
|
||||
Turn the June 2026 public-edge outage into repeatable, observable infrastructure, with all artifacts stored in the SquareMCP repository (`/home/garfield/hermes-mcp/`).
|
||||
1. Write a clear post-incident / RCA document.
|
||||
2. Create a step-by-step deployment runbook that the next operator can follow without guessing.
|
||||
3. Add probes, metrics, and alerting so the same class of failure is detected and escalated before users notice.
|
||||
|
||||
---
|
||||
|
||||
## Root cause (condensed)
|
||||
- **Public ports 80/443/8080 are owned by a Docker Traefik container.** Its iptables rules intercept all inbound traffic before the host-network K8s nginx-ingress can serve it.
|
||||
- **Traefik had no routers or valid TLS certificates** for the commercial `squaremcp.com` / `fetcherpay.com` domains, so it returned `404 page not found` with a self-signed cert.
|
||||
- **K8s cert-manager held valid certs**, but the active nginx-ingress controller uses `ingressClass=public` while the Ingress resources use `ingressClassName=nginx`, so K8s never reconciled them and could not serve traffic anyway.
|
||||
- **Several Docker backends were stopped**: `fetcherpay-web`, `poste`, `postgres`, `gitea`. The `temporal-ui` container was running but Traefik was pointed at its gRPC port (`7233`) instead of its HTTP UI port (`8080`).
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 1: Post-incident / RCA document
|
||||
**Location:** `hermes-mcp/docs/runbooks/2026-06-14-public-edge-outage-rca.md`
|
||||
|
||||
Sections:
|
||||
- **Summary** — what was down, for how long, user impact.
|
||||
- **Timeline** — detection, mitigation, full restoration.
|
||||
- **Root cause** — Traefik/Docker edge + missing routes/certs + K8s ingress class mismatch + stopped containers.
|
||||
- **Why detection failed** — no synthetic uptime checks, no cert-expiry alerting, no Traefik routing alert, Docker restart did not catch stopped non-Hermes services.
|
||||
- **Remediation actions taken** — static cert extraction, file-provider routers, network attachment, container restarts, port conflict resolution.
|
||||
- **Follow-up work** — this plan’s runbook and monitoring deliverables.
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 2: Deployment runbook
|
||||
**Location:** `hermes-mcp/docs/runbooks/deployment.md`
|
||||
|
||||
The runbook will cover:
|
||||
1. **Pre-flight checks**
|
||||
- Confirm Traefik is attached to required networks (`hermes-net`, `obsidian-net`, `fetcherpay`).
|
||||
- Confirm all expected Docker networks exist.
|
||||
- Confirm static cert directory (`/home/garfield/letsencrypt/manual/certs/`) contains current certs for all file-provider domains.
|
||||
2. **Deploy / update the edge proxy**
|
||||
- Rebuild / restart Traefik from `traefik-compose.yml`.
|
||||
- Validate `tls.yml` routers, services, and certificate entries.
|
||||
- Smoke-test every public host immediately after restart.
|
||||
3. **Deploy Hermes / SquareMCP (K8s path)**
|
||||
- Build, push, update digest in `hermes-k8s.yaml`.
|
||||
- Apply manifests and wait for rollout.
|
||||
- Verify `/health`, `/openapi-living-brief.json`, OAuth endpoints, `/api/pilot-request`.
|
||||
4. **Deploy FetcherPay stack (Docker path)**
|
||||
- Export required env vars (or ensure `.env` is present).
|
||||
- `docker compose -p fetcherpay up -d` for web, api, mail, git, workflow.
|
||||
- Verify `fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`, `workflow.fetcherpay.com`.
|
||||
5. **Certificate renewal / rotation**
|
||||
- When Traefik ACME works vs. when to fall back to K8s cert-manager secret extraction.
|
||||
- Step-by-step secret extraction command template.
|
||||
6. **Rollback checklist**
|
||||
- Revert image digest / compose change, restart, verify.
|
||||
7. **Verification script**
|
||||
- A single `hermes-mcp/scripts/verify-public-endpoints.sh` that curls every critical URL and exits non-zero on failure.
|
||||
|
||||
---
|
||||
|
||||
## Deliverable 3: Diagnostics, metrics, and probes
|
||||
Two viable approaches. The recommended one keeps the current architecture and hardens it; the alternative migrates the edge to K8s.
|
||||
|
||||
### Option A — Harden the existing Traefik edge (recommended)
|
||||
**Why:** Lowest risk, fastest to implement, directly protects against the exact failure modes we just saw.
|
||||
|
||||
Implementation pieces:
|
||||
1. **Synthetic uptime probes (blackbox exporter)**
|
||||
- Add `prom/blackbox-exporter` config inside the repo (e.g. `hermes-mcp/monitoring/blackbox.yml`).
|
||||
- Probe all public URLs every 60s: HTTPS, TLS cert validity, expected HTTP status.
|
||||
- Domains: `hermes.squaremcp.com/openapi-living-brief.json`, `app.squaremcp.com`, `docs.squaremcp.com`, `squaremcp.com`, `www.squaremcp.com`, `tiktok.squaremcp.com`, `fetcherpay.com`, `www.fetcherpay.com`, `workflow.fetcherpay.com`, `mail.fetcherpay.com`, `git.fetcherpay.com`.
|
||||
- Path-specific probes: `POST /api/pilot-request`, `GET /auth/tiktok/start`.
|
||||
2. **Certificate expiry alerting**
|
||||
- Blackbox `probe_ssl_earliest_cert_expiry` alert when any cert has < 7 days left.
|
||||
- Separate alert for Traefik default / self-signed cert (would fire immediately on a routing miss).
|
||||
3. **Traefik routing health**
|
||||
- Enable Traefik metrics endpoint (`--metrics.prometheus`).
|
||||
- Alert on `traefik_router_server_errors` or `traefik_service_server_up == 0`.
|
||||
4. **Container health & restart policy**
|
||||
- Ensure every commercial service has `restart: unless-stopped` and a Docker `healthcheck`.
|
||||
- Add a simple systemd user timer or cron that runs `docker compose -p fetcherpay ps` and alerts if any expected container is not `Up`.
|
||||
5. **K8s ingress reconciliation check**
|
||||
- A probe/script (`hermes-mcp/scripts/check-k8s-ingress.sh`) that confirms all `squaremcp.com` Ingresses have a matching `ADDRESS` and valid TLS secret.
|
||||
- Alert if `kubectl get ingress -A` shows missing addresses or cert-manager `CertificateReady=False`.
|
||||
6. **Hermes application metrics**
|
||||
- Add a `/metrics` endpoint using `prom-client` in `src/index.ts`.
|
||||
- Instrument request latency, error rate, active OAuth sessions, tool call counts.
|
||||
- Scrape it from Prometheus.
|
||||
7. **Separate readiness probe**
|
||||
- Keep `/health` for liveness; add `/ready` that checks DB/Redis connectivity before reporting ready.
|
||||
8. **Alertmanager + Slack / email**
|
||||
- Deploy `prom/alertmanager` alongside Prometheus.
|
||||
- Route critical alerts (site down, cert expiring, service unhealthy) to a Slack webhook and/or email.
|
||||
9. **Verification script**
|
||||
- `hermes-mcp/scripts/verify-public-endpoints.sh` used in runbook and optionally in CI.
|
||||
|
||||
### Option B — Migrate public edge to K8s nginx-ingress
|
||||
**Why:** Eliminates the split-ingress complexity that caused the routing confusion.
|
||||
|
||||
Implementation pieces:
|
||||
1. Reconcile `ingressClassName: nginx` → `public` (or change the controller to `nginx`).
|
||||
2. Reconfigure Traefik to not bind public 80/443, or move it to an internal Docker-only role.
|
||||
3. Point public DNS/router directly at the K8s nginx-ingress controller (host-network or NodePort).
|
||||
4. Re-issue all certs via cert-manager and remove the static-cert workaround.
|
||||
5. Still add blackbox exporter / Alertmanager / Hermes metrics as in Option A.
|
||||
|
||||
**Trade-off:** Larger architectural change, risk of another outage during migration, but cleaner long term.
|
||||
|
||||
---
|
||||
|
||||
## Suggested file changes (all under `hermes-mcp/`)
|
||||
- **New:** `docs/runbooks/2026-06-14-public-edge-outage-rca.md`
|
||||
- **New / rewrite:** `docs/runbooks/deployment.md`
|
||||
- **New:** `docs/runbooks/monitoring-playbook.md` (alert runbook)
|
||||
- **New:** `scripts/verify-public-endpoints.sh`
|
||||
- **New:** `scripts/check-k8s-ingress.sh`
|
||||
- **Modify:** `src/index.ts` — add `/metrics`, `/ready`, enhance `/health`
|
||||
- **Modify:** `hermes-k8s.yaml` — add startup probe, resource requests/limits
|
||||
- **New:** `monitoring/blackbox.yml`, `monitoring/prometheus.yml`, `monitoring/alert-rules.yml`, `monitoring/alertmanager.yml`
|
||||
- **Modify:** root `docker-compose.fetcherpay.yml` or create `monitoring/docker-compose.monitoring.yml` if the user prefers not to touch the prod compose file.
|
||||
|
||||
---
|
||||
|
||||
## Phasing recommendation
|
||||
- **Phase 1 (immediate):** RCA doc + runbook + `scripts/verify-public-endpoints.sh`.
|
||||
- **Phase 2 (this week):** blackbox exporter + cert-expiry alerts + container-up check.
|
||||
- **Phase 3 (next sprint):** Hermes `/metrics` + dashboards + Alertmanager Slack routing.
|
||||
- **Phase 4 (future):** decide on Option B edge migration after Phase 1–3 are stable.
|
||||
88
docs/runbooks/2026-06-14-public-edge-outage-rca.md
Normal file
88
docs/runbooks/2026-06-14-public-edge-outage-rca.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Public Edge Outage — Root Cause Analysis
|
||||
|
||||
**Date:** 2026-06-14
|
||||
**Severity:** High — all public `squaremcp.com` and `fetcherpay.com` properties unreachable or certificate-invalid.
|
||||
**Status:** Resolved. All listed commercial domains reachable with valid TLS.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
On 2026-06-14, every public-facing SquareMCP / FetcherPay domain was either returning `404 page not found` or serving an invalid/default TLS certificate. The root cause was a **misconfigured public edge proxy combined with stopped backends and a K8s ingress class mismatch**. Traffic from the internet never reached the Kubernetes nginx-ingress controller that held valid certificates; instead it was intercepted by a Docker Traefik container that had no routes and no valid certificates for the affected domains.
|
||||
|
||||
---
|
||||
|
||||
## Timeline (all times UTC-4)
|
||||
|
||||
- **~09:30** — User reports that commercial sites are not reachable.
|
||||
- **09:30–10:00** — Diagnosis: Traefik container owns public `:80`/`:`443`, has default cert, no routers for `*.squaremcp.com` / `*.fetcherpay.com`.
|
||||
- **10:00–10:30** — Added file-provider routers and static K8s-extracted certificates for `squaremcp.com`, `www.squaremcp.com`, `app.squaremcp.com`, `docs.squaremcp.com`, `tiktok.squaremcp.com`.
|
||||
- **10:30–11:00** — Fixed `fetcherpay.com` / `www.fetcherpay.com` by attaching Traefik to the `fetcherpay` Docker network and starting the stopped `fetcherpay-web` container.
|
||||
- **11:00–11:30** — Fixed `workflow.fetcherpay.com` (Traefik was routing to gRPC port `7233` instead of HTTP UI port `8080`).
|
||||
- **11:30–12:00** — Fixed `mail.fetcherpay.com` by starting `poste`, extracting the K8s cert, and adding a Traefik router/service.
|
||||
- **12:00–13:30** — Fixed `git.fetcherpay.com` by starting `postgres` and `gitea`, extracting the K8s cert, adding a router/service, and resolving a host port `2222` conflict by remapping Gitea SSH to `22222`.
|
||||
- **13:30–14:00** — Final verification of all domains and Hermes path-specific routes.
|
||||
|
||||
---
|
||||
|
||||
## Root cause
|
||||
|
||||
### 1. Docker Traefik intercepts all public ingress
|
||||
- The Traefik v3 container binds host ports `80`, `443`, and `8080`.
|
||||
- Docker publishes these ports via `docker-proxy`, which inserts `iptables` DNAT rules.
|
||||
- Those rules intercept all inbound public HTTP/S traffic **before** it can reach the host-network MicroK8s nginx-ingress controller.
|
||||
|
||||
### 2. Traefik had no routes or valid TLS for the commercial domains
|
||||
- Traefik’s dynamic config comes from Docker labels and a file provider (`/home/garfield/letsencrypt/manual/tls.yml`).
|
||||
- At the start of the incident the file provider only had a partial/incomplete set of routers.
|
||||
- There were no valid Let’s Encrypt certificates for most domains because GoDaddy DNS-01 returns `DUPLICATE_RECORD` for `_acme-challenge.*` TXT records, blocking issuance.
|
||||
- Result: any request for an unmatched host fell through to Traefik’s default self-signed certificate and returned `404 page not found`.
|
||||
|
||||
### 3. K8s nginx-ingress was unreachable even though it had valid certs
|
||||
- Cert-manager inside MicroK8s held valid TLS secrets for the affected domains.
|
||||
- The active nginx-ingress-microk8s controller is configured for `ingressClass=public`.
|
||||
- Most Ingress resources specify `ingressClassName: nginx`.
|
||||
- Because of the class mismatch, those Ingresses were never reconciled by the active controller, so K8s could not serve traffic even if Traefik had forwarded it.
|
||||
|
||||
### 4. Several Docker backends were stopped
|
||||
- `fetcherpay-web` — stopped.
|
||||
- `poste` (mail) — stopped.
|
||||
- `postgres` and `gitea` (git) — stopped.
|
||||
- `temporal-ui` was running, but the Traefik Docker label pointed at the gRPC port `7233` instead of the HTTP UI port `8080`, causing 502s for `workflow.fetcherpay.com`.
|
||||
|
||||
---
|
||||
|
||||
## Why detection failed
|
||||
|
||||
- No synthetic uptime probes were running against the public endpoints.
|
||||
- No certificate-expiry or certificate-default alerting.
|
||||
- No Traefik routing-health alert.
|
||||
- Docker `restart: unless-stopped` only helps if the container was started; there was no watchdog for expected-but-stopped services.
|
||||
- K8s ingress reconciliation was not monitored, so the class mismatch went unnoticed.
|
||||
|
||||
---
|
||||
|
||||
## Remediation actions taken
|
||||
|
||||
1. **Rebuilt the Traefik file-provider config** (`/home/garfield/letsencrypt/manual/tls.yml`) with explicit routers and services for every commercial domain.
|
||||
2. **Attached Traefik to the `fetcherpay` Docker network** in `/home/garfield/traefik-compose.yml` so it could reach FetcherPay backends.
|
||||
3. **Extracted valid K8s cert-manager secrets** and loaded them as static TLS certificates in Traefik to bypass the GoDaddy duplicate-TXT issue.
|
||||
4. **Started stopped backend containers**: `fetcherpay-web`, `poste`, `postgres`, `gitea`.
|
||||
5. **Fixed `workflow.fetcherpay.com`** by routing to `temporal-ui:8080` instead of `7233`.
|
||||
6. **Fixed `git.fetcherpay.com`** SSH port conflict by changing the host mapping from `2222:22` to `22222:22` in `/home/garfield/Downloads/docker-compose.prod.yml`.
|
||||
7. **Verified** all public endpoints return expected HTTP codes with TLS certificates that validate.
|
||||
|
||||
---
|
||||
|
||||
## Remaining technical debt
|
||||
|
||||
- K8s nginx-ingress is still effectively bypassed for public traffic. Long-term the ingress classes should be reconciled or the public edge should be migrated to a single controller.
|
||||
- Several `fetcherpay.com` subdomains that depend on stopped services remain down: `api.fetcherpay.com`, `prometheus.fetcherpay.com`, `grafana.fetcherpay.com`, `adminer.fetcherpay.com`, `traefik.fetcherpay.com`.
|
||||
- Secrets are still stored plaintext in manifests and compose files.
|
||||
- No centralized logging, metrics, or alerting exists for Hermes or the edge proxy.
|
||||
|
||||
---
|
||||
|
||||
## Follow-up work
|
||||
|
||||
See `2026-06-14-public-edge-outage-plan.md` for the full runbook / monitoring / probing plan.
|
||||
Reference in New Issue
Block a user