--- name: tunnel-doctor description: Diagnoses and fixes conflicts between Tailscale and proxy/VPN tools (Shadowrocket, Clash, Surge) on macOS. Covers five conflict layers - (1) route hijacking, (2) HTTP proxy env var interception, (3) system proxy bypass, (4) SSH ProxyCommand double tunneling, and (5) VM/container runtime proxy propagation (OrbStack/Docker). Includes SOP for remote development via SSH tunnels with proxy-safe Makefile patterns. Use when Tailscale ping works but SSH/HTTP times out, when browser returns 503 but curl works, when git push fails with "failed to begin relaying via HTTP", when Docker pull times out behind TUN/VPN, when setting up Tailscale SSH to WSL instances, or when bootstrapping remote dev environments over Tailscale. allowed-tools: Read, Grep, Edit, Bash --- # Tunnel Doctor Diagnose and fix conflicts when Tailscale coexists with proxy/VPN tools on macOS, with specific guidance for SSH access to WSL instances. ## Five Conflict Layers Proxy/VPN tools on macOS create conflicts at five independent layers. Layers 1-3 affect Tailscale connectivity; Layer 4 affects SSH git operations; Layer 5 affects VM/container runtimes: | Layer | What breaks | What still works | Root cause | |-------|-------------|------------------|------------| | 1. Route table | Everything (SSH, curl, browser) | `tailscale ping` | `tun-excluded-routes` adds `en0` route overriding Tailscale utun | | 2. HTTP env vars | `curl`, Python requests, Node.js fetch | SSH, browser | `http_proxy` set without `NO_PROXY` for Tailscale | | 3. System proxy (browser) | Browser only (HTTP 503) | SSH, `curl` (both with/without proxy) | Browser uses VPN system proxy; DIRECT rule routes via Wi-Fi, not Tailscale utun | | 4. SSH ProxyCommand double tunnel | `git push/pull` (intermittent) | `ssh -T` (small data) | `connect -H` creates HTTP CONNECT tunnel redundant with Shadowrocket TUN; landing proxy drops large/long-lived transfers | | 5. VM/Container proxy propagation | `docker pull`, `docker build` | Host `curl`, running containers | VM runtime (OrbStack/Docker Desktop) auto-injects or caches proxy config; removing proxy makes it worse (VM traffic via TUN → TLS timeout) | ## Diagnostic Workflow ### Step 1: Identify the Symptom Determine which scenario applies: - **Browser returns HTTP 503, but `curl` and SSH both work** → System proxy bypass conflict (Step 2C) - **`local.` fails in browser/default `curl`, but direct/no-proxy request works** → Local vanity domain proxy interception (Step 2C-1) - **Tailscale ping works, SSH works, but curl/HTTP times out** → HTTP proxy env var conflict (Step 2A) - **Tailscale ping works, SSH/TCP times out** → Route conflict (Step 2B) - **Remote dev server auth redirects to `localhost` → browser can't follow** → SSH tunnel needed (Step 2D) - **`make status` / scripts curl to localhost fail with proxy** → localhost proxy interception (Step 2E) - **`git push/pull` fails with `FATAL: failed to begin relaying via HTTP`** → SSH double tunnel (Step 2F) - **`docker build` `RUN apk/apt` fails with `Connection refused` instantly** → OrbStack transparent proxy + TUN conflict (Step 2G-1, fix: `--network host`) - **`docker pull` fails with `TLS handshake timeout`** → VM proxy misconfiguration (Step 2G-2, fix: `docker.json` with `host.internal`) - **Container healthcheck `(unhealthy)` but app runs fine** → Lowercase proxy env var leak (Step 2G-4, fix: clear `http_proxy`+`HTTP_PROXY`) - **`docker build` can't fetch base images** → VM/container proxy propagation (Step 2G) - **`git clone` fails with `Connection closed by 198.18.x.x`** → TUN DNS hijack for SSH (Step 2H) - **SSH connects but `operation not permitted`** → Tailscale SSH config issue (Step 4) - **SSH connects but `be-child ssh` exits code 1** → WSL snap sandbox issue (Step 5) - **TCP port 22 reachable (`nc -z` succeeds) but SSH fails with `kex_exchange_identification: Connection closed`** → Tailscale SSH proxy intercept on WSL (Step 5A) - **`tailscale ssh` returns "not available on App Store builds"** → Wrong Tailscale distribution on macOS (Step 5B) **Key distinctions**: - SSH does NOT use `http_proxy`/`NO_PROXY` env vars. If SSH works but HTTP doesn't → Layer 2. - `curl` uses `http_proxy` env var, NOT the system proxy. Browser uses system proxy (set by VPN). If `curl` works but browser doesn't → Layer 3. - If `tailscale ping` works but regular `ping` doesn't → Layer 1 (route table corrupted). - If `ssh -T git@github.com` works but `git push` fails intermittently → Layer 4 (double tunnel). - If host `curl https://...` works but `docker pull` times out → Layer 5 (VM proxy propagation). - If `docker pull` works but `docker build` `RUN apk add` fails instantly with `Connection refused` → OrbStack transparent proxy broken by TUN (Step 2G-1). - If container healthcheck shows `(unhealthy)` but app works → lowercase `http_proxy` leaked into container (Step 2G-4). - If DNS resolves to `198.18.x.x` virtual IPs → TUN DNS hijack (Step 2H). - If `nc -z` succeeds on port 22 but SSH gets no banner (`kex_exchange_identification`) → Tailscale SSH proxy intercept (Step 5A). Confirm with `tcpdump -i any port 22` on the remote — 0 packets means Tailscale intercepts above the kernel. - If `tailscale ssh` fails with "not available on App Store builds" → install Standalone Tailscale (Step 5B). ### Fast Path: Run Automated Checks For common macOS conflicts (env proxy, system proxy exceptions, direct/proxy path split, local TLS trust), run: ```bash python3 scripts/quick_diagnose.py --host local.claude4.dev --url https://local.claude4.dev/health ``` Optional route ownership check for a Tailscale destination: ```bash python3 scripts/quick_diagnose.py --host --url http://:/health --tailscale-ip <100.x.x.x> ``` Interpretation: - `direct=PASS` + `forced_proxy=FAIL` = host must bypass proxy (`skip-proxy` + `NO_PROXY`). - `strict_tls=FAIL` + `direct=PASS` = path is reachable; trust issue only (install/trust local CA). - `host in scutil exceptions: no` = browser/system clients still likely proxied. ### Step 2A: Fix HTTP Proxy Environment Variables Check if proxy env vars are intercepting Tailscale HTTP traffic: ```bash env | grep -i proxy ``` **Broken output** — proxy is set but `NO_PROXY` doesn't exclude Tailscale: ``` http_proxy=http://127.0.0.1:1082 https_proxy=http://127.0.0.1:1082 NO_PROXY=localhost,127.0.0.1 ← Missing Tailscale! ``` **Fix** — add Tailscale MagicDNS domain + CIDR to `NO_PROXY`: ```bash export NO_PROXY=localhost,127.0.0.1,.ts.net,100.64.0.0/10,192.168.*,10.*,172.16.* ``` | Entry | Covers | Why | |-------|--------|-----| | `.ts.net` | MagicDNS domains (`host.tailnet.ts.net`) | Matched before DNS resolution | | `100.64.0.0/10` | Tailscale IPs (`100.64.*` – `100.127.*`) | Precise CIDR, no public IP false positives | | `192.168.*,10.*,172.16.*` | RFC 1918 private networks | LAN should never be proxied | **Two layers complement each other**: `.ts.net` handles domain-based access, `100.64.0.0/10` handles direct IP access. **NO_PROXY syntax pitfalls** — see [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) for the compatibility matrix. **Go `net/http` CIDR caveat**: Go's standard `net/http` does NOT support CIDR notation in `NO_PROXY`. Setting `NO_PROXY=100.64.0.0/10` works for curl and Python, but Go programs (including Tailscale-adjacent tooling) will still send traffic through the proxy. The fix is to use MagicDNS hostnames (e.g., `workstation-4090-wsl`) instead of raw IPs, or add explicit hostnames to `NO_PROXY`: ```bash # WRONG for Go programs — CIDR is silently ignored NO_PROXY=100.64.0.0/10 go-program http://100.101.102.103:8002/health # → goes through proxy # CORRECT — use hostname (matched as suffix) or explicit IP export NO_PROXY=localhost,127.0.0.1,.ts.net,workstation-4090-wsl,100.101.102.103,192.168.*,10.*,172.16.* ``` This is especially relevant when accessing Tailscale services from Go-based tools (e.g., custom CLIs, Go test suites hitting remote APIs). Verify the fix: ```bash # Both must return HTTP 200: NO_PROXY="...(new value)..." curl -s --connect-timeout 5 http://.ts.net:/health -w "HTTP %{http_code}\n" NO_PROXY="...(new value)..." curl -s --connect-timeout 5 http://:/health -w "HTTP %{http_code}\n" ``` Then persist in shell config (`~/.zshrc` or `~/.bashrc`). ### Step 2B: Detect Route Conflicts Check if a proxy tool hijacked the Tailscale CGNAT range: ```bash route -n get ``` **Healthy output** — traffic goes through Tailscale interface: ``` destination: 100.64.0.0 interface: utun7 # Tailscale interface (utunN varies) ``` **Broken output** — proxy hijacked the route: ``` destination: 100.64.0.0 gateway: 192.168.x.1 # Default gateway interface: en0 # Physical interface, NOT Tailscale ``` **Important**: Not all `utun` interfaces are Tailscale's. Verify which utun belongs to Tailscale before concluding the route is correct: ```bash # Find Tailscale's utun interface (has a 100.x.x.x IP) ifconfig | grep -A2 'inet 100\.' ``` Quick indicators by MTU: - **MTU 1280** → typically Tailscale - **MTU 4064** → typically Shadowrocket TUN If `route -n get` shows traffic going to a utun with MTU 4064, it is hitting Shadowrocket's TUN, not Tailscale — this is still a route conflict even though the interface name starts with `utun`. Confirm with full route table: ```bash netstat -rn | grep 100.64 ``` Two competing routes indicate a conflict: ``` 100.64/10 192.168.x.1 UGSc en0 ← Proxy added this (wins) 100.64/10 link#N UCSI utun7 ← Tailscale route (loses) ``` **Root cause**: On macOS, `UGSc` (Static Gateway) takes priority over `UCSI` (Cloned Static Interface) for the same prefix length. ### Step 2C: Fix System Proxy Bypass (Browser 503) **Symptom**: Browser shows HTTP 503 for `http://:`, but both `curl --noproxy '*'` and `curl` (with proxy env var) return 200. SSH also works. **Root cause**: The browser uses the system proxy configured by the VPN profile (Shadowrocket/Clash/Surge). The proxy matches `IP-CIDR,100.64.0.0/10,DIRECT` and tries to connect directly — but "directly" means via the Wi-Fi interface (en0), NOT through Tailscale's utun interface. The proxy process itself doesn't have a route to Tailscale IPs, so the connection fails with 503. **Diagnosis**: ```bash # curl with proxy env var works (curl connects to proxy port, but traffic flows differently) curl -s -o /dev/null -w "%{http_code}" http://:/ # → 200 # Browser gets 503 because it goes through the VPN system proxy, not http_proxy env var ``` **Fix** — add Tailscale CGNAT range to `skip-proxy` in the proxy tool config: For Shadowrocket, in `[General]`: ``` skip-proxy = 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12, 100.64.0.0/10, localhost, *.local, captive.apple.com ``` `skip-proxy` tells the system "bypass the proxy entirely for these addresses." The browser then connects directly through the OS network stack, where Tailscale's routing table correctly handles the traffic. **Why `skip-proxy` works but `tun-excluded-routes` doesn't**: - `skip-proxy`: Bypasses the HTTP proxy layer only. Traffic still flows through the TUN interface and Tailscale utun handles it. Safe. - `tun-excluded-routes`: Removes the CIDR from the TUN routing entirely. This creates a competing `en0` route that overrides Tailscale. Breaks everything. #### Step 2C-1: Fix Local Vanity Domain Interception (`local.`) **Symptom**: `https://local.` fails in browser or default `curl`, but succeeds with direct/no-proxy command: ```bash env -u http_proxy -u https_proxy curl -k -I https://local./health # -> 200 curl -I https://local./health # -> proxy CONNECT then TLS reset/failure ``` **Root cause**: The domain is routed through system/shell proxy instead of local direct path. **Fix**: 1. Add domain to proxy app bypass list (`skip-proxy` for Shadowrocket). 2. Add domain to shell bypass list (`NO_PROXY`/`no_proxy`). 3. If local TLS uses internal CA, trust the local root certificate. ```bash # ~/.zshrc export NO_PROXY=localhost,127.0.0.1,.ts.net,100.64.0.0/10,192.168.*,10.*,172.16.*,local.,www.local. export no_proxy="$NO_PROXY" ``` **Verification**: ```bash python3 scripts/quick_diagnose.py --host local. --url https://local./health ``` Expected: - `host in NO_PROXY: yes` - `host in scutil exceptions: yes` - `ambient=PASS` and `direct=PASS` ### Step 2D: Fix Auth Redirect for Remote Dev (SSH Tunnel) **Symptom**: Dev server runs on a remote machine (e.g., Mac Mini via Tailscale). You access `http://:3010` in the browser. Login/signup works, but after auth, the app redirects to `http://localhost:3010/` which fails — `localhost` on your machine isn't running the dev server. **Root cause**: The app's `APP_URL` (or equivalent) is set to `http://localhost:3010`. Auth libraries (Better-Auth, NextAuth, etc.) use this URL for callback redirects. Changing `APP_URL` to the Tailscale IP introduces Shadowrocket proxy conflicts and breaks local development on the remote machine. **Fix** — SSH local port forwarding. This avoids all three conflict layers entirely: ```bash # Forward local port 3010 to remote machine's localhost:3010 ssh -NL 3010:localhost:3010 # Or with autossh for auto-reconnect (recommended for long sessions) autossh -M 0 -f -N -L 3010:localhost:3010 \ -o "ServerAliveInterval=30" \ -o "ServerAliveCountMax=3" \ -o "ExitOnForwardFailure=yes" \ ``` Now access `http://localhost:3010` in the browser. Auth redirects to `localhost:3010` → tunnel → remote dev server → works correctly. **Why this is the best approach**: - No `.env` changes needed — `APP_URL=http://localhost:3010` works everywhere - No Shadowrocket conflicts — `localhost` is always in `skip-proxy` - No code changes — same behavior as local development - Industry standard — VS Code Remote SSH, GitHub Codespaces use the same pattern **Install autossh**: `brew install autossh` (macOS) or `apt install autossh` (Linux) **Kill background tunnel**: `pkill -f 'autossh.*'` ### Step 2E: Fix localhost Proxy Interception in Scripts **Symptom**: Makefile targets or scripts that `curl` localhost (health checks, warmup routes) fail or timeout when `http_proxy` is set globally in the shell. **Root cause**: `http_proxy=http://127.0.0.1:1082` is set in `~/.zshrc` but `no_proxy` doesn't include `localhost`. All curl commands send localhost requests through the proxy. **Fix** — add `--noproxy localhost` to all localhost curl commands in scripts: ```makefile # WRONG — fails when http_proxy is set @curl -sf http://localhost:9000/minio/health/live && echo "OK" # CORRECT — always bypasses proxy for localhost @curl --noproxy localhost -sf http://localhost:9000/minio/health/live && echo "OK" ``` Alternatively, set `no_proxy` globally in `~/.zshrc`: ```bash export no_proxy=localhost,127.0.0.1 ``` ### Step 2F: Fix SSH ProxyCommand Double Tunnel (git push/pull failures) **Symptom**: `ssh -T git@github.com` succeeds consistently, but `git push` or `git pull` fails intermittently with: ``` FATAL: failed to begin relaying via HTTP. Connection closed by UNKNOWN port 65535 ``` Small operations (auth, fetch metadata) work; large data transfers fail. **Root cause**: When Shadowrocket TUN is active, it already routes all TCP traffic through its VPN tunnel. If SSH config also uses `ProxyCommand connect -H`, data flows through two proxy layers — the landing proxy drops large/long-lived HTTP CONNECT connections. **Diagnosis**: ```bash # 1. Confirm Shadowrocket TUN is active ifconfig | grep '^utun' # 2. Check SSH config for ProxyCommand grep -A5 'Host github.com' ~/.ssh/config # 3. Confirm: removing ProxyCommand fixes push GIT_SSH_COMMAND="ssh -o ProxyCommand=none" git push origin main ``` **Fix** — remove ProxyCommand and switch to `ssh.github.com:443`. See [references/proxy_conflict_reference.md § SSH ProxyCommand and Git Operations](references/proxy_conflict_reference.md) for the full SSH config, why port 443 helps, and fallback options when VPN is off. ### Step 2G: Fix VM/Container Runtime Proxy Propagation (Docker pull/build failures) **Symptom**: `docker pull` or `docker build` fails with `net/http: TLS handshake timeout`, `Connection refused` from Alpine/Debian repos, or `Internal Server Error` from `auth.docker.io`, while host `curl` to the same URLs works fine. **Applies to**: OrbStack, Docker Desktop, or any VM-based Docker runtime on macOS with Shadowrocket/Clash TUN active. **Root cause**: VM-based Docker runtimes (OrbStack, Docker Desktop) run the Docker daemon inside a lightweight VM. The VM's outbound traffic takes a different network path than host processes: ``` Host process (curl): Process → TUN (Shadowrocket) → landing proxy → internet ✅ VM process (Docker): Docker daemon → VM bridge → host network → TUN → ??? ❌ ``` The TUN handles host-originated traffic correctly but may drop or delay VM-bridged traffic (different TCP stack, MTU, keepalive behavior). **Critical distinction: `docker pull` vs `docker build` use different proxy paths**: | Operation | Proxy source | What controls it | |-----------|-------------|------------------| | `docker pull` | Docker daemon config | `~/.orbstack/config/docker.json` or `docker info` | | `docker build` (`RUN apt/apk`) | Build container env | `--build-arg http_proxy=...` or `--network host` | | `docker run` | Container env | `-e http_proxy=...` or inherited from daemon | Fixing `docker.json` alone will NOT fix `docker build` — the `RUN` commands inside the build container don't inherit daemon proxy settings. **Diagnosis** — identify which sub-problem: ```bash # 1. Can the Docker daemon pull images? docker pull --quiet alpine:latest 2>&1 # 2. Can a RUN command inside a build reach the internet? docker build --no-cache - <<'EOF' 2>&1 FROM alpine:latest RUN apk update && echo "APK OK" EOF # 3. Can a running container reach the internet? docker run --rm alpine:latest sh -c "apk update 2>&1 | head -3" ``` **Four sub-problems and their fixes**: #### 2G-1: `docker build` fails but host works (most common with OrbStack + Shadowrocket) **Symptom**: `RUN apk add` or `RUN apt-get install` inside `docker build` fails with `Connection refused` instantly (< 0.2s), even though host `curl` to the same URL works. **Root cause**: OrbStack's `network_proxy: auto` creates a transparent proxy inside the VM that intercepts all HTTPS traffic. When Shadowrocket TUN is also active, the transparent proxy's upstream connection breaks — it redirects HTTPS to `127.0.0.1` inside the VM, which has nothing listening. **Diagnosis**: ```bash # Verify: inside the container, HTTPS goes to 127.0.0.1 (broken transparent proxy) docker run --rm alpine:latest sh -c "wget -q --timeout=5 -O /dev/null https://dl-cdn.alpinelinux.org/ 2>&1" # → "wget: can't connect to remote host (127.0.0.1): Connection refused" # ^^^^^^^^^^^^ This is the smoking gun # Verify: --network host bypasses the VM bridge and works docker run --rm --network host alpine:latest sh -c "apk update 2>&1 | head -3" # → "v3.23.x ... OK: 27431 distinct packages available" ← Works! ``` **Fix** — use `--network host` for docker build: ```bash docker build --network host -f Dockerfile -t myimage . ``` This bypasses OrbStack's VM network bridge entirely. The build container uses the host's network stack directly, where Shadowrocket TUN correctly handles traffic. **Trade-off**: `--network host` disables build-time network isolation. For CI/CD, prefer fixing the proxy config (2G-2). For local development, `--network host` is the pragmatic fix. **Permanent fix** — if all your builds need this, add to `~/.docker/daemon.json` or use a shell alias: ```bash # Shell alias (add to ~/.zshrc) alias docker-build='docker build --network host' ``` #### 2G-2: OrbStack auto-detects and caches proxy config OrbStack's `network_proxy: auto` reads `http_proxy` from the shell environment and configures the Docker daemon. The config is stored in `~/.orbstack/config/docker.json`. **Key behaviors**: - `network_proxy: auto` — OrbStack reads host env, creates transparent proxy in VM - `network_proxy: none` — Disables transparent proxy, but VM bridge traffic still routes through TUN (may timeout) - `docker.json` — Controls `docker pull` proxy, NOT `docker build` RUN commands **Diagnosis**: ```bash # Check all three layers echo "=== OrbStack config ===" orbctl config get network_proxy echo "=== docker.json (daemon proxy) ===" cat ~/.orbstack/config/docker.json echo "=== Docker info (effective proxy) ===" docker info | grep -iE "proxy|No Proxy" ``` **Fix** — configure `docker.json` with `host.internal` (OrbStack resolves this to the host IP): ```bash python3 -c " import json, os config = { 'proxies': { 'http-proxy': 'http://host.internal:1082', 'https-proxy': 'http://host.internal:1082', 'no-proxy': 'localhost,127.0.0.1,::1,192.168.128.0/24,100.64.0.0/10,host.internal,*.local' } } path = os.path.expanduser('~/.orbstack/config/docker.json') json.dump(config, open(path, 'w'), indent=2) print('Written:', path) " # Full restart required orbctl stop && sleep 3 && orbctl start ``` **Important**: Use `host.internal` (OrbStack-specific), NOT `127.0.0.1` (points to VM loopback) and NOT `host.docker.internal` (may not resolve in all contexts). **Why NOT remove the proxy**: When TUN is active, removing the Docker proxy means VM traffic goes directly through the bridge → TUN path, which causes TLS handshake timeouts. The proxy provides a working outbound channel. #### 2G-3: Removing proxy makes Docker worse (counter-intuitive) | Docker config | Traffic path | Result | |---------------|-------------|--------| | Proxy ON (`127.0.0.1`), no `no-proxy` | Docker → VM proxy → ??? | `docker pull` may work, localhost probes ❌ | | Proxy ON (`host.internal`), + `no-proxy` | External: Docker → host proxy → internet; Local: direct | **Both work ✅** | | Proxy OFF (`network_proxy: none`) | Docker → VM bridge → host → TUN → internet | TLS timeout ❌ | | **`--network host` (build only)** | **Build container → host network → TUN → internet** | **Build works ✅** | **Decision tree**: - `docker pull` broken → Fix `docker.json` with `host.internal` proxy (2G-2) - `docker build` broken → Use `--network host` (2G-1) OR pass `--build-arg http_proxy=http://host.internal:1082` - Both broken → Fix both: `docker.json` + `--network host` #### 2G-4: Deploy scripts and container healthchecks probe localhost through proxy Deploy scripts that `curl localhost` inside containers or Docker healthchecks that use `wget http://localhost` will route through the proxy if env vars leak into the container. **Common symptoms**: - Container healthcheck shows `(unhealthy)` but the app inside is running fine - `wget: can't connect to remote host (127.0.0.1): Connection refused` in healthcheck logs (proxy port, not app port) **Root cause**: Docker inherits uppercase AND lowercase proxy env vars from the host. Many tools only clear uppercase (`HTTP_PROXY=`) but forget lowercase (`http_proxy=http://127.0.0.1:1082`). The healthcheck `wget` uses lowercase. **Fix in docker-compose.yml** — clear BOTH cases: ```yaml environment: # Must clear both uppercase and lowercase — wget/curl check different vars - HTTP_PROXY= - HTTPS_PROXY= - http_proxy= - https_proxy= - NO_PROXY=* - no_proxy=* ``` **Fix in deploy scripts**: ```bash _local_bypass="localhost,127.0.0.1,::1" export NO_PROXY="${_local_bypass}${NO_PROXY:+,${NO_PROXY}}" export no_proxy="$NO_PROXY" # Use 127.0.0.1 instead of localhost in probe URLs (some proxy implementations # only match exact string "localhost" in no-proxy, not the resolved IP) curl http://127.0.0.1:3001/health # ✅ bypasses proxy curl http://localhost:3001/health # ❌ may still go through proxy ``` **Verify the fix**: ```bash # Docker proxy check (should show proxy + no-proxy) docker info | grep -iE "proxy|No Proxy" # Pull test docker pull --quiet hello-world # Build test (the real verification) docker build --network host --no-cache - <<'EOF' FROM alpine:latest RUN apk update && echo "BUILD OK" EOF # Container env check (no proxy leak) docker exec env | grep -i proxy # Expected: all empty or not set ``` ### Step 2H: Fix TUN DNS Hijack for SSH/Git (198.18.x.x virtual IPs) **Symptom**: `git clone/fetch/push` fails with `Connection closed by 198.18.0.x port 443`. `ssh -T git@github.com` may also fail. DNS resolution returns `198.18.x.x` addresses instead of real IPs. **Root cause**: Shadowrocket TUN intercepts all DNS queries and returns virtual IPs in the `198.18.0.0/15` range. It then routes traffic to these virtual IPs through the TUN for protocol-aware proxying. HTTP/HTTPS works because the landing proxy understands these protocols, but SSH-over-443 (used by GitHub) gets mishandled — the TUN sees port 443 traffic, expects HTTPS, and drops the SSH handshake. **Diagnosis**: ```bash # DNS returns virtual IP (TUN hijack) nslookup ssh.github.com # → 198.18.0.26 ← Shadowrocket virtual IP, NOT real GitHub IP # Direct IP works (bypasses DNS hijack) ssh -o HostName=140.82.112.35 -o Port=443 git@github.com # → "Hi user! You've successfully authenticated" ``` **Fix** — use direct IP in SSH config to bypass DNS hijack: ```bash # ~/.ssh/config Host github.com HostName 140.82.112.35 # GitHub SSH server real IP (bypasses TUN DNS hijack) Port 443 User git ServerAliveInterval 60 ServerAliveCountMax 3 IdentityFile ~/.ssh/id_ed25519 ``` **GitHub SSH server IPs** (as of 2026, verify with `dig +short ssh.github.com @8.8.8.8`): - `140.82.112.35` (primary) - `140.82.112.36` (alternate) **Trade-off**: Hardcoded IPs break if GitHub changes them. Monitor `ssh -T git@github.com` — if it starts failing, update the IP. A cron job can automate this: ```bash # Weekly check (add to crontab) 0 9 * * 1 dig +short ssh.github.com @8.8.8.8 | head -1 > /tmp/github-ssh-ip.txt ``` **Alternative** (if you control Shadowrocket rules): Add GitHub SSH IPs to DIRECT rule so TUN passes them through without protocol inspection: ``` IP-CIDR,140.82.112.0/24,DIRECT IP-CIDR,192.30.252.0/22,DIRECT ``` This is more robust but requires proxy tool config access. ### Step 3: Fix Proxy Tool Configuration Identify the proxy tool and apply the appropriate fix. See [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) for detailed instructions per tool. **Key principle**: Do NOT use `tun-excluded-routes` to exclude `100.64.0.0/10`. This causes the proxy to add a `→ en0` route that overrides Tailscale. Instead, let the traffic enter the proxy TUN and use a DIRECT rule to pass it through. **Universal fix** — add this rule to any proxy tool: ``` IP-CIDR,100.64.0.0/10,DIRECT IP-CIDR,fd7a:115c:a1e0::/48,DIRECT ``` After applying fixes, verify: ```bash route -n get # Should show Tailscale utun interface, NOT en0 ``` ### Step 4: Configure Tailscale SSH ACL If SSH connects but returns `operation not permitted`, the Tailscale ACL may require browser authentication for each connection. At [Tailscale ACL admin](https://login.tailscale.com/admin/acls), ensure the SSH section uses `"action": "accept"`: ```json "ssh": [ { "action": "accept", "src": ["autogroup:member"], "dst": ["autogroup:self"], "users": ["autogroup:nonroot", "root"] } ] ``` **Note**: `"action": "check"` requires browser authentication each time. Change to `"accept"` for non-interactive SSH access. ### Step 5: Fix WSL Tailscale Installation If SSH connects and ACL passes but fails with `be-child ssh` exit code 1 in tailscaled logs, the snap-installed Tailscale has sandbox restrictions preventing SSH shell execution. **Diagnosis** — check WSL tailscaled logs: ```bash # For snap installs: sudo journalctl -u snap.tailscale.tailscaled -n 30 --no-pager # For apt installs: sudo journalctl -u tailscaled -n 30 --no-pager ``` Look for: ``` access granted to user@example.com as ssh-user "username" starting non-pty command: [/snap/tailscale/.../tailscaled be-child ssh ...] Wait: code=1 ``` **Fix** — replace snap with apt installation: ```bash # Remove snap version sudo snap remove tailscale # Install apt version curl -fsSL https://tailscale.com/install.sh | sh # Start with SSH enabled sudo tailscale up --ssh ``` **Important**: The new installation may assign a different Tailscale IP. Check with `tailscale status --self`. ### Step 5A: Fix Tailscale SSH Proxy Silent Failure on WSL **Symptom**: TCP port 22 is reachable (`nc -z -w 5 22` succeeds), but SSH fails immediately with: ``` kex_exchange_identification: Connection closed by remote host ``` No SSH banner is ever received. This happens even with apt-installed Tailscale (not snap). **Root cause**: When `tailscale up --ssh` is enabled on WSL, Tailscale intercepts port 22 connections at the application layer (above the kernel network stack). If Tailscale's built-in SSH proxy malfunctions, it accepts the TCP connection but immediately closes it before sending the SSH banner. **Key diagnostic** — on the WSL instance: ```bash # This will show 0 packets even during active SSH attempts sudo tcpdump -i any port 22 -c 5 -w /dev/null 2>&1 ``` Zero packets means Tailscale is intercepting connections before they reach the kernel network stack. The kernel's `sshd` never sees the connection. **Distinction from Step 5**: Step 5 covers snap sandbox issues where `be-child ssh` fails. This is a different problem — Tailscale's SSH proxy itself silently fails, regardless of installation method. **Fix** — disable Tailscale's SSH proxy and use regular sshd: ```bash # On the WSL instance: sudo tailscale up --ssh=false # Verify sshd is running sudo service ssh status # If not running: sudo service ssh start # Verify from the client machine: ssh -o ConnectTimeout=10 @ 'echo SSH_OK' ``` After disabling Tailscale SSH, connections go through the kernel network stack to `sshd` as normal. The Tailscale ACL `"action": "accept"` in Step 4 is no longer relevant — authentication is handled by `sshd` using SSH keys or passwords. **When to keep `--ssh` enabled**: Only if you specifically need Tailscale's SSH features (ACL-based access control, no SSH key management). If standard sshd works, prefer `--ssh=false` for reliability. ### Step 5B: Fix App Store Tailscale on macOS (Missing `tailscale ssh`) **Symptom**: Running `tailscale ssh` returns: ``` The 'tailscale ssh' subcommand is not available on macOS builds distributed through the App Store or TestFlight. ``` **Root cause**: The App Store version of Tailscale for macOS is sandboxed and does not include the `tailscale ssh` subcommand. **Fix** — install the Standalone version: 1. Uninstall the App Store version (delete from /Applications) 2. Download the Standalone build from https://pkgs.tailscale.com/stable/#macos 3. Install to /Applications **Post-install CLI setup**: The standalone `tailscale` CLI binary is embedded inside the app bundle. Add an alias to your shell config: ```bash # ~/.zshrc alias tailscale="/Applications/Tailscale.app/Contents/MacOS/Tailscale" ``` Verify: ```bash source ~/.zshrc tailscale version tailscale ssh @ # Should work now ``` ### Step 6: Verify End-to-End Run a complete connectivity test: ```bash # 1. Check route is correct (must show Tailscale's utun, not en0 or Shadowrocket's utun) route -n get # Also confirm which utun is Tailscale's: ifconfig | grep -A2 'inet 100\.' # 2. Test TCP connectivity nc -z -w 5 22 # 3. Test SSH ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no @ 'echo SSH_OK && hostname && whoami' ``` All three must pass. If step 1 fails, revisit Step 3. If step 1 shows wrong utun (e.g., Shadowrocket's utun with MTU 4064 instead of Tailscale's with MTU 1280), that is also a route conflict. If step 2 passes but step 3 fails with `kex_exchange_identification`, revisit Step 5A (Tailscale SSH proxy intercept). If step 2 fails, check WSL sshd or firewall. If step 3 fails with other errors, revisit Steps 4-5. ## SOP: Remote Development via Tailscale Proactive setup guide for remote development over Tailscale with proxy tools. Follow these steps **before** encountering problems. ### Prerequisites - Tailscale installed and running on both machines - Proxy tool (Shadowrocket/Clash/Surge) configured with Tailscale compatibility (see Step 3 above) - SSH access working: `ssh 'echo ok'` ### 1. Proxy-Safe Makefile Pattern Any Makefile target that curls `localhost` must use `--noproxy localhost`. This is required because `http_proxy` is often set globally in `~/.zshrc` (common in China), and Make inherits shell environment variables. ```makefile ## ── Health Checks ───────────────────────────────────── status: ## Health check dashboard @echo "=== Dev Infrastructure ===" @docker exec my-postgres pg_isready -U postgres 2>/dev/null && echo "PostgreSQL: OK" || echo "PostgreSQL: FAIL" @curl --noproxy localhost -sf http://localhost:9000/minio/health/live >/dev/null 2>&1 && echo "MinIO: OK" || echo "MinIO: FAIL" @curl --noproxy localhost -sf http://localhost:3001/api/status >/dev/null 2>&1 && echo "API: OK" || echo "API: FAIL" ## ── Route Warmup ────────────────────────────────────── warmup: ## Pre-compile key routes (run after dev server is ready) @echo "Warming up dev server routes..." @echo -n " /api/health → " && curl --noproxy localhost -s -o /dev/null -w '%{http_code} (%{time_total}s)\n' http://localhost:3010/api/health @echo -n " / → " && curl --noproxy localhost -s -o /dev/null -w '%{http_code} (%{time_total}s)\n' http://localhost:3010/ @echo "Warmup complete." ``` **Rules**: - Every `curl http://localhost` call MUST include `--noproxy localhost` - Docker commands (`docker exec`) are unaffected by `http_proxy` — no fix needed - `redis-cli`, `pg_isready` connect via TCP directly — no fix needed ### 2. SSH Tunnel Makefile Targets Add these targets for remote development via Tailscale SSH tunnels: ```makefile ## ── Remote Development ──────────────────────────────── REMOTE_HOST ?= TUNNEL_FORWARD ?= -L 3010:localhost:3010 tunnel: ## SSH tunnel to remote machine (foreground) ssh -N $(TUNNEL_FORWARD) $(REMOTE_HOST) tunnel-bg: ## SSH tunnel to remote machine (background, auto-reconnect) autossh -M 0 -f -N $(TUNNEL_FORWARD) \ -o "ServerAliveInterval=30" \ -o "ServerAliveCountMax=3" \ -o "ExitOnForwardFailure=yes" \ $(REMOTE_HOST) @echo "Tunnel running in background. Kill with: pkill -f 'autossh.*$(REMOTE_HOST)'" ``` **Design decisions**: | Choice | Rationale | |--------|-----------| | `?=` (conditional assign) | Allows override: `make tunnel REMOTE_HOST=100.x.x.x` | | `TUNNEL_FORWARD` as variable | Supports multi-port: `make tunnel TUNNEL_FORWARD="-L 3010:localhost:3010 -L 9000:localhost:9000"` | | `autossh -M 0` | Disables autossh's own monitoring port; relies on `ServerAliveInterval` instead (more reliable through NAT) | | `ExitOnForwardFailure=yes` | Fails immediately if port is already bound, instead of silently running without tunnel | | Kill hint uses `autossh.*$(REMOTE_HOST)` | Precise pattern — won't accidentally kill other SSH sessions | **Install autossh**: `brew install autossh` (macOS) or `apt install autossh` (Linux/WSL) ### 3. Multi-Port Tunnels When the project requires multiple services (dev server + object storage + API gateway): ```bash # Forward multiple ports in one tunnel make tunnel TUNNEL_FORWARD="-L 3010:localhost:3010 -L 9000:localhost:9000 -L 3001:localhost:3001" # Or define a project-specific default in Makefile TUNNEL_FORWARD ?= -L 3010:localhost:3010 -L 9000:localhost:9000 ``` Each `-L` flag is independent. If one port is already bound locally, `ExitOnForwardFailure=yes` will abort the entire tunnel — fix the port conflict first. ### 4. SSH Non-Login Shell Setup **This is a frequent source of "it works interactively but fails in scripts" bugs.** SSH non-login shells don't load `~/.zshrc` (or `~/.bashrc` on Linux), so tools installed via nvm, Homebrew, uv, cargo, or any shell-level manager won't be in `$PATH`. Proxy env vars set in `~/.zshrc` also won't be loaded. This affects **all** remote commands run via `ssh user@host "command"`, including CI/CD pipelines, cron-triggered SSH, and Makefile remote targets. Prefix all remote commands with `source ~/.zshrc 2>/dev/null;` (macOS) or `source ~/.bashrc 2>/dev/null;` (Linux/WSL). **Common failure**: `ssh user@host "uv run ..."` or `ssh user@host "node ..."` returns `command not found` even though the command works in an interactive SSH session. See [references/proxy_conflict_reference.md § SSH Non-Login Shell Pitfall](references/proxy_conflict_reference.md) for details and examples. For Makefile targets that run remote commands: ```makefile REMOTE_CMD = ssh $(REMOTE_HOST) 'source ~/.zshrc 2>/dev/null; $(1)' remote-status: ## Check remote dev server status $(call REMOTE_CMD,curl --noproxy localhost -sf http://localhost:3010/api/health && echo "OK" || echo "FAIL") ``` ### 5. End-to-End Workflow #### First-time setup (remote machine) ```bash # 1. Clone repo and install dependencies ssh cd /path/to/project git clone git@github.com:user/repo.git && cd repo pnpm install # Add --registry https://registry.npmmirror.com if in China # 2. Copy .env from local machine (run on local) scp .env :/path/to/project/repo/.env # 3. Start Docker infrastructure make up && make status # 4. Run database migrations bun run db:migrate # 5. Start dev server bun run dev ``` #### Daily workflow (local machine) ```bash # 1. Start tunnel make tunnel-bg # 2. Open browser open http://localhost:3010 # 3. Auth, coding, testing — everything works as if local # 4. When done, kill tunnel pkill -f 'autossh.*' ``` #### Why this works ``` Browser → localhost:3010 → SSH tunnel → Remote localhost:3010 → Dev server ↓ Auth redirects to localhost:3010 ↓ Browser follows redirect → same tunnel → works ``` The key insight: `APP_URL=http://localhost:3010` in `.env` is correct for **both** local and remote development. The SSH tunnel makes the remote server's localhost accessible as the local machine's localhost. Auth callback redirects to `localhost:3010` always resolve correctly. ### 6. Checklist Before starting remote development, verify: - [ ] Tailscale connected: `tailscale status` - [ ] SSH works: `ssh 'echo ok'` - [ ] Proxy tool configured: `[Rule]` has `IP-CIDR,100.64.0.0/10,DIRECT` - [ ] `skip-proxy` includes `100.64.0.0/10` - [ ] `tun-excluded-routes` does NOT include `100.64.0.0/10` - [ ] `NO_PROXY` includes `.ts.net,100.64.0.0/10` - [ ] `autossh` installed: `which autossh` - [ ] Makefile curl commands have `--noproxy localhost` - [ ] Remote dev server running: `ssh 'source ~/.zshrc 2>/dev/null; curl --noproxy localhost -sf http://localhost:3010/'` - [ ] Tunnel works: `make tunnel-bg && curl -sf http://localhost:3010/` ## References - [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) — Per-tool configuration (Shadowrocket, Clash, Surge), NO_PROXY syntax, SSH ProxyCommand, and conflict architecture