# Vanguard Networking Skill ## Purpose Provides the agent with knowledge of the GPU cluster topology and networking configuration for the Vanguard SOC cluster — a live 4-node (expanding to 6) GPU mesh on the 192.168.86.x subnet. ## Cluster Topology (Live) ### Head Node: Mega (RTX 5090) - **Hostname**: `Mega` - **IP Address**: `192.168.86.29` - **gRPC Port**: `50051` - **GPU**: RTX 5090 (32GB VRAM) - **Role**: Head node, MCP server host, orchestrator - **Services**: Vanguard MCP, BOINC, Folding@home ### Compute Node 1: AMDMSIX870E-1 (RTX 5090) - **Hostname**: `AMDMSIX870E-1` - **IP Address**: `192.168.86.16` - **gRPC Port**: `50052` - **GPU**: RTX 5090 (32GB VRAM) - **Role**: Compute node - **Services**: Vanguard Node Agent, BOINC, Folding@home ### Compute Node 2: AMDMSIX870E-2 (RTX 5090) - **Hostname**: `AMDMSIX870E-2` - **IP Address**: `192.168.86.22` - **gRPC Port**: `50053` - **GPU**: RTX 5090 (32GB VRAM) - **Role**: Compute node - **Services**: Vanguard Node Agent, BOINC, Folding@home ### Compute Node 3: DellUltracore9 (RTX 4090) - **Hostname**: `DellUltracore9` - **IP Address**: `192.168.86.3` - **gRPC Port**: `50054` - **CPU**: Dell Ultra Core 9 285 - **GPU**: RTX 4090 (24GB VRAM) - **Role**: Compute node - **Services**: Vanguard Node Agent, BOINC, Folding@home ### Placeholder Node 5: Ian's Aurora (RTX 4090) — PENDING - **Hostname**: `aurora-ian` - **IP Address**: TBD - **gRPC Port**: `50055` - **CPU**: Intel i9-14900KF - **GPU**: RTX 4090 (24GB VRAM) - **Role**: Future compute node - **Status**: Awaiting network integration ### Placeholder Node 6: (RTX 4080 Super) — PENDING - **Hostname**: TBD - **IP Address**: TBD - **gRPC Port**: `50056` - **CPU**: Intel i9-14900K - **GPU**: RTX 4080 Super (16GB VRAM) - **Role**: Future compute node - **Status**: Awaiting network integration ## Network Configuration ### Subnet - **Network**: `192.168.86.0/24` - **Gateway**: `192.168.86.1` ### Cluster Service Endpoints - **MCP Server (Head)**: `grpc://192.168.86.29:50051` - **Compute 1**: `grpc://192.168.86.16:50052` - **Compute 2**: `grpc://192.168.86.22:50053` - **Compute 3**: `grpc://192.168.86.3:50054` - **Heartbeat Interval**: 10 seconds - **Task Timeout**: 300 seconds (default) ### GPU Affinity Preferences - **Ising Parallel Tempering (high-replica)**: Distribute replicas across all 4 nodes - **Fractal Generation (Sierpinski, Menger)**: Prefer RTX 5090 nodes (Mega, AMDMSIX870E-1/2) - **Parallel Stepping (>100K nodes)**: Prefer RTX 5090 (higher compute) - **Visualization Rendering**: Any GPU - **Small Simulations (<10K nodes)**: Prefer RTX 4090 (DellUltracore9) ### Resource Reservations (Normal Mode) - **BOINC**: 15% GPU per card - **Folding@home**: 10% GPU per card - **UtilityFog**: 75% GPU per card ### Resource Reservations (Grokking Run) - **BOINC**: 0% (gracefully paused) - **Folding@home**: 0% (gracefully paused) - **UtilityFog**: 100% GPU per card - All 4 nodes dedicated to the grokking computation - BOINC/F@H auto-restored when grokking ends ## Grokking Run Protocol 1. Watchdog broadcasts `GrokkingRun` mode to all 4 nodes 2. Each node pauses BOINC (`boinccmd --set_gpu_mode never`) and F@H (`FAHClient --pause`) 3. GPU router lifts the 25% reserve ceiling — full 100% capacity available 4. Parallel Tempering replicas distributed across all available GPUs 5. Timer counts down; on expiry, watchdog restores Normal mode 6. BOINC and F@H resume automatically ## Usage Examples ### Submit a Fractal Task ```python import grpc from cluster_pb2 import TaskRequest, GpuPreference from cluster_pb2_grpc import ClusterServiceStub channel = grpc.insecure_channel('192.168.86.29:50051') stub = ClusterServiceStub(channel) request = TaskRequest( task_type='fractal_step', payload=b'...', gpu_preference=GpuPreference.GPU_PREFER_5090, priority=5, branch_id='sierpinski-d4-b0' ) receipt = stub.SubmitTask(request) print(f"Task {receipt.task_id} assigned to {receipt.assigned_node}/{receipt.assigned_gpu}") ``` ### Query Cluster Status ```python from cluster_pb2 import Empty node_list = stub.ListNodes(Empty()) for node in node_list.nodes: print(f"{node.hostname}: {len(node.gpus)} GPUs, {node.total_vram_mb}MB VRAM") ``` ### Trigger Grokking Run ```python result = mcp_client.call_tool('trigger_grokking_run', { 'duration_secs': 600, 'confirm': True }) ``` ## Monitoring ### Health Checks - Heartbeat every 10s from each node - GPU temperature threshold: 85C (tasks rejected above this) - GPU utilization tracked per-card - VRAM availability monitored ### Failure Handling - Node offline: tasks re-queued to other nodes - GPU overheating: tasks paused until temp < 80C - Task timeout: automatic retry (max 3 attempts) ## Security - gRPC uses insecure channel (local 192.168.86.0/24 only) - Firewall: ports 50051-50056 open only to subnet - No external access ## Task Priority Levels - **0-3**: Low (background tasks) - **4-6**: Normal (default) - **7-9**: High (interactive) - **10**: Critical (grokking run) ## Routing Strategies - `LeastLoaded`: Pick GPU with lowest utilization (default) - `RoundRobin`: Cycle through all available GPUs - `VramCapacity`: Pick GPU with most free VRAM - `AffinityFirst`: Respect GPU model preference strictly ## Troubleshooting ### Common Issues 1. **"Queue full"**: Increase `max_capacity` in TaskQueue or wait for tasks to complete 2. **"No available GPUs"**: Check if all GPUs are above 85C or fully utilized 3. **"Node not responding"**: Verify network connectivity on 192.168.86.x, check if node process is running 4. **"BOINC/F@H starved"**: Watchdog logs violations; reduce UFT task load or trigger grokking run ### Logs - MCP Server: `journalctl -u vanguard-mcp -f` - Node Agent: `journalctl -u vanguard-node -f` - Watchdog: `journalctl -u vanguard-watchdog -f`