---
name: distributed-compute
description: Work with PrimePath's distributed computation engine across Metal GPU, CPU NEON, and multi-device networking. Use when modifying GPU shaders, load balancing, Conductor/Carriage networking, Nester Carry Chain math, or the three-gear auto-shifting engine.
user-invocable: true
argument-hint: [component] [action]
effort: high
---

# PrimePath Distributed Computation Engine

You are working on PrimePath's distributed prime search engine. This system coordinates computation across Metal GPU, CPU with ARM64 NEON, and multiple networked Macs.

## Architecture Overview

```
CONDUCTOR (master, port 9807, Bonjour)
  |-- Splits work via split_range() into WorkChunks
  |-- Sends WorkAssignment JSON to Carriages
  |-- Heartbeat 5s, timeout 10s, auto-reassign on disconnect
  |
CARRIAGE (worker, auto-discovers via Bonjour)
  |-- Receives WorkAssignment, runs local TaskManager
  |-- Reports Progress at 1Hz, DiscoveryReport immediately
  |-- Sends WorkDone on completion

LOCAL MACHINE (runs both Conductor + TaskManager)
  |
  |-- CPU Sieve Pipeline: wheel-210 + CRT + matrix filter + small prime test
  |-- GPU Metal: ring-buffered async dispatch (3 slots, 262K candidates/batch)
  |-- NEON Pre-filter: trial div by primes {3..47}, eliminates ~75% composites
  |-- Load Balancer: 50/50 GPU/CPU split, pool nudge, GPU throttle at 85%
  |-- Three-Gear Engine (Nester Carry Chain divisibility):
      Gear 1: CPU single-thread, 8-wide Barrett (<5K divisors)
      Gear 2: CPU 10-thread, 8-wide Barrett (5K-50K)
      Gear 3: GPU Metal, one thread per divisor (50K+)
```

## Key Files

| Component | Files |
|-----------|-------|
| **Metal GPU dispatch** | `PrimePath/MetalCompute.mm`, `MetalCompute.h` |
| **Metal shaders** | `PrimePath/PrimeShaders.metal` |
| **GPU backend abstraction** | `PrimePath/GPUBackend.hpp`, `GPUBackend.cpp` |
| **Load balancer + NEON** | `PrimePath/LoadBalancer.hpp`, `LoadBalancer.cpp` |
| **Nester Carry Chain math** | `PrimePath/PrimeEngine.hpp` (Barrett reduction, streaming divisibility) |
| **Task orchestration** | `PrimePath/TaskManager.hpp`, `TaskManager.mm` |
| **Conductor (master)** | `PrimePath/Network/ConductorServer.mm` |
| **Carriage (worker)** | `PrimePath/Network/CarriageClient.mm` |
| **Network protocol** | `PrimePath/Network/NetworkProtocol.hpp`, `WorkSplitter.hpp` |
| **PrimeNet/GIMPS** | `PrimePath/Network/PrimeNetClient.hpp`, `PrimeNetClient.mm` |
| **Shaders** | `PrimePath/PrimeShaders.metal` |
| **UI + orchestration** | `PrimePath/AppDelegate.mm` |

## GPU Specifics

- Ring buffer size 3: up to 2 batches in-flight, CPU processes batch N-1 while GPU runs batch N
- All buffers `MTLResourceStorageModeShared` (Apple Silicon unified memory, zero-copy)
- u128 arithmetic in Metal: custom `u128` struct with lo/hi u64 limbs, `mulhi` for 64x64->128
- Barrett reduction replaces division (O(8 muls) vs O(128 iterations))
- GPU pacing: 2ms min gap between dispatches to avoid starving CPU
- Mersenne TF kernel: fused sieve+modexp, 96-bit Barrett, one thread per candidate

## CPU/NEON Specifics

- Nester Carry Chain: modular multiplication without UDIV instruction
  - `q = floor(x * inv_d / 2^64)` via ARM64 UMULH
  - Precomputed reciprocal: `inv = UINT64_MAX / d`
  - Streaming divisibility: processes number segment-by-segment MSB to LSB
- NEON pre-filter: `uint64x2_t` lanes, 2 candidates at a time, ~30% survival rate
- Template batching: 1/2/4/8/16 divisors per pass, compile-time unroll at -O2

## Network Protocol

- Framing: 4-byte big-endian length prefix + JSON body
- Message types: AssignWork (0x01), Progress (0x10), DiscoveryReport (0x11), WorkDone (0x12), Ping/Pong (0x03/0x13), Hello (0x04)
- Work splitting: `split_range(type, start, end, num_workers)` creates equal WorkChunks
- Failover: dead carriages have work reassigned from last known position

## Task Types

Wieferich, Wall-Sun-Sun, Wilson, Twin, Sophie Germain, Cousin, Sexy, General, Emirp, Mersenne TF, Fermat Factor (11 types)

## Build Commands

```bash
# Debug build
xcodebuild -scheme PrimePath -configuration Debug build

# Release build
xcodebuild -scheme PrimePath -configuration Release build

# Release with notarization
./scripts/build-dmg.sh

# Unit tests
clang++ -std=c++17 -O2 -I. test_engine.cpp PrimePath/PrimeEngine.cpp -o test_engine -lpthread
./test_engine
```

## Guidelines

- All GPU work goes through GPUBackend abstraction. Never call Metal APIs directly from TaskManager.
- Load balancer is decoupled from search implementations. Query `advise()` for routing decisions.
- u128 math in Metal shaders must use the custom u128 struct, not compiler extensions.
- Barrett reduction must be used for all modular arithmetic in hot loops (no division).
- Ring buffer slot management is critical: always release previous slot before submitting new work.
- NEON pre-filter runs on every candidate batch regardless of gear selection.
- Conductor/Carriage messages are JSON over TCP with length-prefix framing.
- Test on both GPU and CPU-fallback paths when modifying compute kernels.
- Checkpoint files are written every 30s. Mersenne TF checkpoints use per-exponent filenames: `mersenne_tf_checkpoint_M{exponent}.txt`

When asked to work on: $ARGUMENTS