--- name: exo-distributed description: Distributed LLM inference across Apple Silicon clusters with exo. Run models across Mac Studios via Thunderbolt RDMA, auto peer discovery, and MLX sharding. Use for multi-device inference, model parallelism, or building LLM clusters. version: 1.0.0 --- # exo-distributed Skill > *"Run models across heterogeneous devices by forming GPU clusters with zero configuration."* **Trit**: 0 (ERGODIC - coordination/orchestration) **Color**: Neutral (60-180° hues) **Source**: Random walk fusion over DuckLake interactions + DeepWiki exo-explore/exo ## Overview [exo](https://github.com/exo-explore/exo) enables distributed LLM inference across multiple Apple Silicon devices: - **Auto Peer Discovery**: Devices find each other automatically - **RDMA over Thunderbolt 5**: Low-latency direct memory access - **MLX Backend**: Native Apple Silicon acceleration via mlx.distributed - **Pipeline + Tensor Parallelism**: Shard models across devices ## Quick Start ```bash # Install exo pip install exo-explore # Start on first device (becomes master if elected) exo # Start on additional devices (auto-discovers peers) exo # Devices automatically form a cluster and expose OpenAI-compatible API # Default: http://localhost:8080 ``` ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ EXO CLUSTER │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────┐ Thunderbolt 5 ┌────────────┐ │ │ │ Mac Studio │◄───── RDMA ────►│ Mac Studio │ │ │ │ M4 Max │ │ M4 Max │ │ │ │ Layers 0-15│ │Layers 16-31│ │ │ └──────┬─────┘ └──────┬─────┘ │ │ │ │ │ │ └──────────────┬───────────────┘ │ │ │ │ │ ┌────▼────┐ │ │ │ Master │ │ │ │ (Elected)│ │ │ └────┬────┘ │ │ │ │ │ ┌─────────▼─────────┐ │ │ │ REST API :8080 │ │ │ │ OpenAI Compatible │ │ │ └───────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Core Components ### 1. Peer Discovery (Gossipsub/libp2p) ```python # Automatic discovery via mDNS/UDP broadcast # Builds Topology graph of connected nodes # No manual configuration required # View discovered peers exo --list-peers ``` ### 2. MLX Backend ```python from exo.worker.engines.mlx.utils_mlx import mlx_distributed_init # Initializes mlx.distributed group # Backends: # - MlxRing: TCP/IP (fallback) # - MlxJaccl: RDMA over Thunderbolt (preferred) mlx_distributed_init( rank=0, # This device's rank world_size=4, # Total devices backend="jaccl" # RDMA backend ) ``` ### 3. Shard Distribution ```python from exo.master.placement import place_instance # Determines model sharding across devices # Filters by available memory # Prioritizes Thunderbolt cycles shard_assignments = place_instance( model="llama-3.3-70b", topology=discovered_topology, strategy="pipeline" # or "tensor" ) ``` ### 4. RDMA over Thunderbolt 5 ```python # IBV device matrix: N×N connectivity # matrix[i][j] = interface on device i connecting to device j # Environment variables for Jaccl backend: # MLX_IBV_DEVICES= # MLX_RANK= # MLX_IBV_COORDINATOR= # MLX_METAL_FAST_SYNCH=1 ``` ## Sharding Strategies ### Pipeline Parallelism ``` Device 0: Layers 0-15 → embeddings + early layers Device 1: Layers 16-31 → middle layers Device 2: Layers 32-47 → late layers Device 3: Layers 48-63 → final layers + head Data flows: D0 → D1 → D2 → D3 → output ``` ```python from exo.master.placement import get_shard_assignments_for_pipeline_parallel # Inserts PipelineFirstLayer and PipelineLastLayer # for inter-device communication assignments = get_shard_assignments_for_pipeline_parallel( model="llama-3.3-70b", num_devices=4 ) ``` ### Tensor Parallelism ``` All devices: All layers (replicated) But: Attention heads partitioned across devices MLP tensors partitioned across devices Each device computes partial results → all-reduce → combined output ``` ```python from exo.master.placement import get_shard_assignments_for_tensor_parallel # Specific sharding for: # - LlamaModel # - DeepseekV3Model # - Qwen3MoeModel assignments = get_shard_assignments_for_tensor_parallel( model="deepseek-r1", num_devices=4 ) ``` ## Supported Models | Model | Size | Min Devices | Strategy | |-------|------|-------------|----------| | Llama 3.3 | 70B | 2 × M4 Max | Pipeline | | DeepSeek R1 | 671B | 8+ × M4 Max | Tensor | | Qwen 2.5 | 72B | 2 × M4 Max | Pipeline | | Mixtral 8×22B | 141B | 4 × M4 Max | Tensor | | Llama 3.1 | 405B | 8+ × M4 Max | Tensor | ## API Usage ### OpenAI-Compatible Endpoint ```python import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" # exo doesn't require API key ) response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Hello!"}], stream=True ) for chunk in response: print(chunk.choices[0].delta.content, end="") ``` ### Direct API ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.3-70b", "messages": [{"role": "user", "content": "Hello"}], "stream": true }' ``` ## Hardware Setup ### Thunderbolt Cluster (Recommended) ``` Mac Studio 1 ──TB5──► Mac Studio 2 │ │ TB5 TB5 │ │ ▼ ▼ Mac Studio 3 ──TB5──► Mac Studio 4 ``` - Use Thunderbolt 5 cables (120 Gbps bidirectional) - All-to-all connectivity required for RDMA - RDMA gives ~10× lower latency than TCP/IP ### Network Cluster (Fallback) ```bash # TCP/IP via MlxRing backend # Works but higher latency exo --backend ring ``` ## GF(3) Triadic Integration ``` exo-distributed (0) ⊗ mlx-apple-silicon (+1) ⊗ bisimulation-game (-1) = 0 ✓ exo-distributed (0) ⊗ parallel-fanout (+1) ⊗ sheaf-cohomology (-1) = 0 ✓ exo-distributed (0) ⊗ gay-mcp (+1) ⊗ temporal-coalgebra (-1) = 0 ✓ ``` ### Trifurcated Inference Pattern ```clojure ;; Distribute inference across 3 device groups (defn trifurcated-inference [prompt] (let [minus (future (exo-infer :validator prompt)) ; -1: Check safety ergodic (future (exo-infer :main prompt)) ; 0: Main inference plus (future (exo-infer :speculative prompt))] ; +1: Speculative draft ;; GF(3) sum: -1 + 0 + 1 = 0 ✓ {:validated @minus :response @ergodic :speculative @plus})) ``` ## Derivational Chaining (from Bumpus/DuckLake) Each inference step derives from previous via seed chaining: ```python # Seed derivation for deterministic distributed inference GOLDEN = 0x9E3779B97F4A7C15 def derive_shard_seed(base_seed: int, shard_id: int, step: int) -> int: """Deterministic seed for each shard at each step""" z = (base_seed + GOLDEN * shard_id + step) & 0xFFFFFFFFFFFFFFFF z = ((z ^ (z >> 30)) * 0xBF58476D1CE4E5B9) & 0xFFFFFFFFFFFFFFFF z = ((z ^ (z >> 27)) * 0x94D049BB133111EB) & 0xFFFFFFFFFFFFFFFF return z ^ (z >> 31) # Each device uses its shard_seed for sampling reproducibility ``` ## Commands ```bash # Start exo node exo # List discovered peers exo --list-peers # Specify backend exo --backend jaccl # RDMA (default if available) exo --backend ring # TCP/IP fallback # Run specific model exo --model llama-3.3-70b # Set API port exo --port 8080 # Benchmark exo bench --config bench_simple.yaml ``` ## Monitoring ### Dashboard ```bash # Web dashboard at http://localhost:8080/dashboard # Shows: # - Connected peers # - Model shards # - Inference throughput # - Memory usage per device ``` ### DuckLake Integration ```sql -- Track exo inference events in DuckLake CREATE TABLE exo_inferences ( id INTEGER PRIMARY KEY, timestamp TIMESTAMP, model VARCHAR, prompt_tokens INTEGER, completion_tokens INTEGER, latency_ms FLOAT, devices INTEGER, strategy VARCHAR, trit INTEGER, seed BIGINT ); -- Query inference history SELECT model, AVG(latency_ms) as avg_latency, COUNT(*) as count FROM exo_inferences GROUP BY model ORDER BY avg_latency; ``` ## Troubleshooting | Issue | Solution | |-------|----------| | Peers not discovered | Check firewall, ensure same network | | RDMA not working | Verify Thunderbolt cables, check `ibv_devices` | | OOM on device | Reduce batch size or use more devices | | Slow inference | Switch from `ring` to `jaccl` backend | | Model not loading | Check `~/.cache/huggingface` for space | ## References - [exo-explore/exo](https://github.com/exo-explore/exo) (GitHub) - [DeepWiki exo documentation](https://deepwiki.com/exo-explore/exo) - [MLX Distributed](https://github.com/ml-explore/mlx) - [Jaccl RDMA Backend](https://github.com/ml-explore/mlx/tree/main/mlx/distributed) --- **Skill Name**: exo-distributed **Type**: Distributed LLM Inference / Cluster Orchestration **Trit**: 0 (ERGODIC - coordination) **GF(3)**: Coordinates multi-device inference with balanced sharding **Platform**: Apple Silicon clusters (macOS) **Discovery**: Random walk fusion over DuckLake + DeepWiki exo-explore/exo ## Scientific Skill Interleaving This skill connects to the K-Dense-AI/claude-scientific-skills ecosystem: ### Graph Theory - **networkx** [○] via bicomodule - Universal graph hub ### Bibliography References - `distributed-systems`: 3 citations in bib.duckdb ## SDF Interleaving This skill connects to **Software Design for Flexibility** (Hanson & Sussman, 2021): ### Primary Chapter: 8. Degeneracy **Concepts**: redundancy, fallback, multiple strategies, robustness ### GF(3) Balanced Triad ``` exo-distributed (○) + SDF.Ch8 (−) + [balancer] (+) = 0 ``` **Skill Trit**: 0 (ERGODIC - coordination) ### Secondary Chapters - Ch3: Variations on an Arithmetic Theme - Ch4: Pattern Matching - Ch5: Evaluation - Ch6: Layering - Ch10: Adventure Game Example - Ch7: Propagators ### Connection Pattern Degeneracy provides fallbacks. This skill offers redundant strategies. ## Cat# Integration This skill maps to **Cat# = Comod(P)** as a bicomodule in the equipment structure: ``` Trit: 0 (ERGODIC) Home: Prof Poly Op: ⊗ Kan Role: Adj Color: #26D826 ``` ### GF(3) Naturality The skill participates in triads satisfying: ``` (-1) + (0) + (+1) ≡ 0 (mod 3) ``` This ensures compositional coherence in the Cat# equipment structure.