# Nakshatra > Distributed LLM inference across heterogeneous workers (NVIDIA / AMD / Apple Silicon / CPU). Splits one model by layer ranges; patched llama.cpp + gRPC chain protocol. (Inspired by [Petals](README_PETALS.md); v0.1 design is independent.) Status (2026-05-06): **v0.1 functionally alive.** Two-worker cluster on Tailscale produces the same top-1 token as a single-machine `llama-cli` reference. See [`experiments/v0.0/m6_findings.md`](experiments/v0.0/m6_findings.md) for the empirical result. **Paper:** [Nakshatra: Vendor-Agnostic Distributed Inference on Heterogeneous Consumer Hardware](https://pnl.market/research/6a017d83b86a1bf1c69ea714) (Bastola, 2026). --- ## What this is Nakshatra splits one transformer model across multiple machines. Each worker holds a contiguous range of the model's layers — say worker A holds layers 0–13 and worker B holds layers 14–27 of a 28-layer model. A request flows: tokenizer on the client → worker A computes its layers and emits a hidden-state vector → worker A ships the vector to worker B over the network → worker B finishes the layers, applies the language-model head, returns the next token. Per-token network traffic is the size of one hidden state vector (12 KB for a 3B model, 16 KB for a 70B model) crossing one hop per worker boundary. Trivial bandwidth. Weights stay local on each worker — they are never streamed during inference. The architectural commitments and roadmap live in [`docs/`](docs/). Start with [`petals-architecture.md`](docs/petals-architecture.md) for the design and [`v0.1-implementation-plan.md`](docs/v0.1-implementation-plan.md) for the milestones and acceptance test. --- ## Quickstart — stand up a 2-worker cluster The walkthrough below takes a fresh pair of machines from "we have Python and a GGUF" to "we're seeing the right token come back." Targets the v0.1 §7 ship gate: under one hour for an external operator. ### Prerequisites On **both** machines: - Python 3.9+ - `git`, `cmake`, a C++17 compiler (gcc 13+ on Linux, Xcode CLT on macOS) - About 4 GB of free disk per worker for a sub-GGUF of a 3B-class model (more for larger models) - Tailscale or any other point-to-point IP transport between the two machines - SSH between them (for shipping the sub-GGUF) On the **first machine** only: - The full GGUF of the model you want to run. Llama-family architectures only for v0.1. Tested with Llama-3.2-3B-class fine-tunes (28 layers, hidden_size=3072). ### 1. Clone and build patched llama.cpp on each machine ```bash git clone https://github.com/ggml-org/llama.cpp.git ~/llama.cpp cd ~/llama.cpp git checkout c46583b # or close enough; recent llama.cpp commits work ``` Apply the M3+M4 patches (in [`experiments/v0.0/m4_patches/`](experiments/v0.0/m4_patches/)) on each machine: ```bash cd ~/llama.cpp patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.h.patch patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model.cpp.patch patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-model-loader.cpp.patch patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/llama-graph.cpp.patch patch -p4 < /path/to/nakshatra/experiments/v0.0/m4_patches/models_llama.cpp.patch ``` Set up the worker daemon as a CMake target by dropping the source into `examples/nakshatra-spike/`: ```bash mkdir -p examples/nakshatra-spike cp /path/to/nakshatra/experiments/v0.0/worker_daemon.cpp examples/nakshatra-spike/ cat > examples/nakshatra-spike/CMakeLists.txt <