# Legend of Elya — World's First LLM on Nintendo 64 [![BCOS Certified](https://img.shields.io/badge/BCOS-Certified-brightgreen?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik05IDE2LjE3TDQuODMgMTJsLTEuNDIgMS40MUw5IDE5IDIxIDdsLTEuNDEtMS40MXoiLz48L3N2Zz4=)](https://github.com/Scottcjn/Rustchain/blob/main/BCOS.md) An original N64 homebrew ROM featuring **Sophia Elya** — an AI NPC powered by a nano-GPT transformer running **live inference on the MIPS R4300i CPU**. No precomputed responses. No lookup tables. Real matrix multiply, real softmax, real attention — on a 93.75 MHz CPU from 1996. > **Video demos**: > - [Full Demo (58s)](https://bottube.ai/watch/7GL90ftLqvh) — Complete walkthrough with multiple prompts > - [First Coherent Output (69s)](https://bottube.ai/watch/shFVLBT0kHY) — 61.8 tok/s generating coherent English > > **Download ROM**: [`legend_of_elya.z64`](legend_of_elya.z64) — ready to run in ares emulator or EverDrive 64 ![N64 LLM Screenshot](screenshots/n64_llm_ibm_power8.png) --- ## What It Does - Press **A** near Sophia Elya to trigger AI dialog - The N64 CPU runs a full 4-layer transformer: embedding → attention → FFN → logits → sampling - Output tokens appear character-by-character with a live **tok/s counter** - Each response is different — seeded by CPU oscillator jitter (hardware entropy) - 32 prompts covering identity, Elya lore, RustChain, hardware trivia - Runs in the [ares](https://ares-emu.net) emulator and on **real N64 hardware** via EverDrive 64 --- ## Architecture | Parameter | Value | |-----------|-------| | Parameters | **819,200** (819K) | | Layers | 4 | | Embedding dim | 128 | | Attention heads | 4 (32-dim each) | | Vocabulary | 256 (byte-level ASCII) | | Context window | 64 tokens | | Quantization | Q8 (int8 weights + float16 block scales, 32-weight blocks) | | Weight file | **458 KB** on cartridge ROM | | Inference math | **Float32** on MIPS R4300i FPU | | Speed | **~60 tok/s** in emulator, ~1-3 tok/s on real hardware | | KV cache | 256 KB in RDRAM | | Total RDRAM | ~263 KB (KV cache + 7KB scratch) | ### Key Implementation Details - **Float32 inference** — all activations, attention scores, and accumulations are IEEE 754 float32 - **On-the-fly Q8 dequantization** — weights stay compressed as int8 in ROM; dequantized per matmul - **Custom Taylor exp()** — range-reduction `exp(x) = exp(x/128)^128` with degree-4 Taylor series and 7 squarings. Uses **zero float-to-int casts** to avoid the R4300i's missing `trunc.w.s` instruction - **Quake III fast inverse sqrt** — `0x5f3759df` bit trick with 2 Newton-Raphson iterations for RMS normalization - **Big-endian aware** — weight file is little-endian (Python export), N64 is big-endian. `swap16`/`swap32` helpers handle byte-order conversion for header fields and float16 scales - **Hardware entropy** — MIPS CP0 Count register XOR'd with frame counter for RNG seeding - **Greedy sampling** — pure argmax over printable ASCII (32-126), matching proven x86 reference quality - **Embedding scale restoration** — Q8 export normalizes to [-1,1]; the original scale factor (em=3.5) is stored in header byte and restored at init --- ## Files | File | Purpose | |------|---------| | `nano_gpt.c` | Float32 GPT inference engine (MIPS R4300i) | | `nano_gpt.h` | Model struct definitions, KV cache, API | | `legend_of_elya.c` | Game: dungeon scene, sprites, dialog, music, HUD | | `train_sophia_v5.py` | PyTorch training + Q8 weight export | | `weights/sophia_weights.bin` | Pre-trained v5 weights (458KB, ready to use) | | `Makefile` | libdragon build system | | `src/` | Latest source snapshots | | `screenshots/` | Working N64 LLM screenshots | | `mining/` | **Optional** RustChain mining attestation module | --- ## Quick Start ### Option 1: Use Pre-built ROM Download `legend_of_elya.z64` from [Releases](../../releases) and load in ares emulator or copy to EverDrive SD card. ### Option 2: Build from Source Requires [libdragon](https://github.com/DragonMinded/libdragon) toolchain: ```bash # Set toolchain path export N64_INST=/path/to/mips64-toolchain # Place weights in filesystem/ cp weights/sophia_weights.bin filesystem/ # Build make clean && make # Run in ares ares legend_of_elya.z64 ``` ### Option 3: Train Your Own Model ```bash # Requires PyTorch + CUDA GPU python3 train_sophia_v5.py # ~20 min on RTX 5070, exports filesystem/sophia_weights.bin ``` --- ## Pre-trained Weights The `weights/sophia_weights.bin` file contains a pre-trained v5 model (819K params, Q8 format, 458KB). Training corpus covers: Sophia Elya identity, RustChain blockchain, Elya lore, N64 hardware, PowerPC architecture, dungeon/RPG dialog. **Weight file format:** | Offset | Size | Field | |--------|------|-------| | 0 | 4 | Magic: `0x53454149` ("SEAI"), little-endian | | 4 | 1 | n_layers (4) | | 5 | 2 | n_embed (128) | | 7 | 1 | n_heads (4) | | 8 | 2 | vocab_size (256) | | 10 | 1 | ctx_len (64) | | 11 | 1 | em_scale_x16 (56 = 3.5 × 16) | | 12 | 32768 | Embedding table (256 × 128, int8) | | 32780 | ... | Layer weights (int8) + scales (float16) × 4 layers | --- ## Honest Limitations - **819K parameters.** Responses are short and sometimes imprecise ("rinces" instead of "Princess"). Expected at this scale. The achievement is real-time transformer inference on 1996 hardware. - **Context window is 64 tokens.** Prompt + response must fit in 64 bytes. - **No memory between dialogs.** KV cache resets each conversation. - **Byte-level vocabulary.** One ASCII character per token — no subword tokenization. - **Training corpus is small.** More data and epochs will improve coherence. --- ## Roadmap: N64 LLM SDK The goal is to shrink, optimize, and package this into a **reusable SDK** that any N64 homebrew developer can drop into their game to give NPCs real language understanding. ### Phase 1: Core Engine (DONE) - [x] Float32 transformer inference on MIPS R4300i - [x] Q8 quantized weights with on-the-fly dequantization - [x] Custom math (Taylor exp, fast inverse sqrt) avoiding missing R4300i instructions - [x] Big-endian weight loading from ROM filesystem - [x] Hardware entropy from CPU oscillator - [x] Working demo ROM with dialog system ### Phase 2: Model Quality (IN PROGRESS) - [ ] Extended training corpus (500+ QA pairs across game domains) - [ ] Longer training runs (200K+ steps) for better convergence - [ ] Context-aware prompting (NPC name, location, game state as prefix tokens) - [ ] Multiple personality weights (warrior NPC, merchant, sage, villain) - [ ] Fine-tune for specific game genres (RPG, adventure, puzzle) ### Phase 3: Performance Optimization - [ ] **RSP microcode acceleration** — the N64's RSP has 8-lane SIMD (`VMULF`/`VMADH`); offloading matmul to RSP could give 4-8× speedup over scalar VR4300 - [ ] **Q4 quantization** — halve weight size to ~230KB, fit more model or more NPCs - [ ] **Tiled matmul** — process weights in cache-friendly blocks to reduce RDRAM stalls - [ ] **Speculative generation** — pre-generate during idle frames (exploration, cutscenes) - [ ] **KV cache sharing** — multiple NPCs sharing embedding + early layers, diverging at output ### Phase 4: SDK Release - [ ] **`n64_llm.h` / `n64_llm.c`** — single-file drop-in library - [ ] **Simple API**: ```c // Init with weight data from ROM N64LLM_State *npc = n64llm_init(rom_weights, weight_size); // Set NPC personality context n64llm_set_context(npc, "You are a blacksmith in the Crystal Caverns."); // Generate response to player input char response[128]; n64llm_generate(npc, "Do you sell shields?", response, sizeof(response)); // Per-frame generation (non-blocking, 1 token per frame) int done = n64llm_step(npc); ``` - [ ] **Multiple NPC support** — share weights, separate KV caches (~256KB each) - [ ] **Weight format tools** — Python scripts to train custom NPC personalities - [ ] **Expansion Pak support** — 8MB mode enables 6-8 layer models or multiple NPCs - [ ] **Example ROMs** — tavern scene with 3 NPCs, shop with merchant, quest giver ### Phase 5: Advanced Features - [ ] **Player text input** — on-screen keyboard (D-pad character picker) - [ ] **Game state injection** — feed inventory, health, location as context tokens - [ ] **Emotional state** — NPC mood affects response style (scared, friendly, hostile) - [ ] **Memory** — persist key facts across conversations using save file - [ ] **Multi-language** — vocabulary supports full 256-byte range for accented characters - [ ] **RSP-only inference** — entire forward pass on RSP, freeing VR4300 for game logic ### Size Targets | Config | Layers | Embed | Params | Weight Size | RAM (KV+scratch) | Use Case | |--------|--------|-------|--------|-------------|-------------------|----------| | Tiny | 2 | 64 | ~100K | ~60KB | ~70KB | Simple responses, many NPCs | | Small | 4 | 128 | 819K | 458KB | 263KB | Current — single NPC dialog | | Medium | 6 | 192 | ~2.8M | ~1.5MB | 600KB | Rich dialog, Expansion Pak | | Large | 8 | 256 | ~8.4M | ~4.2MB | 1.6MB | Full conversations, 8MB mode | --- ## Why This Matters Every "AI NPC" in modern games is a cloud API call. This runs **entirely on the cartridge** — no internet, no server, no loading screen. The VR4300 does the matrix math. The ROM holds the weights. The RDRAM holds the KV cache. It's the same transformer architecture as GPT — just 819K parameters instead of 175 billion. And it runs on hardware that predates Google. If we can make a transformer talk on 8MB of RAM and a 93MHz MIPS CPU, the excuses for cloud-dependent "AI" in games evaporate. --- ## Screenshots | IBM POWER8 Response | Elya Crystal Response | |---------------------|----------------------| | ![](screenshots/n64_llm_ibm_power8.png) | ![](screenshots/n64_llm_zelda_triforce.png) | --- ## Optional: RustChain Mining Module The `mining/` directory contains an optional **proof-of-antiquity** mining module that lets a real N64 earn RTC (RustChain Token) rewards by submitting hardware attestations to the [RustChain](https://rustchain.org) blockchain. **How it works:** - N64 runs 5 hardware fingerprint checks (CPU PRId, COUNT timing, VI scan, memory ratio, anti-emulation) - Results are written to controller pak via joybus → Raspberry Pi Pico relays over USB → Python host bridge submits to RustChain node - Real chain data (epoch, slot, balance, miner count) flows back: RustChain API → Python → USB → Pico → pak READ → N64 display - N64 gets a **3.0x antiquity multiplier** as vintage hardware (1996 silicon) - Wallet is hardware-derived from RDRAM config registers + CP0 PRId — unique per console **Requirements:** N64 + EverDrive 64 + Raspberry Pi Pico + USB cable See [`mining/README.md`](mining/README.md) for full setup instructions. --- ## Credits Built by [Elyan Labs](https://rustchain.org). - **Engine**: nano-GPT float32 inference on MIPS R4300i - **Game**: libdragon SDK, pixel art, dungeon adventure - **Training**: PyTorch on RTX 5070 - **Platform**: [BoTTube](https://bottube.ai) for video hosting Source is open — build it, train it, improve it, port it.