--- name: mlx-apple-silicon description: Run LLMs on Apple Silicon with MLX/mlx_lm - unified memory, 4-bit quantization, streaming generation, prompt caching. Optimal for M-series chips. version: 1.0.0 --- # MLX Apple Silicon Skill > *"Unified memory means no GPU↔CPU transfers - arrays live in shared memory."* **Trit**: +1 (PLUS - generative) **Color**: Warm (optimistic/fast) ## Overview [MLX](https://github.com/ml-explore/mlx) is Apple's ML framework for Apple Silicon: - **Unified Memory**: No GPU↔CPU data transfers - **Lazy Evaluation**: Compute only what's needed - **Metal Backend**: Native GPU acceleration - **4-bit Quantization**: 75% smaller models [MLX-LM](https://github.com/ml-explore/mlx-lm) provides high-level LLM APIs. ## Quick Start ```bash # Install (macOS Apple Silicon) pip install mlx mlx-lm # Install (Linux CUDA - v0.28+) pip install "mlx[cuda]" # Generate text mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \ --prompt "Hello" --max-tokens 100 # Interactive chat mlx_lm.chat --model mlx-community/Mistral-7B-Instruct-v0.3-4bit # Vision/Multimodal (mlx-vlm) pip install mlx-vlm mlx_vlm.chat --model mlx-community/Qwen2.5-VL-7B-Instruct-4bit ``` ## Python API ### Basic Generation ```python from mlx_lm import load, generate # Load 4-bit quantized model model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") # Generate messages = [{"role": "user", "content": "Write a haiku"}] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True) text = generate(model, tokenizer, prompt=prompt, max_tokens=100) print(text) ``` ### Streaming Generation ```python from mlx_lm import load, stream_generate model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") for response in stream_generate(model, tokenizer, prompt="Hello", max_tokens=100): print(response.text, end="", flush=True) # response.token, response.logprobs, response.generation_tps available ``` ### Batch Generation ```python from mlx_lm import load, batch_generate model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") prompts = ["Story about AI", "Explain ML", "Write a poem"] result = batch_generate(model, tokenizer, prompts, max_tokens=100) for text in result.texts: print(text) ``` ### Sampling Control ```python from mlx_lm import load, generate from mlx_lm.sample_utils import make_sampler model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") sampler = make_sampler( temp=0.7, # Temperature top_p=0.9, # Nucleus sampling top_k=50, # Top-k sampling min_p=0.05, # Min probability threshold repetition_penalty=1.1 ) text = generate(model, tokenizer, prompt="Tell me a joke", sampler=sampler) ``` ### Prompt Caching (Multi-turn) ```python from mlx_lm import load, stream_generate from mlx_lm.models.cache import make_prompt_cache, save_prompt_cache, load_prompt_cache model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") # Create cache for system prompt + context system = "You are an expert. " + long_context cache = make_prompt_cache(model) # Prime the cache for r in stream_generate(model, tokenizer, system, prompt_cache=cache, max_tokens=1): break # Save for reuse save_prompt_cache("my_cache.safetensors", cache) # Later: reuse with different queries cache = load_prompt_cache("my_cache.safetensors") for r in stream_generate(model, tokenizer, "What is 2+2?", prompt_cache=cache, max_tokens=50): print(r.text, end="", flush=True) ``` ### KV Cache Rotation (Long Sequences) ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") # Limit KV cache to 512 tokens (bounded memory for long sequences) text = generate( model, tokenizer, prompt="Very long context...", max_kv_size=512, max_tokens=1000 ) ``` ### Speculative Decoding ```python from mlx_lm import load, stream_generate # Main model model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") # Faster draft model draft_model, _ = load("mlx-community/Mistral-3B-Instruct-4bit") for r in stream_generate( model, tokenizer, prompt="Tell me about ML", draft_model=draft_model, num_draft_tokens=3, max_tokens=512 ): print(r.text, end="", flush=True) ``` ## Model Conversion & Quantization ```python from mlx_lm import convert # Download, quantize, and optionally upload convert( hf_path="mistralai/Mistral-7B-Instruct-v0.3", mlx_path="./my-mistral-4bit", quantize=True, q_bits=4, # 4-bit, 8-bit, or MXFP4/NVFP4 q_group_size=64, dtype="float16", upload_repo="mlx-community/my-mistral-4bit" # Optional ) ``` ```bash # CLI conversion mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \ -q --upload-repo mlx-community/my-mistral-4bit ``` ## LoRA/QLoRA Fine-Tuning ### LoRALinear Adapter ```python import mlx.core as mx import mlx.nn as nn class LoRALinear(nn.Module): """Low-Rank Adaptation: W' = W + scale * (A @ B)""" def __init__(self, input_dims, output_dims, r=8, scale=20.0, dropout=0.0): self.linear = nn.Linear(input_dims, output_dims) self.dropout = nn.Dropout(p=dropout) self.scale = scale # A: (input, r), B: (r, output) - B zero-init for stable start self.lora_a = mx.random.uniform(low=-1/mx.sqrt(input_dims), high=1/mx.sqrt(input_dims), shape=(input_dims, r)) self.lora_b = mx.zeros((r, output_dims)) def __call__(self, x): y = self.linear(x) z = (self.dropout(x) @ self.lora_a) @ self.lora_b return y + (self.scale * z).astype(x.dtype) ``` ### Training Loop with Gradient Accumulation ```python from functools import partial import mlx.optimizers as optim # Freeze base, unfreeze LoRA layers model.freeze() for l in model.model.layers[-16:]: # Last 16 layers l.self_attn.q_proj = LoRALinear.from_linear(l.self_attn.q_proj) l.self_attn.v_proj = LoRALinear.from_linear(l.self_attn.v_proj) optimizer = optim.Adam(learning_rate=1e-5) def loss_fn(model, inputs, targets, lengths): logits = model(inputs) mask = build_mask(lengths) ce = nn.losses.cross_entropy(logits, targets) * mask return ce.sum() / mask.sum() loss_and_grad = nn.value_and_grad(model, loss_fn) # Compiled step with gradient accumulation @partial(mx.compile, inputs=model.state, outputs=model.state) def step(batch, accumulated_grad, do_update, accum_steps): loss, grad = loss_and_grad(model, *batch) if accumulated_grad: grad = tree_map(lambda a, b: a + b, grad, accumulated_grad) if do_update: grad = tree_map(lambda g: g / accum_steps, grad) optimizer.update(model, grad) grad = None return loss, grad # Gradient checkpointing for memory mx.checkpoint(layer.__call__) # Recompute activations in backward ``` ### CLI Fine-Tuning ```bash mlx_lm.lora --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \ --data ./train.jsonl --iters 1000 --batch-size 4 \ --lora-layers 16 --lora-rank 8 --learning-rate 1e-5 \ --adapter-path ./adapters ``` ## Sampling Strategies ```python from mlx_lm.sample_utils import make_sampler # Temperature: higher = more random # Top-K: keep top K tokens only # Top-P (nucleus): keep tokens until cumsum(prob) > p # Min-P: keep tokens with prob > top_prob * min_p # Repetition penalty: discourage repeated tokens sampler = make_sampler( temp=0.7, top_p=0.9, top_k=50, min_p=0.05, repetition_penalty=1.1, repetition_context_size=100 ) # Sampler internals: # 1. Apply repetition penalty to seen tokens # 2. Apply top-k filter (argpartition) # 3. Apply min-p filter (relative to top logprob) # 4. Apply top-p filter (cumulative threshold) # 5. Sample with temperature: categorical(logits / temp) ``` ## Generation Loop Internals ```python # Prefill: process prompt in chunks for i in range(0, len(prompt), prefill_step_size): chunk = prompt[i:i+prefill_step_size] _ = model(chunk, cache=cache) # Decode: async token generation stream = mx.new_stream(mx.default_device()) with mx.stream(stream): for _ in range(max_tokens): logits = model(tokens[None], cache=cache)[:, -1, :] logprobs = logits - mx.logsumexp(logits, keepdims=True) token = sampler(logprobs) mx.async_eval(token) yield token ``` ## Speculative Decoding ```python from mlx_lm import load, stream_generate # Main model + faster draft model model, tok = load("mlx-community/Mistral-7B-4bit") draft, _ = load("mlx-community/Mistral-1B-4bit") for r in stream_generate( model, tok, prompt="...", draft_model=draft, num_draft_tokens=4, # Draft generates 4, main verifies ): print(r.text, end="") # Pattern: draft → verify → accept prefix → rewind cache ``` ## Supported Models ### Text Models (mlx-lm) - **Llama** (2, 3, 3.2, 3.3) - **Mistral** (v0.1-v0.3, Nemo) - **Phi** (3, 3.5, 4) - **Gemma** (2, 3) - **Qwen** (2, 2.5, 3, Coder) - **DeepSeek** (v2, v3, R1) - **Mixtral** (MoE 8x7B, 8x22B) - 100+ more on [mlx-community](https://huggingface.co/mlx-community) ### Vision/Multimodal (mlx-vlm) - **Qwen-VL** (2, 2.5, 3) - **LLaVA** (1.5, 1.6, NeXT, Interleave) - **PaliGemma** (2) - **Pixtral** (12B) - **Molmo** (7B, 72B) - **DeepSeek-VL** (v2) - **Phi-3-Vision**, **Florence2**, **Idefics3** ```python # Vision example from mlx_vlm import load, generate model, processor = load("mlx-community/Qwen2.5-VL-7B-Instruct-4bit") output = generate(model, processor, "image.jpg", "Describe this image") ``` ## Core MLX Concepts ### Unified Memory ```python import mlx.core as mx # Arrays live in shared memory - no GPU↔CPU transfers a = mx.random.normal((1000, 1000)) b = mx.random.normal((1000, 1000)) c = mx.matmul(a, b) # Automatic device selection, no data copy ``` ### Lazy Evaluation ```python import mlx.core as mx a = mx.ones((1000, 1000)) b = mx.ones((1000, 1000)) c = mx.matmul(a, b) # Not computed yet mx.eval(c) # Now computed ``` ### Composable Transforms ```python import mlx.core as mx def loss_fn(w, x, y): return mx.mean((mx.matmul(x, w) - y) ** 2) # Automatic differentiation grad_fn = mx.grad(loss_fn) # Vectorization vmap_fn = mx.vmap(loss_fn) ``` ## Performance | Feature | Benefit | |---------|---------| | Unified Memory | No GPU↔CPU transfers | | Metal Backend | Native M-series acceleration | | CUDA Backend | Linux NVIDIA GPU support (v0.28+) | | 4-bit Quantization | 75% smaller, fits on small Macs | | MXFP4/NVFP4 | New microscaling formats (v0.29+) | | Lazy Evaluation | Reduced memory footprint | | Prompt Caching | Fast multi-turn dialogue | | KV Rotation | Infinite context in bounded memory | | Speculative Decoding | 2-3x faster with draft model | | M5 Neural Accelerators | 3.5-4x TTFT speedup (v0.30+) | | Wired Memory | Large models on macOS 15+ | | mx.distributed | Multi-GPU training (NCCL) | ## GF(3) Triads ``` mlx-apple-silicon (+1) ⊗ unworld (0) ⊗ segal-types (-1) = 0 ✓ mlx-apple-silicon (+1) ⊗ gay-mcp (0) ⊗ temporal-coalgebra (-1) = 0 ✓ mlx-apple-silicon (+1) ⊗ rama-gay-clojure (0) ⊗ bisimulation-game (-1) = 0 ✓ ``` ## Commands ```bash # Generate mlx_lm.generate --model MODEL --prompt "..." --max-tokens N # Chat mlx_lm.chat --model MODEL # Convert mlx_lm.convert --hf-path HF_MODEL -q --mlx-path ./local # Cache prompt mlx_lm.cache_prompt --model MODEL --prompt "..." --prompt-cache-file cache.safetensors # LoRA fine-tune mlx_lm.lora --model MODEL --data ./data --output ./lora-adapters ``` ## Integration with Gay.jl Coloring ```python from mlx_lm import load, stream_generate # Each generation step can be colored by trit GOLDEN = 0x9E3779B97F4A7C15 def splitmix64(x): z = (x + GOLDEN) & 0xFFFFFFFFFFFFFFFF z = ((z ^ (z >> 30)) * 0xBF58476D1CE4E5B9) & 0xFFFFFFFFFFFFFFFF z = ((z ^ (z >> 27)) * 0x94D049BB133111EB) & 0xFFFFFFFFFFFFFFFF return (z ^ (z >> 31)) & 0xFFFFFFFFFFFFFFFF def token_to_trit(token_id, seed): h = splitmix64(seed ^ token_id) return (h % 3) - 1 # {-1, 0, +1} model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") seed = 0x42D for i, r in enumerate(stream_generate(model, tokenizer, prompt="Hello", max_tokens=10)): trit = token_to_trit(r.token, seed + i) print(f"{r.text} [trit={trit:+d}]", end=" ") ``` ## Model Architecture Internals (LLaMA) ### Attention with Grouped Query Attention (GQA) ```python class Attention(nn.Module): def __init__(self, args): self.n_heads = args.num_attention_heads # e.g., 32 self.n_kv_heads = args.num_key_value_heads # e.g., 8 (GQA compression) self.head_dim = args.hidden_size // self.n_heads self.scale = self.head_dim ** -0.5 self.q_proj = nn.Linear(dim, self.n_heads * self.head_dim) self.k_proj = nn.Linear(dim, self.n_kv_heads * self.head_dim) self.v_proj = nn.Linear(dim, self.n_kv_heads * self.head_dim) self.o_proj = nn.Linear(self.n_heads * self.head_dim, dim) self.rope = initialize_rope(...) def __call__(self, x, mask=None, cache=None): B, L, D = x.shape q = self.q_proj(x).reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3) k = self.k_proj(x).reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3) v = self.v_proj(x).reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3) # RoPE: Rotary Position Embeddings (θ_i = base^(-2i/d)) q, k = self.rope(q, offset=cache.offset if cache else 0), self.rope(k, offset=cache.offset if cache else 0) if cache: k, v = cache.update_and_fetch(k, v) out = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale, mask=mask) return self.o_proj(out.transpose(0, 2, 1, 3).reshape(B, L, -1)) ``` ### SwiGLU MLP ```python class MLP(nn.Module): def __call__(self, x): # SwiGLU: Down(SiLU(Gate(x)) ⊙ Up(x)) return self.down_proj(nn.silu(self.gate_proj(x)) * self.up_proj(x)) ``` ### TransformerBlock (Pre-Norm) ```python class TransformerBlock(nn.Module): def __call__(self, x, mask=None, cache=None): h = x + self.self_attn(self.input_layernorm(x), mask, cache) return h + self.mlp(self.post_attention_layernorm(h)) ``` ## Automatic Differentiation ```python import mlx.core as mx import mlx.nn as nn import mlx.optimizers as optim def loss_fn(model, x, y): logits = model(x) return mx.mean(nn.losses.cross_entropy(logits, y)) # Value and gradient in one pass loss_and_grad_fn = nn.value_and_grad(model, loss_fn) loss, grads = loss_and_grad_fn(model, inputs, targets) # Gradient clipping + optimizer step grads = optim.clip_grad_norm(grads, max_norm=1.0) optimizer.update(model, grads) mx.eval(model.parameters(), optimizer.state) ``` ### Gradient Flow Through Attention ``` ∂L/∂values ← softmax_backward(attention_weights, ∂L/∂output) ∂L/∂scores ← attention_weights^T @ ∂L/∂output ∂L/∂keys ← queries^T @ ∂L/∂scores ∂L/∂queries ← ∂L/∂scores @ keys # All fused in mx.fast.scaled_dot_product_attention backward ``` ## RoPE Variants | Variant | Context | Base θ Formula | |---------|---------|----------------| | Default | 4K-8K | `10000^(-2i/d)` | | Llama3RoPE | 128K | Frequency interpolation + scaling | | YarnRoPE | 64K+ | Smooth frequency scaling | | SuScaledRoPE | 100K+ | Split short/long frequency scaling | ## KV Cache Strategies ```python # Standard incremental cache cache = KVCache() # Pre-allocates in 256-token chunks # Rotating cache for sliding window attention (Mistral, LLaMA 3.2) cache = RotatingKVCache(max_size=4096, keep=4) # keep=N attention sinks # Prompt caching (reuse system prompt) from mlx_lm.models.cache import make_prompt_cache, save_prompt_cache cache = make_prompt_cache(model) save_prompt_cache("system.safetensors", cache) ``` ## Latent Space Topology ### Extracting Hidden States ```python # Hook into transformer layers for latent analysis def extract_activations(model, inputs): activations = [] h = model.model.embed_tokens(inputs) for layer in model.model.layers: h = layer(h, mask=None, cache=None) activations.append(h.copy()) # Snapshot each layer return activations # Analyze residual stream residual_norms = [mx.linalg.norm(a, axis=-1).mean() for a in activations] ``` ### Hyperbolic Distance (Beyond Euclid) ```python def poincare_distance(u, v, eps=1e-5): """Hyperbolic distance in Poincaré ball model""" diff = u - v norm_u = mx.linalg.norm(u, axis=-1, keepdims=True) norm_v = mx.linalg.norm(v, axis=-1, keepdims=True) norm_diff = mx.linalg.norm(diff, axis=-1, keepdims=True) denom = (1 - norm_u**2) * (1 - norm_v**2) + eps return mx.arccosh(1 + 2 * norm_diff**2 / denom) # For attention patterns: heads form hyperbolic tree structures # Low curvature → flat Euclidean, High curvature → hierarchical ``` ### Active Inference Integration ```python def free_energy(model, x, prior_mean, prior_var): """Variational free energy for active inference""" # Prediction: forward pass gives expected sensory input pred = model(x) # Prediction error (likelihood) pred_error = mx.mean((pred - x) ** 2) # Complexity (KL divergence from prior) posterior = model.model.layers[-1].self_attn.rope # Use RoPE as approximate posterior kl = 0.5 * mx.sum(posterior / prior_var + mx.log(prior_var) - 1) return pred_error + kl # Minimize to update beliefs ``` ## References - [ml-explore/mlx](https://github.com/ml-explore/mlx) (23K★) - [ml-explore/mlx-lm](https://github.com/ml-explore/mlx-lm) (3.1K★) - [mlx-community on HuggingFace](https://huggingface.co/mlx-community) - [MLX Documentation](https://ml-explore.github.io/mlx/) - [LLaMA model implementation](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/llama.py) --- **Skill Name**: mlx-apple-silicon **Type**: LLM Inference / Apple Silicon / Autodiff **Trit**: +1 (PLUS - generative) **GF(3)**: Generates tokens deterministically **Platform**: macOS with Apple Silicon **Active Inference**: Supports latent space extraction + free energy minimization ## Scientific Skill Interleaving This skill connects to the K-Dense-AI/claude-scientific-skills ecosystem: ### Autodiff - **jax** [○] via bicomodule - Automatic differentiation ### Bibliography References - `general`: 734 citations in bib.duckdb ## SDF Interleaving This skill connects to **Software Design for Flexibility** (Hanson & Sussman, 2021): ### Primary Chapter: 5. Evaluation **Concepts**: eval, apply, interpreter, environment ### GF(3) Balanced Triad ``` mlx-apple-silicon (−) + SDF.Ch5 (−) + [balancer] (−) = 0 ``` **Skill Trit**: -1 (MINUS - verification) ### Secondary Chapters - Ch3: Variations on an Arithmetic Theme - Ch4: Pattern Matching - Ch6: Layering - Ch10: Adventure Game Example - Ch1: Flexibility through Abstraction ### Connection Pattern Evaluation interprets expressions. This skill processes or generates evaluable forms. ## Cat# Integration This skill maps to **Cat# = Comod(P)** as a bicomodule in the equipment structure: ``` Trit: 0 (ERGODIC) Home: Prof Poly Op: ⊗ Kan Role: Adj Color: #26D826 ``` ### GF(3) Naturality The skill participates in triads satisfying: ``` (-1) + (0) + (+1) ≡ 0 (mod 3) ``` This ensures compositional coherence in the Cat# equipment structure.