---
name: mlx
description: "Use when working with Apple's MLX or MLX-LM: fact-checking current behavior against upstream source/runtime, patching MLX-based repos, porting PyTorch/JAX code to MLX, validating lazy evaluation, indexing, compilation, streams, channels-last layouts, Metal kernels, quantization, caches, and local MLX model loading or generation on Apple silicon."
---

# MLX

Use this skill for MLX or MLX-LM engineering work where correctness depends on
current upstream behavior, not model memory.

## When to Use

- Auditing or patching MLX or MLX-LM repos
- Fact-checking "latest" MLX or MLX-LM behavior
- Porting PyTorch or JAX code to MLX
- Debugging MLX indexing, lazy evaluation, compilation, or stream behavior
- Deciding when to use stock ops, `mx.fast.*`, `mx.fast.metal_kernel(...)`, or a deeper extension path
- Profiling or debugging MLX GPU execution with Metal capture hooks
- Profiling MLX memory usage or allocator/cache behavior on Apple silicon
- Reviewing MLX-LM model load, cache, prompt-cache, quantization, or generation code
- Validating local MLX model paths on Apple silicon

## Core Rules

- If the user asks for current or latest MLX facts, verify releases and source first.
- Prefer upstream docs/source plus runtime checks over memory.
- Treat undocumented runtime behavior as unstable.
- Distinguish documented contracts from observed caveats.
- Keep MLX-LM inference checks local and minimal: `lazy=True`, short prompts, small
  `max_tokens`.

## Quick Start

Set the skill path once:

```bash
export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export MLX_SKILL="$CODEX_HOME/skills/mlx"
```

Check latest upstream releases:

```bash
"$MLX_SKILL/scripts/mlx_release_info.sh"
```

Run the bundled runtime probe:

```bash
"$MLX_SKILL/scripts/mlx_probe.sh"
```

The launcher checks both `python3` and `python` and picks one that can import
`mlx`.

Run the probe with a local MLX model:

```bash
MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"
```

## Workflow

### 1. Classify the task

- **Current facts**: verify latest `mlx` / `mlx-lm` releases, then inspect source
- **Repo validation**: run the repo's own validator if it exists; otherwise use the bundled probe
- **Porting or debugging**: check the current facts reference, then validate the specific behavior locally
- **Local model inference**: use a local MLX model path and keep decode checks short

### 2. For current upstream facts

Use authenticated GitHub workflows when possible:

```bash
"$MLX_SKILL/scripts/mlx_release_info.sh"
gh repo clone ml-explore/mlx /tmp/mlx-upstream -- --depth 1
gh repo clone ml-explore/mlx-lm /tmp/mlx-lm-upstream -- --depth 1
```

Inspect only the files relevant to the question. Typical targets:

- MLX: `docs/src/usage/indexing.rst`, `lazy_evaluation.rst`, `compile.rst`,
  `numpy.rst`, `python/data_types.rst`, `python/memory_management.rst`,
  `python/mlx/nn/layers/convolution.py`,
  `docs/src/dev/custom_metal_kernels.rst`, `docs/src/dev/metal_debugger.rst`,
  `docs/src/dev/extensions.rst`
- MLX-LM: `mlx_lm/generate.py`, `mlx_lm/utils.py`, `mlx_lm/models/base.py`,
  `mlx_lm/models/cache.py`

### 3. For runtime validation

If the repo already has an MLX validator, prefer that first.

Otherwise run:

```bash
"$MLX_SKILL/scripts/mlx_probe.sh"
```

The bundled probe checks high-signal MLX and MLX-LM behavior:

- indexing and mask limitations
- slice-copy vs aliasing
- compile and retracing rules
- training flow and optimizer semantics
- channels-last inputs
- stream APIs
- custom Metal kernel and capture-hook surface
- MLX-LM API surface, attention mask, caches, prompt-cache roundtrip
- AutoAWQ/GPTQ transform helpers

### 4. For local model checks

Use a local MLX model path when load/generate behavior matters:

```bash
MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"
```

This adds:

- real `load(..., lazy=True)`
- one-step `generate(...)`
- `stream_generate(...)` response validation
- prompt-cache save/load on the actual model cache
- generation-stream / `async_eval` / `clear_cache` checks

### 5. For porting or reviews

Check [current-facts.md](./references/current-facts.md) first.

Then use [porting-checklist.md](./references/porting-checklist.md) for the
common MLX-specific failure modes:

- boolean mask selection unsupported
- slices are copies, not views
- no tensor `backward()` pattern
- explicit `mx.eval(...)` required in training and timing
- channels-last activations
- stream-aware benchmarking
- MLX-LM cache and generation API differences

## Kernel Escalation Path

- Start with stock MLX ops.
- If there is already a tuned kernel in `mx.fast.*`, prefer that first.
- Use `mx.fast.metal_kernel(...)` for Apple-only fused hot paths when the stock
  op graph is the bottleneck.
- Be explicit about contiguity: `ensure_row_contiguous=True` can hide copies.
- Use `@mx.custom_function` when the custom kernel also needs custom gradient
  logic.
- Move to C++ `Primitive` extensions only when Python-level Metal kernels are
  not enough.
- For serious GPU profiling, capture a `.gputrace` with
  `mx.metal.start_capture(...)` / `mx.metal.stop_capture()` and inspect it in
  Xcode.

## High-Signal MLX Differences

- Training is `nn.value_and_grad(...)` plus `optimizer.update(...)` plus
  `mx.eval(model.parameters(), optimizer.state)`.
- Module parameters are created lazily; explicit `mx.eval(model.parameters())`
  matters before timing and export.
- Conv inputs are channels-last: `NLC`, `NHWC`, `NDHWC`.
- `mx.compile(...)` retraces on dtype, rank, and input-arity changes.
- `shapeless=True` avoids shape-only retracing but can break shape-dependent code.
- Streams are first-class, and timing without `mx.eval(...)` or
  `mx.synchronize(...)` is often wrong.
- Memory profiling should use the top-level `mx.get_*_memory()` helpers and
  `mx.device_info()`, not deprecated `mx.metal.*` aliases.
- MLX has a real Python-level fused-kernel escape hatch in
  `mx.fast.metal_kernel(...)`.

## High-Signal MLX-LM Differences

- `generate(...)` and `stream_generate(...)` accept strings or token IDs.
- `batch_generate(...)` expects token ID lists, not raw strings.
- `stream_generate(...)` yields `GenerationResponse` objects.
- Prompt caches are not always pure KV caches; hybrid models can mix
  `ArraysCache` and `KVCache`.
- Current `mlx-lm==0.31.0` caveat: `batch_generate(..., max_tokens=1)` can hit a
  `ZeroDivisionError`.

## References

- Current validated facts and caveats: [current-facts.md](./references/current-facts.md)
- Porting and review checklist: [porting-checklist.md](./references/porting-checklist.md)

## Helpers

- Release helper: [scripts/mlx_release_info.sh](./scripts/mlx_release_info.sh)
- Runtime probe launcher: [scripts/mlx_probe.sh](./scripts/mlx_probe.sh)
- Runtime probe implementation: [scripts/mlx_probe.py](./scripts/mlx_probe.py)