---
title: "Distributed Locking for AI Agent Coordination"
date: "2026-02-10"
description: "How Redis-backed file claims solve the coordination problem when multiple AI agents work on the same codebase simultaneously."
topics: [distributed-systems, redis, agent-coordination]
status: "published"
author: "Michael Hofweller"
readingTime: "12 min read"
---

# Distributed Locking for AI Agent Coordination

When multiple AI agents work on the same codebase simultaneously, they need a way to avoid stepping on each other's toes. This is the distributed coordination problem — and it's one of the oldest problems in computer science, now showing up in an entirely new context.

## The Problem

Imagine three Claude Code agents working on a feature branch. Agent A is refactoring the authentication module. Agent B is updating the API routes that depend on auth. Agent C is writing tests for both. Without coordination, Agent B might read a file that Agent A is halfway through rewriting. Agent C might test against a state that no longer exists.

This isn't hypothetical. It's what happens in every multi-agent engineering setup that lacks coordination primitives.

## Redis-Backed File Claims

The solution we implemented in Nexus uses Redis as a distributed lock manager with file-level granularity. When an agent needs to modify a file, it "claims" it:

```
CLAIM file:src/auth/login.ts agent:agent-a ttl:30000
```

The claim is a Redis key with a TTL (time-to-live). This gives us several properties for free:

- **Mutual exclusion**: Only one agent can hold a claim on a file at a time
- **Crash tolerance**: If an agent dies, the TTL expires and the lock is automatically released
- **Visibility**: Any agent can query Redis to see who holds what

## Pipeline Pattern for Atomic Operations

A single file claim is simple, but real work often requires claiming multiple files atomically. You don't want to claim `auth/login.ts` but fail on `auth/types.ts` — that leaves you in a half-locked state.

We use Redis pipelines to make multi-file claims atomic:

```typescript
const pipeline = redis.pipeline();
for (const file of files) {
  pipeline.set(`claim:${file}`, agentId, "PX", ttl, "NX");
}
const results = await pipeline.exec();
```

The `NX` flag means "only set if not exists." If any claim fails, we roll back all of them. This is the all-or-nothing guarantee that makes the system reliable.

## Heartbeat-Based Liveness

TTLs handle the crash case, but what about an agent that's alive but slow? A 30-second TTL might expire while a legitimate operation is still in progress.

The solution is heartbeats. Every agent with active claims sends periodic heartbeat signals that extend the TTL:

```
PEXPIRE claim:src/auth/login.ts 30000
```

If heartbeats stop — because the agent crashed, lost network, or was terminated — the claims expire naturally. No manual cleanup required.

## Conflict Resolution

What happens when two agents try to claim the same file? The first one wins (Redis `NX` guarantees this). The second agent gets a rejection and must decide:

1. **Wait and retry** — poll until the claim is released
2. **Request release** — send a message to the holding agent asking it to finish up
3. **Escalate** — flag the conflict for human review

In practice, option 2 works best for AI agents. They're cooperative by nature and can often reorganize their work to avoid the conflict entirely.

## Lessons Learned

Building this system taught us several things about distributed coordination for AI agents:

**Agents are more cooperative than processes.** Traditional distributed locking assumes adversarial or at least independent actors. AI agents can actually communicate about their intentions, which makes conflict resolution much smoother.

**TTLs should be generous.** AI agents doing code generation can take unpredictable amounts of time. Short TTLs cause spurious expirations. We settled on 30 seconds with heartbeat renewal every 10 seconds.

**Visibility matters more than speed.** The ability for any agent (or human) to see who holds what locks is incredibly valuable for debugging. We built a dashboard view of all active claims that updates in real-time via WebSocket.

The full implementation lives in the [Nexus coordination server](https://github.com/mhofwell/nexus-2), where it's battle-tested across multi-agent engineering sessions.