---
name: anth-rate-limits
description: 'Implement Anthropic Claude API rate limiting, backoff, and quota management.

  Use when handling 429 errors, optimizing request throughput,

  or managing RPM/TPM limits across usage tiers.

  Trigger with phrases like "anthropic rate limit", "claude 429",

  "anthropic throttling", "claude retry", "anthropic backoff".

  '
allowed-tools: Read, Write, Edit
version: 1.0.0
license: MIT
author: Jeremy Longshore <jeremy@intentsolutions.io>
tags:
- saas
- ai
- anthropic
compatibility: Designed for Claude Code
---
# Anthropic Rate Limits

## Overview

The Claude API uses token-bucket rate limiting measured in three dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Limits increase automatically as you move through usage tiers.

## Rate Limit Dimensions

| Dimension | Header | Description |
|-----------|--------|-------------|
| RPM | `anthropic-ratelimit-requests-limit` | Requests per minute |
| ITPM | `anthropic-ratelimit-tokens-limit` | Input tokens per minute |
| OTPM | `anthropic-ratelimit-tokens-limit` | Output tokens per minute |

Limits are per-organization and per-model-class. Cached input tokens do NOT count toward ITPM limits.

## Usage Tiers (Auto-Upgrade)

| Tier | Monthly Spend | Key Benefit |
|------|---------------|-------------|
| Tier 1 (Free) | $0 | Evaluation access |
| Tier 2 | $40+ | Higher RPM |
| Tier 3 | $200+ | Production-grade limits |
| Tier 4 | $2,000+ | High-throughput access |
| Scale | Custom | Custom limits via sales |

Check your current tier and limits at [console.anthropic.com](https://console.anthropic.com/settings/limits).

## SDK Built-In Retry

```python
import anthropic

# The SDK retries 429 and 5xx errors automatically (2 retries by default)
client = anthropic.Anthropic(max_retries=5)  # Increase for high-traffic apps

# Disable auto-retry for manual control
client = anthropic.Anthropic(max_retries=0)
```

```typescript
const client = new Anthropic({ maxRetries: 5 });
```

## Custom Rate Limiter with Header Awareness

```python
import time
import anthropic

class RateLimitedClient:
    def __init__(self):
        self.client = anthropic.Anthropic(max_retries=0)  # We handle retries
        self.remaining_requests = 100
        self.remaining_tokens = 100000
        self.reset_at = 0.0

    def create_message(self, **kwargs):
        # Pre-check: wait if near limit
        if self.remaining_requests < 3 and time.time() < self.reset_at:
            wait = self.reset_at - time.time()
            print(f"Pre-throttle: waiting {wait:.1f}s")
            time.sleep(wait)

        for attempt in range(5):
            try:
                response = self.client.messages.create(**kwargs)

                # Update from response headers (via _response)
                headers = response._response.headers
                self.remaining_requests = int(headers.get("anthropic-ratelimit-requests-remaining", 100))
                self.remaining_tokens = int(headers.get("anthropic-ratelimit-tokens-remaining", 100000))
                reset = headers.get("anthropic-ratelimit-requests-reset")
                if reset:
                    from datetime import datetime
                    self.reset_at = datetime.fromisoformat(reset.replace("Z", "+00:00")).timestamp()

                return response
            except anthropic.RateLimitError as e:
                retry_after = float(e.response.headers.get("retry-after", 2 ** attempt))
                print(f"429 — retry in {retry_after}s (attempt {attempt + 1})")
                time.sleep(retry_after)

        raise Exception("Exhausted rate limit retries")
```

## Queue-Based Throughput Control

```typescript
import PQueue from 'p-queue';
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Enforce 50 RPM with concurrency limit
const queue = new PQueue({
  concurrency: 10,
  interval: 60_000,
  intervalCap: 50,
});

async function rateLimitedCall(prompt: string) {
  return queue.add(() =>
    client.messages.create({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      messages: [{ role: 'user', content: prompt }],
    })
  );
}

// Process 200 prompts without hitting limits
const results = await Promise.all(
  prompts.map(p => rateLimitedCall(p))
);
```

## Cost-Saving: Use Batches for Bulk Work

```python
# Message Batches API: 50% cheaper, no rate limit pressure on real-time quota
batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {
            "model": "claude-sonnet-4-20250514",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}]
        }}
        for i, prompt in enumerate(prompts)
    ]
)
```

## Error Handling

| Header | Description | Action |
|--------|-------------|--------|
| `retry-after` | Seconds until next request allowed | Sleep this duration exactly |
| `anthropic-ratelimit-requests-remaining` | Requests left in window | Throttle if < 5 |
| `anthropic-ratelimit-tokens-remaining` | Tokens left in window | Reduce `max_tokens` if low |
| `anthropic-ratelimit-requests-reset` | ISO timestamp of window reset | Schedule retry after this time |

## Resources

- [Rate Limits Documentation](https://docs.anthropic.com/en/api/rate-limits)
- [Usage Tiers](https://docs.anthropic.com/en/api/service-tiers)
- [Message Batches API](https://docs.anthropic.com/en/api/creating-message-batches)

## Next Steps

For security configuration, see `anth-security-basics`.