---
name: kafka-development
description: Best practices and guidelines for Apache Kafka event streaming and distributed messaging
---

# Kafka Development

You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.

## Core Principles

- Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
- Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
- Design for scalability, durability, and exactly-once semantics where needed
- Leave NO todos, placeholders, or missing pieces in the implementation

## Architecture Overview

### Core Components

- **Topics**: Categories/feeds for organizing messages
- **Partitions**: Ordered, immutable sequences within topics enabling parallelism
- **Producers**: Clients that publish messages to topics
- **Consumers**: Clients that read messages from topics
- **Consumer Groups**: Coordinate consumption across multiple consumers
- **Brokers**: Kafka servers that store data and serve clients

### Key Concepts

- Messages are appended to partitions in order
- Each message has an offset - a unique sequential ID within the partition
- Consumers maintain their own cursor (offset) and can read streams repeatedly
- Partitions are distributed across brokers for scalability

## Topic Design

### Partitioning Strategy

- Use partition keys to place related events in the same partition
- Messages with the same key always go to the same partition
- This ensures ordering for related events
- Choose keys carefully - uneven distribution causes hot partitions

### Partition Count

- More partitions = more parallelism but more overhead
- Consider: expected throughput, consumer count, broker resources
- Start with number of consumers you expect to run concurrently
- Partitions can be increased but not decreased

### Topic Configuration

- `retention.ms`: How long to keep messages (default 7 days)
- `retention.bytes`: Maximum size per partition
- `cleanup.policy`: delete (remove old) or compact (keep latest per key)
- `min.insync.replicas`: Minimum replicas that must acknowledge

## Producer Best Practices

### Reliability Settings

```
acks=all               # Wait for all replicas to acknowledge
retries=MAX_INT        # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry
```

### Performance Tuning

- `batch.size`: Accumulate messages before sending (default 16KB)
- `linger.ms`: Wait time for batching (0 = send immediately)
- `buffer.memory`: Total memory for buffering unsent messages
- `compression.type`: gzip, snappy, lz4, or zstd for bandwidth savings

### Error Handling

- Implement retry logic with exponential backoff
- Handle retriable vs non-retriable exceptions differently
- Log and alert on send failures
- Consider dead letter topics for messages that fail repeatedly

### Partitioner

- Default: hash of key determines partition (null key = round-robin)
- Custom partitioners for specific routing needs
- Ensure even distribution to avoid hot partitions

## Consumer Best Practices

### Offset Management

- Consumers track which messages they've processed via offsets
- `auto.offset.reset`: earliest (start from beginning) or latest (only new messages)
- Commit offsets after successful processing, not before
- Use `enable.auto.commit=false` for exactly-once semantics

### Consumer Groups

- Consumers in a group share partitions (each partition to one consumer)
- More consumers than partitions = some consumers idle
- Group rebalancing occurs when consumers join/leave
- Use `group.instance.id` for static membership to reduce rebalances

### Processing Patterns

- Process messages in order within a partition
- Handle out-of-order messages across partitions if needed
- Implement idempotent processing for at-least-once delivery
- Consider transactional processing for exactly-once

### Timeouts and Failures

- Implement processing timeout to isolate slow events
- When timeout occurs, set event aside and continue to next message
- Maintain overall system performance over processing every single event
- Use dead letter queues for messages failing all retries

## Error Handling and Retry

### Retry Strategy

- Allow multiple runtime retries per processing attempt
- Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
- Runtime retries typically cover 99% of failures
- After exhausting retries, route to dead letter queue

### Dead Letter Topics

- Create dedicated DLT for messages that can't be processed
- Include original topic, partition, offset, and error details
- Monitor DLT for patterns indicating systemic issues
- Implement manual or automated retry from DLT

## Schema Management

### Schema Registry

- Use Confluent Schema Registry for schema management
- Producers validate data against registered schemas during serialization
- Schema mismatches throw exceptions, preventing malformed data
- Provides common reference for producers and consumers

### Schema Evolution

- Design schemas for forward and backward compatibility
- Add optional fields with defaults for backward compatibility
- Avoid removing or renaming fields
- Use schema versioning and migration strategies

## Kafka Streams

### State Management

- Implement log compaction to maintain latest version of each key
- Periodically purge old data from state stores
- Monitor state store size and access patterns
- Use appropriate storage backends for your scale

### Windowing Operations

- Handle out-of-order events and skewed timestamps
- Use appropriate time extraction and watermarking techniques
- Configure grace periods for late-arriving data
- Choose window types based on use case (tumbling, hopping, sliding, session)

## Security

### Authentication

- Use SASL/SSL for client authentication
- Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
- Enable SSL for encryption in transit
- Rotate credentials regularly

### Authorization

- Use Kafka ACLs for fine-grained access control
- Grant minimum necessary permissions per principal
- Separate read/write permissions by topic
- Audit access patterns regularly

## Monitoring and Observability

### Key Metrics

- **Producer**: record-send-rate, record-error-rate, batch-size-avg
- **Consumer**: records-consumed-rate, records-lag, commit-latency
- **Broker**: under-replicated-partitions, request-latency, disk-usage

### Lag Monitoring

- Consumer lag = last produced offset - last committed offset
- High lag indicates consumers can't keep up
- Alert on increasing lag trends
- Scale consumers or optimize processing

### Distributed Tracing

- Propagate trace context in message headers
- Use OpenTelemetry for end-to-end tracing
- Correlate producer and consumer spans
- Track message journey through the pipeline

## Testing

### Unit Testing

- Mock Kafka clients for isolated testing
- Test serialization/deserialization logic
- Verify partitioning logic
- Test error handling paths

### Integration Testing

- Use embedded Kafka or Testcontainers
- Test full producer-consumer flows
- Verify exactly-once semantics if used
- Test rebalancing scenarios

### Performance Testing

- Load test with production-like message rates
- Test consumer throughput and lag behavior
- Verify broker resource usage under load
- Test failure and recovery scenarios

## Common Patterns

### Event Sourcing

- Store all state changes as immutable events
- Rebuild state by replaying events
- Use log compaction for snapshots
- Enable time-travel debugging

### CQRS (Command Query Responsibility Segregation)

- Separate write (command) and read (query) models
- Use Kafka as the event store
- Build read-optimized projections from events
- Handle eventual consistency appropriately

### Saga Pattern

- Coordinate distributed transactions across services
- Each service publishes events for next step
- Implement compensating transactions for rollback
- Use correlation IDs to track saga instances

### Change Data Capture (CDC)

- Capture database changes as Kafka events
- Use Debezium or similar CDC tools
- Enable real-time data synchronization
- Build event-driven integrations