---
name: docker-containerization
description: Write production-grade Dockerfiles, docker-compose configurations, multi-stage builds, and container optimization. Activate on "Dockerfile", "docker", "docker-compose", "container", "multi-stage
  build", "docker image", "container optimization", "docker security". NOT for Kubernetes orchestration, cloud-specific container services (ECS, Cloud Run), or CI/CD pipelines (use github-actions-pipeline-builder).
allowed-tools: Read,Write,Edit,Bash,Grep,Glob
metadata:
  category: DevOps & Site Reliability
  tags:
  - docker
  - containerization
  - dockerfile
  - docker-compose
  pairs-with:
  - skill: devops-automator
    reason: Docker images are built, tested, and deployed through CI/CD automation pipelines
  - skill: github-actions-pipeline-builder
    reason: Container builds and registry pushes are common GitHub Actions workflow stages
  - skill: site-reliability-engineer
    reason: Container health checks, resource limits, and orchestration are SRE responsibilities
  - skill: microservices-patterns
    reason: Microservices are typically deployed as individual Docker containers
---

# Docker Containerization

Write production-grade Dockerfiles with multi-stage builds, security hardening, and size optimization. Covers docker-compose for local development, image layer caching, health checks, and the patterns that separate a 2GB image from a 50MB one.

## When to Use

**Use for**:
- Writing Dockerfiles from scratch or improving existing ones
- Multi-stage builds for compiled languages (Go, Rust, TypeScript)
- Docker Compose for local development environments
- Image size optimization (choosing base images, layer caching)
- Docker security scanning and hardening
- Development vs production Dockerfile patterns
- Debugging container build failures
- .dockerignore optimization

**NOT for**:
- Kubernetes deployment/orchestration (different domain)
- Cloud-specific container services (ECS, Cloud Run, App Runner)
- CI/CD pipeline configuration (use `github-actions-pipeline-builder`)
- Container networking beyond docker-compose
- Docker Swarm

---

## Dockerfile Decision Tree

```mermaid
flowchart TD
    Start[What are you building?] --> Lang{Language/runtime?}
    Lang -->|Node.js/TypeScript| Node[Node pattern]
    Lang -->|Python| Python[Python pattern]
    Lang -->|Go| Go[Go pattern]
    Lang -->|Rust| Rust[Rust pattern]
    Lang -->|Static site| Static[Static pattern]

    Node --> NQ{Need build step?}
    NQ -->|Yes, TypeScript/bundler| MultiNode[Multi-stage: build + runtime]
    NQ -->|No, plain JS| SingleNode[Single stage with slim base]

    Python --> PQ{Package manager?}
    PQ -->|pip| PipPattern[pip + venv pattern]
    PQ -->|uv| UvPattern[uv pattern — fastest]
    PQ -->|poetry| PoetryPattern[poetry export pattern]

    Go --> GoMulti[Multi-stage: build + scratch/distroless]
    Rust --> RustMulti[Multi-stage: build + debian-slim]
    Static --> StaticMulti[Multi-stage: build + nginx/caddy]
```

---

## Production Patterns by Language

### Node.js / TypeScript (Multi-Stage)

```dockerfile
# Stage 1: Dependencies
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production

# Stage 2: Build (TypeScript/bundler)
FROM node:22-alpine AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 3: Production
FROM node:22-alpine AS production
WORKDIR /app
ENV NODE_ENV=production

# Security: non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nextjs -u 1001

COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package.json ./

USER nextjs
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
  CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "dist/index.js"]
```

### Python (uv — Fastest)

```dockerfile
FROM python:3.12-slim AS base

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

WORKDIR /app

# Install dependencies (cached layer)
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev --no-editable

# Copy application code
COPY . .

# Non-root user
RUN useradd -r -s /bin/false appuser
USER appuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=3s \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Go (Multi-Stage → Distroless)

```dockerfile
# Build stage
FROM golang:1.22-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /server ./cmd/server

# Production: distroless (no shell, no package manager, minimal attack surface)
FROM gcr.io/distroless/static-debian12
COPY --from=build /server /server
EXPOSE 8080
USER nonroot:nonroot
ENTRYPOINT ["/server"]
```

---

## Layer Caching Strategy

```mermaid
flowchart TD
    subgraph "Slow to change (cache hit)"
        A[Base image] --> B[System packages]
        B --> C[Language runtime deps]
    end
    subgraph "Medium change frequency"
        C --> D[Application dependencies]
    end
    subgraph "Fast changing (cache miss OK)"
        D --> E[Application code]
        E --> F[Build step]
    end
```

**Rule**: Order Dockerfile instructions from least-frequently-changed to most-frequently-changed. Each instruction creates a layer. When a layer changes, all subsequent layers are rebuilt.

### Anti-Pattern: COPY Before Dependencies

**Novice**:
```dockerfile
COPY . .                    # ← Busts cache on ANY file change
RUN npm install             # ← Reinstalls everything every build
```

**Expert**:
```dockerfile
COPY package.json package-lock.json ./  # ← Only busts on dependency changes
RUN npm ci                              # ← Cached when deps unchanged
COPY . .                                # ← Only app code changes trigger rebuild
```

**Timeline**: This has been best practice since Docker layer caching was introduced, but LLMs trained on older tutorials still generate the wrong order.

---

## Docker Compose for Development

```yaml
# docker-compose.yml
services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: development          # Use a dev-specific stage
    ports:
      - "${PORT:-3000}:3000"
    volumes:
      - .:/app                     # Hot reload via bind mount
      - /app/node_modules          # Anonymous volume: don't override node_modules
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/myapp
    depends_on:
      db:
        condition: service_healthy
    develop:
      watch:                       # Docker Compose Watch (2024+)
        - action: sync
          path: ./src
          target: /app/src
        - action: rebuild
          path: package.json

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s

volumes:
  pgdata:
```

### Anti-Pattern: No Health Checks

**Novice**: Relies on `depends_on` alone — but that only waits for the container to START, not for the service to be READY.
**Expert**: Always add `healthcheck` to database/cache services and use `condition: service_healthy` in `depends_on`. A Postgres container that has started but hasn't finished WAL recovery will crash your app.

---

## Image Size Optimization

| Base Image | Size | Use When |
|-----------|------|----------|
| `node:22` | ~1.1 GB | Never in production |
| `node:22-slim` | ~200 MB | Need apt packages |
| `node:22-alpine` | ~130 MB | Default choice |
| `distroless` | ~20 MB | Go/Rust compiled binaries |
| `scratch` | 0 MB | Fully static binaries |
| `chainguard/*` | ~10-30 MB | Security-hardened alternatives |

### Quick Wins

```dockerfile
# 1. Use --no-cache for apk/apt
RUN apk add --no-cache curl

# 2. Combine RUN commands to reduce layers
RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

# 3. Use .dockerignore aggressively
# .dockerignore:
node_modules
.git
*.md
.env*
dist
coverage
.next
```

---

## Security Hardening

```dockerfile
# 1. Non-root user (MANDATORY)
RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup
USER appuser

# 2. Read-only filesystem (in compose)
# docker-compose.yml:
#   read_only: true
#   tmpfs:
#     - /tmp

# 3. No new privileges
# docker run --security-opt no-new-privileges ...

# 4. Pin image digests for reproducibility
FROM node:22-alpine@sha256:abc123...

# 5. Scan for vulnerabilities
# docker scout quickview myimage:latest
# trivy image myimage:latest
```

### Anti-Pattern: Running as Root

**Novice**: Skips the USER instruction. Everything runs as root.
**Expert**: Running as root inside a container means a container escape gives the attacker root on the host. Always create and switch to a non-root user. Only use root for package installation in build stages.
**Detection**: `docker inspect --format='{{.Config.User}}' image:tag` — if empty, it's root.

---

## Health Check Strategy by Service Type

### Principle: Liveness, Not Readiness

Docker HEALTHCHECK answers one question: "Is this process alive and minimally functional?" It does NOT answer "Are all dependencies reachable?" — that's readiness (a Kubernetes concept). Conflating them causes cascading restarts: DB goes down → every API container "fails" health check → orchestrator restarts them all → thundering herd on DB recovery.

### API Services

```dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget -qO- http://localhost:${PORT}/health || exit 1
```

The `/health` endpoint should:
- Return 200 if the process can serve HTTP requests
- NOT check database connectivity (that's readiness)
- NOT run expensive queries or computations
- Respond in <100ms — it runs every 30 seconds

```js
// Minimal /health endpoint
app.get('/health', (req, res) => res.status(200).json({ status: 'ok' }));
```

If you need a richer health check for monitoring dashboards (DB status, queue depth, cache hit rate), expose it on `/health/detailed` and do NOT wire it to Docker HEALTHCHECK.

Compose equivalent:
```yaml
healthcheck:
  test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
  interval: 30s
  timeout: 3s
  start_period: 10s
  retries: 3
```

### Worker / Background Job Services

Workers don't serve HTTP. Use a heartbeat file pattern:

```dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD test $(find /tmp/worker-heartbeat -mmin -1 2>/dev/null | wc -l) -gt 0 || exit 1
```

The worker writes a timestamp file on each successful job loop iteration:
```js
// Inside your worker loop
await processJob();
fs.writeFileSync('/tmp/worker-heartbeat', Date.now().toString());
```

If the heartbeat file is older than 1 minute, the worker is stuck. Checks: process is alive, event loop is not blocked, jobs are being dequeued.

### Static File Servers (nginx, Caddy)

```dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD wget -qO- http://localhost:80/ || exit 1
```

Short start period — static servers boot fast. Just check it serves a page. No `/health` endpoint needed.

### Database Containers

Use the database's native client for health checks, not HTTP:

```yaml
# PostgreSQL
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 10s
  timeout: 5s
  start_period: 30s    # DBs are slow to start — generous grace period
  retries: 5

# Redis
healthcheck:
  test: ["CMD", "redis-cli", "ping"]
  interval: 10s
  timeout: 3s
  retries: 5

# MySQL
healthcheck:
  test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
  interval: 10s
  timeout: 5s
  start_period: 30s
  retries: 5
```

### Tuning Parameters

| Parameter | Guidance |
|-----------|----------|
| `interval` | 30s for apps, 10s for databases. Lower = more CPU overhead and log noise. |
| `timeout` | 3-5s. If your health check takes longer, it's too expensive. |
| `start_period` | How long until the first check. 5s for static, 10s for APIs, 30s for databases, 60s+ for JVM apps. |
| `retries` | 3 for apps, 5 for databases. Too low = restarts on transient blips. |

---

## References

- `references/multi-stage-patterns.md` — Consult for complex multi-stage builds: build caching with BuildKit, cross-compilation, monorepo Dockerfiles, Bun/Deno patterns
- `references/compose-patterns.md` — Consult for advanced docker-compose: profiles, extends, override files, networking, secrets management, GPU passthrough