---
name: karpenter
description: Kubernetes node autoscaling and cost optimization with Karpenter. Use when implementing node provisioning, spot instance management, cluster right-sizing, node consolidation, or reducing compute costs. Covers NodePool configuration, EC2NodeClass setup, disruption budgets, spot/on-demand mix strategies, multi-architecture support, and capacity-type selection.
triggers:
  - karpenter
  - node autoscaling
  - nodepool
  - ec2nodeclass
  - provisioner
  - spot instances
  - on-demand instances
  - node consolidation
  - node termination
  - cluster autoscaling
  - right-sizing
  - capacity-type
  - node disruption
  - compute costs
  - instance selection
  - graviton
  - arm64
allowed-tools: Read, Grep, Glob, Edit, Write, Bash
---

# Karpenter

## Overview

Karpenter is a Kubernetes node autoscaler that provisions right-sized compute resources in response to changing application load. Unlike Cluster Autoscaler which scales predefined node groups, Karpenter provisions nodes based on aggregate pod resource requirements, enabling better bin-packing and cost optimization.

### Key Differences from Cluster Autoscaler

- **Direct provisioning**: Talks directly to cloud provider APIs (no node groups required)
- **Fast scaling**: Provisions nodes in seconds vs minutes
- **Flexible instance selection**: Chooses from all available instance types automatically
- **Consolidation**: Actively replaces nodes with cheaper alternatives
- **Spot instance optimization**: First-class support with automatic fallback

### When to Use Karpenter

- Running workloads with diverse resource requirements
- Need for fast scaling (sub-minute response)
- Cost optimization with spot instances and Graviton (ARM64)
- Consolidation to reduce cluster waste and over-provisioning
- Clusters with unpredictable or bursty workloads
- Right-sizing infrastructure to actual usage patterns
- Managing mixed capacity types (spot/on-demand) automatically

## Instructions

### 1. Installation and Setup

- Install Karpenter controller in cluster
- Configure cloud provider credentials (IAM roles)
- Set up instance profiles and security groups
- Create NodePools for different workload types
- Define EC2NodeClass (AWS) or equivalent for your provider

### 2. Design NodePool Strategy

- Separate NodePools for different workload classes
- Define instance type families and sizes
- Configure spot/on-demand mix
- Set resource limits per NodePool
- Plan for multi-AZ distribution

### 3. Configure Disruption Management

- Set disruption budgets to control churn
- Configure consolidation policies
- Define expiration windows for node lifecycle
- Handle workload-specific disruption constraints
- Test disruption scenarios

### 4. Optimize for Cost and Performance

- Enable consolidation for cost savings
- Use spot instances with fallback strategies
- Set appropriate resource requests on pods (Karpenter depends on accurate requests)
- Monitor node utilization and waste
- Adjust instance type restrictions based on usage
- Leverage Graviton (ARM64) instances for 20% cost reduction
- Configure capacity-type weighting to prefer spot over on-demand

### 5. Cost Optimization Strategies

- **Spot instances**: Configure 70-90% spot mix for fault-tolerant workloads
- **Graviton (ARM64)**: Use c7g, m7g, r7g families for lower costs
- **Consolidation**: Enable WhenUnderutilized policy to replace expensive nodes
- **Instance diversity**: Wide instance family selection improves spot availability
- **Right-sizing**: Let Karpenter bin-pack efficiently instead of over-provisioning

### 6. Spot Instance Management

- Use wide instance type selection (10+ families) for better spot availability
- Configure automatic fallback to on-demand when spot unavailable
- Implement Pod Disruption Budgets to control blast radius
- Set graceful termination handlers in applications (preStop hooks)
- Monitor spot interruption rates and adjust instance selection
- Use diverse availability zones to reduce correlated failures

### 7. Node Consolidation

- **WhenUnderutilized**: Replaces nodes with cheaper/smaller alternatives actively
- **WhenEmpty**: Only consolidates completely empty nodes (conservative)
- Configure consolidateAfter delay to prevent churn (30s-600s typical)
- Use disruption budgets to limit consolidation rate (5-20% per window)
- Respect Pod Disruption Budgets during consolidation
- Set expiration windows to force periodic node refresh

## Best Practices

1. **Start Conservative**: Begin with restrictive instance types, expand based on observation
2. **Use Disruption Budgets**: Prevent too many nodes from being disrupted simultaneously
3. **Set Pod Resource Requests**: Karpenter relies on accurate requests for scheduling
4. **Enable Consolidation**: Let Karpenter optimize node utilization automatically
5. **Separate Workload Classes**: Use multiple NodePools for different requirements
6. **Monitor Provisioning**: Track provisioning latency and failures
7. **Test Spot Interruptions**: Ensure graceful handling of spot instance terminations
8. **Use Topology Spread**: Combine with pod topology constraints for availability

## Examples

### Example 1: Basic NodePool with Multiple Instance Types

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  # Template for nodes created by this NodePool
  template:
    spec:
      # Reference to EC2NodeClass (AWS-specific configuration)
      nodeClassRef:
        name: default

      # Requirements that constrain instance selection
      requirements:
        # Use amd64 or arm64 architectures
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]

        # Allow multiple instance families
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            ["c6a", "c6i", "c7i", "m6a", "m6i", "m7i", "r6a", "r6i", "r7i"]

        # Allow a range of instance sizes
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge", "4xlarge"]

        # Use 80% spot, 20% on-demand
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

        # Spread across availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b", "us-west-2c"]

      # Kubelet configuration
      kubelet:
        # Set max pods based on instance size
        maxPods: 110
        # Memory reservation for system components
        systemReserved:
          cpu: 100m
          memory: 100Mi
          ephemeral-storage: 1Gi
        # Eviction thresholds
        evictionHard:
          memory.available: 5%
          nodefs.available: 10%
        # Image garbage collection
        imageGCHighThresholdPercent: 85
        imageGCLowThresholdPercent: 80

      # Taints and labels
      taints:
        - key: workload-type
          value: general
          effect: NoSchedule

      # Metadata applied to nodes
      metadata:
        labels:
          workload-type: general
          managed-by: karpenter

  # Limits for this NodePool
  limits:
    cpu: 1000
    memory: 1000Gi

  # Disruption controls
  disruption:
    # Consolidation policy
    consolidationPolicy: WhenUnderutilized

    # Time window for when disruptions are allowed
    consolidateAfter: 30s

    # Budgets control the rate of disruptions
    budgets:
      - nodes: 10%
        duration: 5m

  # Node weight for scheduling decisions (higher = preferred)
  weight: 10
```

### Example 2: EC2NodeClass for AWS-Specific Configuration

```yaml
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # AMI selection
  amiFamily: AL2

  # Alternative: Use specific AMI selector
  # amiSelectorTerms:
  #   - id: ami-0123456789abcdef0
  #   - tags:
  #       karpenter.sh/discovery: my-cluster

  # IAM role for nodes (instance profile)
  role: KarpenterNodeRole-my-cluster

  # Subnet selection - use tags to identify subnets
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        kubernetes.io/role/internal-elb: "1"

  # Security group selection
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
    - name: my-cluster-node-security-group

  # User data for node initialization
  userData: |
    #!/bin/bash
    echo "Custom node initialization"
    # Configure container runtime
    # Set up logging
    # Install monitoring agents

  # Block device mappings for EBS volumes
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
        deleteOnTermination: true

  # Metadata options for IMDS
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  # Detailed monitoring
  detailedMonitoring: true

  # Tags applied to EC2 instances
  tags:
    Name: karpenter-node
    Environment: production
    ManagedBy: karpenter
    ClusterName: my-cluster
```

### Example 3: Specialized NodePools for Different Workloads

```yaml
---
# GPU workload NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-workloads
spec:
  template:
    spec:
      nodeClassRef:
        name: gpu-nodes

      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6", "p4", "p5"]

        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"] # GPU instances typically on-demand

        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ["0"]

      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule

      metadata:
        labels:
          workload-type: gpu
          nvidia.com/gpu: "true"

  limits:
    cpu: 500
    memory: 2000Gi
    nvidia.com/gpu: 16

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s

---
# Batch/Spot-heavy NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: batch-workloads
spec:
  template:
    spec:
      nodeClassRef:
        name: default

      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"] # Only spot instances

        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["c6a", "c6i", "c7i", "m6a", "m6i"] # Compute-optimized

        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["2xlarge", "4xlarge", "8xlarge"]

      taints:
        - key: workload-type
          value: batch
          effect: NoSchedule

      metadata:
        labels:
          workload-type: batch
          spot-interruption-handler: enabled

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s
    budgets:
      - nodes: 20% # Allow more aggressive disruption for batch

---
# Stateful workload NodePool (on-demand only)
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: stateful-workloads
spec:
  template:
    spec:
      nodeClassRef:
        name: stateful-nodes

      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"] # Only on-demand for stability

        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["r6i", "r7i"] # Memory-optimized

        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["xlarge", "2xlarge", "4xlarge"]

        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b"]

      kubelet:
        maxPods: 50 # Lower density for stateful workloads

      taints:
        - key: workload-type
          value: stateful
          effect: NoSchedule

      metadata:
        labels:
          workload-type: stateful
          storage-optimized: "true"

  limits:
    cpu: 200
    memory: 800Gi

  disruption:
    consolidationPolicy: WhenEmpty # Only consolidate when completely empty
    consolidateAfter: 600s # Wait 10 minutes
    budgets:
      - nodes: 1 # Very conservative disruption
        duration: 30m
```

### Example 4: Disruption Budgets and Consolidation Policies

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: production-apps
spec:
  template:
    spec:
      nodeClassRef:
        name: default

      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["c6i", "m6i", "r6i"]

  # Advanced disruption configuration
  disruption:
    # Consolidation policy options:
    # - WhenUnderutilized: Replace nodes with cheaper/smaller nodes
    # - WhenEmpty: Only replace completely empty nodes
    consolidationPolicy: WhenUnderutilized

    # How soon after a node becomes eligible for consolidation
    consolidateAfter: 30s

    # Expiration settings - force node replacement after time period
    expireAfter: 720h # 30 days

    # Multiple budget windows for different times/scenarios
    budgets:
      # During business hours: conservative disruption
      - nodes: 5%
        duration: 8h
        schedule: "0 8 * * MON-FRI"

      # During off-hours: more aggressive consolidation
      - nodes: 20%
        duration: 16h
        schedule: "0 18 * * MON-FRI"

      # Weekends: most aggressive
      - nodes: 30%
        duration: 48h
        schedule: "0 0 * * SAT"

      # Default budget (always active)
      - nodes: 10%
```

### Example 5: Pod Scheduling with Karpenter

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-application
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-application
  template:
    metadata:
      labels:
        app: my-application
    spec:
      # Tolerations to allow scheduling on Karpenter nodes
      tolerations:
        - key: workload-type
          operator: Equal
          value: general
          effect: NoSchedule

      # Node selector to target specific NodePool
      nodeSelector:
        workload-type: general
        karpenter.sh/capacity-type: spot # Prefer spot

      # Affinity rules for better placement
      affinity:
        # Spread across zones for availability
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: my-application
                topologyKey: topology.kubernetes.io/zone

        # Node affinity for instance type preferences
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            # Prefer ARM instances (cheaper)
            - weight: 50
              preference:
                matchExpressions:
                  - key: kubernetes.io/arch
                    operator: In
                    values: ["arm64"]

            # Prefer larger instances (better bin-packing)
            - weight: 30
              preference:
                matchExpressions:
                  - key: karpenter.k8s.aws/instance-size
                    operator: In
                    values: ["2xlarge", "4xlarge"]

      # Topology spread constraints
      topologySpreadConstraints:
        # Spread across zones
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: my-application

        # Spread across nodes
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: my-application

      containers:
        - name: app
          image: my-app:latest

          # CRITICAL: Accurate resource requests for Karpenter
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 1000m
              memory: 2Gi

          # Graceful shutdown for spot interruptions
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - sleep 15 # Allow time for deregistration

      # Termination grace period for spot interruptions
      terminationGracePeriodSeconds: 30
```

### Example 6: Spot Instance Handling and Fallback

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-with-fallback
spec:
  template:
    spec:
      nodeClassRef:
        name: default

      requirements:
        # Prioritize spot, but allow on-demand as fallback
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

        # Wide instance type selection for better spot availability
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - "c5a"
            - "c6a"
            - "c6i"
            - "c7i"
            - "m5a"
            - "m6a"
            - "m6i"
            - "m7i"
            - "r5a"
            - "r6a"
            - "r6i"
            - "r7i"

        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge", "4xlarge"]

        # Support both architectures for more spot options
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]

      # Metadata to track spot usage
      metadata:
        labels:
          spot-enabled: "true"
        annotations:
          karpenter.sh/spot-to-spot-consolidation: "true"

  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

    # More aggressive for spot since they can be interrupted anyway
    budgets:
      - nodes: 25%

  # Weight influences Karpenter's NodePool selection
  # Higher weight = more preferred
  # Use lower weight so other NodePools are tried first
  weight: 5
```

### Example 7: Karpenter with Pod Disruption Budget

```yaml
# Application Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-service
spec:
  replicas: 6
  selector:
    matchLabels:
      app: critical-service
  template:
    metadata:
      labels:
        app: critical-service
    spec:
      tolerations:
        - key: workload-type
          operator: Equal
          value: general
          effect: NoSchedule

      containers:
        - name: app
          image: critical-service:latest
          resources:
            requests:
              cpu: 1000m
              memory: 2Gi
            limits:
              cpu: 2000m
              memory: 4Gi

---
# Pod Disruption Budget to protect during consolidation
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 4 # Always keep at least 4 replicas running
  selector:
    matchLabels:
      app: critical-service
# Karpenter respects PDBs during consolidation
# It will not disrupt nodes if doing so would violate the PDB
```

### Example 8: Multi-Architecture NodePool

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      nodeClassRef:
        name: default

      requirements:
        # Support both AMD64 and ARM64
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]

        # ARM instances (Graviton) - typically 20% cheaper
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            # ARM (Graviton2)
            - "c6g"
            - "m6g"
            - "r6g"
            # ARM (Graviton3)
            - "c7g"
            - "m7g"
            - "r7g"
            # AMD64 alternatives
            - "c6i"
            - "m6i"
            - "r6i"

        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

      metadata:
        labels:
          multi-arch: "true"

  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 60s

---
# EC2NodeClass with multi-architecture AMI support
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # AL2 automatically selects the right AMI for architecture
  amiFamily: AL2

  # Alternative: Explicit AMI selection by architecture
  # amiSelectorTerms:
  #   - tags:
  #       karpenter.sh/discovery: my-cluster
  #       kubernetes.io/arch: amd64
  #   - tags:
  #       karpenter.sh/discovery: my-cluster
  #       kubernetes.io/arch: arm64

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
```

## Monitoring and Troubleshooting

### Key Metrics to Monitor

```text
# Provisioning metrics
karpenter_nodes_created_total
karpenter_nodes_terminated_total
karpenter_provisioner_scheduling_duration_seconds

# Disruption metrics
karpenter_disruption_replacement_node_initialized_seconds
karpenter_disruption_consolidation_actions_performed_total
karpenter_disruption_budgets_allowed_disruptions

# Cost metrics
karpenter_provisioner_instance_type_price_estimate
karpenter_cloudprovider_instance_type_offering_price_estimate

# Pod metrics
karpenter_pods_state (pending, running, etc.)
```

### Common Issues and Solutions

#### Issue: Pods stuck in Pending

- Check NodePool requirements match pod node selectors/tolerations
- Verify cloud provider limits not exceeded
- Check instance type availability in selected zones
- Ensure subnet capacity available

#### Issue: Excessive node churn

- Adjust consolidation delay (consolidateAfter)
- Review disruption budgets
- Check if pod resource requests are accurate
- Consider using WhenEmpty instead of WhenUnderutilized

#### Issue: High costs despite using Karpenter

- Enable consolidation if not already active
- Verify spot instances are being used
- Check if pods have unnecessarily large resource requests
- Review instance type selection (allow more variety)

#### Issue: Spot interruptions causing service disruption

- Implement Pod Disruption Budgets
- Use diverse instance types for better spot availability
- Configure appropriate replica counts
- Implement graceful shutdown in applications

## Integration with Terraform

```hcl
# Install Karpenter via Terraform
resource "helm_release" "karpenter" {
  namespace        = "karpenter"
  create_namespace = true
  name             = "karpenter"
  repository       = "oci://public.ecr.aws/karpenter"
  chart            = "karpenter"
  version          = "v0.33.0"

  values = [
    <<-EOT
    settings:
      clusterName: ${var.cluster_name}
      clusterEndpoint: ${var.cluster_endpoint}
      interruptionQueue: ${var.interruption_queue_name}

    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: ${var.karpenter_irsa_arn}

    controller:
      resources:
        requests:
          cpu: 1
          memory: 1Gi
        limits:
          cpu: 2
          memory: 2Gi
    EOT
  ]

  depends_on = [
    aws_iam_role_policy_attachment.karpenter_controller
  ]
}

# Deploy default NodePool
resource "kubectl_manifest" "karpenter_nodepool_default" {
  yaml_body = <<-YAML
    apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      name: default
    spec:
      template:
        spec:
          nodeClassRef:
            name: default
          requirements:
            - key: karpenter.sh/capacity-type
              operator: In
              values: ["spot", "on-demand"]
            - key: karpenter.k8s.aws/instance-family
              operator: In
              values: ["c6i", "m6i", "r6i"]
      limits:
        cpu: 1000
        memory: 1000Gi
      disruption:
        consolidationPolicy: WhenUnderutilized
        consolidateAfter: 30s
  YAML

  depends_on = [helm_release.karpenter]
}
```

## Migration from Cluster Autoscaler

1. **Plan the migration**

   - Identify current node groups and their characteristics
   - Map workloads to new NodePool configurations
   - Plan for coexistence period

2. **Deploy Karpenter alongside Cluster Autoscaler**

   - Install Karpenter in the cluster
   - Create NodePools with distinct labels
   - Test with non-critical workloads first

3. **Migrate workloads incrementally**

   - Update pod specs with Karpenter tolerations/node selectors
   - Monitor provisioning and consolidation behavior
   - Validate cost and performance metrics

4. **Remove Cluster Autoscaler**

   - Once all workloads migrated, scale down CA node groups
   - Remove Cluster Autoscaler deployment
   - Clean up CA-specific resources