--- name: canary-deployment description: Implement canary deployment strategies to gradually roll out new versions to subset of users with automatic rollback based on metrics. --- # Canary Deployment ## Overview Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds. ## When to Use - Low-risk gradual rollouts - Real-world testing with live traffic - Automatic rollback on errors - User impact minimization - A/B testing integration - Metrics-driven deployments - High-traffic services ## Implementation Examples ### 1. **Istio-based Canary Deployment** ```yaml # canary-deployment-istio.yaml apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v1 namespace: production spec: replicas: 3 selector: matchLabels: app: myapp version: v1 template: metadata: labels: app: myapp version: v1 spec: containers: - name: myapp image: myrepo/myapp:1.0.0 ports: - containerPort: 8080 --- apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v2 namespace: production spec: replicas: 1 # Start with minimal replicas for canary selector: matchLabels: app: myapp version: v2 template: metadata: labels: app: myapp version: v2 spec: containers: - name: myapp image: myrepo/myapp:2.0.0 ports: - containerPort: 8080 --- apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp namespace: production spec: hosts: - myapp http: # Canary: 5% to v2, 95% to v1 - match: - headers: user-agent: regex: ".*Chrome.*" # Test with Chrome route: - destination: host: myapp subset: v2 weight: 100 timeout: 10s # Default route with traffic split - route: - destination: host: myapp subset: v1 weight: 95 - destination: host: myapp subset: v2 weight: 5 timeout: 10s --- apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp namespace: production spec: host: myapp trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2 subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 trafficPolicy: outlierDetection: consecutive5xxErrors: 3 interval: 30s baseEjectionTime: 30s --- apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp progressDeadlineSeconds: 300 service: port: 80 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 stepWeightPromotion: 10 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 30s webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: smoke cmd: "curl -sd 'test' http://myapp-canary/api/test" - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/" logCmdOutput: "true" # Automatic rollback on failure skipAnalysis: false ``` ### 2. **Kubernetes Native Canary Script** ```bash #!/bin/bash # canary-rollout.sh - Canary deployment with k8s native tools set -euo pipefail NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" NEW_VERSION="${3:-latest}" CANARY_WEIGHT=10 MAX_WEIGHT=100 STEP_WEIGHT=10 CHECK_INTERVAL=60 MAX_ERROR_RATE=0.05 echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION" # Get current replicas CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}') CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1)) echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS" # Create canary deployment kubectl set image deployment/${DEPLOYMENT}-canary \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE" # Scale up canary gradually CURRENT_WEIGHT=$CANARY_WEIGHT while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do echo "Setting traffic to canary: ${CURRENT_WEIGHT}%" # Update ingress or service to split traffic kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}' # Wait and check metrics echo "Monitoring metrics for ${CHECK_INTERVAL}s..." sleep $CHECK_INTERVAL # Check error rate ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \ curl -s http://localhost:8080/metrics | grep http_requests_total | \ awk '{print $2}' || echo "0") if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then echo "ERROR: Error rate exceeded threshold: $ERROR_RATE" echo "Rolling back canary deployment..." kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}' exit 1 fi CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT)) done # Promote canary to stable echo "Canary deployment successful! Promoting to stable..." kubectl set image deployment/${DEPLOYMENT} \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE" kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m echo "Canary deployment complete!" ``` ### 3. **Metrics-Based Canary Analysis** ```yaml # canary-monitoring.yaml apiVersion: v1 kind: ConfigMap metadata: name: canary-analysis namespace: production data: analyze.sh: | #!/bin/bash set -euo pipefail CANARY_DEPLOYMENT="${1:-myapp-canary}" STABLE_DEPLOYMENT="${2:-myapp-stable}" THRESHOLD="${3:-0.05}" # 5% error rate threshold NAMESPACE="production" echo "Analyzing canary metrics..." # Query Prometheus for metrics CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \ jq -r '.data.result[0].value[1]' || echo "0") STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \ jq -r '.data.result[0].value[1]' || echo "0") CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \ jq -r '.data.result[0].value[1]' || echo "0") STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \ jq -r '.data.result[0].value[1]' || echo "0") echo "Canary Error Rate: $CANARY_ERROR_RATE" echo "Stable Error Rate: $STABLE_ERROR_RATE" echo "Canary P95 Latency: ${CANARY_LATENCY}s" echo "Stable P95 Latency: ${STABLE_LATENCY}s" # Check if canary is within acceptable range if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then echo "FAIL: Canary error rate exceeds threshold" exit 1 fi if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then echo "FAIL: Canary latency is 20% higher than stable" exit 1 fi echo "PASS: Canary meets quality criteria" exit 0 --- apiVersion: batch/v1 kind: Job metadata: name: canary-analysis namespace: production spec: template: spec: containers: - name: analyzer image: curlimages/curl:latest command: - sh - -c - | apk add --no-cache bc jq bash /scripts/analyze.sh volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: canary-analysis restartPolicy: Never ``` ### 4. **Automated Canary Promotion** ```bash #!/bin/bash # promote-canary.sh - Automatically promote successful canary set -euo pipefail NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" MAX_DURATION="${3:-600}" # Max 10 minutes for canary start_time=$(date +%s) echo "Starting automated canary promotion for $DEPLOYMENT" while true; do current_time=$(date +%s) elapsed=$((current_time - start_time)) if [ $elapsed -gt $MAX_DURATION ]; then echo "ERROR: Canary exceeded max duration" exit 1 fi # Check canary health CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.status.readyReplicas}') CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}') if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then echo "Waiting for canary pods to be ready..." sleep 10 continue fi # Run analysis if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then echo "Canary analysis passed! Promoting to stable..." # Merge canary into stable kubectl set image deployment/${DEPLOYMENT} \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \ -n "$NAMESPACE" kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m echo "Canary promoted successfully!" exit 0 else echo "Canary analysis failed. Rolling back..." exit 1 fi done ``` ## Canary Best Practices ### ✅ DO - Start with small traffic percentage (5-10%) - Monitor key metrics continuously - Increase gradually based on metrics - Implement automatic rollback - Run load tests on canary - Test with real user traffic - Set appropriate thresholds - Document rollback procedures ### ❌ DON'T - Rush through canary phases - Ignore metrics - Mix canary and stable versions - Deploy to all users at once - Skip rollback testing - Use artificial load only - Set unrealistic thresholds - Deploy unvalidated changes ## Metrics to Monitor - **Error Rate**: 5xx errors increase - **Latency**: P95/P99 response time - **Throughput**: Requests per second - **Resource Usage**: CPU, memory - **Business Metrics**: Conversion rate, revenue - **User Experience**: Session duration, bounce rate ## Resources - [Flagger Canary Deployments](https://flagger.app/) - [Istio Traffic Management](https://istio.io/latest/docs/concepts/traffic-management/) - [Kubernetes Canary Pattern](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy)