---
name: backup-disaster-recovery
description: Implement backup strategies, disaster recovery plans, and data restoration procedures for protecting critical infrastructure and data.
---

# Backup and Disaster Recovery

## Overview

Design and implement comprehensive backup and disaster recovery strategies to ensure data protection, business continuity, and rapid recovery from infrastructure failures.

## When to Use

- Data protection and compliance
- Business continuity planning
- Disaster recovery planning
- Point-in-time recovery
- Cross-region failover
- Data migration
- Compliance and audit requirements
- Recovery time objective (RTO) optimization

## Implementation Examples

### 1. **Database Backup Configuration**

```yaml
# postgres-backup-cronjob.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: backup-script
  namespace: databases
data:
  backup.sh: |
    #!/bin/bash
    set -euo pipefail

    BACKUP_DIR="/backups/postgresql"
    RETENTION_DAYS=30
    DB_HOST="${POSTGRES_HOST}"
    DB_PORT="${POSTGRES_PORT:-5432}"
    DB_USER="${POSTGRES_USER}"
    DB_PASSWORD="${POSTGRES_PASSWORD}"

    export PGPASSWORD="$DB_PASSWORD"

    # Create backup directory
    mkdir -p "$BACKUP_DIR"

    # Full backup
    BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
    echo "Starting backup to $BACKUP_FILE"
    pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
      --format=plain --no-owner --no-privileges > "$BACKUP_FILE"

    # Compress backup
    gzip "$BACKUP_FILE"
    echo "Backup compressed: ${BACKUP_FILE}.gz"

    # Upload to S3
    aws s3 cp "${BACKUP_FILE}.gz" \
      "s3://my-backups/postgres/$(date +%Y/%m/%d)/"

    # Clean local old backups
    find "$BACKUP_DIR" -type f -mtime +7 -delete

    # Verify backup
    if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
       "${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
      echo "Backup verification successful"
      dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
    fi

    echo "Backup complete: ${BACKUP_FILE}.gz"

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: databases
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: backup-sa
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: POSTGRES_HOST
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: host
                - name: POSTGRES_USER
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: username
                - name: POSTGRES_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: password
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: access-key
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: secret-key
              volumeMounts:
                - name: backup-script
                  mountPath: /backup
                - name: backup-storage
                  mountPath: /backups
              command:
                - sh
                - -c
                - apk add --no-cache aws-cli && bash /backup/backup.sh
          volumes:
            - name: backup-script
              configMap:
                name: backup-script
                defaultMode: 0755
            - name: backup-storage
              emptyDir:
                sizeLimit: 100Gi
          restartPolicy: OnFailure
```

### 2. **Disaster Recovery Plan Template**

```yaml
# disaster-recovery-plan.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-procedures
  namespace: operations
data:
  dr-runbook.md: |
    # Disaster Recovery Runbook

    ## RTO and RPO Targets
    - RTO (Recovery Time Objective): 4 hours
    - RPO (Recovery Point Objective): 1 hour

    ## Pre-Disaster Checklist
    - [ ] Verify backups are current
    - [ ] Test backup restoration process
    - [ ] Verify DR site resources are provisioned
    - [ ] Confirm failover DNS is configured

    ## Primary Region Failure

    ### Detection (0-15 minutes)
    - Alerting system detects primary region down
    - Incident commander declared
    - War room opened in Slack #incidents

    ### Initial Actions (15-30 minutes)
    - Verify primary region is truly down
    - Check backup systems in secondary region
    - Validate latest backup timestamp

    ### Failover Procedure (30 minutes - 2 hours)
    1. Validate backup integrity
    2. Restore database from latest backup
    3. Update application configuration
    4. Perform DNS failover to secondary region
    5. Verify application health

    ### Recovery Steps
    1. Restore from backup: `restore-backup.sh --backup-id=latest`
    2. Update DNS: `aws route53 change-resource-record-sets --cli-input-json file://failover.json`
    3. Verify: `curl https://myapp.com/health`
    4. Run smoke tests
    5. Monitor error rates and performance

    ## Post-Disaster
    - Document timeline and RCA
    - Update runbooks
    - Schedule post-mortem
    - Test backups again

---
apiVersion: v1
kind: Secret
metadata:
  name: dr-credentials
  namespace: operations
type: Opaque
stringData:
  backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
  backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  dr_site_password: "secure-password-here"
```

### 3. **Backup and Restore Script**

```bash
#!/bin/bash
# backup-restore.sh - Complete backup and restore utilities

set -euo pipefail

BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log_info() {
    echo -e "${GREEN}[INFO]${NC} $1"
}

log_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

log_warn() {
    echo -e "${YELLOW}[WARN]${NC} $1"
}

# Backup function
backup_all() {
    local environment=$1
    log_info "Starting backup for $environment environment"

    # Backup databases
    log_info "Backing up databases..."
    for db in myapp_db analytics_db; do
        local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
        pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
        log_info "Backed up $db to $backup_file"
    done

    # Backup Kubernetes resources
    log_info "Backing up Kubernetes resources..."
    kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
        gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
    log_info "Kubernetes resources backed up"

    # Backup volumes
    log_info "Backing up persistent volumes..."
    for pvc in $(kubectl get pvc -A -o name); do
        local pvc_name=$(echo $pvc | cut -d'/' -f2)
        log_info "Backing up PVC: $pvc_name"
        kubectl exec -n default -it backup-pod -- \
            tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
    done

    log_info "All backups completed successfully"
}

# Restore function
restore_all() {
    local environment=$1
    local backup_date=$2

    log_warn "Restoring from backup date: $backup_date"
    read -p "Are you sure? (yes/no): " confirm
    if [ "$confirm" != "yes" ]; then
        log_error "Restore cancelled"
        exit 1
    fi

    # Restore databases
    log_info "Restoring databases..."
    for db in myapp_db analytics_db; do
        local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
        log_info "Restoring $db from $backup_file"
        aws s3 cp "$backup_file" - | gunzip | psql "$db"
    done

    # Restore Kubernetes resources
    log_info "Restoring Kubernetes resources..."
    local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
    aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -

    log_info "Restore completed successfully"
}

# Test restore
test_restore() {
    local environment=$1

    log_info "Testing restore procedure..."

    # Get latest backup
    local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
        sort | tail -n 1 | awk '{print $4}')

    if [ -z "$latest_backup" ]; then
        log_error "No backups found"
        exit 1
    fi

    log_info "Testing restore from: $latest_backup"

    # Create test database
    psql -c "CREATE DATABASE test_restore_$(date +%s);"

    # Download and restore
    aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
        gunzip | psql "test_restore_$(date +%s)"

    log_info "Test restore successful"
}

# List backups
list_backups() {
    local environment=$1
    log_info "Available backups for $environment:"
    aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E "\.sql\.gz|\.yaml\.gz|\.tar\.gz"
}

# Cleanup old backups
cleanup_old_backups() {
    local environment=$1
    log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"

    find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
    log_info "Cleanup completed"
}

# Main
main() {
    case "${1:-}" in
        backup)
            backup_all "${2:-production}"
            ;;
        restore)
            restore_all "${2:-production}" "${3:-}"
            ;;
        test)
            test_restore "${2:-production}"
            ;;
        list)
            list_backups "${2:-production}"
            ;;
        cleanup)
            cleanup_old_backups "${2:-production}"
            ;;
        *)
            echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
            exit 1
            ;;
    esac
}

main "$@"
```

### 4. **Cross-Region Failover**

```yaml
# route53-failover.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: failover-config
  namespace: operations
data:
  failover.sh: |
    #!/bin/bash
    set -euo pipefail

    PRIMARY_REGION="us-east-1"
    SECONDARY_REGION="us-west-2"
    DOMAIN="myapp.com"
    HOSTED_ZONE_ID="Z1234567890ABC"

    echo "Initiating failover to $SECONDARY_REGION"

    # Get primary endpoint
    PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
      --region "$PRIMARY_REGION" \
      --query 'LoadBalancers[0].DNSName' \
      --output text)

    # Get secondary endpoint
    SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
      --region "$SECONDARY_REGION" \
      --query 'LoadBalancers[0].DNSName' \
      --output text)

    # Update Route53 to failover
    aws route53 change-resource-record-sets \
      --hosted-zone-id "$HOSTED_ZONE_ID" \
      --change-batch '{
        "Changes": [
          {
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "'$DOMAIN'",
              "Type": "A",
              "TTL": 60,
              "SetIdentifier": "Primary",
              "Failover": "PRIMARY",
              "AliasTarget": {
                "HostedZoneId": "Z35SXDOTRQ7X7K",
                "DNSName": "'$PRIMARY_ENDPOINT'",
                "EvaluateTargetHealth": true
              }
            }
          },
          {
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "'$DOMAIN'",
              "Type": "A",
              "TTL": 60,
              "SetIdentifier": "Secondary",
              "Failover": "SECONDARY",
              "AliasTarget": {
                "HostedZoneId": "Z35SXDOTRQ7X7K",
                "DNSName": "'$SECONDARY_ENDPOINT'",
                "EvaluateTargetHealth": false
              }
            }
          }
        ]
      }'

    echo "Failover completed"
```

## Best Practices

### ✅ DO
- Perform regular backup testing
- Use multiple backup locations
- Implement automated backups
- Document recovery procedures
- Test failover procedures regularly
- Monitor backup completion
- Use immutable backups
- Encrypt backups at rest and in transit

### ❌ DON'T
- Rely on a single backup location
- Ignore backup failures
- Store backups with production data
- Skip testing recovery procedures
- Over-compress backups beyond recovery speed needs
- Forget to verify backup integrity
- Store encryption keys with backups
- Assume backups are automatically working

## Resources

- [AWS Backup Documentation](https://docs.aws.amazon.com/backup/)
- [Kubernetes Backup Best Practices](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/)
- [PostgreSQL Backup and Recovery](https://www.postgresql.org/docs/current/backup.html)
- [Velero - Kubernetes Native Disaster Recovery](https://velero.io/)