--- name: devops-cloud description: DevOps, cloud infrastructure, and platform engineering. Use when working with AWS, GCP, Azure, Kubernetes, Terraform, CI/CD pipelines, or infrastructure as code. --- # DevOps & Cloud Infrastructure Comprehensive guide for cloud platforms, infrastructure as code, and DevOps practices. ## Cloud Platforms ### AWS (Amazon Web Services) **Compute:** ```yaml # EC2 Instance Types General Purpose: t3, m6i, m7g (ARM) Compute Optimized: c6i, c7g Memory Optimized: r6i, x2idn Storage Optimized: i3, d3 # Auto Scaling aws autoscaling create-auto-scaling-group \ --auto-scaling-group-name my-asg \ --launch-template LaunchTemplateId=lt-xxx \ --min-size 1 --max-size 10 --desired-capacity 2 \ --vpc-zone-identifier "subnet-xxx,subnet-yyy" ``` **Serverless:** ```typescript // Lambda with TypeScript import { APIGatewayProxyHandler } from "aws-lambda"; export const handler: APIGatewayProxyHandler = async (event) => { return { statusCode: 200, body: JSON.stringify({ message: "Success" }), }; }; ``` **Storage:** | Service | Use Case | Durability | |---------|----------|------------| | S3 | Object storage | 99.999999999% | | EBS | Block storage (EC2) | 99.999% | | EFS | Shared file system | 99.999999999% | | FSx | Windows/Lustre | 99.999999999% | ### GCP (Google Cloud Platform) **Key Services:** ```bash # GKE cluster gcloud container clusters create my-cluster \ --zone us-central1-a \ --num-nodes 3 \ --machine-type e2-medium \ --enable-autoscaling --min-nodes 1 --max-nodes 10 # Cloud Run (serverless containers) gcloud run deploy my-service \ --image gcr.io/project/image:tag \ --platform managed \ --region us-central1 \ --allow-unauthenticated ``` ### Azure **Key Services:** ```bash # AKS cluster az aks create \ --resource-group myRG \ --name myAKS \ --node-count 3 \ --enable-addons monitoring \ --generate-ssh-keys # Azure Functions func init MyFunctionApp --typescript func new --name HttpTrigger --template "HTTP trigger" ``` --- ## Kubernetes ### Core Concepts ```yaml # Deployment apiVersion: apps/v1 kind: Deployment metadata: name: my-app labels: app: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-app image: my-app:1.0.0 ports: - containerPort: 8080 resources: requests: memory: "128Mi" cpu: "250m" limits: memory: "256Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 ``` ### Service Types ```yaml # ClusterIP (internal) apiVersion: v1 kind: Service metadata: name: my-service spec: type: ClusterIP selector: app: my-app ports: - port: 80 targetPort: 8080 --- # LoadBalancer (external) apiVersion: v1 kind: Service metadata: name: my-service-lb spec: type: LoadBalancer selector: app: my-app ports: - port: 80 targetPort: 8080 ``` ### Ingress with TLS ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: my-ingress annotations: kubernetes.io/ingress.class: nginx cert-manager.io/cluster-issuer: letsencrypt-prod spec: tls: - hosts: - app.example.com secretName: app-tls rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: my-service port: number: 80 ``` ### Helm Charts ```yaml # values.yaml replicaCount: 3 image: repository: my-app tag: "1.0.0" pullPolicy: IfNotPresent service: type: ClusterIP port: 80 ingress: enabled: true hosts: - host: app.example.com paths: ["/"] resources: limits: cpu: 500m memory: 256Mi requests: cpu: 250m memory: 128Mi ``` --- ## Terraform ### AWS Infrastructure ```hcl # main.tf terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "my-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" } } provider "aws" { region = var.aws_region } # VPC module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "5.0.0" name = "my-vpc" cidr = "10.0.0.0/16" azs = ["us-east-1a", "us-east-1b", "us-east-1c"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"] enable_nat_gateway = true single_nat_gateway = false } # EKS Cluster module "eks" { source = "terraform-aws-modules/eks/aws" version = "19.0.0" cluster_name = "my-cluster" cluster_version = "1.28" vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets eks_managed_node_groups = { default = { min_size = 1 max_size = 10 desired_size = 3 instance_types = ["t3.medium"] } } } ``` ### Variables and Outputs ```hcl # variables.tf variable "aws_region" { description = "AWS region" type = string default = "us-east-1" } variable "environment" { description = "Environment name" type = string validation { condition = contains(["dev", "staging", "prod"], var.environment) error_message = "Environment must be dev, staging, or prod." } } # outputs.tf output "cluster_endpoint" { description = "EKS cluster endpoint" value = module.eks.cluster_endpoint } output "cluster_name" { description = "EKS cluster name" value = module.eks.cluster_name } ``` --- ## CI/CD Pipelines ### GitHub Actions ```yaml # .github/workflows/deploy.yml name: Deploy on: push: branches: [main] pull_request: branches: [main] env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }} jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: "20" cache: "npm" - run: npm ci - run: npm test - run: npm run build build-and-push: needs: test runs-on: ubuntu-latest if: github.event_name == 'push' permissions: contents: read packages: write steps: - uses: actions/checkout@v4 - name: Log in to registry uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v5 with: context: . push: true tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} deploy: needs: build-and-push runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Deploy to Kubernetes uses: azure/k8s-deploy@v4 with: manifests: k8s/ images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} ``` ### GitLab CI ```yaml # .gitlab-ci.yml stages: - test - build - deploy variables: DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA test: stage: test image: node:20 script: - npm ci - npm test - npm run build cache: paths: - node_modules/ build: stage: build image: docker:24 services: - docker:24-dind script: - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY - docker build -t $DOCKER_IMAGE . - docker push $DOCKER_IMAGE only: - main deploy: stage: deploy image: bitnami/kubectl:latest script: - kubectl set image deployment/my-app my-app=$DOCKER_IMAGE only: - main environment: name: production ``` --- ## Docker ### Multi-Stage Builds ```dockerfile # Build stage FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Production stage FROM node:20-alpine AS production WORKDIR /app RUN addgroup -g 1001 -S nodejs && \ adduser -S nextjs -u 1001 COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules USER nextjs EXPOSE 3000 CMD ["node", "dist/main.js"] ``` ### Docker Compose ```yaml # docker-compose.yml version: "3.8" services: app: build: . ports: - "3000:3000" environment: - DATABASE_URL=postgres://user:pass@db:5432/mydb - REDIS_URL=redis://redis:6379 depends_on: db: condition: service_healthy redis: condition: service_started healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 30s timeout: 10s retries: 3 db: image: postgres:16-alpine environment: POSTGRES_USER: user POSTGRES_PASSWORD: pass POSTGRES_DB: mydb volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U user -d mydb"] interval: 10s timeout: 5s retries: 5 redis: image: redis:7-alpine volumes: - redis_data:/data volumes: postgres_data: redis_data: ``` --- ## Monitoring & Observability ### Prometheus + Grafana ```yaml # prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true ``` ### Application Metrics ```typescript // metrics.ts import { Counter, Histogram, Registry } from "prom-client"; export const register = new Registry(); export const httpRequestsTotal = new Counter({ name: "http_requests_total", help: "Total HTTP requests", labelNames: ["method", "path", "status"], registers: [register], }); export const httpRequestDuration = new Histogram({ name: "http_request_duration_seconds", help: "HTTP request duration", labelNames: ["method", "path"], buckets: [0.1, 0.5, 1, 2, 5], registers: [register], }); ``` --- ## Security Best Practices ### Secrets Management ```bash # AWS Secrets Manager aws secretsmanager create-secret \ --name prod/db-credentials \ --secret-string '{"username":"admin","password":"xxx"}' # Kubernetes Secrets (external-secrets) apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: db-credentials spec: secretStoreRef: kind: ClusterSecretStore name: aws-secrets target: name: db-credentials data: - secretKey: username remoteRef: key: prod/db-credentials property: username ``` ### Network Policies ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: api-network-policy spec: podSelector: matchLabels: app: api policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: database ports: - protocol: TCP port: 5432 ``` --- ## Checklist ### Pre-Deployment - [ ] Infrastructure as Code reviewed - [ ] Secrets in secrets manager (not env files) - [ ] Resource limits set - [ ] Health checks configured - [ ] Logging and monitoring enabled - [ ] Network policies defined - [ ] Backup strategy in place ### Production Readiness - [ ] Multi-AZ deployment - [ ] Auto-scaling configured - [ ] SSL/TLS enabled - [ ] WAF rules configured - [ ] Disaster recovery tested - [ ] Runbooks documented