---
name: devops-expert
version: 1.0.0
description: Expert-level DevOps practices, culture, automation, and continuous delivery
category: devops
tags: [devops, ci-cd, automation, infrastructure, culture]
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash(*)
---

# DevOps Expert

Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.

## Core Concepts

### DevOps Culture
- Collaboration and communication
- Shared responsibility
- Continuous improvement
- Breaking down silos
- Blameless culture
- Measuring everything

### Automation
- Infrastructure as Code (IaC)
- Configuration management
- Deployment automation
- Testing automation
- Monitoring automation
- Self-service platforms

### CI/CD
- Continuous Integration
- Continuous Delivery
- Continuous Deployment
- Pipeline as Code
- Artifact management
- Release strategies

## CI/CD Pipeline

```yaml
# GitHub Actions Example
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linting
        run: npm run lint

      - name: Run tests
        run: npm test

      - name: Run security scan
        run: npm audit

      - name: Upload coverage
        uses: codecov/codecov-action@v3

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=staging

      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp -n staging

      - name: Run smoke tests
        run: npm run test:smoke

  deploy-production:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production

    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=production

      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp -n production
```

## Infrastructure as Code

```python
# Pulumi Infrastructure as Code
import pulumi
import pulumi_aws as aws

# VPC
vpc = aws.ec2.Vpc("main-vpc",
    cidr_block="10.0.0.0/16",
    enable_dns_hostnames=True,
    enable_dns_support=True,
    tags={"Name": "main-vpc"})

# Subnets
public_subnet = aws.ec2.Subnet("public-subnet",
    vpc_id=vpc.id,
    cidr_block="10.0.1.0/24",
    availability_zone="us-east-1a",
    map_public_ip_on_launch=True,
    tags={"Name": "public-subnet"})

private_subnet = aws.ec2.Subnet("private-subnet",
    vpc_id=vpc.id,
    cidr_block="10.0.2.0/24",
    availability_zone="us-east-1b",
    tags={"Name": "private-subnet"})

# Internet Gateway
igw = aws.ec2.InternetGateway("igw",
    vpc_id=vpc.id,
    tags={"Name": "main-igw"})

# Route Table
route_table = aws.ec2.RouteTable("public-rt",
    vpc_id=vpc.id,
    routes=[
        aws.ec2.RouteTableRouteArgs(
            cidr_block="0.0.0.0/0",
            gateway_id=igw.id,
        )
    ],
    tags={"Name": "public-rt"})

# Security Group
security_group = aws.ec2.SecurityGroup("web-sg",
    vpc_id=vpc.id,
    description="Allow HTTP and HTTPS traffic",
    ingress=[
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=80,
            to_port=80,
            cidr_blocks=["0.0.0.0/0"],
        ),
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=443,
            to_port=443,
            cidr_blocks=["0.0.0.0/0"],
        ),
    ],
    egress=[
        aws.ec2.SecurityGroupEgressArgs(
            protocol="-1",
            from_port=0,
            to_port=0,
            cidr_blocks=["0.0.0.0/0"],
        )
    ])

# EKS Cluster
cluster = aws.eks.Cluster("app-cluster",
    role_arn=cluster_role.arn,
    vpc_config=aws.eks.ClusterVpcConfigArgs(
        subnet_ids=[public_subnet.id, private_subnet.id],
        security_group_ids=[security_group.id],
    ))

# Export outputs
pulumi.export("vpc_id", vpc.id)
pulumi.export("cluster_name", cluster.name)
pulumi.export("cluster_endpoint", cluster.endpoint)
```

## Deployment Strategies

```python
from typing import List, Dict
import time

class DeploymentStrategy:
    """Implement various deployment strategies"""

    def __init__(self, service_name: str):
        self.service_name = service_name

    def blue_green_deployment(self, blue_version: str, green_version: str):
        """Blue-Green deployment"""
        # Deploy green environment
        self.deploy_environment("green", green_version)

        # Run tests on green
        if self.run_tests("green"):
            # Switch traffic to green
            self.switch_traffic("green")

            # Keep blue for rollback
            print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
        else:
            # Rollback - keep blue active
            print("Tests failed on green. Keeping blue active.")

    def canary_deployment(self, current_version: str, new_version: str,
                         canary_percentage: int = 10):
        """Canary deployment"""
        # Deploy canary with small percentage
        self.deploy_canary(new_version, canary_percentage)

        # Monitor metrics
        metrics = self.monitor_canary_metrics(duration_minutes=10)

        if metrics['error_rate'] < 0.1 and metrics['latency_p95'] < 500:
            # Gradually increase canary traffic
            for percentage in [25, 50, 75, 100]:
                self.update_canary_traffic(percentage)
                time.sleep(300)  # 5 minutes between increases

                if not self.check_health():
                    self.rollback(current_version)
                    return False

            print(f"Canary deployment successful: {new_version}")
            return True
        else:
            self.rollback(current_version)
            print("Canary deployment failed - rolled back")
            return False

    def rolling_deployment(self, version: str, batch_size: int = 1):
        """Rolling deployment"""
        instances = self.get_instances()

        for i in range(0, len(instances), batch_size):
            batch = instances[i:i + batch_size]

            # Update batch
            for instance in batch:
                self.update_instance(instance, version)
                self.wait_for_healthy(instance)

            # Verify batch health
            if not self.check_health():
                print(f"Rolling deployment failed at batch {i//batch_size + 1}")
                return False

        print(f"Rolling deployment successful: {version}")
        return True

    def feature_flag_deployment(self, feature_name: str, enabled: bool,
                               rollout_percentage: int = 100):
        """Feature flag based deployment"""
        return {
            'feature': feature_name,
            'enabled': enabled,
            'rollout_percentage': rollout_percentage,
            'targeting': {
                'user_segments': ['beta_users'] if rollout_percentage < 100 else ['all']
            }
        }
```

## Configuration Management

```python
from typing import Dict, Any
import yaml
import json

class ConfigurationManager:
    """Manage application configuration"""

    def __init__(self, environment: str):
        self.environment = environment
        self.config = {}

    def load_config(self, config_file: str):
        """Load configuration from file"""
        with open(config_file, 'r') as f:
            if config_file.endswith('.yaml') or config_file.endswith('.yml'):
                self.config = yaml.safe_load(f)
            elif config_file.endswith('.json'):
                self.config = json.load(f)

    def get(self, key: str, default: Any = None) -> Any:
        """Get configuration value"""
        keys = key.split('.')
        value = self.config

        for k in keys:
            if isinstance(value, dict):
                value = value.get(k)
            else:
                return default

            if value is None:
                return default

        return value

    def merge_environment_config(self, env_config: Dict):
        """Merge environment-specific configuration"""
        self.config = self._deep_merge(self.config, env_config)

    def _deep_merge(self, base: Dict, override: Dict) -> Dict:
        """Deep merge two dictionaries"""
        result = base.copy()

        for key, value in override.items():
            if key in result and isinstance(result[key], dict) and isinstance(value, dict):
                result[key] = self._deep_merge(result[key], value)
            else:
                result[key] = value

        return result

    def validate_required_keys(self, required_keys: List[str]) -> List[str]:
        """Validate that required configuration keys exist"""
        missing = []

        for key in required_keys:
            if self.get(key) is None:
                missing.append(key)

        return missing
```

## Monitoring and Observability

```python
import logging
from opencensus.ext.azure import metrics_exporter
from opencensus.stats import aggregation as aggregation_module
from opencensus.stats import measure as measure_module
from opencensus.stats import stats as stats_module
from opencensus.stats import view as view_module
from opencensus.tags import tag_map as tag_map_module

class ObservabilityStack:
    """Implement observability best practices"""

    def __init__(self):
        self.logger = self._setup_logging()
        self.stats = stats_module.stats
        self.view_manager = self.stats.view_manager

    def _setup_logging(self) -> logging.Logger:
        """Setup structured logging"""
        logger = logging.getLogger(__name__)
        handler = logging.StreamHandler()

        formatter = logging.Formatter(
            '{"time": "%(asctime)s", "level": "%(levelname)s", '
            '"service": "%(name)s", "message": "%(message)s"}'
        )
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        logger.setLevel(logging.INFO)

        return logger

    def log_with_context(self, level: str, message: str, **context):
        """Log with additional context"""
        log_func = getattr(self.logger, level)
        log_func(message, extra=context)

    def track_custom_metric(self, metric_name: str, value: float,
                           tags: Dict[str, str]):
        """Track custom application metric"""
        # Implementation would send to metrics backend
        pass

    def create_distributed_trace(self, operation_name: str):
        """Create distributed trace span"""
        # Implementation would use OpenTelemetry or similar
        pass
```

## Best Practices

### Culture & Process
- Foster collaboration between Dev and Ops
- Automate everything possible
- Measure and monitor continuously
- Practice blameless post-mortems
- Share knowledge and documentation
- Encourage experimentation
- Celebrate successes and learn from failures

### CI/CD
- Keep builds fast (<10 minutes)
- Run tests in parallel
- Use pipeline as code
- Implement automated rollbacks
- Require code review before merge
- Use trunk-based development
- Deploy small, frequent changes

### Infrastructure
- Use Infrastructure as Code
- Version everything (code, config, infrastructure)
- Implement disaster recovery
- Practice chaos engineering
- Use immutable infrastructure
- Automate security scanning
- Monitor cloud costs

## Anti-Patterns

❌ Manual deployments
❌ Configuration drift
❌ No automated testing
❌ Long-lived feature branches
❌ Blame culture
❌ Siloed teams
❌ Ignoring technical debt

## Resources

- The Phoenix Project: https://itrevolution.com/the-phoenix-project/
- DevOps Handbook: https://itrevolution.com/the-devops-handbook/
- State of DevOps Report: https://www.devops-research.com/research.html
- GitLab CI/CD: https://docs.gitlab.com/ee/ci/
- GitHub Actions: https://docs.github.com/en/actions