---
name: infrastructure-expert
description: Expert infrastructure design including networking, compute, storage, and operations
version: 1.0.0
author: USER
tags: [infrastructure, networking, compute, storage, operations]
---

# Infrastructure Expert

## Purpose
Design robust infrastructure including networking, compute resources, storage systems, and operational practices.

## Activation Keywords
- infrastructure, infra
- networking, VPC, subnet
- compute, servers, instances
- storage, disk, volume
- operations, SRE

## Core Capabilities

### 1. Networking
- VPC design
- Subnet planning
- Security groups
- Load balancers
- DNS/CDN

### 2. Compute
- Instance selection
- Container orchestration
- Serverless
- Spot/Preemptible
- Reserved capacity

### 3. Storage
- Block storage
- Object storage
- File storage
- Backup strategies
- Data lifecycle

### 4. Operations
- Monitoring
- Logging
- Alerting
- Incident response
- Runbooks

### 5. Disaster Recovery
- RPO/RTO definitions
- Backup verification
- Failover testing
- Multi-region design

## Network Architecture

```
VPC Design:
┌─────────────────────────────────────┐
│ VPC (10.0.0.0/16)                   │
│  ├─ Public Subnet (10.0.1.0/24)    │
│  │   └─ NAT Gateway, Bastion       │
│  ├─ Private Subnet (10.0.2.0/24)   │
│  │   └─ Application servers        │
│  └─ Data Subnet (10.0.3.0/24)      │
│      └─ Databases                   │
└─────────────────────────────────────┘
```

## Infrastructure as Code

```hcl
# Terraform example
module "vpc" {
  source = "./modules/vpc"

  name             = "production"
  cidr             = "10.0.0.0/16"
  azs              = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets   = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # High availability

  tags = {
    Environment = "production"
    Terraform   = "true"
  }
}
```

## Storage Selection Guide

| Use Case | Storage Type | Service |
|----------|--------------|--------|
| OS/App data | Block | EBS/Persistent Disk |
| Static files | Object | S3/Cloud Storage |
| Shared files | File | EFS/Filestore |
| Database | Block (high IOPS) | io2/SSD |
| Backup | Object (cold) | Glacier/Coldline |

## Operational Checklist

```markdown
## Monitoring
- [ ] System metrics (CPU, Memory, Disk)
- [ ] Application metrics
- [ ] Business metrics
- [ ] Synthetic monitoring

## Logging
- [ ] Centralized logging
- [ ] Log retention policy
- [ ] Log analysis/search
- [ ] Audit logs

## Alerting
- [ ] Critical alerts → PagerDuty
- [ ] Warning alerts → Slack
- [ ] Alert runbooks linked
- [ ] On-call rotation

## Security
- [ ] Security groups reviewed
- [ ] Access logs enabled
- [ ] Patch management
- [ ] Vulnerability scanning

## Backup
- [ ] Automated backups
- [ ] Cross-region replication
- [ ] Restore testing (quarterly)
- [ ] Backup monitoring
```

## Disaster Recovery Tiers

| Tier | RPO | RTO | Strategy |
|------|-----|-----|----------|
| Tier 1 | Minutes | Minutes | Multi-region active |
| Tier 2 | Hours | Hours | Warm standby |
| Tier 3 | 24h | Days | Backup/restore |

## Example Usage

```
User: "Design infrastructure for a new production environment"

Infrastructure Expert Response:
1. Networking
   - VPC with public/private subnets
   - Multi-AZ deployment
   - Security group design

2. Compute
   - EKS cluster sizing
   - Node pool configuration
   - Auto-scaling setup

3. Storage
   - EBS for databases
   - S3 for static assets
   - Backup to Glacier

4. Operations
   - CloudWatch + Prometheus
   - Centralized logging (Loki)
   - PagerDuty integration

5. DR Plan
   - RPO: 1 hour
   - RTO: 4 hours
   - Cross-region backup
```