---
name: self-service-infrastructure
description: Use when designing infrastructure self-service portals, IaC templates, or automated provisioning systems. Covers Terraform modules, Pulumi, environment provisioning, and infrastructure guardrails.
allowed-tools: Read, Glob, Grep
---

# Self-Service Infrastructure

Patterns for enabling developers to provision infrastructure without tickets, while maintaining governance and control.

## When to Use This Skill

- Designing infrastructure self-service capabilities
- Creating reusable Terraform/Pulumi modules
- Building environment provisioning systems
- Implementing infrastructure guardrails
- Reducing infrastructure request bottlenecks
- Balancing developer autonomy with governance

## Self-Service Fundamentals

### What is Self-Service Infrastructure?

```text
Self-Service Infrastructure:
Enabling developers to provision and manage infrastructure
directly, without filing tickets or waiting for ops teams.

Traditional Model:
┌─────────────────────────────────────────────────────────────┐
│ Developer → Ticket → Ops Review → Manual Provision → Done  │
│                                                              │
│ Timeline: Days to weeks                                      │
│ Bottleneck: Ops team capacity                               │
│ Result: Shadow IT, workarounds, frustration                 │
└─────────────────────────────────────────────────────────────┘

Self-Service Model:
┌─────────────────────────────────────────────────────────────┐
│ Developer → Portal/API → Automatic Provision → Done         │
│                                                              │
│ Timeline: Minutes to hours                                  │
│ Bottleneck: None (automated)                                │
│ Result: Speed, consistency, compliance                      │
└─────────────────────────────────────────────────────────────┘

Self-Service Spectrum:
├── Fully Managed: Click a button, get a database
├── Template-Based: Customize from approved templates
├── Policy-Constrained: Write IaC within guardrails
└── Full Freedom: Any infrastructure (risky)

Sweet Spot: Template-Based with Policy Guardrails
```

### Key Benefits

```text
Self-Service Benefits:

For Developers:
├── Speed: Minutes instead of days
├── Autonomy: Provision when needed
├── Consistency: Same infrastructure every time
├── Learning: Understand infrastructure better
└── Ownership: More responsibility, more control

For Operations:
├── Scale: Handle more requests without more people
├── Consistency: Enforce standards automatically
├── Focus: Work on platform, not tickets
├── Audit: Clear trail of who provisioned what
└── Compliance: Built-in policy enforcement

For Organization:
├── Velocity: Faster time to market
├── Cost: Reduced ops overhead
├── Governance: Better compliance posture
├── Security: Consistent security controls
└── Efficiency: Resources provisioned when needed
```

## Self-Service Architecture

### Component Architecture

```text
Self-Service Infrastructure Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     USER INTERFACE                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Portal    │  │    CLI      │  │    API      │         │
│  │   (Web UI)  │  │ (Terraform) │  │  (REST/gRPC)│         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
│         └────────────────┼────────────────┘                 │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               ORCHESTRATION LAYER                    │    │
│  │  ├── Request validation                              │    │
│  │  ├── Policy evaluation (OPA/Sentinel)               │    │
│  │  ├── Cost estimation                                 │    │
│  │  ├── Approval workflow (if needed)                  │    │
│  │  └── Execution orchestration                        │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               TEMPLATE LIBRARY                       │    │
│  │  ├── Database modules (RDS, Cloud SQL)              │    │
│  │  ├── Compute modules (EKS, GKE, VMs)               │    │
│  │  ├── Storage modules (S3, GCS)                      │    │
│  │  ├── Network modules (VPC, subnets)                 │    │
│  │  └── Composite modules (full environments)          │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               EXECUTION ENGINE                       │    │
│  │  ├── Terraform Cloud/Enterprise                     │    │
│  │  ├── Pulumi Service                                 │    │
│  │  ├── Crossplane                                     │    │
│  │  └── Cloud-native (CDK, ARM, Deployment Manager)   │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               CLOUD PROVIDERS                        │    │
│  │  AWS  │  GCP  │  Azure  │  Kubernetes  │  Others    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
```

### Request Flow

```text
Self-Service Request Flow:

┌─────────────────────────────────────────────────────────────┐
│ 1. REQUEST                                                   │
│    Developer: "I need a PostgreSQL database for staging"    │
│    └── Via portal, CLI, or API                              │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. VALIDATION                                                │
│    ├── User has permission?          ✓ Team member          │
│    ├── Request well-formed?          ✓ Valid config         │
│    ├── Within quotas?                ✓ Under team limit     │
│    └── Meets policy?                 ✓ Allowed instance type│
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. ENRICHMENT                                                │
│    ├── Apply defaults                 db.t3.medium          │
│    ├── Generate names                 myapp-staging-db      │
│    ├── Assign network                 staging-vpc           │
│    ├── Configure monitoring           Datadog integration   │
│    └── Estimate cost                  ~$50/month            │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. APPROVAL (if required)                                    │
│    ├── Auto-approve: staging, dev     ✓ Auto-approved       │
│    ├── Manual approve: production     (Would need approval) │
│    └── Cost threshold: >$500/month    (Would need approval) │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. EXECUTION                                                 │
│    ├── Generate Terraform             Based on template     │
│    ├── Plan                           Preview changes       │
│    ├── Apply                          Create resources      │
│    └── Verify                         Health checks         │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DELIVERY                                                  │
│    ├── Connection string → Vault                            │
│    ├── Notification → Slack/email                           │
│    ├── Documentation → Auto-generated                       │
│    └── Registration → Service catalog                       │
└─────────────────────────────────────────────────────────────┘
```

## IaC Module Design

### Terraform Module Patterns

```text
Terraform Module Structure:

Organization-Wide Module Library:
terraform-modules/
├── databases/
│   ├── rds-postgres/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── README.md
│   │   └── examples/
│   │       ├── simple/
│   │       └── production/
│   └── elasticache-redis/
├── compute/
│   ├── eks-cluster/
│   └── ecs-service/
├── storage/
│   └── s3-bucket/
└── network/
    └── vpc/

Module Design Principles:

1. Opinionated Defaults
   # variables.tf
   variable "instance_class" {
     type        = string
     default     = "db.t3.medium"  # Sensible default
     description = "RDS instance type"

     validation {
       condition = can(regex("^db\\.(t3|r5|m5)", var.instance_class))
       error_message = "Only approved instance families allowed."
     }
   }

2. Minimal Required Inputs
   # Only require what can't be defaulted
   variable "name" {
     type        = string
     description = "Database identifier"
   }

   variable "environment" {
     type        = string
     description = "Environment (dev, staging, prod)"
   }

3. Complete Outputs
   # outputs.tf
   output "endpoint" {
     description = "Database connection endpoint"
     value       = aws_db_instance.main.endpoint
   }

   output "connection_secret_arn" {
     description = "ARN of secret with credentials"
     value       = aws_secretsmanager_secret.db_credentials.arn
   }

4. Built-in Best Practices
   # Security hardened by default
   resource "aws_db_instance" "main" {
     # Encryption always on
     storage_encrypted = true

     # No public access
     publicly_accessible = false

     # Automated backups
     backup_retention_period = var.environment == "prod" ? 30 : 7

     # Enhanced monitoring
     monitoring_interval = 60
   }
```

### Module Versioning

```text
Module Versioning Strategy:

Semantic Versioning:
├── MAJOR: Breaking changes (new required inputs, removed outputs)
├── MINOR: New features (new optional inputs, new outputs)
└── PATCH: Bug fixes (no interface changes)

Version Constraints:
# Allow patch updates automatically
module "database" {
  source  = "terraform.company.com/modules/rds-postgres"
  version = "~> 2.1.0"  # >=2.1.0, <2.2.0
}

# Pin to exact version (production)
module "database" {
  source  = "terraform.company.com/modules/rds-postgres"
  version = "= 2.1.3"
}

Deprecation Policy:
┌─────────────────────────────────────────────────────────────┐
│ Module Version Lifecycle                                     │
├─────────────────────────────────────────────────────────────┤
│ Current (v2.x):     Supported, new features                 │
│ Previous (v1.x):    Supported, security fixes only          │
│ Deprecated (v0.x):  Warning on use, no support              │
│ Removed:            Will not work                           │
│                                                              │
│ Notification:                                                │
│ ├── Slack announcement when version deprecated              │
│ ├── Warning in terraform plan output                        │
│ ├── Dashboard showing deprecated module usage               │
│ └── Migration guide provided                                │
└─────────────────────────────────────────────────────────────┘
```

## Policy and Guardrails

### Policy as Code

```text
Policy as Code Options:

1. HashiCorp Sentinel (Terraform Enterprise)
   # Require encryption for all storage
   import "tfplan/v2" as tfplan

   s3_buckets = filter tfplan.resource_changes as _, rc {
     rc.type is "aws_s3_bucket" and
     rc.mode is "managed" and
     (rc.change.actions contains "create" or
      rc.change.actions contains "update")
   }

   encryption_enabled = rule {
     all s3_buckets as _, bucket {
       bucket.change.after.server_side_encryption_configuration
         is not null
     }
   }

   main = rule { encryption_enabled }

2. Open Policy Agent (OPA)
   # Rego policy for Kubernetes
   package kubernetes.admission

   deny[msg] {
     input.request.kind.kind == "Pod"
     container := input.request.object.spec.containers[_]
     not container.securityContext.runAsNonRoot
     msg := "Containers must run as non-root"
   }

3. Cloud-Native Policies
   # AWS Service Control Policy
   {
     "Version": "2012-10-17",
     "Statement": [{
       "Sid": "RequireEncryption",
       "Effect": "Deny",
       "Action": ["s3:CreateBucket"],
       "Resource": "*",
       "Condition": {
         "StringNotEquals": {
           "s3:x-amz-server-side-encryption": "AES256"
         }
       }
     }]
   }
```

### Guardrail Categories

```text
Infrastructure Guardrails:

1. Security Guardrails
   ├── Encryption required (at-rest, in-transit)
   ├── No public access by default
   ├── Required security groups
   ├── IAM role requirements
   └── Vulnerability scanning

2. Cost Guardrails
   ├── Instance type restrictions
   ├── Storage size limits
   ├── Required cost tags
   ├── Budget thresholds
   └── Approval for large resources

3. Compliance Guardrails
   ├── Allowed regions (data residency)
   ├── Required logging
   ├── Backup requirements
   ├── Retention policies
   └── Audit trail requirements

4. Operational Guardrails
   ├── Naming conventions
   ├── Required tags (owner, cost-center)
   ├── Resource quotas per team
   ├── Monitoring requirements
   └── Deletion protection

Guardrail Implementation:
┌─────────────────────────────────────────────────────────────┐
│                    Guardrail Timing                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Pre-Plan (fastest feedback):                               │
│  ├── Validate terraform files                               │
│  ├── Static analysis (tfsec, checkov)                      │
│  └── Module version checks                                  │
│                                                              │
│  Post-Plan (resource-aware):                                │
│  ├── OPA/Sentinel policy evaluation                        │
│  ├── Cost estimation                                        │
│  └── Blast radius assessment                                │
│                                                              │
│  Post-Apply (verification):                                 │
│  ├── Configuration validation                               │
│  ├── Security scanning                                      │
│  └── Compliance audit                                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## Environment Provisioning

### Environment Templates

```text
Environment Provisioning:

Environment Types:
┌─────────────────────────────────────────────────────────────┐
│ Development Environment                                      │
│ ├── Purpose: Individual developer testing                   │
│ ├── Lifetime: Hours to days                                 │
│ ├── Resources: Minimal (smallest instances)                 │
│ ├── Data: Synthetic or anonymized                           │
│ └── Approval: None (within quota)                           │
├─────────────────────────────────────────────────────────────┤
│ Staging Environment                                          │
│ ├── Purpose: Integration testing, QA                        │
│ ├── Lifetime: Persistent per service                        │
│ ├── Resources: Production-like (scaled down)                │
│ ├── Data: Sanitized production subset                       │
│ └── Approval: None (within quota)                           │
├─────────────────────────────────────────────────────────────┤
│ Production Environment                                       │
│ ├── Purpose: Live customer traffic                          │
│ ├── Lifetime: Permanent                                      │
│ ├── Resources: Full capacity                                │
│ ├── Data: Real customer data                                │
│ └── Approval: Required (security review)                    │
└─────────────────────────────────────────────────────────────┘

Environment Template:
# environment/main.tf
module "network" {
  source      = "../modules/vpc"
  environment = var.environment
  cidr_block  = var.network_cidr
}

module "kubernetes" {
  source      = "../modules/eks"
  environment = var.environment
  vpc_id      = module.network.vpc_id
  node_count  = var.environment == "prod" ? 5 : 2
}

module "database" {
  source         = "../modules/rds"
  environment    = var.environment
  vpc_id         = module.network.vpc_id
  instance_class = var.environment == "prod" ? "db.r5.xlarge" : "db.t3.medium"
  multi_az       = var.environment == "prod"
}

module "cache" {
  source      = "../modules/elasticache"
  environment = var.environment
  vpc_id      = module.network.vpc_id
  node_type   = var.environment == "prod" ? "cache.r5.large" : "cache.t3.micro"
}
```

### Ephemeral Environments

```text
Ephemeral/Preview Environments:

Use Cases:
├── PR preview environments
├── Feature branch testing
├── Demo environments
├── Load testing environments
└── Incident reproduction

Lifecycle:
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  PR Created ──► Environment Created ──► Tests Run           │
│       │              │                      │               │
│       │              ▼                      ▼               │
│       │         Preview URL            PR Updated           │
│       │         Posted to PR              │                 │
│       │                                   │                 │
│       ▼                                   ▼                 │
│  PR Merged ───────────────────────► Environment Destroyed   │
│                                                              │
│  Timeout: Auto-destroy after 7 days of inactivity          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation:
# .github/workflows/preview.yml
name: Preview Environment

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  deploy-preview:
    runs-on: ubuntu-latest
    steps:
      - name: Create/Update Environment
        run: |
          terraform workspace select pr-${{ github.event.pull_request.number }} || \
          terraform workspace new pr-${{ github.event.pull_request.number }}
          terraform apply -auto-approve

      - name: Comment Preview URL
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: '🚀 Preview: https://pr-${{ github.event.pull_request.number }}.preview.company.com'
            })
```

## Technology Options

### Self-Service Platforms

```text
Platform Comparison:

1. Terraform Cloud/Enterprise
   ├── Native Terraform experience
   ├── Policy as Code (Sentinel)
   ├── Private module registry
   ├── Cost estimation
   └── Enterprise features (SSO, audit)

2. Pulumi
   ├── Real programming languages
   ├── Strong typing and IDE support
   ├── Policy as Code (CrossGuard)
   └── Automation API

3. Crossplane
   ├── Kubernetes-native
   ├── GitOps workflow
   ├── Composition for modules
   └── Multi-cloud abstraction

4. Backstage + Terraform
   ├── Unified developer portal
   ├── Software templates
   ├── Plugin ecosystem
   └── Service catalog integration

5. Port/Cortex/OpsLevel
   ├── Commercial developer portals
   ├── Quick to implement
   ├── Built-in integrations
   └── Self-service workflows

Selection Criteria:
┌────────────────────────────────────────────────────────────┐
│ Factor               │ Best Fit                            │
├──────────────────────┼─────────────────────────────────────┤
│ Existing Terraform   │ Terraform Cloud/Enterprise         │
│ Kubernetes-first     │ Crossplane                         │
│ Developer portal     │ Backstage or commercial            │
│ Programming language │ Pulumi                             │
│ Quick start          │ Commercial (Port, OpsLevel)        │
│ Maximum control      │ Build custom                       │
└────────────────────────────────────────────────────────────┘
```

## Cost Management

### Cost Controls

```text
Cost Management in Self-Service:

1. Cost Visibility
   ├── Estimated cost shown before provisioning
   ├── Cost tags automatically applied
   ├── Per-team/project dashboards
   └── Anomaly detection and alerts

2. Cost Guardrails
   ├── Instance type restrictions
   ├── Budget thresholds by team
   ├── Approval required above threshold
   └── Auto-shutdown of unused resources

3. Cost Optimization
   ├── Right-sizing recommendations
   ├── Reserved instance suggestions
   ├── Spot instance for non-production
   └── Scheduled scaling

Cost Estimation Flow:
┌─────────────────────────────────────────────────────────────┐
│ Request: PostgreSQL database for staging                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Cost Estimate:                                             │
│  ├── Compute (db.t3.medium):        $30/month              │
│  ├── Storage (100GB gp3):           $10/month              │
│  ├── Backup storage:                ~$5/month              │
│  └── Data transfer:                 ~$5/month              │
│                                     ─────────               │
│  Estimated Total:                   ~$50/month             │
│                                                              │
│  ✓ Within team budget ($500/month quota)                   │
│  ✓ No approval required                                     │
│                                                              │
│  [Proceed] [Modify] [Cancel]                                │
└─────────────────────────────────────────────────────────────┘
```

## Best Practices

```text
Self-Service Infrastructure Best Practices:

1. Start Small, Expand Gradually
   ├── Begin with 2-3 common resources
   ├── Add based on demand
   ├── Iterate on feedback
   └── Don't try to cover everything day 1

2. Balance Autonomy and Governance
   ├── Guardrails not gates
   ├── Automate approvals where safe
   ├── Clear escalation paths
   └── Trust but verify

3. Optimize for Developer Experience
   ├── Minimal required inputs
   ├── Sensible defaults
   ├── Clear error messages
   └── Fast feedback loops

4. Maintain Module Quality
   ├── Automated testing
   ├── Documentation requirements
   ├── Versioning strategy
   └── Deprecation process

5. Monitor and Improve
   ├── Track provisioning success rate
   ├── Measure time to provision
   ├── Gather user feedback
   └── Identify automation opportunities

6. Handle Edge Cases
   ├── What if provisioning fails?
   ├── How to handle orphaned resources?
   ├── What about existing resources?
   └── How to migrate between versions?
```

## Anti-Patterns

```text
Self-Service Anti-Patterns:

1. "Self-Service Everything"
   ❌ Every possible configuration option
   ✓ Curated set of approved patterns

2. "Security Theater"
   ❌ Manual approvals that don't add value
   ✓ Automated policy enforcement

3. "Configuration Explosion"
   ❌ 50 parameters per resource
   ✓ Sensible defaults with few overrides

4. "Ignore Cost"
   ❌ No visibility into provisioned cost
   ✓ Cost estimation and budgets

5. "Build vs Buy Wrong"
   ❌ Building everything from scratch
   ✓ Use existing tools where appropriate

6. "No Escape Hatch"
   ❌ Blocking legitimate exceptions
   ✓ Process for justified deviations
```

## Related Skills

- `internal-developer-platform` - Platform engineering overview
- `golden-paths` - Standardized workflows
- `container-orchestration` - Kubernetes infrastructure
- `serverless-patterns` - Serverless infrastructure