---
name: devops-infrastructure
description: "Use when provisioning infrastructure, building containers, configuring CI/CD, or deploying services - ensures all infrastructure is codified, versioned, and reviewable with repeatable deployment strategies and proper secrets management | インフラのプロビジョニング、コンテナのビルド、CI/CDの構成、サービスのデプロイ時に使用 - すべてのインフラがコード化、バージョン管理、レビュー可能であることを保証し、再現可能なデプロイ戦略と適切なシークレット管理を実現"
---
# DevOps Infrastructure
## Overview
Manual infrastructure changes are incidents waiting to happen. Unversioned configs are undocumented debt.
**Core principle:** EVERY piece of infrastructure is defined in code, stored in version control, and deployed through automation.
**Violating the letter of this process is violating the spirit of infrastructure engineering.**
## The Iron Law
```
NO MANUAL INFRASTRUCTURE CHANGES - EVERYTHING IS CODE, VERSIONED, AND REVIEWABLE
```
If you clicked through a console UI to create it, it doesn't exist yet. Write the code.
## When to Use
Use for ANY infrastructure work:
- Provisioning cloud resources
- Building container images
- Configuring CI/CD pipelines
- Setting up deployment strategies
- Managing secrets and credentials
- Configuring monitoring and alerting
- Setting up networking and DNS
**Use this ESPECIALLY when:**
- Under deadline pressure ("just create it in the console for now")
- A quick manual fix seems faster
- Someone says "we'll codify it later"
- You're setting up a "temporary" environment
- Debugging a production issue by hand
**Don't skip when:**
- It's "just one resource" (one resource becomes twenty)
- It's a dev environment (dev environments become templates)
- It's a one-time setup (nothing is one-time)
## The Five Phases
You MUST complete each phase before proceeding to the next.
### Phase 1: Define Infrastructure as Code
**BEFORE creating ANY resource:**
1. **Choose Your IaC Tool**
- Terraform: Multi-cloud, mature ecosystem, declarative
- Pulumi: General-purpose languages, strong typing
- CloudFormation: AWS-native, deep integration
- Pick one per project and stick with it
2. **Structure Your Code**
```
infrastructure/
├── modules/ # Reusable components
│ ├── networking/
│ ├── compute/
│ └── database/
├── environments/ # Per-environment configs
│ ├── dev/
│ ├── staging/
│ └── production/
├── variables.tf # Input definitions
└── outputs.tf # Exported values
```
3. **State Management**
- Remote state (S3 + DynamoDB, GCS, Terraform Cloud)
- State locking enabled - always
- Never commit state files to git
- One state per environment minimum
4. **Plan Before Apply**
- Always run plan/preview first
- Review every change in the plan
- Automate plan output in PRs
- Never apply without reviewing the plan
### Phase 2: Container Best Practices
**Every container image follows these rules:**
1. **Multi-Stage Builds**
```dockerfile
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:20-alpine AS runtime
WORKDIR /app
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
```
Small image, non-root user, only production artifacts
```dockerfile
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]
```
Bloated image, root user, dev dependencies included, source code exposed
2. **Minimal Base Images**
- Alpine or distroless
- Pin exact versions (not `latest`)
- Rebuild regularly for security patches
3. **Security Scanning**
- Scan images in CI (Trivy, Snyk, Grype)
- Block deployment on critical/high CVEs
- No secrets in images - ever
- No secrets in build args - they leak in layer history
4. **Image Hygiene**
- `.dockerignore` for every project
- One process per container
- Health checks defined
- Graceful shutdown handling (SIGTERM)
### Phase 3: Kubernetes Deployment Patterns
**Match the workload to the right abstraction:**
1. **Deployments** - Stateless services
- Rolling updates by default
- Resource limits set (CPU and memory)
- Readiness and liveness probes configured
- Pod Disruption Budgets defined
2. **StatefulSets** - Databases, caches, queues
- Stable network identities
- Persistent volume claims
- Ordered startup/shutdown
- Don't use for stateless workloads
3. **CronJobs** - Scheduled tasks
- Concurrency policy set
- Failure history limits
- Deadline seconds configured
- Idempotent by design
4. **Resource Definitions**
```yaml
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
```
Explicit requests and limits
```yaml
# No resource limits defined - hope the node has enough
```
Unbounded resource usage, noisy neighbor problems
### Phase 4: CI/CD Pipeline Design
**Every pipeline follows this structure:**
1. **Pipeline Stages**
```
Lint → Test → Build → Scan → Deploy(staging) → Verify → Deploy(production)
```
- No stage can be skipped
- Failure at any stage stops the pipeline
- Production deploy requires explicit approval
2. **Deployment Strategies**
| Strategy | When | Risk | Rollback |
|----------|------|------|----------|
| **Rolling** | Default for most services | Medium | Automatic |
| **Blue-Green** | Zero-downtime critical services | Low | Instant switch |
| **Canary** | High-traffic, risk-sensitive | Lowest | Route back to stable |
- Rolling: Good default, gradual replacement
- Blue-Green: Full parallel environment, instant cutover
- Canary: Percentage-based traffic shifting, metrics-driven promotion
3. **Rollback Plan**
- Every deployment MUST have a rollback plan
- Automated rollback on health check failure
- Database migrations must be backward-compatible
- "We'll figure it out" is not a rollback plan
4. **Pipeline Security**
- Secrets injected at runtime, never in code
- Least-privilege service accounts
- Signed artifacts and images
- Audit trail for every deployment
### Phase 5: Secrets Management and Monitoring
**Secrets:**
1. **Never in Code**
- Not in source files
- Not in Dockerfiles
- Not in CI config files
- Not in environment variable definitions committed to git
- Use vault (HashiCorp Vault, AWS Secrets Manager, 1Password)
- Inject at runtime via environment or mounted files
2. **Rotation**
- Secrets have expiration dates
- Automated rotation where possible
- Revoke immediately when compromised
**Monitoring and Alerting:**
3. **The Four Golden Signals**
- Latency: How long requests take
- Traffic: How many requests per second
- Errors: Rate of failed requests
- Saturation: How full your resources are
4. **Alert Design**
- Alert on symptoms, not causes
- Every alert has a runbook
- No alert fatigue - if you ignore it, delete it
- Page for user-facing impact only
## Red Flags - STOP and Follow Process
If you catch yourself thinking:
- "Just create it in the console, we'll codify later"
- "This is a one-time setup"
- "I'll hardcode the secret for now"
- "We don't need monitoring yet"
- "Skip the staging deploy, push straight to prod"
- "The rollback plan is to fix forward"
- "It's just a small config change, no PR needed"
- "We'll add resource limits later"
- "Root user is fine for now"
**ALL of these mean: STOP. Follow the process.**
## Common Rationalizations
| Excuse | Reality |
|--------|---------|
| "Console is faster for one resource" | One resource becomes twenty. Codify from the start. |
| "We'll codify it later" | You won't. "Later" means "never" in infrastructure. |
| "It's just a dev environment" | Dev environments are production templates. Treat them the same. |
| "Hardcode the secret for now" | Secrets in code get committed, pushed, leaked. Use a vault. |
| "We don't need monitoring yet" | You need monitoring BEFORE the first incident, not after. |
| "Skip staging, it works on my machine" | Your machine is not production. Deploy to staging first. |
| "Fix forward is our rollback" | Fix forward under pressure creates new incidents. Have a real rollback. |
| "Resource limits slow us down" | Unbounded containers slow everyone down when they consume the node. |
| "Manual change, just this once" | Snowflake servers start with "just this once." |
| "One-time setup doesn't need code" | Nothing is one-time. You'll rebuild, migrate, or recover. |
## Anti-Patterns
| Anti-Pattern | Consequence | Correct Approach |
|-------------|-------------|-----------------|
| **Snowflake servers** | Unreproducible, undocumented, irreplaceable | Everything in IaC, immutable infrastructure |
| **Secrets in code** | Credential leaks, security incidents | Vault/env injection, runtime secrets |
| **No rollback plan** | Extended outages, panic-driven fixes | Automated rollback, backward-compatible migrations |
| **Deploy without approval** | Unreviewed changes in production | PR-based deployments, required approvals |
| **No resource limits** | Noisy neighbors, node exhaustion, cascading failures | Explicit requests and limits on every workload |
| **`latest` tag** | Unreproducible builds, surprise breaking changes | Pin exact versions, rebuild intentionally |
## Quick Reference
| Phase | Key Activities | Success Criteria |
|-------|---------------|------------------|
| **1. IaC** | Define resources in code, remote state, plan before apply | All infrastructure in version control |
| **2. Containers** | Multi-stage builds, minimal images, security scanning | Small, secure, non-root images |
| **3. Kubernetes** | Right abstraction, resource limits, probes | Workloads are resilient and bounded |
| **4. CI/CD** | Pipeline stages, deployment strategy, rollback plan | Automated, gated, reversible deployments |
| **5. Secrets/Monitoring** | Vault injection, four golden signals, alert runbooks | No secrets in code, actionable alerts |
## Verification Checklist
Before marking infrastructure work complete:
- [ ] All resources defined in IaC (no console-created resources)
- [ ] State stored remotely with locking
- [ ] Container images are multi-stage, minimal, non-root
- [ ] Images scanned for vulnerabilities in CI
- [ ] No secrets in code, Dockerfiles, or CI configs
- [ ] Secrets injected from vault/secrets manager at runtime
- [ ] Resource limits set on all workloads
- [ ] Health checks and probes configured
- [ ] Deployment strategy chosen with rollback plan documented
- [ ] Production deploy requires approval
- [ ] Monitoring covers four golden signals
- [ ] Every alert has a runbook
Can't check all boxes? You're not done.
## Integration with Other Skills
**This skill requires using:**
- **test-driven-development** - REQUIRED for testing IaC changes (Terratest, infrastructure integration tests)
**Complementary skills:**
- **systematic-debugging** - Use when deployments fail or infrastructure behaves unexpectedly
- **documentation-generation** - Generate runbooks, architecture diagrams, and operational docs
## Final Rule
```
If it's not in code, it doesn't exist.
If it's not in version control, it's not real.
If it can't be reviewed, it can't be trusted.
```
No exceptions without your human partner's explicit approval.