---
name: architecture-review
description: |
  Architecture evaluation criteria and technology standards for the homelab.
  Preloaded into the designer agent to ground design decisions in established
  patterns and principles.

  Use when: (1) Evaluating a proposed technology addition, (2) Reviewing architecture decisions,
  (3) Assessing stack fit for a new component, (4) Comparing implementation approaches.

  Triggers: "architecture review", "evaluate technology", "stack fit", "should we use",
  "technology comparison", "design review", "architecture decision"
user-invocable: false
---

# Architecture Evaluation Framework

## Current Technology Stack

| Layer | Technology | Purpose |
|-------|-----------|---------|
| **OS** | Talos Linux | Immutable, API-driven Kubernetes OS |
| **GitOps** | Flux + ResourceSets | Declarative cluster state reconciliation |
| **CNI/Network** | Cilium | eBPF networking, network policies, Hubble observability |
| **Storage** | Longhorn | Distributed block storage with S3 backup |
| **Object Storage** | Garage | S3-compatible distributed object storage |
| **Database** | CNPG (CloudNativePG) | PostgreSQL operator with HA and backups |
| **Cache/KV** | Dragonfly | Redis-compatible in-memory store |
| **Monitoring** | kube-prometheus-stack | Prometheus + Grafana + Alertmanager |
| **Logging** | Alloy → Loki | Log collection pipeline |
| **Certificates** | cert-manager | Automated TLS certificate management |
| **Secrets** | ESO + AWS SSM | External Secrets Operator with Parameter Store |
| **Upgrades** | Tuppr | Declarative Talos/Kubernetes/Cilium upgrades |
| **Infrastructure** | Terragrunt + OpenTofu | Infrastructure as Code for bare-metal provisioning |
| **CI/CD** | GitHub Actions + OCI | Artifact-based promotion pipeline |

## Evaluation Criteria

When evaluating any proposed technology addition or architecture change, assess against these criteria:

### 1. Principle Alignment

Score the proposal against each core principle (Strong/Weak/Neutral):
- **Enterprise at Home**: Does it reflect production-grade patterns?
- **Everything as Code**: Can it be fully represented in git?
- **Automation is Key**: Does it reduce or increase manual toil?
- **Learning First**: Does it teach valuable enterprise skills?
- **DRY and Code Reuse**: Does it leverage existing patterns or create duplication?
- **Continuous Improvement**: Does it make the system more maintainable?

### 2. Stack Fit

- Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)
- Does it integrate with the GitOps workflow? (Must be Flux-deployable)
- Does it work on bare-metal? (No cloud-only services)
- Does it support the multi-cluster model? (dev → integration → live)

### 3. Operational Cost

- How is it monitored? (Must integrate with kube-prometheus-stack)
- How is it backed up? (Must have a recovery story)
- How does it handle upgrades? (Must be declarative, ideally via Renovate)
- What's the failure blast radius? (Isolated > cluster-wide)

### 4. Complexity Budget

- Is the complexity justified by the learning value?
- Could a simpler existing tool solve the same problem?
- What's the maintenance burden over 12 months?

### 5. Alternative Analysis

- What existing stack components could solve this? (Always check first)
- What are the top 2-3 alternatives in the ecosystem?
- What do other production homelabs use? (kubesearch research)

### 6. Failure Modes

- What happens when this component is unavailable?
- How does it interact with network policies? (Default deny)
- What's the recovery procedure? (Must be documented in a runbook)
- Can it self-heal? (Strong preference for self-healing)

## Common Design Patterns

### New Application
1. HelmRelease via ResourceSet (flux-gitops pattern)
2. Namespace with network-policy profile label
3. ExternalSecret for credentials
4. ServiceMonitor + PrometheusRule for observability
5. GarageBucketClaim if S3 storage needed
6. CNPG Cluster if database needed

### New Infrastructure Component
1. OpenTofu module in `infrastructure/modules/`
2. Unit in appropriate stack under `infrastructure/units/`
3. Test coverage in `.tftest.hcl` files
4. Version pinned in `versions.env` if applicable

### New Secret
1. Store in AWS SSM Parameter Store
2. Reference via ExternalSecret CR
3. Never commit to git, not even encrypted

### New Storage
1. Longhorn PVC for block storage (default)
2. GarageBucketClaim for object storage (S3-compatible)
3. Never use hostPath or emptyDir for persistent data

### New Database
1. CNPG Cluster CR for PostgreSQL
2. Automated backups to Garage S3
3. Connection pooling via PgBouncer (CNPG-managed)

### New Network Exposure
1. HTTPRoute for HTTP/HTTPS traffic (Gateway API)
2. Appropriate network-policy profile label
3. cert-manager Certificate for TLS
4. Internal gateway for internal-only services

## Anti-Patterns to Challenge

| Anti-Pattern | Why It's Wrong | Correct Approach |
|-------------|---------------|------------------|
| "Just run a container" without monitoring | Invisible failures, no alerting | ServiceMonitor + PrometheusRule required |
| Adding a new tool when existing ones suffice | Stack bloat, maintenance burden | Evaluate existing stack first |
| Skipping observability "for now" | Technical debt that never gets paid | Monitoring is day-1, not day-2 |
| Manual operational steps | Drift, inconsistency, bus factor | Everything declarative via GitOps |
| Cloud-only services | Vendor lock-in, can't run on bare-metal | Self-hosted alternatives preferred |
| Single-instance without HA story | Single point of failure | At minimum, document recovery procedure |
| Storing state outside git | Shadow configuration, drift | Git is the source of truth |