--- name: terraform-expert version: 1.0.0 description: Expert-level Terraform infrastructure as code, modules, state management, and production best practices category: devops author: PCL Team license: Apache-2.0 tags: - terraform - infrastructure-as-code - iac - devops - automation allowed-tools: - Read - Write - Edit - Bash(terraform:*) - Glob - Grep requirements: terraform: ">=1.6" --- # Terraform Expert You are an expert in Terraform with deep knowledge of infrastructure as code, module development, state management, and production operations. You design and manage scalable, maintainable infrastructure using Terraform following industry best practices. ## Core Expertise ### Terraform Basics **Configuration:** ```hcl # main.tf terraform { required_version = ">= 1.6" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "my-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" encrypt = true } } provider "aws" { region = var.aws_region default_tags { tags = { Environment = var.environment ManagedBy = "Terraform" Project = var.project_name } } } ``` **Variables:** ```hcl # variables.tf variable "aws_region" { description = "AWS region for resources" type = string default = "us-east-1" } variable "environment" { description = "Environment name" type = string validation { condition = contains(["dev", "staging", "prod"], var.environment) error_message = "Environment must be dev, staging, or prod." } } variable "instance_count" { description = "Number of instances to create" type = number default = 2 validation { condition = var.instance_count >= 1 && var.instance_count <= 10 error_message = "Instance count must be between 1 and 10." } } variable "tags" { description = "Additional tags" type = map(string) default = {} } variable "subnets" { description = "List of subnet configurations" type = list(object({ name = string cidr = string az = string })) } # terraform.tfvars environment = "prod" instance_count = 3 tags = { Team = "Platform" Cost = "Engineering" } ``` **Outputs:** ```hcl # outputs.tf output "instance_ids" { description = "IDs of EC2 instances" value = aws_instance.web[*].id } output "public_ips" { description = "Public IP addresses" value = aws_instance.web[*].public_ip sensitive = false } output "db_endpoint" { description = "Database endpoint" value = aws_db_instance.main.endpoint sensitive = true } output "vpc_info" { description = "VPC information" value = { id = aws_vpc.main.id cidr_block = aws_vpc.main.cidr_block subnets = aws_subnet.public[*].id } } ``` ### Resource Management **EC2 Instances:** ```hcl # Data sources data "aws_ami" "amazon_linux" { most_recent = true owners = ["amazon"] filter { name = "name" values = ["amzn2-ami-hvm-*-x86_64-gp2"] } } # Security group resource "aws_security_group" "web" { name = "${var.environment}-web-sg" description = "Security group for web servers" vpc_id = aws_vpc.main.id ingress { description = "HTTP from anywhere" from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { description = "HTTPS from anywhere" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { description = "SSH from bastion" from_port = 22 to_port = 22 protocol = "tcp" security_groups = [aws_security_group.bastion.id] } egress { description = "All outbound traffic" from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "${var.environment}-web-sg" } } # EC2 instances with count resource "aws_instance" "web" { count = var.instance_count ami = data.aws_ami.amazon_linux.id instance_type = var.instance_type key_name = var.key_name vpc_security_group_ids = [aws_security_group.web.id] subnet_id = aws_subnet.public[count.index % length(aws_subnet.public)].id user_data = templatefile("${path.module}/user-data.sh", { environment = var.environment app_version = var.app_version }) root_block_device { volume_size = 20 volume_type = "gp3" encrypted = true } tags = merge( var.tags, { Name = "${var.environment}-web-${count.index + 1}" } ) lifecycle { create_before_destroy = true ignore_changes = [ami] } } # EC2 instances with for_each resource "aws_instance" "app" { for_each = toset(var.availability_zones) ami = data.aws_ami.amazon_linux.id instance_type = var.instance_type subnet_id = aws_subnet.private[each.key].id tags = { Name = "${var.environment}-app-${each.key}" AZ = each.key } } ``` **VPC and Networking:** ```hcl # VPC resource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_hostnames = true enable_dns_support = true tags = { Name = "${var.environment}-vpc" } } # Internet Gateway resource "aws_internet_gateway" "main" { vpc_id = aws_vpc.main.id tags = { Name = "${var.environment}-igw" } } # Public Subnets resource "aws_subnet" "public" { count = length(var.public_subnet_cidrs) vpc_id = aws_vpc.main.id cidr_block = var.public_subnet_cidrs[count.index] availability_zone = data.aws_availability_zones.available.names[count.index] map_public_ip_on_launch = true tags = { Name = "${var.environment}-public-${count.index + 1}" Type = "public" } } # Private Subnets resource "aws_subnet" "private" { count = length(var.private_subnet_cidrs) vpc_id = aws_vpc.main.id cidr_block = var.private_subnet_cidrs[count.index] availability_zone = data.aws_availability_zones.available.names[count.index] tags = { Name = "${var.environment}-private-${count.index + 1}" Type = "private" } } # NAT Gateway resource "aws_eip" "nat" { count = var.enable_nat_gateway ? 1 : 0 domain = "vpc" tags = { Name = "${var.environment}-nat-eip" } } resource "aws_nat_gateway" "main" { count = var.enable_nat_gateway ? 1 : 0 allocation_id = aws_eip.nat[0].id subnet_id = aws_subnet.public[0].id tags = { Name = "${var.environment}-nat" } depends_on = [aws_internet_gateway.main] } # Route Tables resource "aws_route_table" "public" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" gateway_id = aws_internet_gateway.main.id } tags = { Name = "${var.environment}-public-rt" } } resource "aws_route_table" "private" { vpc_id = aws_vpc.main.id dynamic "route" { for_each = var.enable_nat_gateway ? [1] : [] content { cidr_block = "0.0.0.0/0" nat_gateway_id = aws_nat_gateway.main[0].id } } tags = { Name = "${var.environment}-private-rt" } } # Route Table Associations resource "aws_route_table_association" "public" { count = length(aws_subnet.public) subnet_id = aws_subnet.public[count.index].id route_table_id = aws_route_table.public.id } resource "aws_route_table_association" "private" { count = length(aws_subnet.private) subnet_id = aws_subnet.private[count.index].id route_table_id = aws_route_table.private.id } ``` **RDS Database:** ```hcl # DB Subnet Group resource "aws_db_subnet_group" "main" { name = "${var.environment}-db-subnet-group" subnet_ids = aws_subnet.private[*].id tags = { Name = "${var.environment}-db-subnet-group" } } # DB Instance resource "aws_db_instance" "main" { identifier = "${var.environment}-postgres" engine = "postgres" engine_version = "15.3" instance_class = var.db_instance_class allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage storage_type = "gp3" storage_encrypted = true db_name = var.db_name username = var.db_username password = random_password.db_password.result multi_az = var.environment == "prod" db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.db.id] backup_retention_period = 7 backup_window = "03:00-04:00" maintenance_window = "mon:04:00-mon:05:00" enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"] skip_final_snapshot = var.environment != "prod" final_snapshot_identifier = var.environment == "prod" ? "${var.environment}-final-snapshot-${formatdate("YYYYMMDDhhmmss", timestamp())}" : null tags = { Name = "${var.environment}-postgres" } } # Random password resource "random_password" "db_password" { length = 32 special = true } # Store password in Secrets Manager resource "aws_secretsmanager_secret" "db_password" { name = "${var.environment}/db/password" } resource "aws_secretsmanager_secret_version" "db_password" { secret_id = aws_secretsmanager_secret.db_password.id secret_string = jsonencode({ username = var.db_username password = random_password.db_password.result host = aws_db_instance.main.address port = aws_db_instance.main.port database = var.db_name }) } ``` ### Modules **Module Structure:** ``` modules/ └── web-app/ ├── main.tf ├── variables.tf ├── outputs.tf └── README.md ``` **Module Definition:** ```hcl # modules/web-app/main.tf terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } } resource "aws_launch_template" "web" { name_prefix = "${var.name}-" image_id = var.ami_id instance_type = var.instance_type vpc_security_group_ids = [aws_security_group.web.id] user_data = base64encode(templatefile("${path.module}/templates/user-data.sh", { app_version = var.app_version })) tag_specifications { resource_type = "instance" tags = merge( var.tags, { Name = "${var.name}-instance" } ) } } resource "aws_autoscaling_group" "web" { name = "${var.name}-asg" desired_capacity = var.desired_capacity min_size = var.min_size max_size = var.max_size vpc_zone_identifier = var.subnet_ids target_group_arns = [aws_lb_target_group.web.arn] health_check_type = "ELB" launch_template { id = aws_launch_template.web.id version = "$Latest" } tag { key = "Name" value = "${var.name}-asg" propagate_at_launch = false } dynamic "tag" { for_each = var.tags content { key = tag.key value = tag.value propagate_at_launch = true } } } # modules/web-app/variables.tf variable "name" { description = "Name prefix for resources" type = string } variable "vpc_id" { description = "VPC ID" type = string } variable "subnet_ids" { description = "List of subnet IDs" type = list(string) } variable "ami_id" { description = "AMI ID for instances" type = string } variable "instance_type" { description = "EC2 instance type" type = string default = "t3.micro" } variable "desired_capacity" { description = "Desired number of instances" type = number default = 2 } variable "min_size" { description = "Minimum number of instances" type = number default = 1 } variable "max_size" { description = "Maximum number of instances" type = number default = 4 } variable "tags" { description = "Additional tags" type = map(string) default = {} } # modules/web-app/outputs.tf output "asg_name" { description = "Auto Scaling Group name" value = aws_autoscaling_group.web.name } output "target_group_arn" { description = "Target Group ARN" value = aws_lb_target_group.web.arn } ``` **Using Modules:** ```hcl # main.tf module "web_app" { source = "./modules/web-app" name = "${var.environment}-web" vpc_id = aws_vpc.main.id subnet_ids = aws_subnet.private[*].id ami_id = data.aws_ami.amazon_linux.id instance_type = var.instance_type desired_capacity = 3 min_size = 2 max_size = 6 tags = var.tags } # Using published module from registry module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "5.1.0" name = "${var.environment}-vpc" cidr = var.vpc_cidr azs = data.aws_availability_zones.available.names private_subnets = var.private_subnet_cidrs public_subnets = var.public_subnet_cidrs enable_nat_gateway = true enable_vpn_gateway = false tags = var.tags } ``` ### State Management **Remote State:** ```hcl # Backend configuration terraform { backend "s3" { bucket = "my-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-locks" } } # Create S3 bucket for state (run once) resource "aws_s3_bucket" "terraform_state" { bucket = "my-terraform-state" lifecycle { prevent_destroy = true } } resource "aws_s3_bucket_versioning" "terraform_state" { bucket = aws_s3_bucket.terraform_state.id versioning_configuration { status = "Enabled" } } resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" { bucket = aws_s3_bucket.terraform_state.id rule { apply_server_side_encryption_by_default { sse_algorithm = "AES256" } } } # DynamoDB table for state locking resource "aws_dynamodb_table" "terraform_locks" { name = "terraform-locks" billing_mode = "PAY_PER_REQUEST" hash_key = "LockID" attribute { name = "LockID" type = "S" } } ``` **Data Sources (Remote State):** ```hcl # Reference state from another workspace data "terraform_remote_state" "vpc" { backend = "s3" config = { bucket = "my-terraform-state" key = "vpc/terraform.tfstate" region = "us-east-1" } } # Use outputs from other state resource "aws_instance" "app" { subnet_id = data.terraform_remote_state.vpc.outputs.private_subnet_ids[0] # ... } ``` ### Advanced Features **Dynamic Blocks:** ```hcl resource "aws_security_group" "web" { name = "web-sg" vpc_id = aws_vpc.main.id dynamic "ingress" { for_each = var.allowed_ports content { description = "Allow port ${ingress.value}" from_port = ingress.value to_port = ingress.value protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } } dynamic "ingress" { for_each = var.allowed_cidrs content { description = ingress.value.description from_port = ingress.value.port to_port = ingress.value.port protocol = "tcp" cidr_blocks = [ingress.value.cidr] } } } ``` **Conditionals:** ```hcl # Conditional resource creation resource "aws_instance" "bastion" { count = var.create_bastion ? 1 : 0 ami = data.aws_ami.amazon_linux.id instance_type = "t3.micro" # ... } # Conditional values resource "aws_db_instance" "main" { multi_az = var.environment == "prod" ? true : false backup_retention_period = var.environment == "prod" ? 30 : 7 } # Ternary operator locals { instance_count = var.environment == "prod" ? 5 : 2 } ``` **Functions:** ```hcl locals { # String functions name_upper = upper(var.name) name_lower = lower(var.name) name_title = title(var.name) # Collection functions subnet_count = length(var.subnet_cidrs) all_subnets = concat(var.public_subnets, var.private_subnets) unique_azs = distinct(var.availability_zones) # Type conversion port_number = tonumber(var.port) tags_list = tolist(var.tags) # Map/Object manipulation merged_tags = merge(var.default_tags, var.custom_tags) # File functions user_data = file("${path.module}/user-data.sh") template = templatefile("${path.module}/config.tpl", { db_host = aws_db_instance.main.endpoint }) # Encoding user_data_encoded = base64encode(local.user_data) # Date/Time timestamp = timestamp() formatted = formatdate("YYYY-MM-DD", timestamp()) } ``` **Workspaces:** ```bash # List workspaces terraform workspace list # Create workspace terraform workspace new staging # Switch workspace terraform workspace select prod # Show current workspace terraform workspace show ``` ```hcl # Use workspace in configuration resource "aws_instance" "web" { instance_type = terraform.workspace == "prod" ? "t3.large" : "t3.micro" tags = { Environment = terraform.workspace } } ``` ## Terraform Commands **Basic Workflow:** ```bash # Initialize terraform init terraform init -upgrade # Upgrade providers # Format code terraform fmt terraform fmt -recursive # Validate configuration terraform validate # Plan changes terraform plan terraform plan -out=tfplan terraform plan -var="instance_count=5" terraform plan -var-file="prod.tfvars" # Apply changes terraform apply terraform apply tfplan terraform apply -auto-approve # Destroy resources terraform destroy terraform destroy -target=aws_instance.web # Show resources terraform show terraform show tfplan # List resources terraform state list # Show specific resource terraform state show aws_instance.web[0] ``` **State Management:** ```bash # Pull remote state terraform state pull > terraform.tfstate # Push local state terraform state push terraform.tfstate # Move resource in state terraform state mv aws_instance.old aws_instance.new # Remove resource from state terraform state rm aws_instance.web # Import existing resource terraform import aws_instance.web i-1234567890abcdef0 # Replace resource terraform apply -replace=aws_instance.web ``` **Other Commands:** ```bash # Generate graph terraform graph | dot -Tsvg > graph.svg # Unlock state terraform force-unlock LOCK_ID # Taint resource (mark for recreation) terraform taint aws_instance.web # Untaint resource terraform untaint aws_instance.web # Get outputs terraform output terraform output instance_id terraform output -json # Console (interactive) terraform console ``` ## Best Practices ### 1. Use Remote State ```hcl terraform { backend "s3" { bucket = "terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-locks" } } ``` ### 2. Use Modules ```hcl # Organize code into reusable modules modules/ ├── vpc/ ├── ec2/ ├── rds/ └── s3/ ``` ### 3. Use Variables and Outputs ```hcl # Parameterize everything # Document with descriptions # Use validation rules ``` ### 4. Use Version Constraints ```hcl terraform { required_version = ">= 1.6" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } } ``` ### 5. Use Naming Conventions ```hcl # Resource naming resource "aws_instance" "web" {} # Not "web_server" # Variable naming variable "instance_type" {} # Snake case # Tag naming tags = { Name = "prod-web-server" Environment = "production" } ``` ### 6. Separate Environments ``` environments/ ├── dev/ │ ├── main.tf │ └── terraform.tfvars ├── staging/ │ ├── main.tf │ └── terraform.tfvars └── prod/ ├── main.tf └── terraform.tfvars ``` ### 7. Use .gitignore ``` # .gitignore .terraform/ *.tfstate *.tfstate.backup .terraform.lock.hcl *.tfvars # If contains secrets crash.log ``` ### 8. Enable Provider Locking ```hcl # Commit .terraform.lock.hcl to version control # Ensures consistent provider versions ``` ## Approach When writing Terraform: 1. **Plan Before Apply**: Always review plan output 2. **Use Modules**: DRY principle, reusable components 3. **Remote State**: S3 + DynamoDB for locking 4. **Version Everything**: Terraform, providers, modules 5. **Parameterize**: Use variables, not hardcoded values 6. **Document**: README files, variable descriptions 7. **Test**: Use terraform validate, fmt, plan 8. **Separate Environments**: Different tfvars or workspaces Always write Terraform code that is maintainable, reusable, and follows infrastructure as code best practices.