--- name: castai-core-workflow-a description: 'Configure CAST AI autoscaler policies and node templates for cost optimization. Use when enabling Phase 2 automation, setting spot instance policies, or configuring node downscaler and evictor settings. Trigger with phrases like "cast ai autoscaler", "cast ai policies", "cast ai spot instances", "cast ai node optimization". ' allowed-tools: Read, Write, Edit, Bash(curl:*), Bash(kubectl:*), Grep version: 1.0.0 license: MIT author: Jeremy Longshore tags: - saas - kubernetes - cost-optimization - castai compatibility: Designed for Claude Code --- # CAST AI Core Workflow: Autoscaler & Policies ## Overview Primary workflow for CAST AI: configure autoscaler policies to optimize cluster costs. Covers enabling spot instances, configuring the node downscaler and evictor, setting cluster CPU/memory limits, and creating node templates for workload-specific requirements. ## Prerequisites - Completed `castai-install-auth` with Phase 2 (cluster controller + evictor) - `CASTAI_API_KEY` and `CASTAI_CLUSTER_ID` set - Cluster in "ready" status ## Instructions ### Step 1: Read Current Policies ```bash curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ | jq . ``` ### Step 2: Enable Cost-Optimized Autoscaling ```bash curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \ -H "Content-Type: application/json" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ -d '{ "enabled": true, "unschedulablePods": { "enabled": true, "headroom": { "cpuPercentage": 10, "memoryPercentage": 10, "enabled": true } }, "nodeDownscaler": { "enabled": true, "emptyNodes": { "enabled": true, "delaySeconds": 180 } }, "spotInstances": { "enabled": true, "clouds": ["aws"], "spotDiversityEnabled": true, "spotDiversityPriceIncreaseLimitPercent": 20 }, "clusterLimits": { "enabled": true, "cpu": { "minCores": 4, "maxCores": 100 } } }' ``` ### Step 3: Configure Node Templates via Terraform ```hcl resource "castai_node_template" "spot_workers" { cluster_id = castai_eks_cluster.this.id name = "spot-workers" is_default = false is_enabled = true constraints { min_cpu = 2 max_cpu = 16 min_memory = 4096 max_memory = 65536 spot = true use_spot_fallbacks = true fallback_restore_rate_seconds = 600 instance_families { include = ["m5", "m6i", "c5", "c6i", "r5", "r6i"] } architectures = ["amd64"] } custom_labels = { "workload-type" = "batch" } } resource "castai_node_template" "gpu_ondemand" { cluster_id = castai_eks_cluster.this.id name = "gpu-ondemand" is_default = false is_enabled = true constraints { spot = false gpu_manufacturers = ["NVIDIA"] instance_families { include = ["p3", "p4d", "g4dn", "g5"] } } custom_labels = { "workload-type" = "gpu" } } ``` ### Step 4: Verify Autoscaler is Working ```bash # Check if the autoscaler is processing nodes curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/external-clusters/${CASTAI_CLUSTER_ID}/nodes" \ | jq '[.items[] | {name, instanceType, lifecycle, castaiManaged: .castaiManaged}] | group_by(.lifecycle) | map({lifecycle: .[0].lifecycle, count: length})' # Expected: mix of spot and on-demand nodes ``` ## Error Handling | Error | Cause | Solution | |-------|-------|----------| | Policy update returns 400 | Invalid policy JSON | Validate with `jq` before sending | | Nodes not scaling | Policy not enabled | Verify `.enabled: true` in policy | | Spot instances not used | Provider not configured | Add cloud provider to `spotInstances.clouds` | | Evictor too aggressive | Low delay threshold | Increase `emptyNodes.delaySeconds` | | Cluster limit hit | `maxCores` too low | Increase `clusterLimits.cpu.maxCores` | ## Resources - [Autoscaler Policies](https://docs.cast.ai/docs/autoscaler-settings) - [Node Configuration](https://docs.cast.ai/docs/node-configuration) - [Terraform Node Templates](https://registry.terraform.io/providers/castai/castai/latest/docs/resources/node_template) ## Next Steps For workload-level autoscaling, see `castai-core-workflow-b`.