--- name: helm-charts-audit description: Audits Helm charts for anti-patterns, security issues, and best practice violations. Use when asked to audit, review, or check Helm chart quality. Generates a comprehensive report under reports/YYYY-MM-DD/helm-charts-audit.md. (project) --- # Purpose Enforce Helm chart quality and security standards across the `helm-charts/` directory through automated checks. **What it checks (13 checks):** 1. Image Tags (no latest, mutable tags) - HIGH 2. Security Context (runAsNonRoot, no privileged) - HIGH 3. Resource Limits (requests, limits, memory) - HIGH 4. RBAC Wildcards (no * permissions) - HIGH 5. Health Probes (liveness, readiness) - HIGH 6. Helm Lint (official helm validation) - HIGH 7. Chart Metadata (apiVersion, version, maintainers) - MEDIUM 8. Chart Structure (README, NOTES.txt, _helpers.tpl) - MEDIUM 9. Dependencies (pinned versions, Chart.lock) - MEDIUM 10. Deprecated APIs (no v1beta1, use stable APIs) - MEDIUM 11. Argo Rollouts (strategy, analysis, steps) - MEDIUM 12. Ingress TLS (certificates, annotations) - MEDIUM 13. GPU Resources (nvidia.com/gpu, tolerations) - LOW # Running Checks **Full audit (all checks):** ```bash node .claude/skills/helm-charts-audit/scripts/run_all_checks.mjs ``` **Generate report (all checks + markdown report):** ```bash node .claude/skills/helm-charts-audit/scripts/generate_report.mjs ``` Report saved to: `reports/YYYY-MM-DD/helm-charts-audit.md` **Individual checks:** ```bash node .claude/skills/helm-charts-audit/scripts/check_image_tags.mjs node .claude/skills/helm-charts-audit/scripts/check_security_context.mjs node .claude/skills/helm-charts-audit/scripts/check_resource_limits.mjs node .claude/skills/helm-charts-audit/scripts/check_rbac_wildcards.mjs node .claude/skills/helm-charts-audit/scripts/check_health_probes.mjs node .claude/skills/helm-charts-audit/scripts/check_helm_lint.mjs node .claude/skills/helm-charts-audit/scripts/check_chart_metadata.mjs node .claude/skills/helm-charts-audit/scripts/check_chart_structure.mjs node .claude/skills/helm-charts-audit/scripts/check_dependencies.mjs node .claude/skills/helm-charts-audit/scripts/check_deprecated_apis.mjs node .claude/skills/helm-charts-audit/scripts/check_argo_rollouts.mjs node .claude/skills/helm-charts-audit/scripts/check_ingress_tls.mjs node .claude/skills/helm-charts-audit/scripts/check_gpu_resources.mjs ``` # Quality Rules ## 1. Image Tags (HIGH) **RULE**: Never use mutable tags. `latest` tag = unpredictable deployments + rollback failures. **Violations**: - `image: nginx:latest` - mutable, changes without notice - `image: nginx` - defaults to :latest - `tag: ""` - empty tag in values.yaml - `tag: head`, `tag: canary`, `tag: dev` - mutable branch tags **Fix**: Use immutable tags like `v1.2.3`, SHA digests `sha256:abc123`, or SemVer `1.21.0`. --- ## 2. Security Context (HIGH) **RULE**: Containers must run with minimal privileges. Privileged containers = cluster takeover risk. **Violations**: - `privileged: true` - full host access, container escape trivial - `runAsNonRoot: false` - runs as root user UID 0 - `runAsUser: 0` - explicitly root - `allowPrivilegeEscalation: true` - can gain more privileges - `hostNetwork: true` - shares host network namespace - `hostPID: true` - can see/kill host processes - `hostIPC: true` - can access host shared memory - `readOnlyRootFilesystem: false` - malware can write anywhere - `capabilities.add: [SYS_ADMIN]` - near-root level access - `capabilities.add: [ALL]` - equivalent to privileged **Fix**: Add proper securityContext with `runAsNonRoot: true`, `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`, `capabilities.drop: [ALL]`. --- ## 3. Resource Limits (HIGH) **RULE**: All containers must have resource requests and limits. No limits = node OOM + noisy neighbor issues. **Violations**: - `resources: {}` - empty resources block - Missing `requests.cpu` - scheduler can't make decisions - Missing `requests.memory` - OOM killer may terminate unexpectedly - Missing `limits.memory` - container can consume all node memory - `requests > limits` - invalid configuration **Fix**: Define `resources.requests.cpu`, `resources.requests.memory`, `resources.limits.memory`. Note: CPU limits often intentionally omitted for better performance. --- ## 4. RBAC Wildcards (HIGH) **RULE**: Follow least-privilege principle. Wildcard permissions = privilege escalation path. **Violations**: - `verbs: ["*"]` - grants all actions - `resources: ["*"]` - access to all resource types - `apiGroups: ["*"]` - access across all API groups - `roleRef.name: cluster-admin` - full cluster access - `verbs: [impersonate]` - can act as other users - `verbs: [escalate, bind]` - can grant additional privileges - Access to `secrets` resource - can read all secrets **Fix**: Use explicit verbs like `[get, list, watch]`, explicit resources like `[pods, services]`, avoid cluster-admin bindings. --- ## 5. Health Probes (HIGH) **RULE**: All workloads must have health probes. No probes = stuck containers not restarted + traffic to unready pods. **Violations**: - Deployment without `livenessProbe` - stuck containers won't restart - Deployment without `readinessProbe` - traffic sent to unready pods - `initialDelaySeconds: 0` - probes start immediately, false failures - `timeoutSeconds: 1` - too short, may cause false failures - `successThreshold > 1` on livenessProbe - should always be 1 - `failureThreshold > 10` - delays detecting actual failures **Fix**: Add livenessProbe and readinessProbe with reasonable `initialDelaySeconds` (10-30s), `periodSeconds` (10s), `timeoutSeconds` (5s). --- ## 6. Helm Lint (HIGH) **RULE**: Charts must pass official helm lint validation. Lint failures = deployment failures. **Violations**: - Template syntax errors - Missing required fields in Chart.yaml - Invalid YAML structure - Broken template references **Fix**: Run `helm lint ` and fix reported issues. --- ## 7. Chart Metadata (MEDIUM) **RULE**: Chart.yaml must have complete metadata. Missing metadata = maintenance nightmare. **Violations**: - `apiVersion: v1` - Helm 2 format, upgrade to v2 - Missing or invalid `version` - must be SemVer - Missing `appVersion` - hard to track what's deployed - Missing `description` - unclear what chart does - Missing `maintainers` - no ownership - `name` doesn't match directory name - confusing **Fix**: Use `apiVersion: v2`, SemVer version, add description and maintainers with email. --- ## 8. Chart Structure (MEDIUM) **RULE**: Follow standard Helm chart structure. Non-standard = user confusion + missing features. **Violations**: - Missing `README.md` - no documentation - Missing `templates/NOTES.txt` - no post-install instructions - Missing `templates/_helpers.tpl` - no template helpers - Missing `.helmignore` - unnecessary files in package - Missing `values.schema.json` - no values validation - Empty `templates/` directory **Fix**: Create missing files following Helm chart best practices. --- ## 9. Dependencies (MEDIUM) **RULE**: Pin dependency versions. Floating versions = non-reproducible builds. **Violations**: - No `version` on dependency - unpinned - `version: "*"` or `version: "^1.0"` - floating version - Missing `Chart.lock` - dependency versions not locked - `repository: file://` - local reference, breaks when published - `repository: http://` - insecure, use HTTPS - Deprecated repository URLs (charts.helm.sh/stable) **Fix**: Pin exact versions, run `helm dependency update` to generate Chart.lock. --- ## 10. Deprecated APIs (MEDIUM) **RULE**: Use stable Kubernetes APIs. Deprecated APIs = upgrade failures. **Violations**: - `extensions/v1beta1` - removed in K8s 1.22 - `apps/v1beta1`, `apps/v1beta2` - removed in K8s 1.16 - `networking.k8s.io/v1beta1` Ingress - removed in K8s 1.22 - `batch/v1beta1` CronJob - removed in K8s 1.25 - `policy/v1beta1` PodSecurityPolicy - removed in K8s 1.25 **Fix**: Update to stable APIs: `apps/v1`, `networking.k8s.io/v1`, `batch/v1`. Run `kubectl convert` if needed. --- ## 11. Argo Rollouts (MEDIUM) **RULE**: Rollouts must have valid strategy configuration. Invalid config = failed deployments. **Violations**: - Rollout without `strategy` - no deployment strategy - Canary without `steps` - no gradual rollout - Canary without `analysis` - no automated validation - BlueGreen without `activeService` - no active service defined - BlueGreen without `previewService` - can't preview before promotion - Missing `revisionHistoryLimit` - old ReplicaSets accumulate - Missing `progressDeadlineSeconds` - stuck rollouts don't timeout **Fix**: Configure proper canary steps with analysis, or blueGreen with activeService/previewService. --- ## 12. Ingress TLS (MEDIUM) **RULE**: Ingress must have TLS configuration. No TLS = unencrypted traffic. **Violations**: - Ingress with hosts but no TLS - traffic unencrypted - TLS without `secretName` - certificate source unclear - No `ingressClassName` - may use wrong controller - Missing cert-manager annotations - no automated certificates - Deprecated `kubernetes.io/ingress.class` annotation - No SSL redirect annotation - HTTP doesn't redirect to HTTPS **Fix**: Add TLS section with secretName, use `cert-manager.io/cluster-issuer` annotation for automated certs. --- ## 13. GPU Resources (LOW) **RULE**: GPU workloads need proper configuration. Missing config = scheduling failures. **Violations**: - GPU limits without matching requests - should be equal - No GPU toleration - won't schedule on GPU nodes - No GPU nodeSelector/affinity - relies only on resource availability - No runtimeClassName - may need nvidia runtime **Fix**: Set `nvidia.com/gpu` in both requests and limits (equal values), add GPU tolerations and nodeSelector. --- # Detection Philosophy This skill uses **VALUE-BASED detection**: - Detects issues by actual values and patterns, not by variable/field names - Future-proof: new charts with issues are automatically detected - No need to update scripts when new charts are added # Parsing Strategy - **Chart.yaml, values.yaml**: YAML content parsed via regex patterns - **templates/*.yaml**: Regex-based parsing (Go template syntax breaks YAML parsers) - **Multi-document YAML**: Handles `---` separators # Safety - Read-only operation (except report generation) - No Helm releases modified - No cluster changes