--- name: devops-automation description: "DevOps and IT Ops automation - CI/CD, monitoring, incident management, and infrastructure workflows" version: "1.0.0" author: claude-office-skills license: MIT category: devops tags: - devops - ci-cd - monitoring - incident - automation department: Engineering models: recommended: - claude-sonnet-4 mcp: server: devops-mcp tools: - github_api - jenkins_trigger - aws_cli - kubernetes_api capabilities: - ci_cd_pipelines - monitoring_alerts - incident_management - infrastructure_automation - deployment_workflows languages: - en - zh related_skills: - slack-workflows - telegram-bot - ai-agent-builder --- # DevOps Automation Automate DevOps workflows including CI/CD pipelines, monitoring, incident management, and infrastructure operations. Based on n8n's IT Ops workflow templates. ## Overview This skill covers: - CI/CD pipeline automation - Monitoring and alerting - Incident management - Infrastructure automation - Deployment workflows --- ## CI/CD Automation ### GitHub Actions Integration ```yaml workflow: "GitHub CI/CD Notifications" triggers: - github_push - github_pull_request - github_workflow_run on_push: action: - trigger_ci: if_main_branch - notify_slack: channel: "#deployments" message: | 📦 *New Push to {branch}* Commit: `{commit_sha_short}` Author: {author} Message: {commit_message} [View Diff]({compare_url}) on_pr_opened: action: - notify_slack: channel: "#code-review" message: | 🔀 *New Pull Request* Title: {pr_title} Author: {author} Branch: {head} → {base} [Review PR]({pr_url}) - assign_reviewers: based_on_codeowners - run_ci_checks on_workflow_complete: action: - notify_slack: message: | {status_emoji} *Build {status}* Workflow: {workflow_name} Branch: {branch} Duration: {duration} {if_failed: [View Logs]({logs_url})} ``` ### Deployment Pipeline ```yaml deployment_pipeline: stages: build: trigger: push_to_main steps: - checkout_code - install_dependencies - run_tests - build_artifact - push_to_registry staging: trigger: build_success steps: - deploy_to_staging - run_integration_tests - notify_qa production: trigger: manual_approval steps: - create_backup - deploy_to_production - run_smoke_tests - notify_team rollback: trigger: deployment_failed OR manual steps: - revert_to_previous - notify_team - create_incident ``` --- ## Monitoring & Alerting ### Alert Routing ```yaml alert_routing: sources: - prometheus - datadog - cloudwatch - new_relic severity_levels: critical: response_time: 5_minutes channels: [pagerduty, slack_urgent, sms] escalation: immediate high: response_time: 15_minutes channels: [slack_alerts, email] escalation: after_15_minutes medium: response_time: 1_hour channels: [slack_alerts] low: response_time: 24_hours channels: [slack_logging] routing_rules: - if: service == "payments" team: payments_oncall severity_boost: +1 - if: service == "auth" team: security_oncall - default: team: platform_oncall ``` ### Alert Templates ```yaml alert_templates: infrastructure: cpu_high: title: "🔥 High CPU Usage" body: | Server: {host} CPU: {cpu_percent}% Duration: {duration} Threshold: {threshold}% [View Dashboard]({grafana_url}) memory_critical: title: "💾 Critical Memory" body: | Server: {host} Memory: {memory_percent}% Available: {available_mb}MB [SSH to Server]({ssh_link}) disk_full: title: "💿 Disk Space Critical" body: | Server: {host} Disk: {disk_percent}% Available: {available_gb}GB Suggestion: Clean logs or expand volume application: error_spike: title: "📈 Error Rate Spike" body: | Service: {service} Error Rate: {error_rate}% Normal: {baseline}% Top Errors: {top_errors} latency_high: title: "🐢 High Latency" body: | Service: {service} P99 Latency: {p99_ms}ms Threshold: {threshold_ms}ms ``` --- ## Incident Management ### Incident Workflow ```yaml incident_workflow: detection: sources: [monitoring, user_report, automated_check] triage: auto_severity: - if: affects_payments severity: critical - if: affects_auth severity: critical - if: affects_api AND error_rate > 10% severity: high response: critical: - create_incident_channel: "#inc-{timestamp}" - page_oncall: immediately - notify_stakeholders: [engineering_lead, product] - start_war_room: zoom_link - create_status_page: incident high: - create_incident_channel - notify_oncall: slack - create_ticket: jira communication: internal: frequency: every_30_minutes channel: incident_channel template: | 📊 *Incident Update* Status: {status} Impact: {impact} Next update: {next_update_time} Current actions: {action_items} external: channel: status_page template: customer_facing_update resolution: steps: - confirm_resolution - update_status_page: resolved - notify_stakeholders - schedule_postmortem - close_incident_channel: after_24h ``` ### Postmortem Template ```yaml postmortem_template: sections: summary: - incident_title - duration - severity - impact timeline: format: | | Time | Event | |------|-------| | {time} | {event} | root_cause: - what_happened - why_it_happened - contributing_factors impact: - users_affected - revenue_impact - sla_breach resolution: - how_it_was_fixed - time_to_detect - time_to_resolve action_items: format: | | Action | Owner | Due Date | Status | |--------|-------|----------|--------| lessons_learned: - what_went_well - what_went_poorly - lucky_breaks ``` --- ## Infrastructure Automation ### Server Provisioning ```yaml provisioning_workflow: trigger: jira_ticket OR slack_request steps: 1. validate_request: check: [budget_approval, security_review] 2. create_infrastructure: terraform: - vpc - security_groups - ec2_instances - load_balancer 3. configure_server: ansible: - base_configuration - security_hardening - monitoring_agent - application_setup 4. validate: - health_check - security_scan - performance_baseline 5. notify: slack: "✅ Server {hostname} is ready" include: [ssh_access, dashboard_link] ``` ### Scheduled Maintenance ```yaml maintenance_automation: tasks: certificate_renewal: schedule: "30 days before expiry" action: - request_new_cert: letsencrypt - deploy_cert - verify_ssl - notify: if_failure security_patching: schedule: "weekly" action: - check_updates - if_critical: immediate_patch - else: schedule_maintenance_window log_rotation: schedule: "daily" action: - rotate_logs - compress_old - upload_to_s3 - delete_local: older_than_7_days backup_verification: schedule: "weekly" action: - restore_to_test_env - run_integrity_checks - report_status ``` --- ## Kubernetes Automation ### K8s Workflows ```yaml kubernetes_automation: deployment: trigger: docker_image_pushed steps: - update_manifest: with_new_image_tag - apply_to_staging - run_tests - if_success: apply_to_production scaling: trigger: metric_threshold rules: - if: cpu > 80% action: scale_up max_replicas: 10 - if: cpu < 20% action: scale_down min_replicas: 2 rollback: trigger: health_check_failed action: - kubectl_rollout_undo - notify_team - create_incident ``` --- ## Output Example **Request**: "Set up deployment notifications for GitHub Actions" **Output**: ```markdown # GitHub Actions Deployment Notifications ## n8n Workflow ```yaml trigger: GitHub Webhook events: [workflow_run] ``` ## Notification Templates **Build Started:** ``` 🚀 *Deployment Started* Branch: main Commit: abc1234 Author: @developer Triggered by: Push [View Workflow](https://github.com/...) ``` **Build Success:** ``` ✅ *Deployment Successful* Environment: Production Duration: 3m 42s Version: v1.2.3 Changes: • Feature X • Bug fix Y [View Deployment](https://app.example.com) ``` **Build Failed:** ``` ❌ *Deployment Failed* Stage: Test Error: npm test failed [View Logs](https://github.com/...) [Retry](https://github.com/...) ``` ## Slack Integration ```yaml channel: "#deployments" mention_on_failure: "@oncall" thread_replies: true ``` ``` --- *DevOps Automation Skill - Part of Claude Office Skills*