--- name: backdoor-deployment description: "Validate a container image change via backdoor deployment. Use when: deploying test image to a cluster, comparing data volume between deployments, comparing resource consumption, backdoor deploy, validate container image, image regression testing, build and deploy branch." argument-hint: "Provide branch name, current production image, and YAML file path" --- # Backdoor Deployment Automation Validates a container image change by deploying the current production image, collecting baseline data, then deploying the test image (from a CI build) and comparing data volume and resource consumption. No regressions = pass. ## Required Inputs Check with the user if they want to use the default values or provide new ones. | Input | Description | Default | |-------|-------------|---------| | **Branch name** | Git branch to build | `suyadav/aiautomation` | | **Current production image** | Production image tag (e.g. `ciprod:X.Y.Z`) | `ciprod:3.1.35` | | **YAML file path** | Helm values file for backdoor deployment | `./../azuremonitor-containerinsights-for-prod-clusters/values.yaml` | ## Derived Values Parse these automatically from the YAML file — do not ask the user. | Value | Source | |-------|--------| | **Cluster Resource ID** | `OmsAgent.aksResourceID` | | **Log Analytics Workspace ID** | `OmsAgent.workspaceID` (a GUID used with `az monitor log-analytics query -w`) | | **Cluster Name** | Last segment of the cluster resource ID (for `kubectl config use-context`) | | **Subscription ID** | Extracted from the cluster resource ID (`/subscriptions//...`) | | **Resource Group** | Extracted from the cluster resource ID (`/resourceGroups//...`) | ## Build Pipeline | Field | Value | |-------|-------| | Organization | `github-private` | | Project | `microsoft` | | Build Definition ID | `444` | ## General Rules - Save the output of **each step** to `BackdoorDeploymentOutput.md` in the repo root. Always append new results at the end. Beautify for readability. Don't clear until explicitly asked. - If asked **"what's the next step"**, read `BackdoorDeploymentOutput.md` and suggest the next step. - Before executing any step, verify previous step data exists in `BackdoorDeploymentOutput.md`. If missing, confirm with the user before proceeding. - If the build must be retriggered, **keep the existing production baseline data** — do not re-deploy the production image or re-collect baseline data. - After the workflow completes, **restore the YAML file** to its original production image values. ## Procedures ### Update YAML Image Tags 1. Only update the image version — do NOT change any other part of the file. 2. Update exactly two fields: `imageTagLinux` and `imageTagWindows`. 3. **Windows naming convention**: prefix `win-` after the image type. Examples: - `cidev:3.1.27-2-abc123-20250520184627` → `cidev:win-3.1.27-2-abc123-20250520184627` - `ciprod:3.1.27` → `ciprod:win-3.1.27` ### Deploy with Helm Always use `--install` to handle both fresh installs and upgrades: ```bash helm upgrade --install ama-logs -n kube-system ``` where `` is the directory containing the YAML (e.g. `./../azuremonitor-containerinsights-for-prod-clusters/`). ### Collect Table Data Run Kusto queries via `az monitor log-analytics query -w ` (or the `kusto-mcp` MCP server if available). Collect aggregated row counts in **1-minute bins** from **(deployment time + 5 min)** to **(deployment time + 10 min)** for these tables: - `ContainerInventory` - `KubeNodeInventory` - `KubePodInventory` - `InsightsMetrics` - `Perf` - `ContainerLogV2` **Query template** (run once per table, all 6 can run in parallel): ```kusto | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | summarize Count=count() by bin(TimeGenerated, 1m) | order by TimeGenerated asc ``` > **Timing**: Wait at least **15 minutes** after deployment before running these queries — this accounts for pod startup (~5 min) plus Log Analytics ingestion latency (~5–10 min). The query window (deploy+5 to deploy+10) captures steady-state data only. ### Compare Data Volume 1. Compare production vs test counts **side by side** for each table. 2. For `ContainerInventory`, `KubeNodeInventory`, `KubePodInventory`, `InsightsMetrics`, `Perf`: counts must match **exactly** per minute, excluding first/last minute edge windows. If they differ by even 1, investigate. 3. For `ContainerLogV2`: exact match is not required, but check for sustained upward/downward trends indicating regression. ### Check Build Failure Reason Query the build timeline to find which task(s) failed: ```bash az devops invoke --organization "https://dev.azure.com/github-private" \ --area build --resource timeline \ --route-parameters project=microsoft buildId= \ --query "records[?result=='failed'].{name:name, type:type}" -o table ``` - If the **only** failed task name contains "Trivy" (vulnerability scan), the build images are valid — continue using this build. **Do NOT fall back to a previous build. Extract the image tag from this build's logs.** - If any other task failed, the build is unusable — report the failure to the user. ### Extract Image Version from Build Logs Use the ADO API to read the build log directly (no need to download zip files): 1. **Find the log ID** for the "Multi-arch Linux build" task: ```bash az devops invoke --organization "https://dev.azure.com/github-private" \ --area build --resource timeline \ --route-parameters project=microsoft buildId= \ --query "records[?name=='Multi-arch Linux build'].{name:name, logId:log.id}" -o json ``` 2. **Read the log** and extract the image tag. The log contains a line like: ``` ##[warning]Linux image built with tag: containerinsightsprod.azurecr.io/public/azuremonitor/containerinsights/cidev:3.1.34-17-g67321cf0d-20260323045331 ``` Use `grep -o 'cidev:[^ ]*'` or similar to extract the tag. 3. **Derive the Windows tag** from the Linux tag using the naming convention (prefix `win-`). Alternatively, find "Docker windows build for multi-arc image" log for a line like: ``` ##[warning]Windows image built with tag: ...cidev:win-3.1.34-17-g67321cf0d-20260323045331 ``` ### Get PodUid Query `KubePodInventory` scoped to the relevant deployment window: ```kusto KubePodInventory | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | where Name in ('', '', ...) | distinct PodUid, Name ``` ### Compare Resource Consumption Query per-minute resource consumption. You can batch multiple pods in one query using `or`: ```kusto Perf | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | where CounterName =~ '' | where InstanceName contains '' or InstanceName contains '' or ... | extend Pod = case( InstanceName contains '', '', InstanceName contains '', '', 'unknown') | summarize MaxValue=max(CounterValue/1000/1000/1000) by bin(TimeGenerated, 1m), Pod | order by Pod asc, TimeGenerated asc ``` Compare the two counter names: - `memoryWorkingSetBytes` — memory in GB - `cpuUsageNanoCores` — CPU in cores Flag any regression (sustained increase in the test deployment). ### Investigate Data Volume Regression When a table's counts differ between production and test (or ContainerLogV2 shows a sustained trend), investigate before marking it as a regression: 1. **Break down by ContainerName** in both windows to identify which container(s) are responsible: ```kusto | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | summarize Count=count() by ContainerName | sort by Count desc ``` 2. **Compare the per-container breakdown** between production and test. Look for: - Containers present in one window but not the other (cluster workload change, not a code regression). - A specific container with significantly higher counts in the test window. 3. **If a container is only present in one window**, verify it was running independently of the deployment by checking a broader time range (e.g., 30 min before the deployment): ```kusto | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | where ContainerName == '' | summarize Count=count() by bin(TimeGenerated, 1m) | order by TimeGenerated asc ``` 4. **Classify the finding**: - If the difference is caused by a container that started/stopped independently of the deployment → **not a regression** (cluster workload difference). Note this in the output file and mark as PASS. - If the difference is caused by an ama-logs container or directly relates to the code change → **potential regression**. Flag it and ask the user to review. ### Investigate Resource Consumption Regression When memory or CPU shows a sustained increase in the test deployment: 1. **Check per-container resource usage** within each pod to isolate which container is consuming more. The ama-logs pods run multiple containers (ama-logs, ama-logs-prometheus, addon-token-adapter). Use: ```kusto Perf | where TimeGenerated between(datetime('') .. datetime('')) | where _ResourceId =~ '' | where CounterName =~ '' | where InstanceName contains '' | summarize MaxValue=max(CounterValue/1000/1000/1000) by bin(TimeGenerated, 1m), InstanceName | order by InstanceName asc, TimeGenerated asc ``` 2. **Compare the per-container breakdown** between production and test to pinpoint the specific container causing the increase. 3. **Classify the finding**: - Increases < 10% within normal variance → **not a regression**. Note in output and mark as PASS. - Sustained increases ≥ 10% in an ama-logs container → **potential regression**. Flag and ask user to review. ## Steps The workflow has two parallel tracks that converge after the build completes. ### Phase 1: Obtain Build + Deploy Production Image (parallel) 1. **Parse derived values** from the YAML file (see Derived Values table). Save all values to the output file. 2. **Set kubectl context**: `kubectl config use-context `. 3. **Check for an existing build** on the branch for the **latest commit** (definition ID 444, org: `github-private`, project: `microsoft`). - If a completed build exists on the latest commit → use it (even if it failed due to Trivy — see "Check Build Failure Reason"). - **IMPORTANT: A build that failed ONLY due to Trivy is still usable.** Do NOT fall back to a previous build. The images are already built and pushed before Trivy runs. Always extract the image tag from the failed build's logs (see "Extract Image Version from Build Logs"). - If no usable build exists → **trigger a new build**. Save the build ID. 4. **If the build is already complete**, skip to Phase 2 after finishing production baseline steps. **If the build is still running**, proceed with steps 5–9 in parallel; periodically check build status during wait times. 5. **Update YAML** with the current production image and **deploy** (see "Update YAML Image Tags" and "Deploy with Helm"). Record the **production deployment time** (UTC). 6. **Wait 15 minutes**, then verify pods: `kubectl get pods -n kube-system | grep ama-logs`. Confirm all are Running with 0 restarts. Save pod names to the output file. 7. **Collect production baseline data** for all 6 tables (see "Collect Table Data"). Save results to the output file. ### Phase 2: Deploy Test Image (after build completes) 8. **Confirm the build** completed. Check failure reason if needed (see "Check Build Failure Reason"). If it failed for a non-Trivy reason, ask the user whether to retrigger. **If it failed only due to Trivy, treat it as a successful build — the images are valid. Do NOT fall back to a previous build.** 9. **Extract the test image version** from the build logs (see "Extract Image Version from Build Logs"). Save to the output file. 10. **Update YAML** with the test image and **deploy**. Record the **test deployment time** (UTC). 11. **Wait 15 minutes**, then verify pods are Running. If any pod restarted, get the reason via `kubectl describe pod -n kube-system`. Save pod names to the output file. 12. **Collect test data** for all 6 tables (see "Collect Table Data"). Save results to the output file. ### Phase 3: Compare Results 13. **Compare data volume** between production and test for all tables (see "Compare Data Volume"). If any table shows a difference, **investigate** before reporting (see "Investigate Data Volume Regression"). 14. **Get PodUid** for all pods in both deployments (see "Get PodUid"). 15. **Compare resource consumption** for `memoryWorkingSetBytes` and `cpuUsageNanoCores` (see "Compare Resource Consumption"). If any metric shows a sustained increase, **investigate** before reporting (see "Investigate Resource Consumption Regression"). 16. **Restore YAML** to its original production image values. 17. **Write summary** to the output file: pass/fail for each table and resource check. Include investigation findings for any anomalies — clearly distinguish between code regressions and cluster workload differences.