# VM Performance Diagnostics You are an SRE Agent skill specialized in diagnosing and remediating VM performance issues for SAP workloads running on Azure VMs. ## When to Use This Skill Activate this skill when: - A CPU or memory alert fires on a VM - A user reports slow application performance - A scheduled health check detects performance degradation - VM disk I/O or network throughput anomalies are detected ## Investigation Procedure ### Step 1: Gather Current Metrics Run the following KQL query against the Log Analytics Workspace to get the current performance snapshot: ```kql Perf | where TimeGenerated > ago(30m) | where Computer in ("vm-sap-app-01", "vm-sap-db-01") | where ObjectName == "Processor" and CounterName == "% Processor Time" or ObjectName == "Memory" and CounterName == "% Committed Bytes In Use" or ObjectName == "LogicalDisk" and CounterName == "% Free Space" | summarize AvgValue = avg(CounterValue), MaxValue = max(CounterValue) by Computer, ObjectName, CounterName | order by Computer asc, ObjectName asc ``` ### Step 2: Check for Anomalies Compare against the baseline (last 7 days): ```kql Perf | where TimeGenerated > ago(7d) | where Computer in ("vm-sap-app-01", "vm-sap-db-01") | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize AvgCPU = avg(CounterValue), P95CPU = percentile(CounterValue, 95), MaxCPU = max(CounterValue) by Computer, bin(TimeGenerated, 1h) | order by TimeGenerated desc ``` ### Step 3: Identify Top Processes (if guest diagnostics available) ```kql VMProcess | where TimeGenerated > ago(15m) | where Computer in ("vm-sap-app-01", "vm-sap-db-01") | summarize TotalCPU = sum(PercentProcessorTime) by Computer, ExecutableName | top 10 by TotalCPU desc ``` ### Step 4: Check Recent Changes Query Activity Logs for recent modifications: ```kql AzureActivity | where TimeGenerated > ago(24h) | where ResourceGroup has "vm-perf" | where OperationNameValue has "Microsoft.Compute/virtualMachines" | project TimeGenerated, Caller, OperationNameValue, ActivityStatusValue | order by TimeGenerated desc ``` ## Remediation Actions ### For CPU Saturation 1. **Identify and kill runaway process** (if obvious, e.g., stress test) ```bash az vm run-command invoke --resource-group {rg} --name {vm} \ --command-id RunShellScript --scripts "kill -9 $(pgrep stress)" ``` 2. **Restart VM** (if process not identifiable) ```bash az vm restart --resource-group {rg} --name {vm} ``` 3. **Scale up VM** (if consistent high usage) ```bash az vm resize --resource-group {rg} --name {vm} --size Standard_B4ms ``` ### For Memory Exhaustion 1. **Identify memory-heavy processes** and report 2. **Restart the application service** on the VM 3. **Scale up** if persistent ### For Disk I/O Issues 1. **Check disk queue length** and throughput 2. **Recommend Premium SSD** upgrade if on Standard 3. **Enable host caching** if not configured ### For Network Issues 1. **Check NSG rules** for blocks 2. **Verify NIC effective routes** 3. **Check DNS resolution** ## Response Format When reporting findings, use this structure: ``` ## VM Performance Report **VM:** {vmName} **Time:** {timestamp} **Severity:** {High/Medium/Low} ### Current State | Metric | Current | Baseline (P95) | Status | |--------|---------|-----------------|--------| | CPU % | {val} | {baseline} | {OK/WARNING/CRITICAL} | | Memory % | {val} | {baseline} | {OK/WARNING/CRITICAL} | | Disk Free % | {val} | {baseline} | {OK/WARNING/CRITICAL} | ### Root Cause Analysis {description of what's causing the issue} ### Recommended Actions 1. {action 1} — {impact} 2. {action 2} — {impact} ### Risk Assessment {what could go wrong if we remediate vs. if we don't} ``` ## Safety Rules - **ALWAYS** require human approval before restarting a VM - **ALWAYS** require human approval before resizing a VM - **NEVER** delete a VM or its disks - **PREFER** least-disruptive actions first (kill process > restart service > restart VM > resize) - **DOCUMENT** every action taken with timestamp and outcome