--- name: LVMS Analyzer description: Analyzes LVMS must-gather data to diagnose storage issues --- # LVMS Analyzer Skill This skill provides detailed guidance for analyzing LVMS (Logical Volume Manager Storage) must-gather data to identify and troubleshoot storage issues. ## When to Use This Skill Use this skill when: - Analyzing LVMS must-gather data offline - Diagnosing PVCs stuck in Pending state - Investigating LVMCluster readiness issues - Troubleshooting volume group creation failures - Debugging TopoLVM CSI driver problems - Checking operator health in LVMS namespace This skill is automatically invoked by the `/lvms:analyze` command when working with must-gather data. ## Prerequisites **Required:** - LVMS must-gather directory extracted and accessible - Must-gather contains LVMS namespace directory: - `namespaces/openshift-lvm-storage/` (newer versions) - OR `namespaces/openshift-storage/` (older versions) - Python 3.6 or higher installed - PyYAML library: `pip install pyyaml` **Namespace Compatibility:** - LVMS namespace changed from `openshift-storage` to `openshift-lvm-storage` in recent versions - The analysis script automatically detects which namespace is present - Both namespaces are fully supported for backward compatibility **Must-Gather Structure:** ``` must-gather/ └── registry-{image-registry}-lvms-must-gather-{version}-sha256-{hash}/ ├── cluster-scoped-resources/ │ ├── core/ │ │ └── persistentvolumes/ │ │ └── pvc-*.yaml # Individual PV files │ ├── storage.k8s.io/ │ │ └── storageclasses/ │ │ ├── lvms-vg1.yaml │ │ └── lvms-vg1-immediate.yaml │ └── security.openshift.io/ │ └── securitycontextconstraints/ │ └── lvms-vgmanager.yaml ├── namespaces/ │ └── openshift-lvm-storage/ # or openshift-storage for older versions │ ├── oc_output/ # IMPORTANT: Primary location for LVMS resources │ │ ├── lvmcluster.yaml # Full LVMCluster resource with status │ │ ├── lvmcluster # Text output (oc describe) │ │ ├── lvmvolumegroup # Text output │ │ ├── lvmvolumegroupnodestatus # Text output │ │ ├── logicalvolume # Text output │ │ ├── pods # Text output (oc get pods) │ │ └── events # Text output │ ├── pods/ │ │ ├── lvms-operator-{hash}/ │ │ │ └── lvms-operator-{hash}.yaml │ │ └── vg-manager-{hash}/ │ │ └── vg-manager-{hash}.yaml │ └── apps/ # May contain deployments/daemonsets └── ... ``` **Key Note:** LVMS resources are primarily in the `oc_output/` directory, with `lvmcluster.yaml` being the most important file containing full cluster and node status. ## Implementation Steps ### Step 1: Validate Must-Gather Path Before running analysis, verify the must-gather directory structure: ```bash # Check if LVMS namespace directory exists (try both namespaces) ls {must-gather-path}/namespaces/openshift-lvm-storage 2>/dev/null || \ ls {must-gather-path}/namespaces/openshift-storage # Verify required resource directories ls {must-gather-path}/cluster-scoped-resources/core/persistentvolumes ``` **Namespace Detection:** The analysis script automatically detects which namespace is present: - Newer LVMS versions use `openshift-lvm-storage` - Older LVMS versions use `openshift-storage` - The script will inform you which namespace was detected **Common Issue:** User provides parent directory instead of subdirectory - Must-gather extracts to a directory like `must-gather.local.12345/` - Inside is a subdirectory like `registry-ci-openshift-org-origin-4-18.../` - Always use the **subdirectory** (the one with cluster-scoped-resources/ and namespaces/) **Handling:** ```bash # If user provides parent directory, try to find the correct subdirectory if [ ! -d "{path}/namespaces/openshift-lvm-storage" ] && \ [ ! -d "{path}/namespaces/openshift-storage" ]; then # Try to find either namespace find {path} -type d \( -name "openshift-lvm-storage" -o -name "openshift-storage" \) -path "*/namespaces/*" # Suggest the correct path to user fi ``` ### Step 2: Run Analysis Script Use the Python analysis script for structured analysis: ```bash python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ {must-gather-path} ``` **Script Location:** - Always use: `plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py` - Use relative path from repository root - Script is part of the LVMS plugin **Component-Specific Analysis:** For focused analysis on specific components: ```bash # Analyze only storage/PVC issues python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ {must-gather-path} --component storage # Analyze only operator health python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ {must-gather-path} --component operator # Analyze only volume groups python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ {must-gather-path} --component volumes # Analyze only pod logs python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ {must-gather-path} --component logs ``` ### Step 3: Interpret Analysis Results The script provides structured output across several sections: **1. LVMCluster Status** Key fields to check: - `state`: Should be "Ready" - `ready`: Should be true - `conditions`: All should have status "True" - ResourcesAvailable: Resources deployed successfully - VolumeGroupsReady: VGs created on all nodes Example healthy output: ``` LVMCluster: lvmcluster-sample ✓ State: Ready ✓ Ready: true Conditions: ✓ ResourcesAvailable: True ✓ VolumeGroupsReady: True ``` Example unhealthy output (real case from must-gather): ``` LVMCluster: my-lvmcluster ❌ State: Degraded ❌ Ready: false Conditions: ✓ ResourcesAvailable: True Reason: ResourcesAvailable Message: Reconciliation is complete and all the resources are available ❌ VolumeGroupsReady: False Reason: VGsDegraded Message: One or more VGs are degraded ``` **2. Volume Group Status** Checks volume group creation per node and device availability: Example output (real case from must-gather): ``` Volume Group/Device Class: vg1 Nodes: 3 Node: ocpnode1.ocpiopex.growipx.com ⚠ Status: Progressing Devices: /dev/mapper/3600a098038315048302b586c38397562, /dev/mapper/mpatha Excluded devices: 24 device(s) - /dev/sdb: /dev/sdb has children block devices and could not be considered - /dev/sdb4: /dev/sdb4 has an invalid filesystem signature (xfs) and cannot be used - /dev/mapper/3600a098038315047433f586c53477272: has an invalid filesystem signature (xfs) ... and 21 more excluded devices Node: ocpnode2.ocpiopex.growipx.com ❌ Status: Degraded Reason: failed to create/extend volume group vg1: failed to extend volume group vg1: WARNING: VG name vg0 is used by VGs VVnkhP-khYQ-blyc-2TNo-d3cv-b6di-4RbSyY and EUV3xv-ft6q-39xK-J3ki-rglf-9H44-rVIHIq. Fix duplicate VG names with vgrename uuid, a device filter, or system IDs. Physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' is already in volume group 'vg0' Unable to add physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' to volume group 'vg0' ... (truncated, see LVMCluster status for full details) Devices: /dev/mapper/mpatha ``` This real example shows a common LVMS issue: duplicate volume group names preventing VG extension. **3. Storage (PVC/PV) Status** Lists pending or failed PVCs: Example output: ``` Pending PVCs: database/postgres-data ❌ Status: Pending (10m) Storage Class: lvms-vg1 Requested: 100Gi Recent Events: ⚠ ProvisioningFailed: no node has enough free space ``` **4. Operator Health** Checks LVMS operator pods, deployments, and daemonsets: Example issues: ``` ❌ vg-manager-abc123 (worker-0) Status: CrashLoopBackOff Restarts: 15 Error: volume group "vg1" not found ``` **5. Pod Logs** Extracts and analyzes error/warning messages from pod logs: Example output (from real must-gather): ``` ═══════════════════════════════════════════════════════════ POD LOGS ANALYSIS ═══════════════════════════════════════════════════════════ Pod: vg-manager-nz4pc Unique errors/warnings: 1 ❌ 2025-10-28T10:47:28Z: Reconciler error Controller: lvmvolumegroup Error Details: failed to create/extend volume group vg1: failed to extend volume group vg1: WARNING: VG name vg0 is used by VGs WsNJwk-DK3q-tSHg-zvQJ-imF1-SdRv-8oh4e0 ... Cannot use /dev/dm-10: device is too small (pv_min_size) Command requires all devices to be found. Pod: lvms-operator-65df9f4dbb-92jwl Unique errors/warnings: 1 ❌ 2025-10-28T10:52:48Z: failed to validate device class setup Controller: lvmcluster Error: VG vg1 on node Degraded is not in ready state (ocpnode1.ocpiopex.growipx.com) ``` **Key Points:** - Logs are parsed from JSON format - Errors are deduplicated (same error repeated in reconciliation loops) - Shows unique error messages with first occurrence timestamp - Provides additional context not visible in resource status ### Step 4: Analyze Root Causes Connect related issues to identify root causes: **Common Pattern 1: Device Filesystem Conflict** ``` Chain of failures: 1. Device /dev/sdb has existing ext4 filesystem 2. vg-manager cannot create volume group 3. Volume group missing on node 4. PVCs stuck in Pending Root cause: Device not properly wiped before LVMS use ``` **Common Pattern 2: Insufficient Capacity** ``` Chain of failures: 1. Thin pool at 95% capacity 2. No free space for new volumes 3. PVCs stuck in Pending Root cause: Insufficient storage capacity or old volumes not cleaned up ``` **Common Pattern 3: Node-Specific Failures** ``` Chain of failures: 1. Volume group missing on specific node 2. TopoLVM CSI driver not functional on that node 3. PVCs with node affinity to that node stuck Pending Root cause: Node-specific device configuration issue ``` ### Step 5: Generate Remediation Plan Based on analysis results, provide prioritized recommendations: **CRITICAL Issues (Fix Immediately):** 1. **Device Conflicts:** ```bash # Clean device on affected node oc debug node/{node-name} chroot /host wipefs -a /dev/{device} # Restart vg-manager to recreate VG oc delete pod -n openshift-lvm-storage -l app.kubernetes.io/component=vg-manager ``` 2. **Pod Crashes:** ```bash # After fixing underlying issue, restart failed pods oc delete pod -n openshift-lvm-storage {pod-name} ``` 3. **LVMCluster Not Ready:** ```bash # Review and fix device configuration oc edit lvmcluster -n openshift-lvm-storage # Ensure devices match actual available devices ``` **WARNING Issues (Address Soon):** 1. **Capacity Issues:** ```bash # Check logical volume usage oc debug node/{node} -- chroot /host lvs --units g # Remove unused volumes or expand thin pool ``` 2. **Partial Node Coverage:** ```bash # Investigate why daemonsets not on all nodes oc get nodes --show-labels oc describe daemonset -n openshift-lvm-storage ``` ### Step 6: Provide Next Steps Always provide clear next steps: 1. **Review logs** (if available in must-gather): - Operator logs: `namespaces/openshift-lvm-storage/pods/lvms-operator-*/logs/` - VG-manager logs: `namespaces/openshift-lvm-storage/pods/vg-manager-*/logs/` - TopoLVM logs: `namespaces/openshift-lvm-storage/pods/topolvm-*/logs/` 2. **Verify fixes** (if cluster is accessible): ```bash # After implementing fixes, verify: oc get lvmcluster -n openshift-lvm-storage oc get lvmvolumegroup -A oc get pvc -A | grep Pending ``` 3. **Re-collect must-gather** (if making changes): ```bash oc adm must-gather --image=quay.io/lvms_dev/lvms-must-gather:latest ``` ## Error Handling ### Script Execution Errors **Script not found:** ```bash # Verify script exists ls plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py # Ensure it's executable chmod +x plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py ``` **Python dependencies missing:** ```bash # Install PyYAML pip install pyyaml # Or use pip3 pip3 install pyyaml ``` **Invalid YAML in must-gather:** - Script handles YAML parsing errors gracefully - Reports which files failed to parse - Continues analysis with available data ### Must-Gather Issues **Missing directories:** - Script validates required directories exist - Reports missing components - Provides guidance on what's missing **Incomplete must-gather:** - If critical resources missing, script reports what it can analyze - Suggests re-collecting must-gather ## Examples ### Example 1: Full Analysis ```bash # Run comprehensive analysis python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ ./must-gather/registry-ci-openshift-org-origin-4-18.../ ``` Output: ``` ═══════════════════════════════════════════════════════════ LVMCLUSTER STATUS ═══════════════════════════════════════════════════════════ LVMCluster: lvmcluster-sample ❌ State: Failed ❌ Ready: false ... ═══════════════════════════════════════════════════════════ LVMS ANALYSIS SUMMARY ═══════════════════════════════════════════════════════════ ❌ CRITICAL ISSUES: 3 - LVMCluster not Ready (state: Failed) - Volume group vg1 not created on worker-0 - 3 PVCs stuck in Pending state ``` ### Example 2: Storage-Only Analysis ```bash # Focus on PVC issues python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ ./must-gather/... --component storage ``` Analyzes only: - PVC/PV status - Storage class configuration - Volume provisioning issues ### Example 3: Operator Health Check ```bash # Check operator components python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \ ./must-gather/... --component operator ``` Analyzes only: - LVMCluster resource - Deployments and daemonsets - Pod status and crashes ## Best Practices 1. **Always validate path first:** - Check for `namespaces/openshift-lvm-storage/` directory - Use the correct subdirectory, not parent 2. **Run full analysis first:** - Get overall health picture - Then drill down with component-specific analysis if needed 3. **Correlate issues:** - Look for patterns across components - Connect pod failures to VG issues to PVC problems 4. **Check timestamps:** - Events and pod restarts have timestamps - Helps understand sequence of failures 5. **Provide actionable output:** - Don't just list issues - Explain root causes - Give specific remediation steps - Include verification commands 6. **Reference documentation:** - Link to LVMS troubleshooting guide - Point to relevant sections in must-gather logs ## Additional Resources - [LVMS Troubleshooting Guide](https://github.com/openshift/lvm-operator/blob/main/docs/troubleshooting.md) - [LVMS Architecture](https://github.com/openshift/lvm-operator/tree/main/docs) - [TopoLVM Documentation](https://github.com/topolvm/topolvm) - [Must-Gather Collection](https://github.com/openshift/lvm-operator/tree/main/must-gather)