--- name: investigate-ci-failure description: Investigate CI/Prow job failures on a GitHub pull request. Use when the user pastes a PR URL and asks about CI failures, red checks, test failures, or wants to understand why a job failed. disable-model-invocation: true --- # Investigate CI Failure Given a PR URL (e.g. `https://github.com/openshift/lightspeed-service/pull/2825`), diagnose why CI jobs failed. ## Workflow ### 1. Extract PR info Parse org, repo, and PR number from the URL. Fetch metadata with `gh`: ```bash # PR metadata gh api repos/{org}/{repo}/pulls/{pr} --jq '{title, state, user: .user.login, head_sha: .head.sha}' # Changed files gh api repos/{org}/{repo}/pulls/{pr}/files --jq '.[].filename' ``` ### 2. Get check statuses ```bash # All checks at a glance gh pr checks {pr} --repo {org}/{repo} # Detailed statuses with Prow URLs (use head SHA from step 1) gh api repos/{org}/{repo}/statuses/{head_sha} \ --jq '.[] | select(.state == "failure" or .state == "error") | {context, state, target_url}' ``` This gives you the list of failed jobs and their Prow dashboard URLs. ### 3. Construct GCS artifact URLs From a Prow `target_url` like: ``` https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr}/{job_name}/{build_id} ``` Derive: - **Directory browser** (for navigating artifact tree): `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr}/{job_name}/{build_id}/` - **Raw file content** (for fetching logs and JSON): `https://storage.googleapis.com/test-platform-results/pr-logs/pull/{org}_{repo}/{pr}/{job_name}/{build_id}/{path}` ### 4. Triage the failure For each failed job, fetch artifacts in this order: #### 4a. Quick status ``` GET storage.googleapis.com/.../finished.json ``` Check `"passed": false` and `"result": "FAILURE"`. #### 4b. Build log (most useful) ``` GET storage.googleapis.com/.../build-log.txt ``` This is the main ci-operator build log. It can be large (200KB+). Search from the **end** for: - `failed` / `FAILED` / `error` / `ERROR` - `step .* failed` - Python tracebacks (`Traceback`, `AssertionError`, `FAILED tests/`) - Container crash indicators (`CrashLoopBackOff`, `OOMKilled`, `Error from server`) #### 4c. Artifact tree exploration The build log alone often doesn't tell the full story. Browse the GCS artifact directory to find step-specific logs, cluster state, and pod logs: ``` GET gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/.../artifacts/ ``` Full artifact tree for an e2e job: ``` {build_id}/ ├── build-log.txt ← main ci-operator log (start here) ├── finished.json ← pass/fail + metadata ├── artifacts/ │ ├── ci-operator.log ← detailed ci-operator log │ ├── junit_operator.xml ← top-level JUnit results │ ├── ci-operator-step-graph.json ← step dependency graph │ ├── ci-operator-metrics.json │ ├── metadata.json │ ├── build-logs/ ← container image build logs │ │ ├── lightspeed-service-api-amd64.log │ │ ├── root-amd64.log │ │ └── src-amd64.log │ ├── build-resources/ ← CI namespace state │ │ ├── pods.json ← all pods in CI namespace │ │ ├── events.json ← k8s events (useful for crashes) │ │ ├── builds.json │ │ ├── imagestreams.json │ │ └── clusterClaim.json │ ├── release/ ← cluster provisioning step │ │ ├── build-log.txt │ │ └── finished.json │ └── e2e-ols-cluster/ ← test workflow steps │ ├── ipi-install-rbac/ ← cluster RBAC setup │ │ └── build-log.txt │ ├── e2e/ ← THE ACTUAL TEST STEP │ │ ├── build-log.txt ← test runner output (pytest) │ │ ├── finished.json │ │ └── artifacts/ ← per-provider test results │ │ ├── junit_e2e_azure_openai.xml │ │ ├── junit_e2e_openai.xml │ │ ├── junit_e2e_watsonx.xml │ │ ├── junit_e2e_rhelai_vllm.xml │ │ ├── junit_e2e_rhoai_vllm.xml │ │ ├── junit_e2e_*_tool_calling.xml │ │ ├── junit_e2e_quota_limits.xml │ │ └── {provider}/cluster/ ← cluster state per provider │ │ ├── podlogs/ │ │ │ ├── lightspeed-app-server-*.log ← OLS service logs │ │ │ ├── lightspeed-postgres-server-*.log │ │ │ └── lightspeed-console-plugin-*.log │ │ ├── olsconfig.yaml ← OLS config used │ │ ├── pods.yaml │ │ ├── deployments.yaml │ │ ├── configmap.yaml │ │ ├── services.yaml │ │ └── routes.yaml │ ├── gather-must-gather/ ← cluster diagnostics │ │ └── artifacts/ │ │ ├── must-gather.tar ← full must-gather (large, ~25MB) │ │ ├── camgi.html ← must-gather analysis report │ │ └── event-filter.html │ └── openshift-configure-cincinnati/ ``` **Where to look by failure type:** | Symptom | Check these artifacts | |---|---| | Test assertion failure | `e2e/build-log.txt` + `junit_e2e_*.xml` | | OLS service error/crash | `{provider}/cluster/podlogs/lightspeed-app-server-*.log` | | Postgres issues | `{provider}/cluster/podlogs/lightspeed-postgres-server-*.log` | | Deployment failure | `{provider}/cluster/pods.yaml` + `deployments.yaml` | | Image build failure | `build-logs/*.log` | | Cluster infra issue | `gather-must-gather/artifacts/camgi.html` + `event-filter.html` | | CI namespace issues | `build-resources/events.json` + `pods.json` | #### 4d. Downloading artifacts locally When you need to search across many files or the artifacts are too large for WebFetch, download them to a temp directory using `gsutil` or `gcloud storage`: ```bash TMPDIR=$(mktemp -d) # Download a specific subdirectory gcloud storage cp -r \ gs://test-platform-results/pr-logs/pull/{org}_{repo}/{pr}/{job_name}/{build_id}/artifacts/e2e-ols-cluster/e2e/artifacts/ \ "$TMPDIR/" ``` The GCS bucket path mirrors the Prow URL: strip `https://prow.ci.openshift.org/view/gs/` and prepend `gs://`. When multiple jobs have failed, investigate each in a separate subagent (Task tool) to keep build-log context isolated and run fetches in parallel. ### 5. Cross-reference with PR changes Compare the failure with the files changed in the PR. Common patterns: | Failure type | Likely cause | |---|---| | Unit/integration test failure | Direct code bug in changed files | | e2e cluster test failure | Infrastructure issue OR deployment-breaking change | | Verify/lint failure | Formatting, type errors, or import issues | | Image build failure | Dependency or Dockerfile issue | | Flaky (passes on retest) | Known flake, not PR-related | Check if the same job fails on `main` branch (flaky test) by looking at job history: ``` https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/{job_name} ``` ### 6. Report findings Summarize: 1. **Which jobs failed** and which passed 2. **Root cause** for each failure (with relevant log excerpts) 3. **Whether it's PR-related or infrastructure/flaky** 4. **Suggested fix** if the failure is caused by the PR changes ## Known CI jobs for this repo | Context | What it tests | |---|---| | `ci/prow/unit` | `make test-unit` — pytest unit tests | | `ci/prow/integration` | `make test-integration` — integration tests | | `ci/prow/verify` | `make verify` — black, ruff, pylint, mypy, woke | | `ci/prow/security` | `make security-check` — bandit | | `ci/prow/images` | Container image build | | `ci/prow/fips-image-scan-service` | FIPS compliance scan | | `ci/prow/e2e-ols-cluster` | Full cluster e2e — deploys OLS + operator on OpenShift, runs `make test-e2e` | | `tide` | Merge readiness (labels, approvals) — not a test | | Konflux | Supply chain security pipeline (separate from Prow) | ## Tool usage notes - Use `gh` CLI for all GitHub API calls (PR metadata, statuses, checks, comments, files). - Use `WebFetch` to browse GCS directories (`gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/...`). - Use `WebFetch` to fetch raw log/JSON content (`storage.googleapis.com/test-platform-results/...`). - The Prow dashboard URL itself is JS-rendered and not useful via WebFetch — always use GCS URLs instead. - Build logs can be very large. When fetched via WebFetch, they're saved to a temp file — read from the end to find failures quickly.