--- name: spark-consumption-cli description: > Analyze lakehouse data interactively using Fabric Lakehouse Livy API sessions and PySpark/Spark SQL for advanced analytics, DataFrames, cross-lakehouse joins, Delta time-travel, and unstructured/JSON data. Use when the user explicitly asks for PySpark, Spark DataFrames, Livy sessions, or Python-based analysis — NOT for simple SQL queries. Triggers: "PySpark", "Spark SQL", "analyze with PySpark", "Spark DataFrame", "Livy session", "lakehouse with Python", "PySpark analysis", "PySpark data quality", "Delta time-travel with Spark". --- > **Update Check — ONCE PER SESSION (mandatory)** > The first time this skill is used in a session, run the **check-updates** skill before proceeding. > - **GitHub Copilot CLI / VS Code**: invoke the `check-updates` skill. > - **Claude Code / Cowork / Cursor / Windsurf / Codex**: compare local vs remote package.json version. > - Skip if the check was already performed earlier in this session. > **CRITICAL NOTES** > 1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering > 2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering # Data Engineering Consumption — CLI Skill ## Table of Contents | Task | Reference | Notes | |---|---|---| | Fabric Topology & Key Concepts | [COMMON-CORE.md § Fabric Topology & Key Concepts](../../common/COMMON-CORE.md#fabric-topology--key-concepts) || | Environment URLs | [COMMON-CORE.md § Environment URLs](../../common/COMMON-CORE.md#environment-urls) || | Authentication & Token Acquisition | [COMMON-CORE.md § Authentication & Token Acquisition](../../common/COMMON-CORE.md#authentication--token-acquisition) | Wrong audience = 401; read before any auth issue | | Core Control-Plane REST APIs | [COMMON-CORE.md § Core Control-Plane REST APIs](../../common/COMMON-CORE.md#core-control-plane-rest-apis) || | Pagination | [COMMON-CORE.md § Pagination](../../common/COMMON-CORE.md#pagination) || | Long-Running Operations (LRO) | [COMMON-CORE.md § Long-Running Operations (LRO)](../../common/COMMON-CORE.md#long-running-operations-lro) || | Rate Limiting & Throttling | [COMMON-CORE.md § Rate Limiting & Throttling](../../common/COMMON-CORE.md#rate-limiting--throttling) || | OneLake Data Access | [COMMON-CORE.md § OneLake Data Access](../../common/COMMON-CORE.md#onelake-data-access) | Requires `storage.azure.com` token, not Fabric token | | Job Execution | [COMMON-CORE.md § Job Execution](../../common/COMMON-CORE.md#job-execution) || | Capacity Management | [COMMON-CORE.md § Capacity Management](../../common/COMMON-CORE.md#capacity-management) || | Gotchas & Troubleshooting | [COMMON-CORE.md § Gotchas & Troubleshooting](../../common/COMMON-CORE.md#gotchas--troubleshooting) || | Best Practices | [COMMON-CORE.md § Best Practices](../../common/COMMON-CORE.md#best-practices) || | Tool Selection Rationale | [COMMON-CLI.md § Tool Selection Rationale](../../common/COMMON-CLI.md#tool-selection-rationale) || | Finding Workspaces and Items in Fabric | [COMMON-CLI.md § Finding Workspaces and Items in Fabric](../../common/COMMON-CLI.md#finding-workspaces-and-items-in-fabric) | **Mandatory** — *READ link first* [needed for finding workspace id by its name or item id by its name, item type, and workspace id] | | Authentication Recipes | [COMMON-CLI.md § Authentication Recipes](../../common/COMMON-CLI.md#authentication-recipes) | `az login` flows and token acquisition | | Fabric Control-Plane API via `az rest` | [COMMON-CLI.md § Fabric Control-Plane API via az rest](../../common/COMMON-CLI.md#fabric-control-plane-api-via-az-rest) | **Always pass `--resource https://api.fabric.microsoft.com`** or `az rest` fails | | Pagination Pattern | [COMMON-CLI.md § Pagination Pattern](../../common/COMMON-CLI.md#pagination-pattern) || | Long-Running Operations (LRO) Pattern | [COMMON-CLI.md § Long-Running Operations (LRO) Pattern](../../common/COMMON-CLI.md#long-running-operations-lro-pattern) || | OneLake Data Access via `curl` | [COMMON-CLI.md § OneLake Data Access via curl](../../common/COMMON-CLI.md#onelake-data-access-via-curl) | Use `curl` not `az rest` (different token audience) | | SQL / TDS Data-Plane Access | [COMMON-CLI.md § SQL / TDS Data-Plane Access](../../common/COMMON-CLI.md#sql--tds-data-plane-access) | `sqlcmd` (Go) connect, query, CSV export | | Job Execution (CLI) | [COMMON-CLI.md § Job Execution](../../common/COMMON-CLI.md#job-execution) || | OneLake Shortcuts | [COMMON-CLI.md § OneLake Shortcuts](../../common/COMMON-CLI.md#onelake-shortcuts) || | Capacity Management (CLI) | [COMMON-CLI.md § Capacity Management](../../common/COMMON-CLI.md#capacity-management) || | Composite Recipes | [COMMON-CLI.md § Composite Recipes](../../common/COMMON-CLI.md#composite-recipes) || | Gotchas & Troubleshooting (CLI-Specific) | [COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific)](../../common/COMMON-CLI.md#gotchas--troubleshooting-cli-specific) | `az rest` audience, shell escaping, token expiry | | Quick Reference: `az rest` Template | [COMMON-CLI.md § Quick Reference: az rest Template](../../common/COMMON-CLI.md#quick-reference-az-rest-template) || | Quick Reference: Token Audience / CLI Tool Matrix | [COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix](../../common/COMMON-CLI.md#quick-reference-token-audience--cli-tool-matrix) | Which `--resource` + tool for each service | | Relationship to SPARK-AUTHORING-CORE.md | [SPARK-CONSUMPTION-CORE.md § Relationship to SPARK-AUTHORING-CORE.md](../../common/SPARK-CONSUMPTION-CORE.md#relationship-to-spark-authoring-coremd) || | Data Engineering Consumption Capability Matrix | [SPARK-CONSUMPTION-CORE.md § Data Engineering Consumption Capability Matrix](../../common/SPARK-CONSUMPTION-CORE.md#data-engineering-consumption-capability-matrix) || | OneLake Table APIs (Schema-enabled Lakehouses) | [SPARK-CONSUMPTION-CORE.md § OneLake Table APIs (Schema-enabled Lakehouses)](../../common/SPARK-CONSUMPTION-CORE.md#onelake-table-apis-schema-enabled-lakehouses) | Unity Catalog-compatible metadata; requires `storage.azure.com` token | | Lakehouse Livy Session Management | [SPARK-CONSUMPTION-CORE.md § Livy Session Management](../../common/SPARK-CONSUMPTION-CORE.md#livy-session-management) | Lakehouse Livy API: session creation, states, lifecycle, termination | | Interactive Data Exploration | [SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration](../../common/SPARK-CONSUMPTION-CORE.md#interactive-data-exploration) | Statement execution, output retrieval, data discovery | | PySpark Analytics Patterns | [SPARK-CONSUMPTION-CORE.md § PySpark Analytics Patterns](../../common/SPARK-CONSUMPTION-CORE.md#pyspark-analytics-patterns) | Cross-lakehouse 3-part naming, performance optimization | | Must/Prefer/Avoid | [SKILL.md § Must/Prefer/Avoid](#mustpreferavoid) | **MUST DO / AVOID / PREFER** checklists | | Quick Start | [SKILL.md § Quick Start](#quick-start) | CLI-specific Lakehouse Livy session setup and data exploration | | Key Fabric Patterns | [SKILL.md § Key Fabric Patterns](#key-fabric-patterns) | Spark pattern quick-reference table | | Session Cleanup | [SKILL.md § Session Cleanup](#session-cleanup) | Clean up idle Lakehouse Livy sessions via CLI | --- ## Must/Prefer/Avoid ### MUST DO - Check for existing idle sessions before creating new ones - Use dynamic workspace/lakehouse discovery - Follow API patterns from [COMMON-CLI.md](../../common/COMMON-CLI.md) ### PREFER - **sqldw-consumption-cli for simple lakehouse queries** — row counts, SELECT, schema exploration, filtering, and aggregation on lakehouse Delta tables should use the SQL Endpoint via `sqlcmd`, not Spark. Only use this skill when the user explicitly requests PySpark, DataFrames, or Spark-specific features. - SQL Endpoint for Delta tables - Livy for unstructured/JSON data or complex Python analytics - Session reuse over creation ### AVOID - Hardcoded workspace IDs - Creating unnecessary sessions - Large result sets without LIMIT - **Confusing Lakehouse Livy sessions with Notebook Spark sessions** — This skill covers **Lakehouse Livy sessions** (the public Livy API at `/lakehouses/{lhId}/livyapi/.../sessions`). Notebook Spark sessions are created internally when running a notebook via the Jobs API (`RunNotebook`) and are NOT managed through the Livy API. To run a notebook as a job, see SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management --- ## Quick Start ### Environment Setup Apply environment detection from COMMON-CORE.md Environment Detection Pattern to set: - `$FABRIC_API_BASE` and `$FABRIC_RESOURCE_SCOPE` - `$FABRIC_API_URL` and `$LIVY_API_PATH` for Livy operations **Authentication**: Use token acquisition from [COMMON-CLI.md](../../common/COMMON-CLI.md) Environment Detection and API Configuration ### Workspace & Item Discovery **Preferred**: Use [COMMON-CLI.md](../../common/COMMON-CLI.md) item discovery patterns (Finding things in Fabric) to find workspaces and items by name. **Fallback** (when workspace is already known): ```bash # List workspaces az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces" --query "value[].{name:displayName, id:id}" --output table read -p "Workspace ID: " workspaceId # List lakehouses in workspace az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/items?type=Lakehouse" --query "value[].{name:displayName, id:id}" --output table read -p "Lakehouse ID: " lakehouseId ``` ### Lakehouse Livy Session Management > **Two types of Spark sessions in Fabric** — This skill manages **Lakehouse Livy sessions**, created via the public Livy API endpoint (`/lakehouses/{lhId}/livyapi/.../sessions`). These are ad-hoc interactive sessions for remote clients. **Notebook Spark sessions** are a separate mechanism — they are created internally when a Fabric Notebook is executed (via portal or Jobs API `RunNotebook`), and are managed through the notebook lifecycle, not the Livy API. ```bash # Check for existing idle Lakehouse Livy session (avoid resource waste) sessionId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'][0].id" --output tsv) # Create if none available - FORCE STARTER POOL USAGE if [[ -z "$sessionId" ]]; then cat > /tmp/body.json << 'EOF' { "name":"analysis", "driverMemory":"56g", "driverCores":8, "executorMemory":"56g", "executorCores":8, "conf": { "spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool" } } EOF sessionId=$(az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --body @/tmp/body.json --query "id" --output tsv) echo "⏳ Waiting for starter pool session to be ready..." # With starter pools, this should be 3-5 seconds timeout=30 # Reduced from 90s since starter pools are fast while [ $timeout -gt 0 ]; do state=$(az rest --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId" --query "state" --output tsv) if [[ "$state" == "idle" ]]; then echo "✅ Session ready in starter pool!" break fi echo " Session state: $state (${timeout}s remaining)" sleep 3 timeout=$((timeout - 3)) done fi ``` ### Data Exploration (Fabric-Specific Patterns) ```bash # Execute statement (LLM knows Python/Spark syntax) cat > /tmp/body.json << 'EOF' { "code": "spark.sql(\"SHOW TABLES\").show(); df = spark.table(\"your_table\"); df.describe().show()", "kind": "pyspark" } EOF az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId/statements" --body @/tmp/body.json ``` ## Key Fabric Patterns | Pattern | Code | Use Case | |---|---|---| | **Table Discovery** | `spark.sql("SHOW TABLES")` | List available tables | | **Cross-Lakehouse** | `spark.sql("SELECT * FROM other_workspace.table")` | Query across workspaces | | **Delta Features** | `df.history()`, `df.readVersion(1)` | Time travel, versioning | | **Schema Evolution** | `df.printSchema()` | Understand structure | ## Lakehouse Livy Session Cleanup ```bash # Clean up idle Lakehouse Livy sessions (optional) az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'].id" --output tsv | xargs -I {} az rest --method delete --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/{}" ``` --- **Focus**: This skill provides Fabric-specific REST API patterns. LLM already knows Python/Spark syntax — we focus on Fabric integration, session management, and API endpoints.