--- name: querying-data-lake description: >- Execute and manage Athena SQL queries across default and federated catalogs (Glue, S3 Tables, Redshift). Triggers on phrases like: query data, run SQL, athena query, analyze table, SQL query, workgroup status, profile table, query Redshift catalog, query S3 Tables. Do NOT use for finding specific data assets (use finding-data-lake-assets), full catalog audits (use exploring-data-catalog), importing data (use ingesting-into-data-lake). version: 1 argument-hint: '[SQL-query|query-name|workgroup-name|catalog-name|''profile TABLE_NAME'']' --- # Query Data Lake Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery. ## Overview Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to `finding-data-lake-assets`), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available. **Constraints for parameter acquisition:** - You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or `profile TABLE_NAME` - You MUST accept the argument as direct text or a pointer to a file containing SQL - You MUST ask the user for the target AWS region if not already set - You MUST confirm the output S3 location before executing any non-trivial query - You MUST respect the user's decision to abort at any step ## Common Tasks ### 1. Verify Dependencies Check for required tools and AWS access before running queries. **Constraints:** - You MUST verify AWS MCP server tools are available (`aws___call_aws`) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable - You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or `aws athena` CLI so output location and cost are tracked - You MUST confirm credentials with `aws sts get-caller-identity` and inform the user about any missing tools ### 2. Resolve Workgroup Check caller identity, list workgroups, auto-select the best one (see [workgroup-selection.md](references/workgroup-selection.md)). **Constraints:** - You MUST select a workgroup before submitting any query (prevents output-location errors) - You MUST present the selected workgroup and its output location to the user - You MUST NOT auto-escalate to a different workgroup on failure without user confirmation ### 3. Resolve the Target Asset If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to `finding-data-lake-assets` to return the concrete `database.table` (and catalog if non-default). **Constraints:** - You MUST NOT attempt to resolve fuzzy asset references with `athena list-data-catalogs` or by iterating `get-tables` — those miss federated catalogs and waste tokens - You SHOULD skip this step only when the user provides a fully-qualified reference (exact `database.table`) or raw SQL they want executed as-is - You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query." - You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or `finding-data-lake-assets` returns a different catalog ### 4. Discover Schema For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (`SELECT ... LIMIT 5`) as part of profiling. ### 5. Build Query Table addressing depends on catalog type: - Default Glue catalog: `database.table` (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with `"awsdatacatalog".database.table`. - Registered data source: `datasource.database.table` - Unregistered Glue catalog: `"catalog/subcatalog".database.table` ### 6. Classify and Execute Classify the SQL statement before executing: | Statement | Behavior | |---|---| | `SELECT`, `SHOW`, `DESCRIBE`, `EXPLAIN` | Safe — execute | | `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `CREATE`, `TRUNCATE`, `MERGE` | Destructive — warn the user and require explicit confirmation | | Unsure | Treat as destructive; confirm | Example tool call (via AWS MCP server): ``` aws___call_aws(command="aws athena start-query-execution --work-group --query-string '' --query-execution-context Database=") ``` For federated or S3 Tables catalogs, also set `Catalog=` in the execution context (e.g. `Catalog=s3tablescatalog/`). **Constraints:** - You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table") - You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow") - You MUST confirm the output S3 location before executing - You MUST explain which tool is being called before executing - You MUST respect the user's decision to abort ### 7. Present and Recover Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with. ### Argument Routing Resolve in this order; stop at the first match: 1. Contains SQL keywords (`SELECT`, `SHOW`, `DESCRIBE`, `INSERT`, etc.) — SQL text, execute directly 2. `profile TABLE_NAME` — run comprehensive table profiling (see [query-patterns.md](references/query-patterns.md)) 3. Matches a known named query — look up and execute 4. Matches a known workgroup — show workgroup status and recent queries 5. Matches a known catalog — delegate to `exploring-data-catalog` to enumerate databases and tables 6. No args — show recent query activity and available tables ### Principles - Always select workgroup before executing (prevents output-location errors) - Profile unfamiliar tables before running analytical queries - Present cost alongside results so users build cost awareness - Suggest `LIMIT` for exploratory queries on large tables - Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements) ## Troubleshooting | Error | Cause | Fix | |---|---|---| | Redshift identifier error with mixed case | Redshift-federated names are lowercase only | Lowercase the identifier | | `CatalogId` validation failure | ARN passed instead of catalog name | Pass the catalog name, not the ARN | | Cross-catalog `information_schema` returns nothing | Missing catalog qualifier | Use catalog-qualified path: `"catalog".information_schema.tables` | | Query fails with output-location error | Workgroup has no output location configured | Select a different workgroup with an output location, or configure one | | Destructive statement executed without confirmation | Statement classification skipped | Always classify `INSERT`/`UPDATE`/`DELETE`/`DROP`/`ALTER`/`CREATE`/`TRUNCATE`/`MERGE` and confirm with the user | ## Additional Resources - [Workgroup selection logic](references/workgroup-selection.md) - [Common query patterns](references/query-patterns.md) - [Athena best practices](https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html) - [Athena federated query](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html)