--- name: data-wrangler description: > Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations. --- # Data Wrangler Transform and export data using DuckDB SQL. ## Contents - [Usage](#usage) - Command syntax and Windows escaping - [Explore Mode](#explore-mode) - Quick data profiling - [Query Mode](#query-mode) - Return results to Claude - [Write Mode](#write-mode) - Export to files - [Request/Response Format](#requestresponse-format) - JSON structure - [Source Types](#source-types) - File, database, and cloud sources - [Transformations](#transformations) - SQL patterns reference - [Secrets](#secrets) - Secure credential handling ## Usage **IMPORTANT - Windows Shell Escaping:** 1. Always `cd` to the skill directory first 2. Use **double quotes** for echo with escaped inner quotes (`\"`) 3. Use **forward slashes** in file paths ```bash cd "" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py ``` ## Explore Mode **Get schema, statistics, and sample in one call.** Use before writing queries to understand data structure. ```json {"mode": "explore", "path": "D:/data/sales.csv"} ``` **Response:** ```json { "file": "D:/data/sales.csv", "format": "csv", "row_count": 15234, "columns": [ {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0}, {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3} ], "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001 | Alice | ... |" } ``` **Options:** - `sample_rows`: Number of sample rows (default: 10, max: 100) - `sources`: For database tables (same as query mode) ## Query Mode Return results directly to Claude for analysis. ### Direct File Queries ```json {"query": "SELECT * FROM 'data.csv' LIMIT 10"} ``` ### Multi-Source Joins ```json { "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id", "sources": [ {"type": "file", "alias": "sales", "path": "/data/sales.parquet"}, {"type": "file", "alias": "products", "path": "/data/products.csv"} ] } ``` ## Write Mode Export query results to files. Add an `output` object to write instead of returning data. ### Basic Write ```json { "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'", "output": { "path": "D:/output/filtered.parquet", "format": "parquet" } } ``` ### Write with Options ```json { "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'", "output": { "path": "D:/output/events/", "format": "parquet", "options": { "compression": "zstd", "partition_by": ["year", "month"], "overwrite": true } } } ``` ### Output Formats | Format | Options | |--------|---------| | `parquet` | `compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size` | | `csv` | `header` (default: true), `delimiter`, `compression`, `partition_by` | | `json` | `array` (true=JSON array, false=newline-delimited) | ### Write Response Response includes verification info - no need for follow-up queries: ```json { "success": true, "output_path": "D:/output/events/", "format": "parquet", "rows_written": 15234, "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."], "total_size_bytes": 5678901, "duration_ms": 1234 } ``` ### Overwrite Protection By default, existing files are **not** overwritten. Set `options.overwrite: true` to allow. ## Request/Response Format ### Request ```json { "query": "SQL statement", "sources": [...], "output": {"path": "...", "format": "..."}, "options": {"max_rows": 200, "format": "markdown"}, "secrets_file": "path/to/secrets.yaml" } ``` ### Query Mode Options - `max_rows`: Maximum rows to return (default: 200) - `max_bytes`: Maximum response size (default: 200000) - `format`: `markdown` (default), `json`, `records`, or `csv` ### Query Mode Response (markdown) ``` | column1 | column2 | |---|---| | value1 | value2 | ``` ### Query Mode Response (json) ```json { "schema": [{"name": "col1", "type": "INTEGER"}], "rows": [[1, "value"]], "truncated": false, "warnings": [], "error": null } ``` ## Source Types ### File (auto-detects CSV, Parquet, JSON, Excel) ```json {"type": "file", "alias": "data", "path": "/path/to/file.csv"} ``` Glob patterns: `{"path": "/logs/**/*.parquet"}` Custom delimiter: `{"path": "/data/file.csv", "delimiter": "|"}` ### PostgreSQL ```json { "type": "postgres", "alias": "users", "host": "host", "port": 5432, "database": "db", "user": "user", "password": "pass", "schema": "public", "table": "users" } ``` ### MySQL ```json { "type": "mysql", "alias": "orders", "host": "host", "port": 3306, "database": "db", "user": "user", "password": "pass", "table": "orders" } ``` ### SQLite ```json {"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"} ``` ### S3 ```json { "type": "s3", "alias": "logs", "url": "s3://bucket/path/*.parquet", "aws_region": "us-east-1", "aws_access_key_id": "...", "aws_secret_access_key": "..." } ``` ## Transformations See [TRANSFORMS.md](TRANSFORMS.md) for advanced patterns including: - **PIVOT/UNPIVOT** - Reshape data between wide and long formats - **Sampling** - Random subsets with `USING SAMPLE n ROWS` or `SAMPLE 10%` - **Dynamic columns** - `EXCLUDE`, `REPLACE`, `COLUMNS('pattern')` - **Window functions** - Running totals, rankings, moving averages - **Date/time operations** - Extraction, arithmetic, formatting ### Quick Examples ```sql -- PIVOT: Convert rows to columns PIVOT sales ON quarter USING SUM(revenue) GROUP BY region -- UNPIVOT: Convert columns to rows UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount -- Sampling: Random 10% with reproducible seed SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42) -- Dynamic columns: Exclude sensitive, transform email SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users ``` ## Workflow 1. **Inspect schema**: `DESCRIBE SELECT * FROM 'file.csv'` 2. **Preview data**: `SELECT * FROM 'file.csv' LIMIT 5` 3. **Transform**: Apply filters, joins, aggregations 4. **Export** (optional): Add `output` to write results ## Error Handling - If `error` is non-null: Check column names, verify paths - If `truncated` is true: Use more aggregation or filters - If write fails with "exists": Set `options.overwrite: true` ## Secrets Store credentials securely in YAML. See [SECRETS.md](SECRETS.md) for complete documentation. ```json { "query": "SELECT * FROM customers LIMIT 10", "secrets_file": "D:/path/to/secrets.yaml", "sources": [{ "type": "postgres", "alias": "customers", "secret": "my_postgres", "table": "customers" }] } ``` Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.