--- name: data-processing description: "Use when working with structured data files (CSV, JSON, YAML, TOML, Parquet) — querying, transforming, filtering, aggregating, or converting between formats" allowed-tools: [Bash(jq*), Bash(yq*), Bash(gron*), Bash(mlr*), Bash(xsv*), Bash(duckdb*), Read, Glob, Grep] version: 1.0.0 author: ykotik license: MIT --- # Data Processing ## When to Use - Querying, filtering, or transforming JSON files - Reading or converting YAML, TOML, or XML config files - Analyzing, aggregating, or joining CSV/TSV/Parquet files - Running SQL queries against local data files without a database server - Converting between data formats (JSON to YAML, CSV to JSON, etc.) - Exploring deeply nested JSON structures ## Tools | Tool | Purpose | Structured output | |------|---------|-------------------| | **jq** | Query and transform JSON | Native JSON | | **yq** | Query and transform YAML, TOML, XML | `-o json` for JSON output | | **gron** | Flatten JSON into greppable `path.to.key = value` lines | `--ungron` to reverse back to JSON | | **miller (mlr)** | Transform CSV/JSON/TSV records with awk-like verbs | `--json` for JSON output | | **xsv** | Fast CSV slicing, searching, joining, statistics | CSV native (pipe to `xsv table` for display) | | **DuckDB** | SQL queries on CSV, JSON, Parquet files | `-json` flag for JSON output | ## Patterns ### JSON: Filter array elements by field value ```bash jq '.items[] | select(.status == "active")' data.json ``` ### JSON: Extract specific fields from array ```bash jq '[.users[] | {name: .name, email: .email}]' data.json ``` ### JSON: Count items grouped by field ```bash jq '[.events[] | .type] | group_by(.) | map({type: .[0], count: length})' data.json ``` ### YAML: Read a nested value ```bash yq '.spec.containers[0].image' deployment.yaml ``` ### YAML: Convert entire file to JSON ```bash yq -o json config.yaml ``` ### TOML: Read a value ```bash yq -p toml '.database.host' config.toml ``` ### JSON: Explore unknown structure by grepping paths ```bash gron data.json | grep -i "error" ``` ### CSV: Column statistics (min, max, mean, stddev) ```bash xsv stats data.csv | xsv table ``` ### CSV: Search rows matching a pattern in a column ```bash xsv search -s status "active" data.csv ``` ### CSV: Select specific columns ```bash xsv select name,email,created_at users.csv ``` ### CSV: Sort by column ```bash xsv sort -s revenue -R sales.csv ``` ### SQL: Query a CSV file ```bash duckdb -c "SELECT department, COUNT(*) as cnt, AVG(salary) as avg_sal FROM 'employees.csv' GROUP BY department ORDER BY avg_sal DESC" ``` ### SQL: Query a JSON file ```bash duckdb -c "SELECT * FROM read_json_auto('events.json') WHERE type = 'error' LIMIT 20" ``` ### SQL: Query Parquet files ```bash duckdb -c "SELECT * FROM 'data/*.parquet' WHERE created_at > '2026-01-01'" ``` ### CSV/JSON: Transform records with miller ```bash mlr --csv filter '$revenue > 1000' then sort-by -nr revenue sales.csv ``` ### Format conversion: CSV to JSON ```bash mlr --icsv --ojson cat data.csv ``` ## Pipelines ### YAML config → JSON → SQL query ```bash yq -o json config.yaml | duckdb -c "SELECT key, value FROM read_json_auto('/dev/stdin') WHERE env = 'production'" ``` Each stage: yq converts YAML to JSON, DuckDB runs SQL on the JSON stream. ### Grep nested JSON paths → reconstruct matching subset ```bash gron large.json | grep "\.errors\[" | gron --ungron ``` Each stage: gron flattens JSON to paths, grep filters, ungron reconstructs valid JSON from matches. ### CSV filter → aggregate with SQL ```bash xsv search -s region "EU" sales.csv | duckdb -c "SELECT product, SUM(revenue) as total FROM read_csv_auto('/dev/stdin') GROUP BY product ORDER BY total DESC" ``` Each stage: xsv filters rows by region, DuckDB aggregates the filtered stream. ### Join two CSV files with SQL ```bash duckdb -c "SELECT u.name, u.email, o.total, o.date FROM 'users.csv' u JOIN 'orders.csv' o ON u.id = o.user_id ORDER BY o.date DESC" ``` ### Multi-format pipeline: JSON → CSV → stats ```bash jq -r '.records[] | [.name, .score] | @csv' data.json | xsv stats ``` Each stage: jq extracts fields to CSV format, xsv computes statistics. ## Prefer Over - Prefer **DuckDB** over Python/pandas for ad-hoc SQL queries on files — single command, no script, handles large files - Prefer **jq** over Python `json` module for one-off JSON transforms — single pipeline vs. multi-line script - Prefer **xsv** over `awk`/`cut` for CSV operations — correct CSV parsing, handles quoted fields and escapes - Prefer **miller** over `awk` for format-aware record transformations — understands CSV/JSON headers natively - Prefer **yq** over custom parsers for config file reads — handles YAML, TOML, XML with consistent syntax ## Do NOT Use When - Data is already in a running database — query the database directly - File is under 10 lines — just use the Read tool and process in-context - Task requires complex multi-step logic with conditionals — write a Python script instead - JSON is simple enough to read by eye — use the Read tool, don't over-engineer - Working with binary or non-structured data formats