---
name: duckdb
description: "Use this skill when working with DuckDB for local data analysis, file ingestion, or data exploration. Covers reading CSV/Excel/Parquet/JSON files into DuckDB, SQL analytics on local data, data profiling, cleaning transformations, and export to various formats. Common phrases: \"analyze this CSV\", \"DuckDB query\", \"local data analysis\", \"read Excel in SQL\", \"profile this data\". Do NOT use for dbt model building (use dbt-transforms with DuckDB adapter) or cloud warehouse administration."
model:
  preferred: sonnet
  acceptable: [sonnet, opus]
  minimum: sonnet
  allow_downgrade: false
  reasoning_demand: medium
version: 1.0.0
---

# DuckDB Local Skill for Claude

Local-first SQL analytics on files. Read, profile, clean, and export data without a warehouse.

## When to Use This Skill

**Activate when:** Reading local files (CSV, Parquet, JSON, Excel) into DuckDB, profiling data quality, running SQL analytics on local data, cleaning/transforming datasets, exporting results to various formats.

**Don't use for:** Building dbt models (use dbt-transforms with DuckDB adapter), cloud warehouse administration, pipeline scheduling (use data-pipelines), Python DataFrame transformations (use python-data-engineering).

## Scope Constraints

- Local files only -- does not manage cloud warehouse connections or remote databases.
- No cloud warehouse administration (Snowflake, BigQuery, Redshift).
- Reference files loaded on demand -- do not pre-load multiple references.
- File paths must be validated before use; never execute user-provided paths without checking existence.

## Model Routing

| reasoning_demand | preferred | acceptable | minimum |
|-----------------|-----------|------------|---------|
| medium | Sonnet | Sonnet, Opus | Sonnet |

## Core Principles

1. **Local-first** -- Process data on the user's machine; no cloud credentials required.
2. **SQL-native** -- Use SQL for all transformations; avoid unnecessary Python wrappers.
3. **Schema inference** -- Let DuckDB auto-detect types; override only when inference fails.
4. **Zero config** -- Start with `import duckdb` or `duckdb` CLI; no server setup.
5. **Format-agnostic** -- Read CSV, Parquet, JSON, Excel through unified SQL interface.

## Default Engine Positioning

DuckDB is the default analytical engine for this skill suite. Assume DuckDB unless told otherwise.

**When to target a warehouse instead:**
- Data volume exceeds ~100GB or requires concurrent multi-user access
- Client provides warehouse access (Snowflake, BigQuery, Databricks)
- Production workload needs scheduling, monitoring, and SLA guarantees
- Regulatory requirements mandate data residency in a managed platform

When a warehouse is needed, DuckDB still serves as the local dev and validation target. Build and test locally, deploy to warehouse.

## Core Usage Patterns

| Pattern | Description | When to Use |
|---------|-------------|-------------|
| **DuckDB as dbt target** | `dbt-duckdb` adapter for local transforms | Prototyping, client demos, small datasets |
| **Standalone query engine** | Direct SQL on files, no framework | Ad-hoc analysis, data profiling, one-off cleaning |
| **DuckDB + DLT** | DLT loads files into DuckDB | Development pipeline with destination swapping |
| **Dev target, warehouse deploy** | DuckDB locally, Snowflake/BigQuery in prod | Full engagement lifecycle |

## Quick Start

```bash
pip install duckdb
```

### Python

```python
import duckdb

# Read a CSV and query it
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 10")
result.show()
```

### CLI

```bash
duckdb -c "SELECT count(*) FROM read_csv_auto('data.csv')"
```

### Persistent Database

```python
con = duckdb.connect("my_analysis.duckdb")
con.sql("CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv')")
con.sql("SELECT count(*) FROM customers").show()
```

### Excel Files

```sql
INSTALL excel; LOAD excel;
SELECT * FROM read_xlsx('data.xlsx', sheet='Sheet1');
```

## File Ingestion

| Format | Function | Auto-Detect | Glob Support | Notes |
|--------|----------|-------------|--------------|-------|
| CSV/TSV | `read_csv_auto()` | delimiter, header, types | Yes | Most common; handles most encodings |
| Parquet | `read_parquet()` | schema from metadata | Yes | Fastest; preserves types perfectly |
| JSON | `read_json_auto()` | structure, nesting | Yes | Handles NDJSON and arrays |
| Excel | `read_xlsx()` (excel ext) | sheet, headers | No | Requires `INSTALL excel; LOAD excel;` (DuckDB >=1.2) |

### Basic Ingestion

```sql
-- CSV with auto-detection
SELECT * FROM read_csv_auto('customers.csv');

-- Parquet with glob
SELECT * FROM read_parquet('data/events/*.parquet');

-- JSON (newline-delimited)
SELECT * FROM read_json_auto('logs.ndjson', format='newline_delimited');

-- Multiple CSVs, union by column name
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);
```

**For detailed ingestion options, see:** [File Ingestion Reference](references/file-ingestion.md)
**For Excel-specific patterns, see:** [Excel Specifics](references/excel-specifics.md)

## Data Profiling Patterns

### Column Statistics

```sql
SELECT
    column_name
    , column_type
    , count
    , null_percentage
    , approx_unique
    , min
    , max
FROM (SUMMARIZE SELECT * FROM read_csv_auto('data.csv'));
```

### NULL Rates and Cardinality

```sql
SELECT
    count(*) as total_rows
    , count(email) as email_present
    , round(100.0 * count(email) / count(*), 1) as email_pct
    , count(distinct email) as email_unique
    , count(*) - count(distinct email) as email_duplicates
FROM read_csv_auto('customers.csv');
```

### Value Distribution

```sql
-- Top 10 values for a column
SELECT status, count(*) as cnt, round(100.0 * count(*) / sum(count(*)) over (), 1) as pct
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY cnt DESC
LIMIT 10;
```

## Cleaning in SQL

Common transformations executed directly in DuckDB:

```sql
CREATE TABLE customers_clean AS
SELECT
    -- Dedup
    * EXCLUDE (row_num)
FROM (
    SELECT
        *
        , row_number() OVER (PARTITION BY email ORDER BY updated_at DESC) as row_num
    FROM read_csv_auto('customers.csv')
)
WHERE row_num = 1;
```

```sql
-- Type casting and trimming
SELECT
    trim(name) as name
    , lower(trim(email)) as email
    , try_cast(revenue as decimal(12,2)) as revenue
    , strptime(date_str, '%m/%d/%Y')::date as order_date
    , regexp_replace(phone, '[^0-9]', '', 'g') as phone_digits
FROM read_csv_auto('messy_data.csv');
```

## Export Patterns

```sql
-- CSV
COPY (SELECT * FROM customers_clean) TO 'output/customers.csv' (HEADER, DELIMITER ',');

-- Parquet (recommended for downstream analytics)
COPY (SELECT * FROM customers_clean) TO 'output/customers.parquet' (FORMAT PARQUET);

-- JSON (newline-delimited)
COPY (SELECT * FROM customers_clean) TO 'output/customers.json' (FORMAT JSON);
```

**For detailed export options, see:** [Export Patterns Reference](references/export-patterns.md)

## Performance Tips

| Setting | Command | Use When |
|---------|---------|----------|
| Memory limit | `SET memory_limit = '4GB';` | Large datasets on constrained machines |
| Threads | `SET threads = 4;` | Control parallelism |
| Temp directory | `SET temp_directory = '/tmp/duckdb';` | Spill to disk for large sorts |
| Progress bar | `SET enable_progress_bar = true;` | Long-running queries |

File format performance: **Parquet >> CSV >> JSON**. Convert to Parquet early for repeated queries.

**For detailed performance tuning, see:** [Performance Reference](references/performance.md)

## Security Posture

See [Security & Compliance Patterns](../shared-references/data-engineering/security-compliance-patterns.md) for the full framework.
See [Consulting Security Tier Model](../shared-references/data-engineering/security-tier-model.md) for tier definitions.

| Capability | Tier 1 (Schema-Only) | Tier 2 (Sampled) | Tier 3 (Full Access) |
|------------|----------------------|-------------------|---------------------|
| Read local files | Metadata only | Anonymized samples | Full data |
| Write/export files | Schema DDL only | Sample outputs | Full exports |
| Profiling | Column types/names | Stats on samples | Full profiling |
| Cleaning transforms | Compile/validate SQL | Execute on samples | Execute on full data |

- **Never** hardcode file paths containing credentials or sensitive data in scripts.
- **Never** commit data files to version control; use `.gitignore`.
- Validate file paths exist before reading; reject paths outside the working directory.

## Reference Files

Reference files loaded on demand:

- **[File Ingestion](references/file-ingestion.md)** -- CSV, Parquet, JSON parameters, glob patterns, schema inference, common issues
- **[Excel Specifics](references/excel-specifics.md)** -- Excel extension (read_xlsx), sheet selection, merged cells, date conversion, spatial fallback
- **[Export Patterns](references/export-patterns.md)** -- CSV/Parquet/JSON options, Python API export, partitioned writes
- **[Performance](references/performance.md)** -- Memory limits, threads, EXPLAIN ANALYZE, pragma settings, format comparison