--- name: motherduck-load-data description: Load data into MotherDuck from local files, object storage, HTTPS, dataframes, or external databases. Use when choosing a MotherDuck-specific ingestion path, especially CTAS and INSERT...SELECT, bulk loading, secrets, and Postgres-endpoint versus DuckDB-client tradeoffs. argument-hint: [source-and-target] license: MIT --- # Load Data into MotherDuck Use this skill when the job is getting data into MotherDuck correctly and efficiently, not just writing one ad hoc import query. ## Source Of Truth - Prefer current MotherDuck loading, cloud-storage, and Postgres-endpoint loading docs first. - Use `CREATE SECRET` and cloud-storage docs for protected-object-store workflows. - Use the DuckDB database upload docs when the source is an existing local `.duckdb`, `.ddb`, or attached DuckDB database. - Keep the loading advice aligned with MotherDuck's documented posture: - batch over streaming - Parquet over CSV when you control the format - dataframe, `COPY`, CTAS, or `INSERT ... SELECT` over row-by-row inserts - native MotherDuck storage first unless DuckLake is explicitly required ## Default Posture - Start by classifying the source: object storage or HTTPS, local file or local DuckDB, in-memory rows, or an external database. - Prefer `CREATE TABLE AS SELECT` for first loads and `INSERT INTO ... SELECT` for appends. - For whole local DuckDB databases, use `CREATE OR REPLACE DATABASE remote_name FROM CURRENT_DATABASE()`, an attached local database, or a file path from a native DuckDB client after attaching `md:`. - Use Parquet for durable bulk movement whenever you control the source format. - Treat the Postgres endpoint as a thin-client path for server-side remote reads, not for local-file or extension-driven ingestion. - Bootstrap the target MotherDuck database first when the ingestion tool does not create it automatically. - Keep raw landing minimally transformed; do typing, deduplication, and business logic in staging or modeling steps. - Keep source storage close to the MotherDuck region when you control placement. ## Workflow 1. Identify where the source data actually lives. 2. Choose the loading path: - object storage or HTTPS: remote read into MotherDuck - local file or local DuckDB: use a DuckDB client path - in-memory rows: Arrow or dataframe bulk load first, batched inserts only as a fallback - external database: use the appropriate scan or replication path from a DuckDB-capable environment 3. Land the data into a raw or staging table with minimal transformation. 4. Validate row counts, types, and a few business aggregates immediately after the load. 5. Promote into modeled tables only after the landing step is correct. ## Open Next - `references/INGESTION_PATTERNS.md` for format-specific options, cloud-storage secrets, Postgres-endpoint loading tradeoffs, Python dataframe paths, and advanced ingestion patterns ## Related Skills - `motherduck-connect` for choosing between the Postgres endpoint and a DuckDB client path - `motherduck-explore` for inspecting destination databases and validating landed tables - `motherduck-query` for writing CTAS, append, and validation SQL - `motherduck-model-data` for promoting landed data into staging and analytics tables - `motherduck-ducklake` only when object-storage-backed lakehouse storage is an explicit requirement