# pg_duckdb Syntax Guide & Gotchas This guide provides a quick reference for the most common SQL patterns used with `pg_duckdb`, along with key behaviors and limitations to be aware of. ## Syntax Reference ### Create a table ```sql -- This is a standard PostgreSQL table CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, product_name TEXT, amount NUMERIC, order_date DATE ); INSERT INTO orders (product_name, amount, order_date) VALUES ('Laptop', 1200.00, '2024-07-01'), ('Keyboard', 75.50, '2024-07-01'), ('Mouse', 25.00, '2024-07-02'); ``` ### Querying standard PostgreSQL tables For analytical queries on your existing PostgreSQL tables, use **standard SQL**. No special syntax is needed. `pg_duckdb` automatically accelerates these queries when you configure `duckdb.force_execution = true`. ```sql SET duckdb.force_execution = true; -- Standard SELECT on a PostgreSQL table SELECT category, AVG(price) as avg_price, COUNT(*) as item_count FROM products -- This is a regular PostgreSQL table GROUP BY category ORDER BY avg_price DESC; ``` ### Querying external files (Parquet, CSV, etc.) To query files from a data lake (e.g., S3, local storage), use the `read_*` functions. You must use the `r['column_name']` syntax to access columns. ```sql -- Query a single Parquet file SELECT r['product_id'], r['review_text'] FROM read_parquet('s3://my-bucket/reviews.parquet') r -- 'r' is a required alias LIMIT 100; -- Query multiple CSV files using a glob pattern SELECT r['timestamp'], r['event_type'], COUNT(*) as event_count FROM read_csv('s3://my-datalake/logs/2024-*.csv') r GROUP BY r['timestamp'], r['event_type']; ``` ### Hybrid queries (joining PostgreSQL and external data) You can seamlessly join PostgreSQL tables with external data sources in a single query. ```sql -- Join a local PostgreSQL 'customers' table with a remote Parquet file of 'orders' SELECT c.customer_name, c.signup_date, SUM(r['order_total']) AS total_spent FROM customers c -- This is a PostgreSQL table JOIN read_parquet('s3://my-bucket/orders/*.parquet') r ON c.customer_id = r['customer_id'] WHERE c.status = 'active' GROUP BY c.customer_name, c.signup_date ORDER BY total_spent DESC; ``` ### Creating MotherDuck-backed tables To persist the results of an analytical query, you can create a new table that uses MotherDuck to store the data in columnar format. Use the `USING duckdb` clause. ```sql -- Create a new table 'sales_summary' with DuckDB storage CREATE TABLE sales_summary USING duckdb AS SELECT r['region'], r['product_category'], SUM(r['sales_amount']) AS total_sales FROM read_parquet('s3://my-datalake/sales_data/year=2024/**/*.parquet') r GROUP BY r['region'], r['product_category']; -- Query the newly created analytical table SELECT * FROM sales_summary WHERE region = 'North America'; ``` ### When to use `duckdb.query()` The `duckdb.query()` function is an **advanced feature** and is **not needed for most queries**. You should only use it when you need to run a query that uses DuckDB-specific syntax that is not valid in standard PostgreSQL. **Example (Using DuckDB's `PIVOT` statement):** ```sql -- This query uses DuckDB's PIVOT syntax, so it must be wrapped in duckdb.query() SELECT * FROM duckdb.query($$ PIVOT sales_summary ON product_category USING SUM(total_sales) GROUP BY region; $$); ``` For all standard analytical queries, prefer the direct SQL syntax shown in the sections above. --- ## Things to Know This section outlines behaviors and limitations in `pg_duckdb`, things to know to use the extension effectively. ### Querying - **Standard SQL is the default**. For analytical queries on your PostgreSQL tables, use standard SQL. `pg_duckdb` accelerates them automatically as long as you configure `duckdb.force_excecution=true`. No special functions are needed. - **Use `r['column']` for external files**. When using `read_parquet()`, `read_csv()`, etc., you must access columns with the `r['column_name']` syntax, where `r` is the alias for the function call. - **`duckdb.query()` is for special cases**. Only use `duckdb.query()` when you need DuckDB-specific syntax that PostgreSQL does not support, like `PIVOT`. It is not needed for regular queries. - **CTE and subquery behavior**. When using `read_parquet()` inside a CTE or subquery, column references can behave differently. You may need to explicitly alias your columns (`SELECT r['col'] AS col`) or continue using the `r['col']` syntax outside the CTE. - **Data type limitations**. [There are bunch of limitations in how data types behave because of differences between PostgreSQL and DuckDB](docs/types.md#known-limitations) ### Transactions - **Writes must be separate**. You cannot write to both a PostgreSQL table and a DuckDB table in the same transaction. This will cause an error. Perform these operations in separate transactions. - **DDL must be separate**. You cannot mix `CREATE TABLE` statements for PostgreSQL and `CREATE TABLE ... USING duckdb` statements in the same transaction. - **Avoid mixed transactions**. There is a setting `duckdb.unsafe_allow_mixed_transactions` to bypass the separate-write rule. This is not recommended as it can lead to data inconsistency if one part of the transaction fails. ### MotherDuck - **Schema mapping is specific**. MotherDuck databases and schemas are mapped to PostgreSQL schemas with a special naming convention. Your default database's `main` schema maps to `public`. Other schemas are mapped to names like `ddb$$`. ### Configuration - **Forcing DuckDB execution**. For a query that only involves PostgreSQL tables, `pg_duckdb` will not run it by default. To force it, you can set `duckdb.force_execution = true`. - **File system access is restricted**. By default, non-superusers cannot access the local file system. This is a security measure.