--- layout: docu redirect_from: - /docs/guides/sql_features/merge title: Merge Statement for SCD Type 2 --- This is a practical, step-by-step guide to using DuckDB’s `MERGE` statement (introduced in v1.4.0) to perform upserts and build [Slowly Changing Dimension Type 2 (SCD Type 2) tables](https://en.wikipedia.org/wiki/Slowly_changing_dimension). Type 2 SCDs let you keep full historical versions of records while clearly identifying the current version, perfect for audit trails, data warehousing, and analytical workloads. Type 2 SCDs are practical when you want to know previous values of your primary key data, when it changed and for how long it was in a particular state. ## Why Use MERGE in DuckDB? - Single SQL statement for `INSERT`, `UPDATE`, and soft `DELETE` (upsert and expire). - Much cleaner and faster than equivalent Python/Pandas logic. - Full history tracking without hard deletes. - Works directly on Parquet, CSV, databases, thanks to DuckDB's connectivity! ## Prerequisites - DuckDB ≥ 1.4.0 - Basic SQL knowledge ## Key Terminology | Term | Meaning | |-------------------------------|-------------------------------------------------------------------------------------------| | **Target table** | The main/master table you are updating (e.g., `master_ducks`) | | **Source table** | The incoming/new data (e.g., `incoming_ducks`) | | **MERGE INTO** | Specifies the target table | | **USING** | Specifies the source table/query | | **ON** | Join condition (usually primary/business key + current flag) | | **WHEN MATCHED** | Row exists in both → typically UPDATE (or DELETE) | | **WHEN NOT MATCHED BY TARGET**| New row (insert) | | **WHEN NOT MATCHED BY SOURCE**| Row disappeared → soft-delete/expire old version | | **RETURNING merge_action** | Optional: shows what happened to each row (INSERT/UPDATE/DELETE) | ## Build an SCD Type 2 Dimension Table We’ll track ducks and preserve history whenever their name, breed, or location changes. > DuckDB has a frontend notebook UI, this is great for managing several SQL statements and segmenting your code. > The UI ships with the DuckDB CLI, so if you have the CLI installed you can use the front end. > To start the notebook front end just run: `duckdb -ui` and you can navigate to [http://localhost:4213/](http://localhost:4213/) to start writing your SQL code inside of your notebooks. Just copy and paste the following code blocks to follow this guide. ### Step 1: Create the Incoming (source) Table This table represents today’s transactional data. ```sql CREATE TABLE IF NOT EXISTS incoming_ducks ( duck_id INTEGER, duck_name VARCHAR, breed VARCHAR, location VARCHAR, begin_date DATE, end_date DATE, is_current BOOLEAN ); INSERT INTO incoming_ducks VALUES (101, 'Quackers', 'Mallard', 'Pond B', CURRENT_DATE - INTERVAL '1 day', NULL, true), (102, 'Waddles', 'Pekin', 'Pond A', CURRENT_DATE - INTERVAL '1 day', NULL, true), (104, 'Splash', 'Muscovy', 'Pond C', CURRENT_DATE - INTERVAL '1 day', NULL, true), (105, 'Puddles', 'Indian Runner', 'Relocated', CURRENT_DATE - INTERVAL '1 day', NULL, true); ``` ### Step 2: Create the Master (target) Table This table represents the type 2 SCD data, (i.e. transaction data with history). ```sql CREATE TABLE IF NOT EXISTS master_ducks ( record_id INTEGER PRIMARY KEY, duck_id INTEGER NOT NULL, duck_name VARCHAR, breed VARCHAR, location VARCHAR, begin_date DATE NOT NULL, end_date DATE, is_current BOOLEAN NOT NULL DEFAULT true ); CREATE SEQUENCE IF NOT EXISTS duck_record_seq START 1; INSERT INTO master_ducks VALUES (nextval('duck_record_seq'), 101, 'Quackers', 'Mallard', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true), (nextval('duck_record_seq'), 102, 'Waddles', 'Pekin', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true), (nextval('duck_record_seq'), 103, 'Feathers', 'Rouen', 'Pond B', CURRENT_DATE - INTERVAL '2 days', NULL, true), (nextval('duck_record_seq'), 105, 'Puddles', 'Indian Runner', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true); ``` ### Step 3: Perform the Merge Statement This statement will perform the merge, it will check for differences between the data of target and source and follow the `WHEN MATCHED` or `WHEN NOT MATCHED` logic specifed. ```sql MERGE INTO master_ducks AS target USING incoming_ducks AS source ON target.duck_id = source.duck_id AND target.is_current = true WHEN MATCHED AND ( target.duck_name <> source.duck_name OR target.breed <> source.breed OR target.location <> source.location ) THEN UPDATE SET end_date = CURRENT_DATE - INTERVAL '1 day', is_current = false WHEN NOT MATCHED BY SOURCE AND target.is_current = true THEN UPDATE SET end_date = CURRENT_DATE - INTERVAL '1 day', is_current = false WHEN NOT MATCHED BY TARGET THEN INSERT ( record_id, duck_id, duck_name, breed, location, begin_date, end_date, is_current ) VALUES ( nextval('duck_record_seq'), source.duck_id, source.duck_name, source.breed, source.location, source.begin_date, source.end_date, source.is_current ) RETURNING merge_action, *; ``` ### Step 4: Insert New Current Versions for Changed Records This statement inserts the new current records into the master table. While it's possible to achieve the same result using the `MERGE` statement's `RETURNING` clause, this two-step approach is more straightforward and easier to understand. ```sql INSERT INTO master_ducks ( record_id, duck_id, duck_name, breed, location, begin_date, end_date, is_current ) SELECT nextval('duck_record_seq'), source.duck_id, source.duck_name, source.breed, source.location, CURRENT_DATE AS begin_date, NULL AS end_date, true AS is_current FROM incoming_ducks AS source INNER JOIN master_ducks AS target ON source.duck_id = target.duck_id WHERE target.is_current = false AND target.end_date = CURRENT_DATE - INTERVAL '1 day'; ``` ### Step 5: Query The Results The following queries can be used to examine the data resulting from the `MERGE` statement. ```sql -- All history SELECT * FROM master_ducks ORDER BY duck_id, begin_date DESC; -- Only current records SELECT * FROM master_ducks WHERE is_current = true; -- Only expired historical records SELECT * FROM master_ducks WHERE is_current = false ORDER BY duck_id, begin_date DESC; ``` ### Step 6: Examine a Single Duck To better illustrate the concept lets examine a single duck, to drive home the value add for type 2 SCDs. If we select from the master table after running the merge statement and the post update insert statement, we can see the individual rows for `Quackers`. To view the original row of data that is historical: ```sql SELECT * FROM master_ducks where duck_name = 'Quackers' and is_current = false; ``` Returns: | record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current | |----------:|--------:|----------:|---------|----------|-------------|--------------|------------| | 1 | 101 | Quackers | Mallard | Pond A | 2025-11-24 | 2025-11-25 | false | **Note**: - The `end date` is NOT NULL, it has the date when this ducks data was updated. - The `is_current` is `false` indicating this is a historical record. - The field that will change is `location`, it is currently `Pond A` and will be updated to `Pond B`. To view the current row of data that is current: ```sql SELECT * FROM master_ducks where duck_name = 'Quackers' and is_current = true; ``` | record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current | |----------:|--------:|----------:|---------|----------|-------------|----------|------------| | 10 | 101 | Quackers | Mallard | Pond B | 2025-11-26 | NULL | true | **Note**: - The `end date` is NULL, the NULL in this context indicates this is the latest record for this `duck_id`. - The `is_current` is `true` also indicating this is a current record. - The `location` is now `Pond B`. To view all of `Quackers` data, which will contain both current and non-current rows: ```sql SELECT * FROM master_ducks where duck_name = 'Quackers'; ``` | record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current | | 1 | 101 | Quackers | Mallard | Pond A | 2025-11-24 | 2025-11-25 | false | | 10 | 101 | Quackers | Mallard | Pond B | 2025-11-26 | NULL | true | ## Common Patterns and Variations | Use Case | Clause to Use | |-----------------------------------|--------------------------------------------------------------------| | Simple upsert (no history) | `WHEN MATCHED THEN UPDATE` and `WHEN NOT MATCHED BY TARGET THEN INSERT` | | Upsert and delete missing rows | Add `WHEN NOT MATCHED BY SOURCE THEN DELETE` | | Only insert new, never update | Omit `WHEN MATCHED` | | Return affected rows | Add `RETURNING merge_action, *` | ## Best Practices - Remember that `TARGET` is the master table and `SOURCE` is the incoming table or query. - Keep end_date NULL for current rows (makes queries faster). - Wrap `MERGE` and `INSERT` statements in a transaction when needed. - Use a primary key or a surrogate key for uniqueness. - Test with RETURNING first.