--- title: How to compare values with a CSV file using WOQL nextjs: metadata: title: How to compare values with a CSV file using WOQL description: Learn how to efficiently compare a list of values against a CSV file to find additions, deletions, and unchanged items using WOQL alternates: canonical: https://terminusdb.org/docs/compare-csv-values-with-woql/ media: [] --- This guide shows you how to compare a list of values against a CSV file using WOQL. This is useful for synchronization tasks where you need to determine what has been added, deleted, or remains unchanged between two data sources. ## The Problem When synchronizing data, you often need to compare: - A list of values you have (e.g., current employee IDs) - A CSV file with reference data (e.g., an HR export) You want to categorize each value as: - **to_add**: Values in your list that are not in the CSV - **to_delete**: Values in the CSV that are not in your list - **no_change**: Values that exist in both ## The Solution: Using Set Operations The most efficient approach uses WOQL's native set operations (`set_difference`, `set_intersection`, `set_union`) which provide O(n log n) performance. These operations can handle 100,000+ elements in under a second. > **See also:** [Integration tests for these patterns](https://github.com/terminusdb/terminusdb/blob/main/tests/test/woql-csv-comparison.js) ### Example CSV File Create a file called `employee-list.csv`: ```csv EmployeeId E001 E002 E003 E005 E007 ``` ### Complete WOQL Query with Set Operations ```javascript WOQL.and( // Step 1: Define your source list using group_by WOQL.group_by( [], ["source_val"], "v:source_list", WOQL.or( WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")), ) ), // Step 2: Read CSV once into a list variable WOQL.group_by( [], ["csv_val"], "v:csv_list", WOQL.get(WOQL.as("EmployeeId", "v:csv_val")) .post("employee-list.csv", { type: "csv" }) ), // Step 3: Convert lists to sets (sorts and removes duplicates) WOQL.list_to_set("v:source_list", "v:source_set"), WOQL.list_to_set("v:csv_list", "v:csv_set"), // Step 4: Compute set differences and intersection WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"), // In source but not CSV WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"), // In CSV but not source WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"), // In both ) ``` ### Expected Results With the source list `[E001, E002, E004, E006]` and CSV containing `[E001, E002, E003, E005, E007]`: ```json { "to_add": ["E004", "E006"], "to_delete": ["E003", "E005", "E007"], "no_change": ["E001", "E002"] } ``` ### How Set Operations Work | Operation | Description | Complexity | |-----------|-------------|------------| | `list_to_set` | Sorts list and removes duplicates | O(n log n) | | `set_difference(A, B, Result)` | Elements in A but not in B | O(n log n) | | `set_intersection(A, B, Result)` | Elements in both A and B | O(n log n) | | `set_union(A, B, Result)` | All unique elements from A and B | O(n log n) | | `set_member(Element, Set)` | Check if element is in set | O(log n) | ### Performance Benchmarks (for only the set operation itself) | Elements | set_difference | set_intersection | |----------|----------------|------------------| | 1,000 | 7ms | 5ms | | 5,000 | 30ms | 17ms | | 10,000 | 64ms | 66ms | | 100,000 | 577ms | 445ms | --- ## Comparing Database Content with CSV (Large Datasets) When comparing large databases against CSV files, the key is to **stream** through one dataset while using efficient lookups against the other. Never load both datasets entirely into memory. ### Strategy: Stream the Larger Dataset, Index the Smaller | Scenario | Stream through | Index/Set | |----------|---------------|-----------| | Large DB, small CSV | Database triples | CSV as set | | Small DB, large CSV | CSV rows | Database index | ### Finding Database Records Not in CSV When you have a large database and a smaller CSV file, load the CSV into a set and stream through database triples: ```javascript WOQL.and( // Step 1: Load CSV into a set (small, fits in memory) WOQL.group_by( [], ["csv_val"], "v:csv_list", WOQL.get(WOQL.as("EmployeeId", "v:csv_val")) .post("employee-list.csv", { type: "csv" }) ), WOQL.list_to_set("v:csv_list", "v:csv_set"), // Step 2: Stream through database records via backtracking WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"), WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"), // Step 3: For each DB record, check if NOT in CSV set - O(log n) WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")), ) ``` This streams through database triples one at a time via backtracking. Each `set_member` check is O(log n), making this efficient even with millions of database records. ### Finding CSV Values Not in Database When checking CSV values against a large database, stream through CSV rows and use indexed triple lookups: ```javascript WOQL.and( // Step 1: Stream CSV rows via backtracking WOQL.get(WOQL.as("EmployeeId", "v:csv_id")) .post("employee-list.csv", { type: "csv" }), // Step 2: For each CSV row, check against database index WOQL.not( WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id") ), ) ``` The triple lookup uses database indexes, so each check is O(log n). This processes one CSV row at a time without loading the entire database. ### Complete Streaming Comparison To get both directions (what's in DB but not CSV, and what's in CSV but not DB) efficiently: ```javascript // Query 1: Database records not in CSV WOQL.and( WOQL.group_by([], ["csv_val"], "v:csv_list", WOQL.get(WOQL.as("EmployeeId", "v:csv_val")) .post("employee-list.csv", { type: "csv" }) ), WOQL.list_to_set("v:csv_list", "v:csv_set"), WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"), WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"), WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")), WOQL.eq("v:status", WOQL.literal("to_remove", "xsd:string")), ) // Query 2: CSV values not in database WOQL.and( WOQL.get(WOQL.as("EmployeeId", "v:csv_id")) .post("employee-list.csv", { type: "csv" }), WOQL.not(WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id")), WOQL.eq("v:status", WOQL.literal("to_add", "xsd:string")), ) ``` For very large datasets, running these as two separate queries is more memory-efficient than trying to compute both in a single query. --- ## Alternative Approach: Using Member and Not For smaller datasets or when you need row-by-row categorization in the results, the member-based approach provides clear per-value categorization: ### Complete WOQL Query ```javascript WOQL.and( // Step 1: Define your source list using group_by WOQL.group_by( [], ["source_val"], "v:source_list", WOQL.or( WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")), ) ), // Step 2: Read CSV once into a list variable WOQL.group_by( [], ["csv_val"], "v:csv_list", WOQL.get(WOQL.as("EmployeeId", "v:csv_val")) .post("employee-list.csv", { type: "csv" }) ), // Step 3: Get all unique values from both sources WOQL.distinct(["v:value"], WOQL.or( WOQL.member("v:value", "v:source_list"), WOQL.member("v:value", "v:csv_list"), ) ), // Step 4: Categorize each value WOQL.or( // to_add: in source_list but NOT in csv_list WOQL.and( WOQL.member("v:value", "v:source_list"), WOQL.not(WOQL.member("v:value", "v:csv_list")), WOQL.eq("v:category", WOQL.literal("to_add", "xsd:string")), ), // to_delete: in csv_list but NOT in source_list WOQL.and( WOQL.member("v:value", "v:csv_list"), WOQL.not(WOQL.member("v:value", "v:source_list")), WOQL.eq("v:category", WOQL.literal("to_delete", "xsd:string")), ), // no_change: in BOTH lists WOQL.and( WOQL.member("v:value", "v:source_list"), WOQL.member("v:value", "v:csv_list"), WOQL.eq("v:category", WOQL.literal("no_change", "xsd:string")), ), ), ) ``` ### Expected Results With the source list `[E001, E002, E004, E006]` and CSV containing `[E001, E002, E003, E005, E007]`: ```json [ { "value": "E004", "category": "to_add" }, { "value": "E006", "category": "to_add" }, { "value": "E003", "category": "to_delete" }, { "value": "E005", "category": "to_delete" }, { "value": "E007", "category": "to_delete" }, { "value": "E001", "category": "no_change" }, { "value": "E002", "category": "no_change" } ] ``` ## Using Set Operations with the JavaScript Client ```javascript const TerminusClient = require('@terminusdb/terminusdb-client') const client = new TerminusClient.WOQLClient('http://localhost:6363', { user: 'admin', key: 'root', }) async function compareWithCSV() { await client.connect() await client.db('mydb') const csvContent = `EmployeeId E001 E002 E003 E005 E007` const query = WOQL.and( WOQL.group_by( [], ["source_val"], "v:source_list", WOQL.or( WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")), WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")), ) ), WOQL.group_by( [], ["csv_val"], "v:csv_list", WOQL.get(WOQL.as("EmployeeId", "v:csv_val")) .post("employee-list.csv", { type: "csv" }) ), WOQL.list_to_set("v:source_list", "v:source_set"), WOQL.list_to_set("v:csv_list", "v:csv_set"), WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"), WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"), WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"), ) const result = await client.query(query, null, null, { files: { 'employee-list.csv': csvContent } }) console.log(result.bindings) } ``` ## Comparing Two CSV Files Compare two CSV files using set operations: ```javascript WOQL.and( // Read first CSV into set A WOQL.group_by( [], ["val_a"], "v:list_a", WOQL.get(WOQL.as("Id", "v:val_a")) .post("file-a.csv", { type: "csv" }) ), WOQL.list_to_set("v:list_a", "v:set_a"), // Read second CSV into set B WOQL.group_by( [], ["val_b"], "v:list_b", WOQL.get(WOQL.as("Id", "v:val_b")) .post("file-b.csv", { type: "csv" }) ), WOQL.list_to_set("v:list_b", "v:set_b"), // Compute differences and intersection WOQL.set_difference("v:set_a", "v:set_b", "v:only_in_a"), WOQL.set_difference("v:set_b", "v:set_a", "v:only_in_b"), WOQL.set_intersection("v:set_a", "v:set_b", "v:in_both"), ) ``` --- ## Performance Considerations ### Set Operations (Recommended) The set operations provide excellent performance for large datasets: | Dataset Size | Performance | Recommendation | |--------------|-------------|----------------| | Up to 10,000 | < 100ms | Use set operations directly | | 10,000-100,000 | 100-600ms | Use set operations directly | | 100,000+ | ~1 second | Consider chunked processing | ### Member-Based Approach The member-based approach has O(n²) complexity and works well for smaller datasets: | Dataset Size | Performance | Recommendation | |--------------|-------------|----------------| | Up to 1,000 | Fast | Works well | | 1,000-5,000 | Several seconds | Consider set operations | | 5,000+ | May timeout | Use set operations instead | ### Very Large Datasets (Millions of Records) For very large databases or CSV files: 1. **Use streaming patterns** - Process CSV rows one at a time with backtracking instead of collecting into lists 2. **Use indexed triple lookups** - Let the database use indexes for individual lookups 3. **Collect CSV into a set, iterate database** - Keep the smaller dataset in memory as a set, iterate the larger one 4. **Consider chunked processing** - Process data in batches of 50,000-100,000 records ## Summary The recommended techniques for CSV comparison in WOQL are: 1. **Use `group_by`** to read CSV data once into a list variable 2. **Use `list_to_set`** to convert lists to sorted sets 3. **Use `set_difference` and `set_intersection`** for O(n log n) comparison 4. **Use `set_member`** for O(log n) membership checks against large sets For comparing database content with CSV: 1. **Use triple patterns** to query database values efficiently 2. **Use backtracking** for memory-efficient processing of large datasets 3. **Collect the smaller dataset** into a set, iterate the larger one The set operations can handle 100,000+ elements in under a second, making them suitable for production synchronization tasks.