{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Splink data linking and deduplication demo \n", "\n", "In this demo we link and simultaneously dedupe two datasets which themselves contain duplicates.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Imports and setup\n", "\n", "The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RendererRegistry.enable('mimetype')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "pd.options.display.max_columns = 500\n", "pd.options.display.max_rows = 100\n", "import altair as alt\n", "alt.renderers.enable('mimetype')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging \n", "logging.basicConfig() # Means logs will print in Jupyter Lab\n", "\n", "# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood\n", "logging.getLogger(\"splink\").setLevel(logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from utility_functions.demo_utils import get_spark\n", "spark = get_spark()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Read in data\n", "\n", "In this example, we read in two datasets, `df_1` and `df_2`. These will be provided to `Splink` as a list like `[df_1, df_2]`. By adding additional elements to this link, you can use `Splink` to link and deduplicate as many datasets as you like.\n", "\n", "⚠️ Note that `splink` makes the following assumptions about your data:\n", "\n", "- There is a field containing a unique record identifier in each dataset\n", "- There is a field containing a dataset name in each dataset, to disambiguate the `unique_id` column if the same id values occur in more than one dataset. By default, this column is called `source_dataset`, but you can change this in the settings.\n", "- The two datasets being linked have common column names - e.g. date of birth is represented in both datasets in a field of the same name. In many cases, this means that the user needs to rename columns prior to using `splink`\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "| 0| Julia | null|2015-10-29| London| hannah88@powers.com| 0| df_1|\n", "| 1| Julia | Taylor|2015-07-31| London| hannah88@powers.com| 0| df_1|\n", "| 2| oNah| Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1|\n", "| 3| Noah | Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1|\n", "| 4| Molly | Bell|2002-01-05|Peterborough| null| 2| df_1|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "| 0| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| df_2|\n", "| 1| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| df_2|\n", "| 2| Watson| Noah |2008-03-23| null|matthew78@ballard...| 1| df_2|\n", "| 3| Noah | Watson|2008-02-05| tolon|matthew78@ballard...| 1| df_2|\n", "| 4| Watson| Noah |2008-06-15|Bolton|matthew78@ballard...| 1| df_2|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import lit \n", "from pyspark.sql import functions as f\n", "df_1 = spark.read.parquet(\"data/fake_df_l_link_dedupe_test.parquet\")\n", "df_1 = df_1.withColumn(\"source_dataset\", lit(\"df_1\"))\n", "df_2 = spark.read.parquet(\"data/fake_df_r_link_dedupe_test.parquet\")\n", "df_2 = df_2.withColumn(\"source_dataset\", lit(\"df_2\"))\n", "df_1.show(5)\n", "df_2.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Configure splink using the `settings` object\n", "\n", "Most of `splink` configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex. \n", "\n", "💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find [here](http://robinlinacre.com/splink_settings_editor/).\n", "\n", "Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.\n", "\n", "To help with authoring and validation of the settings dictionary, we have written a [json schema](https://json-schema.org/), which can be found [here](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json). \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "settings = {\n", " \"link_type\": \"link_and_dedupe\",\n", " \"blocking_rules\": [\n", " \"l.city = r.city\"\n", " ],\n", " \"comparison_columns\": [\n", " {\n", " \"col_name\": \"first_name\",\n", " \"num_levels\": 3,\n", " \"term_frequency_adjustments\": True\n", " },\n", " {\n", " \"col_name\": \"surname\",\n", " \"num_levels\": 3,\n", " \"term_frequency_adjustments\": True\n", " },\n", " {\n", " \"col_name\": \"dob\"\n", " },\n", " {\n", " \"col_name\": \"email\"\n", " }\n", " ],\n", " \"additional_columns_to_retain\": [\"group\"]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In words, this setting dictionary says:\n", "- We are performing a link and deduplication task (the other options are `link_only`, or `dedupe_only`)\n", "- We are going generate comparisons subject to the blocking rule `l.city = r.city` i.e. all pairwise record comparisons (both within and between datasets) where the city field matches\n", "- When comparing records, we will use information from the `first_name`, `surname`, `dob` and `email` columns to compute a match score.\n", "- For `first_name` and `surname`, string comparisons will have three levels:\n", " - Level 2: Strings are (almost) exactly the same\n", " - Level 1: Strings are similar \n", " - Level 0: No match\n", "- We will make adjustments for term frequencies on the `first_name` and `surname` columns\n", "- We will retain the `group` column in the results even though this is not used as part of comparisons. This is a labelled dataset and `group` contains the true match - i.e. where group matches, the records pertain to the same person" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Estimate match scores using the Expectation Maximisation algorithm\n", "Columns are assumed to be strings by default. See the 'comparison vector settings' notebook for details of configuration options." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:splink.iterate:Iteration 0 complete\n", "INFO:splink.model:The maximum change in parameters was 0.28655259609222417 for key first_name, level 0\n", "INFO:splink.iterate:Iteration 1 complete\n", "INFO:splink.model:The maximum change in parameters was 0.07504022121429443 for key surname, level 2\n", "INFO:splink.iterate:Iteration 2 complete\n", "INFO:splink.model:The maximum change in parameters was 0.032610535621643066 for key surname, level 0\n", "INFO:splink.iterate:Iteration 3 complete\n", "INFO:splink.model:The maximum change in parameters was 0.018751978874206543 for key surname, level 0\n", "INFO:splink.iterate:Iteration 4 complete\n", "INFO:splink.model:The maximum change in parameters was 0.01141306757926941 for key surname, level 0\n", "INFO:splink.iterate:Iteration 5 complete\n", "INFO:splink.model:The maximum change in parameters was 0.007352173328399658 for key surname, level 0\n", "INFO:splink.iterate:Iteration 6 complete\n", "INFO:splink.model:The maximum change in parameters was 0.004936426877975464 for key surname, level 0\n", "INFO:splink.iterate:Iteration 7 complete\n", "INFO:splink.model:The maximum change in parameters was 0.003407001495361328 for key surname, level 0\n", "INFO:splink.iterate:Iteration 8 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0023947954177856445 for key surname, level 0\n", "INFO:splink.iterate:Iteration 9 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0017041563987731934 for key surname, level 0\n", "INFO:splink.iterate:Iteration 10 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0012229979038238525 for key surname, level 0\n", "INFO:splink.iterate:Iteration 11 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0008828639984130859 for key surname, level 0\n", "INFO:splink.iterate:Iteration 12 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0006399154663085938 for key surname, level 0\n", "INFO:splink.iterate:Iteration 13 complete\n", "INFO:splink.model:The maximum change in parameters was 0.00046521425247192383 for key surname, level 0\n", "INFO:splink.iterate:Iteration 14 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0003389120101928711 for key surname, level 0\n", "INFO:splink.iterate:Iteration 15 complete\n", "INFO:splink.model:The maximum change in parameters was 0.000247269868850708 for key surname, level 0\n", "INFO:splink.iterate:Iteration 16 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0001805722713470459 for key surname, level 0\n", "INFO:splink.iterate:Iteration 17 complete\n", "INFO:splink.model:The maximum change in parameters was 0.00013199448585510254 for key surname, level 0\n", "INFO:splink.iterate:Iteration 18 complete\n", "INFO:splink.model:The maximum change in parameters was 9.652972221374512e-05 for key surname, level 0\n", "INFO:splink.iterate:EM algorithm has converged\n" ] } ], "source": [ "from splink import Splink\n", "\n", "linker = Splink(settings, [df_1, df_2], spark)\n", "df_e = linker.get_scored_comparisons()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Inspect results \n", "\n", "Each row in the results table represents a comparison of two input records.\n", "\n", "Note that since this is a link and dedupe task, there will be record comparisons of three types:\n", "- Comparison between two records both originating from `df_l`. \n", "- Comparison between two records both originating from `df_r`. \n", "- Comparison between one record from `df_l` and one record from `df_r`\n", "\n", "Information about the source table is included in the `_source_table_l` and `_source_table_r` fields." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | match_probability | \n", "unique_id_l | \n", "source_dataset_l | \n", "unique_id_r | \n", "source_dataset_r | \n", "group_l | \n", "group_r | \n", "first_name_l | \n", "first_name_r | \n", "surname_l | \n", "surname_r | \n", "dob_l | \n", "dob_r | \n", "email_l | \n", "email_r | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
163 | \n", "0.996844 | \n", "0 | \n", "df_1 | \n", "0 | \n", "df_2 | \n", "0 | \n", "0 | \n", "Julia | \n", "Julia | \n", "None | \n", "Taylor | \n", "2015-10-29 | \n", "2016-01-27 | \n", "hannah88@powers.com | \n", "hannah88@powers.com | \n", "
255 | \n", "0.996844 | \n", "0 | \n", "df_1 | \n", "1 | \n", "df_1 | \n", "0 | \n", "0 | \n", "Julia | \n", "Julia | \n", "None | \n", "Taylor | \n", "2015-10-29 | \n", "2015-07-31 | \n", "hannah88@powers.com | \n", "hannah88@powers.com | \n", "
254 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "12 | \n", "df_1 | \n", "0 | \n", "6 | \n", "Julia | \n", "Matilda | \n", "None | \n", "Hsrir | \n", "2015-10-29 | \n", "1983-04-30 | \n", "hannah88@powers.com | \n", "patrcio47@davis.cam | \n", "
253 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "13 | \n", "df_1 | \n", "0 | \n", "6 | \n", "Julia | \n", "Matilda | \n", "None | \n", "None | \n", "2015-10-29 | \n", "1983-04-30 | \n", "hannah88@powers.com | \n", "patricia47@davis.com | \n", "
252 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "14 | \n", "df_1 | \n", "0 | \n", "7 | \n", "Julia | \n", "Baxter | \n", "None | \n", "Aria | \n", "2015-10-29 | \n", "1992-09-07 | \n", "hannah88@powers.com | \n", "christineshepherd@allen.com | \n", "
162 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "15 | \n", "df_2 | \n", "0 | \n", "6 | \n", "Julia | \n", "Harris | \n", "None | \n", "Matilda | \n", "2015-10-29 | \n", "1983-04-30 | \n", "hannah88@powers.com | \n", "patricia47@davis.com | \n", "
251 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "15 | \n", "df_1 | \n", "0 | \n", "7 | \n", "Julia | \n", "Aria | \n", "None | \n", "Baxter | \n", "2015-10-29 | \n", "1992-09-07 | \n", "hannah88@powers.com | \n", "christineshepherd@allen.com | \n", "
21847 | \n", "0.000525 | \n", "0 | \n", "df_2 | \n", "15 | \n", "df_2 | \n", "0 | \n", "6 | \n", "Julia | \n", "Harris | \n", "Taylor | \n", "Matilda | \n", "2016-01-27 | \n", "1983-04-30 | \n", "hannah88@powers.com | \n", "patricia47@davis.com | \n", "
161 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "16 | \n", "df_2 | \n", "0 | \n", "6 | \n", "Julia | \n", "Matilda | \n", "None | \n", "Harris | \n", "2015-10-29 | \n", "1983-03-03 | \n", "hannah88@powers.com | \n", "patricia47@davis.com | \n", "
250 | \n", "0.001500 | \n", "0 | \n", "df_1 | \n", "16 | \n", "df_1 | \n", "0 | \n", "8 | \n", "Julia | \n", "Wilson | \n", "None | \n", "Charlie | \n", "2015-10-29 | \n", "1998-09-15 | \n", "hannah88@powers.com | \n", "samantha81@henry.com | \n", "