{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Splink data linking demo (link only)\n", "\n", "In this demo we link two small datasets. \n", "\n", "The larger table contains duplicates, but in this notebook we use the `link_only` setting, so `splink` makes no attempt to deduplicate these records. \n", "\n", "Note it is possible to simultaneously link and dedupe using the `link_and_dedupe` setting.\n", "\n", "**Important** Where deduplication is not required, `link_only` can provide an important performance boost by dramatically reducing the number of records which need to be compared.\n", "\n", "For example, if you wanted to link 10 records to 1,000, then the maximum number of comparisons that need to be made (i.e. with no blocking rules) is 10,000. If you need to dedupe as well, that number would be n(n-1)/2 = 509,545.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Imports and setup\n", "\n", "The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RendererRegistry.enable('mimetype')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "pd.options.display.max_columns = 500\n", "pd.options.display.max_rows = 100\n", "import altair as alt\n", "alt.renderers.enable('mimetype')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging \n", "logging.basicConfig() # Means logs will print in Jupyter Lab\n", "\n", "# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood\n", "logging.getLogger(\"splink\").setLevel(logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "22/01/11 05:40:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "22/01/11 05:40:57 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n", "22/01/11 05:40:57 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n" ] } ], "source": [ "from utility_functions.demo_utils import get_spark\n", "spark = get_spark()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Read in the data\n", "\n", "In this example, we link two datasets, but you can link as many as you like.\n", "\n", "⚠️ Note that `splink` makes the following assumptions about your data:\n", "\n", "- There is a field containing a unique record identifier in each dataset. By default, this should be called `unique_id`, but you can change this in the settings\n", "- There is a field containing a dataset name in each dataset, to disambiguate the `unique_id` column if the same id values occur in more than one dataset. By default, this column is called `source_dataset`, but you can change this in the settings.\n", "- The two datasets being linked have common column names - e.g. date of birth is represented in both datasets in a field of the same name. In many cases, this means that the user needs to rename columns prior to using `splink`\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The count of rows in `df_1` is 181\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "| 0| Julia | null|2015-10-29| London| hannah88@powers.com| 0| df_1|\n", "| 4| oNah| Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1|\n", "| 13| Molly | Bell|2002-01-05|Peterborough| null| 2| df_1|\n", "| 15| Alexander|Amelia |1983-05-19| Glasgow|ic-mpbell@alleale...| 3| df_1|\n", "| 20| Ol vri|ynnollC|1972-03-08| Plymouth|derekwilliams@nor...| 4| df_1|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n", "The count of rows in `df_2` is 819\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "| 1| Julia | Taylor|2015-07-31|London| hannah88@powers.com| 0| df_2|\n", "| 2| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| df_2|\n", "| 3| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| df_2|\n", "| 5| Noah | Watson|2008-03-23|Bolton|matthew78@ballard...| 1| df_2|\n", "| 6| Watson| Noah |2008-03-23| null|matthew78@ballard...| 1| df_2|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import lit \n", "df_1 = spark.read.parquet(\"data/fake_df_l.parquet\")\n", "df_1 = df_1.withColumn(\"source_dataset\", lit(\"df_1\"))\n", "df_2 = spark.read.parquet(\"data/fake_df_r.parquet\")\n", "df_2 = df_2.withColumn(\"source_dataset\", lit(\"df_2\"))\n", "print(f\"The count of rows in `df_1` is {df_1.count()}\")\n", "df_1.show(5)\n", "print(f\"The count of rows in `df_2` is {df_2.count()}\")\n", "df_2.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Configure splink using the `settings` object\n", "\n", "Most of `splink` configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex. \n", "\n", "💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find [here](http://robinlinacre.com/splink_settings_editor/).\n", "\n", "Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.\n", "\n", "To help with authoring and validation of the settings dictionary, we have written a [json schema](https://json-schema.org/), which can be found [here](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json). \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The comparison expression allows for the case where a first name and surname have been inverted \n", "sql_case_expression = \"\"\"\n", "CASE \n", "WHEN first_name_l = first_name_r AND surname_l = surname_r THEN 4 \n", "WHEN first_name_l = surname_r AND surname_l = first_name_r THEN 3\n", "WHEN first_name_l = first_name_r THEN 2\n", "WHEN surname_l = surname_r THEN 1\n", "ELSE 0 \n", "END\n", "\"\"\"\n", "\n", "settings = {\n", " \"link_type\": \"link_only\", \n", " \"max_iterations\": 20,\n", " \"blocking_rules\": [\n", " ],\n", " \"comparison_columns\": [\n", " {\n", " \"custom_name\": \"name_inversion\",\n", " \"custom_columns_used\": [\"first_name\", \"surname\"],\n", " \"case_expression\": sql_case_expression,\n", " \"num_levels\": 5\n", " },\n", " {\n", " \"col_name\": \"city\",\n", " \"num_levels\": 3\n", " },\n", " {\n", " \"col_name\": \"email\",\n", " \"num_levels\": 3\n", " },\n", " {\n", " \"col_name\": \"dob\"\n", " }\n", " ],\n", " \"additional_columns_to_retain\": [\"group\"],\n", " \"em_convergence\": 0.01,\n", " \"max_iterations\": 4,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In words, this setting dictionary says:\n", "\n", "- We are performing a data linking task (the other options are `dedupe_only`, or `link_and_dedupe`)\n", "- Since the input datasets are so small, we do not specify any blocking rules and instead generate all possible comparisons.\n", "- When comparing records, we will use information from the `first_name`, `surname`, `city` and `email` columns to compute a match score.\n", "- For the comparisons on the `first_name` and `surname` column we allow the possibility that the names have been inputted in the wrong order. \n", " - The highest level of similarity is that both `first_name` and `surname` both match.\n", " - There are other levels of similarity for the names being inverted, and just first name, or just surname matching.\n", "- We will retain the `group` column in the results even though this is not used as part of comparisons. This is a labelled dataset and `group` contains the true match - i.e. where group matches, the records pertain to the same person" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Estimate match scores using the Expectation Maximisation algorithm" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/robinlinacre/anaconda3/lib/python3.8/site-packages/splink/default_settings.py:262: UserWarning: You have not specified any blocking rules, meaning all comparisons between the input dataset(s) will be generated and blocking will not be used.For large input datasets, this will generally be computationally intractable because it will generate comparisons equal to the number of rows squared.\n", " warnings.warn(\n", "/Users/robinlinacre/anaconda3/lib/python3.8/site-packages/splink/default_settings.py:185: UserWarning: No -1 level found in case statement. You usually want to use -1 as the level for the null value. e.g. WHEN col_l is null or col_r is null then -1 Case statement is:\n", " \n", "CASE \n", "WHEN first_name_l = first_name_r AND surname_l = surname_r THEN 4 \n", "WHEN first_name_l = surname_r AND surname_l = first_name_r THEN 3\n", "WHEN first_name_l = first_name_r THEN 2\n", "WHEN surname_l = surname_r THEN 1\n", "ELSE 0 \n", "END\n", ".\n", " warnings.warn(\n", "INFO:splink.iterate:Iteration 0 complete \n", "INFO:splink.model:The maximum change in parameters was 0.40458029469636825 for key name_inversion, level 4\n", "INFO:splink.iterate:Iteration 1 complete\n", "INFO:splink.model:The maximum change in parameters was 0.07434341748102571 for key email, level 1\n", "INFO:splink.iterate:Iteration 2 complete\n", "INFO:splink.model:The maximum change in parameters was 0.025150011310513642 for key dob, level 1\n", "INFO:splink.iterate:Iteration 3 complete\n", "INFO:splink.model:The maximum change in parameters was 0.009595088165527288 for key name_inversion, level 0\n", "INFO:splink.iterate:EM algorithm has converged\n" ] } ], "source": [ "from splink import Splink\n", "\n", "linker = Splink(settings, [df_1, df_2], spark)\n", "df_e = linker.get_scored_comparisons()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Inspect results \n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | match_weight | \n", "match_probability | \n", "source_dataset_l | \n", "unique_id_l | \n", "source_dataset_r | \n", "unique_id_r | \n", "surname_l | \n", "surname_r | \n", "first_name_l | \n", "first_name_r | \n", "gamma_name_inversion | \n", "city_l | \n", "city_r | \n", "gamma_city | \n", "email_l | \n", "email_r | \n", "gamma_email | \n", "dob_l | \n", "dob_r | \n", "gamma_dob | \n", "group_l | \n", "group_r | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
93101 | \n", "37.816346 | \n", "1.0 | \n", "df_1 | \n", "664 | \n", "df_2 | \n", "668 | \n", "Taylor | \n", "Ivy | \n", "Ivy | \n", "Taylor | \n", "3 | \n", "Lonon | \n", "London | \n", "1 | \n", "jonesjennmfer@pitt.coi | \n", "jonesjennifer@pitts.com | \n", "1 | \n", "1980-01-13 | \n", "1980-01-13 | \n", "1 | \n", "113 | \n", "113 | \n", "
93106 | \n", "37.816346 | \n", "1.0 | \n", "df_1 | \n", "664 | \n", "df_2 | \n", "673 | \n", "Taylor | \n", "Ivy | \n", "Ivy | \n", "Taylor | \n", "3 | \n", "Lonon | \n", "London | \n", "1 | \n", "jonesjennmfer@pitt.coi | \n", "jonesjennifer@pitts.com | \n", "1 | \n", "1980-01-13 | \n", "1980-01-13 | \n", "1 | \n", "113 | \n", "113 | \n", "
79930 | \n", "37.816346 | \n", "1.0 | \n", "df_1 | \n", "581 | \n", "df_2 | \n", "585 | \n", "Shaw | \n", "Eleanor | \n", "Eleanor | \n", "Shaw | \n", "3 | \n", "Birmingham | \n", "Birmingha | \n", "1 | \n", "stephaniewebbhart.net | \n", "stephaniewebb@hart.net | \n", "1 | \n", "1979-03-31 | \n", "1979-03-31 | \n", "1 | \n", "97 | \n", "97 | \n", "
73327 | \n", "37.373057 | \n", "1.0 | \n", "df_1 | \n", "517 | \n", "df_2 | \n", "526 | \n", "Martha | \n", "Brown | \n", "Brown | \n", "Martha | \n", "3 | \n", "Southend-on-Sea | \n", "Southend-on-Sea | \n", "2 | \n", "watsonthomas@jones-stuart.biz | \n", "watsonthomas@jones-s.urttbiz | \n", "1 | \n", "2002-09-01 | \n", "2002-09-01 | \n", "1 | \n", "89 | \n", "89 | \n", "
79105 | \n", "37.373057 | \n", "1.0 | \n", "df_1 | \n", "574 | \n", "df_2 | \n", "578 | \n", "Williams | \n", "George | \n", "George | \n", "Williams | \n", "3 | \n", "London | \n", "London | \n", "2 | \n", "desek58gibbr.biz | \n", "derek58@gibbs.biz | \n", "1 | \n", "1981-08-06 | \n", "1981-08-06 | \n", "1 | \n", "96 | \n", "96 | \n", "