{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Splink data linking demo (link only)\n", "\n", "In this demo we link two small datasets. \n", "\n", "The larger table contains duplicates, but in this notebook we use the `link_only` setting, so `splink` makes no attempt to deduplicate these records. \n", "\n", "Note it is possible to simultaneously link and dedupe using the `link_and_dedupe` setting.\n", "\n", "**Important** Where deduplication is not required, `link_only` can provide an important performance boost by dramatically reducing the number of records which need to be compared.\n", "\n", "For example, if you wanted to link 10 records to 1,000, then the maximum number of comparisons that need to be made (i.e. with no blocking rules) is 10,000. If you need to dedupe as well, that number would be n(n-1)/2 = 509,545.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Imports and setup\n", "\n", "The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", "pd.options.display.max_columns = 500\n", "pd.options.display.max_rows = 100\n", "import altair as alt\n", "alt.renderers.enable('mimetype')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging \n", "logging.basicConfig() # Means logs will print in Jupyter Lab\n", "\n", "# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood\n", "logging.getLogger(\"splink\").setLevel(logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from utility_functions.demo_utils import get_spark\n", "spark = get_spark()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Read in the data\n", "\n", "In this example, we link two datasets, but you can link as many as you like.\n", "\n", "⚠️ Note that `splink` makes the following assumptions about your data:\n", "\n", "- There is a field containing a unique record identifier in each dataset. By default, this should be called `unique_id`, but you can change this in the settings\n", "- There is a field containing a dataset name in each dataset, to disambiguate the `unique_id` column if the same id values occur in more than one dataset. By default, this column is called `source_dataset`, but you can change this in the settings.\n", "- The two datasets being linked have common column names - e.g. date of birth is represented in both datasets in a field of the same name. In many cases, this means that the user needs to rename columns prior to using `splink`\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The count of rows in `df_1` is 181\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "| 0| Julia | null|2015-10-29| London| hannah88@powers.com| 0| df_1|\n", "| 4| oNah| Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1|\n", "| 13| Molly | Bell|2002-01-05|Peterborough| null| 2| df_1|\n", "| 15| Alexander|Amelia |1983-05-19| Glasgow|ic-mpbell@alleale...| 3| df_1|\n", "| 20| Ol vri|ynnollC|1972-03-08| Plymouth|derekwilliams@nor...| 4| df_1|\n", "+---------+----------+-------+----------+------------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n", "The count of rows in `df_2` is 819\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "|unique_id|first_name|surname| dob| city| email|group|source_dataset|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "| 1| Julia | Taylor|2015-07-31|London| hannah88@powers.com| 0| df_2|\n", "| 2| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| df_2|\n", "| 3| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| df_2|\n", "| 5| Noah | Watson|2008-03-23|Bolton|matthew78@ballard...| 1| df_2|\n", "| 6| Watson| Noah |2008-03-23| null|matthew78@ballard...| 1| df_2|\n", "+---------+----------+-------+----------+------+--------------------+-----+--------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import lit \n", "df_1 = spark.read.parquet(\"data/fake_df_l.parquet\")\n", "df_1 = df_1.withColumn(\"source_dataset\", lit(\"df_1\"))\n", "df_2 = spark.read.parquet(\"data/fake_df_r.parquet\")\n", "df_2 = df_2.withColumn(\"source_dataset\", lit(\"df_2\"))\n", "print(f\"The count of rows in `df_1` is {df_1.count()}\")\n", "df_1.show(5)\n", "print(f\"The count of rows in `df_2` is {df_2.count()}\")\n", "df_2.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Configure splink using the `settings` object\n", "\n", "Most of `splink` configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex. \n", "\n", "💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find [here](http://robinlinacre.com/splink_settings_editor/).\n", "\n", "Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.\n", "\n", "To help with authoring and validation of the settings dictionary, we have written a [json schema](https://json-schema.org/), which can be found [here](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json). \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The comparison expression allows for the case where a first name and surname have been inverted \n", "sql_case_expression = \"\"\"\n", "CASE \n", "WHEN first_name_l = first_name_r AND surname_l = surname_r THEN 4 \n", "WHEN first_name_l = surname_r AND surname_l = first_name_r THEN 3\n", "WHEN first_name_l = first_name_r THEN 2\n", "WHEN surname_l = surname_r THEN 1\n", "ELSE 0 \n", "END\n", "\"\"\"\n", "\n", "settings = {\n", " \"link_type\": \"link_only\", \n", " \"max_iterations\": 20,\n", " \"blocking_rules\": [\n", " ],\n", " \"comparison_columns\": [\n", " {\n", " \"custom_name\": \"name_inversion\",\n", " \"custom_columns_used\": [\"first_name\", \"surname\"],\n", " \"case_expression\": sql_case_expression,\n", " \"num_levels\": 5\n", " },\n", " {\n", " \"col_name\": \"city\",\n", " \"num_levels\": 3\n", " },\n", " {\n", " \"col_name\": \"email\",\n", " \"num_levels\": 3\n", " },\n", " {\n", " \"col_name\": \"dob\"\n", " }\n", " ],\n", " \"additional_columns_to_retain\": [\"group\"]\n", " \n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In words, this setting dictionary says:\n", "\n", "- We are performing a data linking task (the other options are `dedupe_only`, or `link_and_dedupe`)\n", "- Since the input datasets are so small, we do not specify any blocking rules and instead generate all possible comparisons.\n", "- When comparing records, we will use information from the `first_name`, `surname`, `city` and `email` columns to compute a match score.\n", "- For the comparisons on the `first_name` and `surname` column we allow the possibility that the names have been inputted in the wrong order. \n", " - The highest level of similarity is that both `first_name` and `surname` both match.\n", " - There are other levels of similarity for the names being inverted, and just first name, or just surname matching.\n", "- We will retain the `group` column in the results even though this is not used as part of comparisons. This is a labelled dataset and `group` contains the true match - i.e. where group matches, the records pertain to the same person" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Estimate match scores using the Expectation Maximisation algorithm" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/robinlinacre/anaconda3/lib/python3.8/site-packages/splink/default_settings.py:199: UserWarning: You have not specified any blocking rules, meaning all comparisons between the input dataset(s) will be generated and blocking will not be used.For large input datasets, this will generally be computationally intractable because it will generate comparisons equal to the number of rows squared.\n", " warnings.warn(\n", "INFO:splink.iterate:Iteration 0 complete\n", "INFO:splink.model:The maximum change in parameters was 0.40568520724773405 for key name_inversion, level 4\n", "INFO:splink.iterate:Iteration 1 complete\n", "INFO:splink.model:The maximum change in parameters was 0.06933289766311646 for key email, level 1\n", "INFO:splink.iterate:Iteration 2 complete\n", "INFO:splink.model:The maximum change in parameters was 0.02503591775894165 for key dob, level 0\n", "INFO:splink.iterate:Iteration 3 complete\n", "INFO:splink.model:The maximum change in parameters was 0.009511321783065796 for key dob, level 0\n", "INFO:splink.iterate:Iteration 4 complete\n", "INFO:splink.model:The maximum change in parameters was 0.004227638244628906 for key dob, level 0\n", "INFO:splink.iterate:Iteration 5 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0022344589233398438 for key dob, level 0\n", "INFO:splink.iterate:Iteration 6 complete\n", "INFO:splink.model:The maximum change in parameters was 0.001312553882598877 for key dob, level 1\n", "INFO:splink.iterate:Iteration 7 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0008212625980377197 for key dob, level 0\n", "INFO:splink.iterate:Iteration 8 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0005371570587158203 for key dob, level 0\n", "INFO:splink.iterate:Iteration 9 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0003641173243522644 for key city, level 0\n", "INFO:splink.iterate:Iteration 10 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0002571418881416321 for key city, level 0\n", "INFO:splink.iterate:Iteration 11 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0001854151487350464 for key city, level 0\n", "INFO:splink.iterate:Iteration 12 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0001360774040222168 for key city, level 0\n", "INFO:splink.iterate:Iteration 13 complete\n", "INFO:splink.model:The maximum change in parameters was 0.0001013725996017456 for key city, level 0\n", "INFO:splink.iterate:Iteration 14 complete\n", "INFO:splink.model:The maximum change in parameters was 7.649511098861694e-05 for key city, level 0\n", "INFO:splink.iterate:EM algorithm has converged\n" ] }, { "data": { "text/plain": [ "DataFrame[match_probability: double, source_dataset_l: string, unique_id_l: bigint, source_dataset_r: string, unique_id_r: bigint, first_name_l: string, first_name_r: string, surname_l: string, surname_r: string, gamma_name_inversion: int, city_l: string, city_r: string, gamma_city: int, email_l: string, email_r: string, gamma_email: int, dob_l: string, dob_r: string, gamma_dob: int, group_l: bigint, group_r: bigint]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from splink import Splink\n", "\n", "linker = Splink(settings, [df_1, df_2], spark)\n", "df_e = linker.get_scored_comparisons()\n", "\n", "# Later, we will make term frequency adjustments. \n", "# Persist caches these results in memory, preventing them having to be recomputed when we make these adjustments.\n", "df_e.persist() \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Inspect results \n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_probabilitysource_dataset_lunique_id_lsource_dataset_runique_id_rfirst_name_lfirst_name_rsurname_lsurname_rgamma_name_inversioncity_lcity_rgamma_cityemail_lemail_rgamma_emaildob_ldob_rgamma_dobgroup_lgroup_r
584991.0df_1419df_2422EmilyBrownBrownEmily3LndonLondon1sarahbrown@mckinney.comsarahnron@mckinbey.com12005-07-152005-07-1517171
799301.0df_1581df_2585EleanorShawShawEleanor3BirminghamBirmingha1stephaniewebbhart.netstephaniewebb@hart.net11979-03-311979-03-3119797
931011.0df_1664df_2668IvyTaylorTaylorIvy3LononLondon1jonesjennmfer@pitt.coijonesjennifer@pitts.com11980-01-131980-01-131113113
931061.0df_1664df_2673IvyTaylorTaylorIvy3LononLondon1jonesjennmfer@pitt.coijonesjennifer@pitts.com11980-01-131980-01-131113113
24711.0df_115df_218AlexanderAmeliaAmeliaAlexander3GlasgowGlasgow2ic-mpbell@allealewis.orgicampbell@allen-lewis.org11983-05-191983-05-19133
1375311.0df_1924df_2926MillsThomasThomasMills3LondonLondon2hensondebbie@garcia.comhensondrbbie@gaeia.com11970-03-091970-03-091167167
791051.0df_1574df_2578GeorgeWilliamsWilliamsGeorge3LondonLondon2desek58gibbr.bizderek58@gibbs.biz11981-08-061981-08-0619696
791041.0df_1574df_2577GeorgeWilliamsWilliamsGeorge3LondonLondon2desek58gibbr.bizderek58@gibbs.biz11981-08-061981-08-0619696
1424791.0df_1960df_2966GabrielBartlettBartlettGabriel3WolverhamptonWolverhampton2ogomez@robinson-mckinney.comogomez@rob-nsonimcknney.com11973-12-091973-12-091173173
296571.0df_1209df_2210ThompsonFreddieFreddieThompson3PeterboroughPeterborough2scottsalinas@hughes-lopez.comscottsalinah@ughes-lopez.com11999-07-231999-07-2313636
733221.0df_1517df_2521BrownMarthaMarthaBrown3Southend-on-SeaSouthend-on-Sea2watsonthomas@jones-stuart.bizwatsonthomas@onesistuart.b-z12002-09-012002-09-0118989
733271.0df_1517df_2526BrownMarthaMarthaBrown3Southend-on-SeaSouthend-on-Sea2watsonthomas@jones-stuart.bizwatsonthomas@jones-s.urttbiz12002-09-012002-09-0118989
1029761.0df_1726df_2727HarryLawrenceLawrenceHarry3Stoke-on-TrentStoke-on-Trent2aarbarpace@mbnning.orgbarbarapace@manning.org12016-12-252016-12-251125125
931021.0df_1664df_2669IvyIvyTaylorTaylor4LononLodno1jonesjennmfer@pitt.coijonesjennifer@pitts.com11980-01-131980-01-131113113
791021.0df_1574df_2575GeorgeGeorgeWilliamsWilliams4LondonLndon1desek58gibbr.bizderek58@gibbs.biz11981-08-061981-08-0619696
931001.0df_1664df_2667IvyIvyTaylorTaylor4LononLondon1jonesjennmfer@pitt.coijonesjennifer@pitts.com11980-01-131980-01-131113113
1029791.0df_1726df_2730HarryHarryLawrenceLawrence4Stoke-on-TrentStoke-on-ernt1aarbarpace@mbnning.orgbarbarapace@manning.org12016-12-252016-12-251125125
1284901.0df_1879df_2883LeoLeoJonesJones4LdnonLondon1tcarr@lewis-kline.comtcarr@lweis-kine.com12019-06-152019-06-151156156
799341.0df_1581df_2589EleanorEleanorShawShaw4BirminghamBirmingham2stephaniewebbhart.netstephaniewebb@hart.net11979-03-311979-03-3119797
791031.0df_1574df_2576GeorgeGeorgeWilliamsWilliams4LondonLondon2desek58gibbr.bizderek58@gibbs.biz11981-08-061981-08-0619696
\n", "
" ], "text/plain": [ " match_probability source_dataset_l unique_id_l source_dataset_r \\\n", "58499 1.0 df_1 419 df_2 \n", "79930 1.0 df_1 581 df_2 \n", "93101 1.0 df_1 664 df_2 \n", "93106 1.0 df_1 664 df_2 \n", "2471 1.0 df_1 15 df_2 \n", "137531 1.0 df_1 924 df_2 \n", "79105 1.0 df_1 574 df_2 \n", "79104 1.0 df_1 574 df_2 \n", "142479 1.0 df_1 960 df_2 \n", "29657 1.0 df_1 209 df_2 \n", "73322 1.0 df_1 517 df_2 \n", "73327 1.0 df_1 517 df_2 \n", "102976 1.0 df_1 726 df_2 \n", "93102 1.0 df_1 664 df_2 \n", "79102 1.0 df_1 574 df_2 \n", "93100 1.0 df_1 664 df_2 \n", "102979 1.0 df_1 726 df_2 \n", "128490 1.0 df_1 879 df_2 \n", "79934 1.0 df_1 581 df_2 \n", "79103 1.0 df_1 574 df_2 \n", "\n", " unique_id_r first_name_l first_name_r surname_l surname_r \\\n", "58499 422 Emily Brown Brown Emily \n", "79930 585 Eleanor Shaw Shaw Eleanor \n", "93101 668 Ivy Taylor Taylor Ivy \n", "93106 673 Ivy Taylor Taylor Ivy \n", "2471 18 Alexander Amelia Amelia Alexander \n", "137531 926 Mills Thomas Thomas Mills \n", "79105 578 George Williams Williams George \n", "79104 577 George Williams Williams George \n", "142479 966 Gabriel Bartlett Bartlett Gabriel \n", "29657 210 Thompson Freddie Freddie Thompson \n", "73322 521 Brown Martha Martha Brown \n", "73327 526 Brown Martha Martha Brown \n", "102976 727 Harry Lawrence Lawrence Harry \n", "93102 669 Ivy Ivy Taylor Taylor \n", "79102 575 George George Williams Williams \n", "93100 667 Ivy Ivy Taylor Taylor \n", "102979 730 Harry Harry Lawrence Lawrence \n", "128490 883 Leo Leo Jones Jones \n", "79934 589 Eleanor Eleanor Shaw Shaw \n", "79103 576 George George Williams Williams \n", "\n", " gamma_name_inversion city_l city_r gamma_city \\\n", "58499 3 Lndon London 1 \n", "79930 3 Birmingham Birmingha 1 \n", "93101 3 Lonon London 1 \n", "93106 3 Lonon London 1 \n", "2471 3 Glasgow Glasgow 2 \n", "137531 3 London London 2 \n", "79105 3 London London 2 \n", "79104 3 London London 2 \n", "142479 3 Wolverhampton Wolverhampton 2 \n", "29657 3 Peterborough Peterborough 2 \n", "73322 3 Southend-on-Sea Southend-on-Sea 2 \n", "73327 3 Southend-on-Sea Southend-on-Sea 2 \n", "102976 3 Stoke-on-Trent Stoke-on-Trent 2 \n", "93102 4 Lonon Lodno 1 \n", "79102 4 London Lndon 1 \n", "93100 4 Lonon London 1 \n", "102979 4 Stoke-on-Trent Stoke-on-ernt 1 \n", "128490 4 Ldnon London 1 \n", "79934 4 Birmingham Birmingham 2 \n", "79103 4 London London 2 \n", "\n", " email_l email_r \\\n", "58499 sarahbrown@mckinney.com sarahnron@mckinbey.com \n", "79930 stephaniewebbhart.net stephaniewebb@hart.net \n", "93101 jonesjennmfer@pitt.coi jonesjennifer@pitts.com \n", "93106 jonesjennmfer@pitt.coi jonesjennifer@pitts.com \n", "2471 ic-mpbell@allealewis.org icampbell@allen-lewis.org \n", "137531 hensondebbie@garcia.com hensondrbbie@gaeia.com \n", "79105 desek58gibbr.biz derek58@gibbs.biz \n", "79104 desek58gibbr.biz derek58@gibbs.biz \n", "142479 ogomez@robinson-mckinney.com ogomez@rob-nsonimcknney.com \n", "29657 scottsalinas@hughes-lopez.com scottsalinah@ughes-lopez.com \n", "73322 watsonthomas@jones-stuart.biz watsonthomas@onesistuart.b-z \n", "73327 watsonthomas@jones-stuart.biz watsonthomas@jones-s.urttbiz \n", "102976 aarbarpace@mbnning.org barbarapace@manning.org \n", "93102 jonesjennmfer@pitt.coi jonesjennifer@pitts.com \n", "79102 desek58gibbr.biz derek58@gibbs.biz \n", "93100 jonesjennmfer@pitt.coi jonesjennifer@pitts.com \n", "102979 aarbarpace@mbnning.org barbarapace@manning.org \n", "128490 tcarr@lewis-kline.com tcarr@lweis-kine.com \n", "79934 stephaniewebbhart.net stephaniewebb@hart.net \n", "79103 desek58gibbr.biz derek58@gibbs.biz \n", "\n", " gamma_email dob_l dob_r gamma_dob group_l group_r \n", "58499 1 2005-07-15 2005-07-15 1 71 71 \n", "79930 1 1979-03-31 1979-03-31 1 97 97 \n", "93101 1 1980-01-13 1980-01-13 1 113 113 \n", "93106 1 1980-01-13 1980-01-13 1 113 113 \n", "2471 1 1983-05-19 1983-05-19 1 3 3 \n", "137531 1 1970-03-09 1970-03-09 1 167 167 \n", "79105 1 1981-08-06 1981-08-06 1 96 96 \n", "79104 1 1981-08-06 1981-08-06 1 96 96 \n", "142479 1 1973-12-09 1973-12-09 1 173 173 \n", "29657 1 1999-07-23 1999-07-23 1 36 36 \n", "73322 1 2002-09-01 2002-09-01 1 89 89 \n", "73327 1 2002-09-01 2002-09-01 1 89 89 \n", "102976 1 2016-12-25 2016-12-25 1 125 125 \n", "93102 1 1980-01-13 1980-01-13 1 113 113 \n", "79102 1 1981-08-06 1981-08-06 1 96 96 \n", "93100 1 1980-01-13 1980-01-13 1 113 113 \n", "102979 1 2016-12-25 2016-12-25 1 125 125 \n", "128490 1 2019-06-15 2019-06-15 1 156 156 \n", "79934 1 1979-03-31 1979-03-31 1 97 97 \n", "79103 1 1981-08-06 1981-08-06 1 96 96 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inspect main dataframe that contains the match scores\n", "df_e.toPandas().sort_values(\"match_probability\", ascending=False).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `params` property of the `linker` is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "config": { "header": { "title": null }, "title": { "anchor": "middle", "offset": 10 }, "view": { "continuousHeight": 300, "continuousWidth": 400, "height": 300, "width": 400 } }, "data": { "name": "data-ceab06ca9a4c907808588fc81343c76f" }, "datasets": { "data-ceab06ca9a4c907808588fc81343c76f": [ { "bayes_factor": 0.33746388159569507, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9899958998803806, "log2_bayes_factor": -1.5671949945941157, "m_probability": 0.3353385925292969, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.9937021732330322 }, { "bayes_factor": 43.04556551790893, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.005140347579447535, "log2_bayes_factor": 5.42779271613676, "m_probability": 0.17891953885555267, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.004156515002250671 }, { "bayes_factor": 82.16842152983283, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.003082859391495617, "log2_bayes_factor": 6.360512147486598, "m_probability": 0.17386698722839355, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.002115983050316572 }, { "bayes_factor": 20406.02818466379, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 3, "level_name": "level_3", "level_proportion": 0.0008027577041464385, "log2_bayes_factor": 14.3167077840707, "m_probability": 0.14137406647205353, "max_gamma_index": 4, "num_levels": 5, "u_probability": 6.928054062882438e-06 }, { "bayes_factor": 9258.582412155205, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 4, "level_name": "level_4", "level_proportion": 0.0009781501351920944, "log2_bayes_factor": 13.176575603038243, "m_probability": 0.17050081491470337, "max_gamma_index": 4, "num_levels": 5, "u_probability": 1.8415434169583023e-05 }, { "bayes_factor": 0.09647031769622005, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.8747266775886633, "log2_bayes_factor": -3.37377107226111, "m_probability": 0.084816575050354, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.8791986703872681 }, { "bayes_factor": 7.848894447588318, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.03279609504851985, "log2_bayes_factor": 2.9724894581690755, "m_probability": 0.2478567212820053, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.03157855197787285 }, { "bayes_factor": 7.479329462081099, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.09247724580052386, "log2_bayes_factor": 2.902908935207437, "m_probability": 0.6673266887664795, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.08922279626131058 }, { "bayes_factor": 0.04449680027094667, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9945923379461757, "log2_bayes_factor": -4.49015459300374, "m_probability": 0.04449551925063133, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.9999712109565735 }, { "bayes_factor": 10896992.968579952, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.0017372362928768847, "log2_bayes_factor": 23.377426741614574, "m_probability": 0.30858883261680603, "max_gamma_index": 2, "num_levels": 3, "u_probability": 2.831871448449874e-08 }, { "bayes_factor": 22489.264303527143, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.0036704306889567212, "log2_bayes_factor": 14.456948846222605, "m_probability": 0.6469156742095947, "max_gamma_index": 2, "num_levels": 3, "u_probability": 2.876553298847284e-05 }, { "bayes_factor": 0.4109166928436421, "column_name": "dob", "gamma_column_name": "gamma_dob", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9966608104918419, "log2_bayes_factor": -1.2830821559768926, "m_probability": 0.41090723872184753, "max_gamma_index": 1, "num_levels": 2, "u_probability": 0.9999769926071167 }, { "bayes_factor": 25590.136849725804, "column_name": "dob", "gamma_column_name": "gamma_dob", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.003339202178888195, "log2_bayes_factor": 14.643300242123873, "m_probability": 0.5890927314758301, "max_gamma_index": 1, "num_levels": 2, "u_probability": 2.302030407008715e-05 } ] }, "hconcat": [ { "encoding": { "color": { "value": "red" }, "row": { "field": "column_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "gamma_index" }, "type": "nominal" }, "tooltip": [ { "field": "column_name", "type": "nominal" }, { "field": "level_name", "type": "ordinal" }, { "field": "u_probability", "format": ".4f", "type": "quantitative" }, { "field": "bayes_factor", "format": ".4f", "type": "quantitative" }, { "field": "level_proportion", "format": ".2%", "title": "Percentage of record comparisons in this level", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ".4f", "type": "quantitative" } ], "x": { "axis": { "title": "proportion" }, "field": "u_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "level_name", "type": "nominal" } }, "height": 50, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontWeight": "normal", "text": "Non-matches" }, "transform": [ { "filter": "(datum.bayes_factor != 'unnecessary filter2 due to vega lite issue 4680')" } ], "width": 150 }, { "encoding": { "color": { "value": "green" }, "row": { "field": "column_name", "header": { "labels": false }, "sort": { "field": "gamma_index" }, "type": "nominal" }, "tooltip": [ { "field": "column_name", "type": "nominal" }, { "field": "level_name", "type": "ordinal" }, { "field": "m_probability", "format": ".4f", "type": "quantitative" }, { "field": "bayes_factor", "format": ".4f", "type": "quantitative" }, { "field": "level_proportion", "format": ".2%", "title": "Percentage of record comparisons in this level", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ".4f", "type": "quantitative" } ], "x": { "axis": { "title": "proportion" }, "field": "m_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "level_name", "type": "nominal" } }, "height": 50, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontWeight": "normal", "text": "Matches" }, "transform": [ { "filter": "(datum.bayes_factor != 'unnecessary filter due to vega lite issue 4680')" } ], "width": 150 } ], "title": { "subtitle": "Estimated proportion of matches λ = 0.00563", "text": "Probability distributions of non-matches and matches " }, "transform": [] }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = linker.model\n", "model.probability_distribution_chart()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "config": { "header": { "title": null }, "mark": { "tooltip": null }, "title": { "anchor": "middle" }, "view": { "continuousHeight": 300, "continuousWidth": 400, "height": 300, "width": 400 } }, "data": { "name": "data-ceab06ca9a4c907808588fc81343c76f" }, "datasets": { "data-ceab06ca9a4c907808588fc81343c76f": [ { "bayes_factor": 0.33746388159569507, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9899958998803806, "log2_bayes_factor": -1.5671949945941157, "m_probability": 0.3353385925292969, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.9937021732330322 }, { "bayes_factor": 43.04556551790893, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.005140347579447535, "log2_bayes_factor": 5.42779271613676, "m_probability": 0.17891953885555267, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.004156515002250671 }, { "bayes_factor": 82.16842152983283, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.003082859391495617, "log2_bayes_factor": 6.360512147486598, "m_probability": 0.17386698722839355, "max_gamma_index": 4, "num_levels": 5, "u_probability": 0.002115983050316572 }, { "bayes_factor": 20406.02818466379, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 3, "level_name": "level_3", "level_proportion": 0.0008027577041464385, "log2_bayes_factor": 14.3167077840707, "m_probability": 0.14137406647205353, "max_gamma_index": 4, "num_levels": 5, "u_probability": 6.928054062882438e-06 }, { "bayes_factor": 9258.582412155205, "column_name": "name_inversion", "gamma_column_name": "gamma_name_inversion", "gamma_index": 4, "level_name": "level_4", "level_proportion": 0.0009781501351920944, "log2_bayes_factor": 13.176575603038243, "m_probability": 0.17050081491470337, "max_gamma_index": 4, "num_levels": 5, "u_probability": 1.8415434169583023e-05 }, { "bayes_factor": 0.09647031769622005, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.8747266775886633, "log2_bayes_factor": -3.37377107226111, "m_probability": 0.084816575050354, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.8791986703872681 }, { "bayes_factor": 7.848894447588318, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.03279609504851985, "log2_bayes_factor": 2.9724894581690755, "m_probability": 0.2478567212820053, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.03157855197787285 }, { "bayes_factor": 7.479329462081099, "column_name": "city", "gamma_column_name": "gamma_city", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.09247724580052386, "log2_bayes_factor": 2.902908935207437, "m_probability": 0.6673266887664795, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.08922279626131058 }, { "bayes_factor": 0.04449680027094667, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9945923379461757, "log2_bayes_factor": -4.49015459300374, "m_probability": 0.04449551925063133, "max_gamma_index": 2, "num_levels": 3, "u_probability": 0.9999712109565735 }, { "bayes_factor": 10896992.968579952, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.0017372362928768847, "log2_bayes_factor": 23.377426741614574, "m_probability": 0.30858883261680603, "max_gamma_index": 2, "num_levels": 3, "u_probability": 2.831871448449874e-08 }, { "bayes_factor": 22489.264303527143, "column_name": "email", "gamma_column_name": "gamma_email", "gamma_index": 2, "level_name": "level_2", "level_proportion": 0.0036704306889567212, "log2_bayes_factor": 14.456948846222605, "m_probability": 0.6469156742095947, "max_gamma_index": 2, "num_levels": 3, "u_probability": 2.876553298847284e-05 }, { "bayes_factor": 0.4109166928436421, "column_name": "dob", "gamma_column_name": "gamma_dob", "gamma_index": 0, "level_name": "level_0", "level_proportion": 0.9966608104918419, "log2_bayes_factor": -1.2830821559768926, "m_probability": 0.41090723872184753, "max_gamma_index": 1, "num_levels": 2, "u_probability": 0.9999769926071167 }, { "bayes_factor": 25590.136849725804, "column_name": "dob", "gamma_column_name": "gamma_dob", "gamma_index": 1, "level_name": "level_1", "level_proportion": 0.003339202178888195, "log2_bayes_factor": 14.643300242123873, "m_probability": 0.5890927314758301, "max_gamma_index": 1, "num_levels": 2, "u_probability": 2.302030407008715e-05 } ] }, "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "type": "quantitative" }, "row": { "field": "column_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "gamma_index" }, "type": "nominal" }, "tooltip": [ { "field": "column_name", "type": "nominal" }, { "field": "level_name", "type": "ordinal" }, { "field": "m_probability", "format": ".4f", "type": "quantitative" }, { "field": "bayes_factor", "format": ".4f", "type": "quantitative" }, { "field": "level_proportion", "format": ".2%", "title": "Percentage of record comparisons in this level", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ".4f", "title": "log2(Bayes factor, K = m/u)", "type": "quantitative" } ], "x": { "axis": { "title": "log2(Bayes factor, K = m/u)", "values": [ -10, -5, 0, 5, 10 ] }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "level_name", "type": "nominal" } }, "height": 50, "mark": { "clip": true, "type": "bar" }, "resolve": { "scale": { "y": "independent" } }, "title": "Influence of comparison vector values on match probability" }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.bayes_factor_chart()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)\n", "model.all_charts_write_html_file(\"splink_charts.html\", overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also generate a report which explains how the match probability was computed for an individual comparison row. \n", "\n", "Note that you need to convert the row to a dictionary for this to work" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Initial probability of match (prior) = λ = 0.00563\n", "------\n", "Comparison of name_inversion. Values are: \n", "name_inversion_l: Ja k, Kirk\n", "name_inversion_r: Leo , Jones\n", "Comparison has: 5 levels\n", "Level for this comparison: gamma_name_inversion = 0\n", "m probability = P(level|match): 0.3353\n", "u probability = P(level|non-match): 0.9937\n", "Bayes factor = m/u: 0.3375\n", "New probability of match (updated belief): 0.001907\n", "\n", "------\n", "Comparison of city. Values are: \n", "city_l: London\n", "city_r: Manchester\n", "Comparison has: 3 levels\n", "Level for this comparison: gamma_city = 0\n", "m probability = P(level|match): 0.08482\n", "u probability = P(level|non-match): 0.8792\n", "Bayes factor = m/u: 0.09647\n", "New probability of match (updated belief): 0.0001843\n", "\n", "------\n", "Comparison of email. Values are: \n", "email_l: fphillips@young-trner.info\n", "email_r: None\n", "Comparison has: 3 levels\n", "Level for this comparison: gamma_email = -1\n", "m probability = P(level|match): 1\n", "u probability = P(level|non-match): 1\n", "Bayes factor = m/u: 1\n", "New probability of match (updated belief): 0.0001843\n", "\n", "------\n", "Comparison of dob. Values are: \n", "dob_l: 2008-02-17\n", "dob_r: 1983-07-01\n", "Comparison has: 2 levels\n", "Level for this comparison: gamma_dob = 0\n", "m probability = P(level|match): 0.4109\n", "u probability = P(level|non-match): 1\n", "Bayes factor = m/u: 0.4109\n", "New probability of match (updated belief): 7.573e-05\n", "\n", "\n", "Final probability of match = 7.573e-05\n", "\n", "Reminder:\n", "\n", "The m probability for a given level is the proportion of matches which are in this level.\n", "We would generally expect the highest similarity level to have the largest proportion of matches.\n", "For example, we would expect first name field to match exactly amongst most matching records, except where nicknames, aliases or typos have occurred.\n", "For a comparison column that changes through time, like address, we may expect a lower proportion of comparisons to be in the highest similarity level.\n", "\n", "The u probability for a given level is the proportion of non-matches which are in this level.\n", "We would generally expect the lowest similarity level to have the highest proportion of non-matches, but the magnitude depends on the cardinality of the field.\n", "For example, we would expect that in the vast majority of non-matching records, the date of birth field would not match. However, we would expect it to be common for gender to match amongst non-matches.\n", "\n" ] } ], "source": [ "from splink.intuition import intuition_report\n", "row_dict = df_e.toPandas().sample(1).to_dict(orient=\"records\")[0]\n", "print(intuition_report(row_dict, model))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json", "config": { "title": { "fontSize": 14 }, "view": { "continuousHeight": 300, "continuousWidth": 400 } }, "data": { "name": "data-6916d61d602750fe23c53d3e64120973" }, "datasets": { "data-6916d61d602750fe23c53d3e64120973": [ { "binwidth": 0.01, "count_rows": 2687, "freqdensity": 268700, "normalised": 0.7174899866488653, "splink_score_bin_high": 0.01, "splink_score_bin_low": 0 }, { "binwidth": 0.01, "count_rows": 54, "freqdensity": 5400, "normalised": 0.014419225634178908, "splink_score_bin_high": 0.02, "splink_score_bin_low": 0.01 }, { "binwidth": 0.009999999999999998, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.03, "splink_score_bin_low": 0.02 }, { "binwidth": 0.010000000000000002, "count_rows": 113, "freqdensity": 11299.999999999998, "normalised": 0.030173564753004006, "splink_score_bin_high": 0.04, "splink_score_bin_low": 0.03 }, { "binwidth": 0.010000000000000002, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.05, "splink_score_bin_low": 0.04 }, { "binwidth": 0.009999999999999995, "count_rows": 25, "freqdensity": 2500.0000000000014, "normalised": 0.006675567423230979, "splink_score_bin_high": 0.06, "splink_score_bin_low": 0.05 }, { "binwidth": 0.010000000000000009, "count_rows": 1, "freqdensity": 99.99999999999991, "normalised": 0.0002670226969292388, "splink_score_bin_high": 0.07, "splink_score_bin_low": 0.06 }, { "binwidth": 0.009999999999999995, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.08, "splink_score_bin_low": 0.07 }, { "binwidth": 0.009999999999999995, "count_rows": 1, "freqdensity": 100.00000000000006, "normalised": 0.0002670226969292392, "splink_score_bin_high": 0.09, "splink_score_bin_low": 0.08 }, { "binwidth": 0.010000000000000009, "count_rows": 27, "freqdensity": 2699.9999999999977, "normalised": 0.007209612817089448, "splink_score_bin_high": 0.1, "splink_score_bin_low": 0.09 }, { "binwidth": 0.009999999999999995, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.11, "splink_score_bin_low": 0.1 }, { "binwidth": 0.009999999999999995, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.12, "splink_score_bin_low": 0.11 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.13, "splink_score_bin_low": 0.12 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.14, "splink_score_bin_low": 0.13 }, { "binwidth": 0.009999999999999981, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.15, "splink_score_bin_low": 0.14 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.16, "splink_score_bin_low": 0.15 }, { "binwidth": 0.010000000000000009, "count_rows": 11, "freqdensity": 1099.999999999999, "normalised": 0.0029372496662216268, "splink_score_bin_high": 0.17, "splink_score_bin_low": 0.16 }, { "binwidth": 0.009999999999999981, "count_rows": 3, "freqdensity": 300.00000000000057, "normalised": 0.0008010680907877186, "splink_score_bin_high": 0.18, "splink_score_bin_low": 0.17 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.19, "splink_score_bin_low": 0.18 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.2, "splink_score_bin_low": 0.19 }, { "binwidth": 0.009999999999999981, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.21, "splink_score_bin_low": 0.2 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.22, "splink_score_bin_low": 0.21 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.23, "splink_score_bin_low": 0.22 }, { "binwidth": 0.009999999999999981, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.24, "splink_score_bin_low": 0.23 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.25, "splink_score_bin_low": 0.24 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.26, "splink_score_bin_low": 0.25 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.27, "splink_score_bin_low": 0.26 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.28, "splink_score_bin_low": 0.27 }, { "binwidth": 0.009999999999999953, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.29, "splink_score_bin_low": 0.28 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.3, "splink_score_bin_low": 0.29 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.31, "splink_score_bin_low": 0.3 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.32, "splink_score_bin_low": 0.31 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.33, "splink_score_bin_low": 0.32 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.34, "splink_score_bin_low": 0.33 }, { "binwidth": 0.009999999999999953, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.35, "splink_score_bin_low": 0.34 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.36, "splink_score_bin_low": 0.35 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.37, "splink_score_bin_low": 0.36 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.38, "splink_score_bin_low": 0.37 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.39, "splink_score_bin_low": 0.38 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.4, "splink_score_bin_low": 0.39 }, { "binwidth": 0.009999999999999953, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.41, "splink_score_bin_low": 0.4 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.42, "splink_score_bin_low": 0.41 }, { "binwidth": 0.010000000000000009, "count_rows": 16, "freqdensity": 1599.9999999999986, "normalised": 0.004272363150867821, "splink_score_bin_high": 0.43, "splink_score_bin_low": 0.42 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.44, "splink_score_bin_low": 0.43 }, { "binwidth": 0.010000000000000009, "count_rows": 7, "freqdensity": 699.9999999999994, "normalised": 0.0018691588785046717, "splink_score_bin_high": 0.45, "splink_score_bin_low": 0.44 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.46, "splink_score_bin_low": 0.45 }, { "binwidth": 0.009999999999999953, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.47, "splink_score_bin_low": 0.46 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.48, "splink_score_bin_low": 0.47 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.49, "splink_score_bin_low": 0.48 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.5, "splink_score_bin_low": 0.49 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.51, "splink_score_bin_low": 0.5 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.52, "splink_score_bin_low": 0.51 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.53, "splink_score_bin_low": 0.52 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.54, "splink_score_bin_low": 0.53 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.55, "splink_score_bin_low": 0.54 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.56, "splink_score_bin_low": 0.55 }, { "binwidth": 0.009999999999999898, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.57, "splink_score_bin_low": 0.56 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.58, "splink_score_bin_low": 0.57 }, { "binwidth": 0.010000000000000009, "count_rows": 12, "freqdensity": 1199.9999999999989, "normalised": 0.0032042723631508655, "splink_score_bin_high": 0.59, "splink_score_bin_low": 0.58 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.6, "splink_score_bin_low": 0.59 }, { "binwidth": 0.010000000000000009, "count_rows": 2, "freqdensity": 199.99999999999983, "normalised": 0.0005340453938584777, "splink_score_bin_high": 0.61, "splink_score_bin_low": 0.6 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.62, "splink_score_bin_low": 0.61 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.63, "splink_score_bin_low": 0.62 }, { "binwidth": 0.010000000000000009, "count_rows": 7, "freqdensity": 699.9999999999994, "normalised": 0.0018691588785046717, "splink_score_bin_high": 0.64, "splink_score_bin_low": 0.63 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.65, "splink_score_bin_low": 0.64 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.66, "splink_score_bin_low": 0.65 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.67, "splink_score_bin_low": 0.66 }, { "binwidth": 0.010000000000000009, "count_rows": 6, "freqdensity": 599.9999999999994, "normalised": 0.0016021361815754327, "splink_score_bin_high": 0.68, "splink_score_bin_low": 0.67 }, { "binwidth": 0.009999999999999898, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.69, "splink_score_bin_low": 0.68 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.7, "splink_score_bin_low": 0.69 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.71, "splink_score_bin_low": 0.7 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.72, "splink_score_bin_low": 0.71 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.73, "splink_score_bin_low": 0.72 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.74, "splink_score_bin_low": 0.73 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.75, "splink_score_bin_low": 0.74 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.76, "splink_score_bin_low": 0.75 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.77, "splink_score_bin_low": 0.76 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.78, "splink_score_bin_low": 0.77 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.79, "splink_score_bin_low": 0.78 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.8, "splink_score_bin_low": 0.79 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.81, "splink_score_bin_low": 0.8 }, { "binwidth": 0.009999999999999898, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.82, "splink_score_bin_low": 0.81 }, { "binwidth": 0.010000000000000009, "count_rows": 5, "freqdensity": 499.99999999999955, "normalised": 0.001335113484646194, "splink_score_bin_high": 0.83, "splink_score_bin_low": 0.82 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.84, "splink_score_bin_low": 0.83 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.85, "splink_score_bin_low": 0.84 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.86, "splink_score_bin_low": 0.85 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.87, "splink_score_bin_low": 0.86 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.88, "splink_score_bin_low": 0.87 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.89, "splink_score_bin_low": 0.88 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.9, "splink_score_bin_low": 0.89 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.91, "splink_score_bin_low": 0.9 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.92, "splink_score_bin_low": 0.91 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.93, "splink_score_bin_low": 0.92 }, { "binwidth": 0.009999999999999898, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.94, "splink_score_bin_low": 0.93 }, { "binwidth": 0.010000000000000009, "count_rows": 13, "freqdensity": 1299.9999999999989, "normalised": 0.003471295060080104, "splink_score_bin_high": 0.95, "splink_score_bin_low": 0.94 }, { "binwidth": 0.010000000000000009, "count_rows": 3, "freqdensity": 299.9999999999997, "normalised": 0.0008010680907877164, "splink_score_bin_high": 0.96, "splink_score_bin_low": 0.95 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.97, "splink_score_bin_low": 0.96 }, { "binwidth": 0.010000000000000009, "count_rows": 9, "freqdensity": 899.9999999999992, "normalised": 0.0024032042723631493, "splink_score_bin_high": 0.98, "splink_score_bin_low": 0.97 }, { "binwidth": 0.010000000000000009, "count_rows": 0, "freqdensity": 0, "normalised": 0, "splink_score_bin_high": 0.99, "splink_score_bin_low": 0.98 }, { "binwidth": 0.010000000000000009, "count_rows": 743, "freqdensity": 74299.99999999993, "normalised": 0.1983978638184244, "splink_score_bin_high": 1, "splink_score_bin_low": 0.99 } ] }, "encoding": { "tooltip": [ { "field": "count_rows", "title": "count", "type": "quantitative" } ], "x": { "axis": { "title": "splink score" }, "bin": "binned", "field": "splink_score_bin_low", "type": "quantitative" }, "x2": { "field": "splink_score_bin_high" }, "y": { "axis": { "title": "probability density" }, "field": "normalised", "type": "quantitative" } }, "height": 200, "mark": "bar", "title": "Histogram of splink scores", "width": 700 }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from splink.diagnostics import splink_score_histogram\n", "from pyspark.sql.functions import expr \n", "splink_score_histogram(df_e.filter(expr('match_probability > 0.001')), spark)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }