{
"cells": [
{
"cell_type": "markdown",
"id": "84cca40c",
"metadata": {},
"source": [
"# Predicting which records match\n",
"\n",
"In the previous tutorial, we built and estimated a linkage model.\n",
"\n",
"In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "48f57034",
"metadata": {},
"outputs": [],
"source": [
"from splink.duckdb.duckdb_linker import DuckDBLinker\n",
"import pandas as pd \n",
"pd.options.display.max_columns = 1000\n",
"df = pd.read_csv(\"./data/fake_1000.csv\")\n"
]
},
{
"cell_type": "markdown",
"id": "d77b6eb8",
"metadata": {},
"source": [
"## Load estimated model from previous tutorial"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "619553a5",
"metadata": {},
"outputs": [],
"source": [
"linker = DuckDBLinker(df)\n",
"linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")"
]
},
{
"cell_type": "markdown",
"id": "c1d97518",
"metadata": {},
"source": [
"# Predicting match weights using the trained model\n",
"\n",
"We use `linker.predict()` to run the model. \n",
"\n",
"Under the hood this will:\n",
"\n",
"- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`\n",
"\n",
"- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data\n",
"\n",
"- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores\n",
"\n",
"Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ead23f3e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" match_weight | \n",
" match_probability | \n",
" unique_id_l | \n",
" unique_id_r | \n",
" first_name_l | \n",
" first_name_r | \n",
" gamma_first_name | \n",
" bf_first_name | \n",
" surname_l | \n",
" surname_r | \n",
" gamma_surname | \n",
" bf_surname | \n",
" dob_l | \n",
" dob_r | \n",
" gamma_dob | \n",
" bf_dob | \n",
" city_l | \n",
" city_r | \n",
" gamma_city | \n",
" bf_city | \n",
" bf_tf_adj_city | \n",
" tf_city_l | \n",
" tf_city_r | \n",
" email_l | \n",
" email_r | \n",
" gamma_email | \n",
" bf_email | \n",
" cluster_l | \n",
" cluster_r | \n",
" match_key | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 14.295585 | \n",
" 0.999950 | \n",
" 4 | \n",
" 5 | \n",
" Grace | \n",
" Grace | \n",
" 2 | \n",
" 85.549242 | \n",
" NaN | \n",
" Kelly | \n",
" -1 | \n",
" 1.000000 | \n",
" 1997-04-26 | \n",
" 1991-04-26 | \n",
" 2 | \n",
" 93.584788 | \n",
" Hull | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.001230 | \n",
" NaN | \n",
" grace.kelly52@jones.com | \n",
" grace.kelly52@jones.com | \n",
" 3 | \n",
" 255.419933 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 12.793439 | \n",
" 0.999859 | \n",
" 26 | \n",
" 29 | \n",
" Thomas | \n",
" Thomas | \n",
" 2 | \n",
" 85.549242 | \n",
" Gabriel | \n",
" Gabriel | \n",
" 3 | \n",
" 90.170377 | \n",
" 1976-09-15 | \n",
" 1976-08-15 | \n",
" 2 | \n",
" 93.584788 | \n",
" Loodon | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.001230 | \n",
" NaN | \n",
" gabriel.t54@nnichls.info | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 11 | \n",
" 11 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 12.793439 | \n",
" 0.999859 | \n",
" 28 | \n",
" 29 | \n",
" Thomas | \n",
" Thomas | \n",
" 2 | \n",
" 85.549242 | \n",
" Gabriel | \n",
" Gabriel | \n",
" 3 | \n",
" 90.170377 | \n",
" 1976-09-15 | \n",
" 1976-08-15 | \n",
" 2 | \n",
" 93.584788 | \n",
" London | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.212792 | \n",
" NaN | \n",
" gabriel.t54@nichols.info | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 11 | \n",
" 11 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" -0.626931 | \n",
" 0.393039 | \n",
" 37 | \n",
" 860 | \n",
" Theodore | \n",
" Theodore | \n",
" 2 | \n",
" 85.549242 | \n",
" Morris | \n",
" Marshall | \n",
" 0 | \n",
" 0.261908 | \n",
" 1978-08-19 | \n",
" 1972-07-25 | \n",
" 0 | \n",
" 0.255612 | \n",
" Birmingham | \n",
" Birmingham | \n",
" 1 | \n",
" 10.257653 | \n",
" 1.120874 | \n",
" 0.049200 | \n",
" 0.0492 | \n",
" t.m39@brooks-sawyer.com | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 13 | \n",
" 214 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.626931 | \n",
" 0.393039 | \n",
" 39 | \n",
" 860 | \n",
" Theodore | \n",
" Theodore | \n",
" 2 | \n",
" 85.549242 | \n",
" Morris | \n",
" Marshall | \n",
" 0 | \n",
" 0.261908 | \n",
" 1978-08-19 | \n",
" 1972-07-25 | \n",
" 0 | \n",
" 0.255612 | \n",
" Birmingham | \n",
" Birmingham | \n",
" 1 | \n",
" 10.257653 | \n",
" 1.120874 | \n",
" 0.049200 | \n",
" 0.0492 | \n",
" t.m39@brooks-sawyer.com | \n",
" NaN | \n",
" -1 | \n",
" 1.000000 | \n",
" 13 | \n",
" 214 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" match_weight match_probability unique_id_l unique_id_r first_name_l \\\n",
"0 14.295585 0.999950 4 5 Grace \n",
"1 12.793439 0.999859 26 29 Thomas \n",
"2 12.793439 0.999859 28 29 Thomas \n",
"3 -0.626931 0.393039 37 860 Theodore \n",
"4 -0.626931 0.393039 39 860 Theodore \n",
"\n",
" first_name_r gamma_first_name bf_first_name surname_l surname_r \\\n",
"0 Grace 2 85.549242 NaN Kelly \n",
"1 Thomas 2 85.549242 Gabriel Gabriel \n",
"2 Thomas 2 85.549242 Gabriel Gabriel \n",
"3 Theodore 2 85.549242 Morris Marshall \n",
"4 Theodore 2 85.549242 Morris Marshall \n",
"\n",
" gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob \\\n",
"0 -1 1.000000 1997-04-26 1991-04-26 2 93.584788 \n",
"1 3 90.170377 1976-09-15 1976-08-15 2 93.584788 \n",
"2 3 90.170377 1976-09-15 1976-08-15 2 93.584788 \n",
"3 0 0.261908 1978-08-19 1972-07-25 0 0.255612 \n",
"4 0 0.261908 1978-08-19 1972-07-25 0 0.255612 \n",
"\n",
" city_l city_r gamma_city bf_city bf_tf_adj_city tf_city_l \\\n",
"0 Hull NaN -1 1.000000 1.000000 0.001230 \n",
"1 Loodon NaN -1 1.000000 1.000000 0.001230 \n",
"2 London NaN -1 1.000000 1.000000 0.212792 \n",
"3 Birmingham Birmingham 1 10.257653 1.120874 0.049200 \n",
"4 Birmingham Birmingham 1 10.257653 1.120874 0.049200 \n",
"\n",
" tf_city_r email_l email_r gamma_email \\\n",
"0 NaN grace.kelly52@jones.com grace.kelly52@jones.com 3 \n",
"1 NaN gabriel.t54@nnichls.info NaN -1 \n",
"2 NaN gabriel.t54@nichols.info NaN -1 \n",
"3 0.0492 t.m39@brooks-sawyer.com NaN -1 \n",
"4 0.0492 t.m39@brooks-sawyer.com NaN -1 \n",
"\n",
" bf_email cluster_l cluster_r match_key \n",
"0 255.419933 1 1 0 \n",
"1 1.000000 11 11 0 \n",
"2 1.000000 11 11 0 \n",
"3 1.000000 13 214 0 \n",
"4 1.000000 13 214 0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_predictions = linker.predict(threshold_match_probability=0.2)\n",
"df_predictions.as_pandas_dataframe(limit=5)"
]
},
{
"cell_type": "markdown",
"id": "f00370bb",
"metadata": {},
"source": [
"## Clustering\n",
"\n",
"The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:\n",
"```\n",
"A -> B with score 0.9\n",
"B -> C with score 0.95\n",
"C -> D with score 0.1\n",
"D -> E with score 0.99\n",
"```\n",
"\n",
"Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.\n",
"\n",
"With a score threshold of 0.5, the above data could be represented conceptually as:\n",
"\n",
"```\n",
"ID, Cluster ID\n",
"A, 1\n",
"B, 1\n",
"C, 1\n",
"D, 2\n",
"E, 2\n",
"```\n",
"\n",
"The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "257ae717",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Completed iteration 1, root rows count 14\n",
"Completed iteration 2, root rows count 0\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cluster_id | \n",
" unique_id | \n",
" first_name | \n",
" surname | \n",
" dob | \n",
" city | \n",
" email | \n",
" cluster | \n",
" tf_city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" Robert | \n",
" Alan | \n",
" 1971-06-24 | \n",
" NaN | \n",
" robert255@smith.net | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" Robert | \n",
" Allen | \n",
" 1971-05-24 | \n",
" NaN | \n",
" roberta25@smith.net | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" Rob | \n",
" Allen | \n",
" 1971-06-24 | \n",
" London | \n",
" roberta25@smith.net | \n",
" 0 | \n",
" 0.212792 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 3 | \n",
" Robert | \n",
" Alen | \n",
" 1971-06-24 | \n",
" Lonon | \n",
" NaN | \n",
" 0 | \n",
" 0.007380 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" Grace | \n",
" NaN | \n",
" 1997-04-26 | \n",
" Hull | \n",
" grace.kelly52@jones.com | \n",
" 1 | \n",
" 0.001230 | \n",
"
\n",
" \n",
" 5 | \n",
" 4 | \n",
" 5 | \n",
" Grace | \n",
" Kelly | \n",
" 1991-04-26 | \n",
" NaN | \n",
" grace.kelly52@jones.com | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" Logan | \n",
" pMurphy | \n",
" 1973-08-01 | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" NaN | \n",
" NaN | \n",
" 2015-03-03 | \n",
" Portsmouth | \n",
" evied56@harris-bailey.net | \n",
" 3 | \n",
" 0.017220 | \n",
"
\n",
" \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" NaN | \n",
" Dean | \n",
" 2015-03-03 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" 8 | \n",
" 9 | \n",
" Evie | \n",
" Dean | \n",
" 2015-03-03 | \n",
" Pootsmruth | \n",
" evihd56@earris-bailey.net | \n",
" 3 | \n",
" 0.001230 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cluster_id unique_id first_name surname dob city \\\n",
"0 0 0 Robert Alan 1971-06-24 NaN \n",
"1 0 1 Robert Allen 1971-05-24 NaN \n",
"2 0 2 Rob Allen 1971-06-24 London \n",
"3 0 3 Robert Alen 1971-06-24 Lonon \n",
"4 4 4 Grace NaN 1997-04-26 Hull \n",
"5 4 5 Grace Kelly 1991-04-26 NaN \n",
"6 6 6 Logan pMurphy 1973-08-01 NaN \n",
"7 7 7 NaN NaN 2015-03-03 Portsmouth \n",
"8 8 8 NaN Dean 2015-03-03 NaN \n",
"9 8 9 Evie Dean 2015-03-03 Pootsmruth \n",
"\n",
" email cluster tf_city \n",
"0 robert255@smith.net 0 NaN \n",
"1 roberta25@smith.net 0 NaN \n",
"2 roberta25@smith.net 0 0.212792 \n",
"3 NaN 0 0.007380 \n",
"4 grace.kelly52@jones.com 1 0.001230 \n",
"5 grace.kelly52@jones.com 1 NaN \n",
"6 NaN 2 NaN \n",
"7 evied56@harris-bailey.net 3 0.017220 \n",
"8 NaN 3 NaN \n",
"9 evihd56@earris-bailey.net 3 0.001230 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n",
"clusters.as_pandas_dataframe(limit=10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"vscode": {
"interpreter": {
"hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}