{ "cells": [ { "cell_type": "markdown", "id": "84cca40c", "metadata": {}, "source": [ "# Predicting which records match\n", "\n", "In the previous tutorial, we built and estimated a linkage model.\n", "\n", "In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match." ] }, { "cell_type": "code", "execution_count": 1, "id": "48f57034", "metadata": {}, "outputs": [], "source": [ "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "import pandas as pd \n", "pd.options.display.max_columns = 1000\n", "df = pd.read_csv(\"./data/fake_1000.csv\")\n" ] }, { "cell_type": "markdown", "id": "d77b6eb8", "metadata": {}, "source": [ "## Load estimated model from previous tutorial" ] }, { "cell_type": "code", "execution_count": 2, "id": "619553a5", "metadata": {}, "outputs": [], "source": [ "linker = DuckDBLinker(df)\n", "linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")" ] }, { "cell_type": "markdown", "id": "c1d97518", "metadata": {}, "source": [ "# Predicting match weights using the trained model\n", "\n", "We use `linker.predict()` to run the model. \n", "\n", "Under the hood this will:\n", "\n", "- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`\n", "\n", "- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data\n", "\n", "- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores\n", "\n", "Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold." ] }, { "cell_type": "code", "execution_count": 3, "id": "ead23f3e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_namebf_first_namesurname_lsurname_rgamma_surnamebf_surnamedob_ldob_rgamma_dobbf_dobcity_lcity_rgamma_citybf_citybf_tf_adj_citytf_city_ltf_city_remail_lemail_rgamma_emailbf_emailcluster_lcluster_rmatch_key
014.2955850.99995045GraceGrace285.549242NaNKelly-11.0000001997-04-261991-04-26293.584788HullNaN-11.0000001.0000000.001230NaNgrace.kelly52@jones.comgrace.kelly52@jones.com3255.419933110
112.7934390.9998592629ThomasThomas285.549242GabrielGabriel390.1703771976-09-151976-08-15293.584788LoodonNaN-11.0000001.0000000.001230NaNgabriel.t54@nnichls.infoNaN-11.00000011110
212.7934390.9998592829ThomasThomas285.549242GabrielGabriel390.1703771976-09-151976-08-15293.584788LondonNaN-11.0000001.0000000.212792NaNgabriel.t54@nichols.infoNaN-11.00000011110
3-0.6269310.39303937860TheodoreTheodore285.549242MorrisMarshall00.2619081978-08-191972-07-2500.255612BirminghamBirmingham110.2576531.1208740.0492000.0492t.m39@brooks-sawyer.comNaN-11.000000132140
4-0.6269310.39303939860TheodoreTheodore285.549242MorrisMarshall00.2619081978-08-191972-07-2500.255612BirminghamBirmingham110.2576531.1208740.0492000.0492t.m39@brooks-sawyer.comNaN-11.000000132140
\n", "
" ], "text/plain": [ " match_weight match_probability unique_id_l unique_id_r first_name_l \\\n", "0 14.295585 0.999950 4 5 Grace \n", "1 12.793439 0.999859 26 29 Thomas \n", "2 12.793439 0.999859 28 29 Thomas \n", "3 -0.626931 0.393039 37 860 Theodore \n", "4 -0.626931 0.393039 39 860 Theodore \n", "\n", " first_name_r gamma_first_name bf_first_name surname_l surname_r \\\n", "0 Grace 2 85.549242 NaN Kelly \n", "1 Thomas 2 85.549242 Gabriel Gabriel \n", "2 Thomas 2 85.549242 Gabriel Gabriel \n", "3 Theodore 2 85.549242 Morris Marshall \n", "4 Theodore 2 85.549242 Morris Marshall \n", "\n", " gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob \\\n", "0 -1 1.000000 1997-04-26 1991-04-26 2 93.584788 \n", "1 3 90.170377 1976-09-15 1976-08-15 2 93.584788 \n", "2 3 90.170377 1976-09-15 1976-08-15 2 93.584788 \n", "3 0 0.261908 1978-08-19 1972-07-25 0 0.255612 \n", "4 0 0.261908 1978-08-19 1972-07-25 0 0.255612 \n", "\n", " city_l city_r gamma_city bf_city bf_tf_adj_city tf_city_l \\\n", "0 Hull NaN -1 1.000000 1.000000 0.001230 \n", "1 Loodon NaN -1 1.000000 1.000000 0.001230 \n", "2 London NaN -1 1.000000 1.000000 0.212792 \n", "3 Birmingham Birmingham 1 10.257653 1.120874 0.049200 \n", "4 Birmingham Birmingham 1 10.257653 1.120874 0.049200 \n", "\n", " tf_city_r email_l email_r gamma_email \\\n", "0 NaN grace.kelly52@jones.com grace.kelly52@jones.com 3 \n", "1 NaN gabriel.t54@nnichls.info NaN -1 \n", "2 NaN gabriel.t54@nichols.info NaN -1 \n", "3 0.0492 t.m39@brooks-sawyer.com NaN -1 \n", "4 0.0492 t.m39@brooks-sawyer.com NaN -1 \n", "\n", " bf_email cluster_l cluster_r match_key \n", "0 255.419933 1 1 0 \n", "1 1.000000 11 11 0 \n", "2 1.000000 11 11 0 \n", "3 1.000000 13 214 0 \n", "4 1.000000 13 214 0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_predictions = linker.predict(threshold_match_probability=0.2)\n", "df_predictions.as_pandas_dataframe(limit=5)" ] }, { "cell_type": "markdown", "id": "f00370bb", "metadata": {}, "source": [ "## Clustering\n", "\n", "The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:\n", "```\n", "A -> B with score 0.9\n", "B -> C with score 0.95\n", "C -> D with score 0.1\n", "D -> E with score 0.99\n", "```\n", "\n", "Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.\n", "\n", "With a score threshold of 0.5, the above data could be represented conceptually as:\n", "\n", "```\n", "ID, Cluster ID\n", "A, 1\n", "B, 1\n", "C, 1\n", "D, 2\n", "E, 2\n", "```\n", "\n", "The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:" ] }, { "cell_type": "code", "execution_count": 4, "id": "257ae717", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Completed iteration 1, root rows count 14\n", "Completed iteration 2, root rows count 0\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cluster_idunique_idfirst_namesurnamedobcityemailclustertf_city
000RobertAlan1971-06-24NaNrobert255@smith.net0NaN
101RobertAllen1971-05-24NaNroberta25@smith.net0NaN
202RobAllen1971-06-24Londonroberta25@smith.net00.212792
303RobertAlen1971-06-24LononNaN00.007380
444GraceNaN1997-04-26Hullgrace.kelly52@jones.com10.001230
545GraceKelly1991-04-26NaNgrace.kelly52@jones.com1NaN
666LoganpMurphy1973-08-01NaNNaN2NaN
777NaNNaN2015-03-03Portsmouthevied56@harris-bailey.net30.017220
888NaNDean2015-03-03NaNNaN3NaN
989EvieDean2015-03-03Pootsmruthevihd56@earris-bailey.net30.001230
\n", "
" ], "text/plain": [ " cluster_id unique_id first_name surname dob city \\\n", "0 0 0 Robert Alan 1971-06-24 NaN \n", "1 0 1 Robert Allen 1971-05-24 NaN \n", "2 0 2 Rob Allen 1971-06-24 London \n", "3 0 3 Robert Alen 1971-06-24 Lonon \n", "4 4 4 Grace NaN 1997-04-26 Hull \n", "5 4 5 Grace Kelly 1991-04-26 NaN \n", "6 6 6 Logan pMurphy 1973-08-01 NaN \n", "7 7 7 NaN NaN 2015-03-03 Portsmouth \n", "8 8 8 NaN Dean 2015-03-03 NaN \n", "9 8 9 Evie Dean 2015-03-03 Pootsmruth \n", "\n", " email cluster tf_city \n", "0 robert255@smith.net 0 NaN \n", "1 roberta25@smith.net 0 NaN \n", "2 roberta25@smith.net 0 0.212792 \n", "3 NaN 0 0.007380 \n", "4 grace.kelly52@jones.com 1 0.001230 \n", "5 grace.kelly52@jones.com 1 NaN \n", "6 NaN 2 NaN \n", "7 evied56@harris-bailey.net 3 0.017220 \n", "8 NaN 3 NaN \n", "9 evihd56@earris-bailey.net 3 0.001230 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n", "clusters.as_pandas_dataframe(limit=10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 5 }