{ "cells": [ { "cell_type": "markdown", "id": "84cca40c", "metadata": {}, "source": [ "# Predicting which records match\n", "\n", "In the previous tutorial, we built and estimated a linkage model.\n", "\n", "In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match." ] }, { "cell_type": "code", "execution_count": 1, "id": "48f57034", "metadata": {}, "outputs": [], "source": [ "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "import pandas as pd \n", "pd.options.display.max_columns = 1000\n", "df = pd.read_csv(\"./data/fake_1000.csv\")\n" ] }, { "cell_type": "markdown", "id": "d77b6eb8", "metadata": {}, "source": [ "## Load estimated model from previous tutorial" ] }, { "cell_type": "code", "execution_count": 2, "id": "619553a5", "metadata": {}, "outputs": [], "source": [ "linker = DuckDBLinker(df)\n", "linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")" ] }, { "cell_type": "markdown", "id": "c1d97518", "metadata": {}, "source": [ "# Predicting match weights using the trained model\n", "\n", "We use `linker.predict()` to run the model. \n", "\n", "Under the hood this will:\n", "\n", "- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`\n", "\n", "- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data\n", "\n", "- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores\n", "\n", "Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold." ] }, { "cell_type": "code", "execution_count": 3, "id": "ead23f3e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_namebf_first_namesurname_lsurname_rgamma_surnamebf_surnamedob_ldob_rgamma_dobbf_dobcity_lcity_rgamma_citytf_city_ltf_city_rbf_citybf_tf_adj_cityemail_lemail_rgamma_emailbf_emailmatch_key
012.7389850.99985445GraceGrace285.548522NaNKelly-11.0000001997-04-261991-04-26293.585230HullNaN-10.001230NaN1.01.0grace.kelly52@jones.comgrace.kelly52@jones.com3255.4199530
111.2368250.9995862629ThomasThomas285.548522GabrielGabriel390.1695351976-09-151976-08-15293.585230LoodonNaN-10.001230NaN1.01.0gabriel.t54@nnichls.infoNaN-11.0000000
211.2368250.9995862829ThomasThomas285.548522GabrielGabriel390.1695351976-09-151976-08-15293.585230LondonNaN-10.212792NaN1.01.0gabriel.t54@nichols.infoNaN-11.0000000
38.2364390.9966954750ErinErin285.548522RogersRoers279.1494092010-01-022010-03-03113.323346LondonNaN-10.212792NaN1.01.0e.rogers3@hopkins.orgNaN-11.0000000
44.7422570.9639834950ErinErin285.548522NaNRoers-11.0000002010-03-012010-03-03293.585230LondonNaN-10.212792NaN1.01.0e.rogers3@honkips.orgNaN-11.0000000
\n", "
" ], "text/plain": [ " match_weight match_probability unique_id_l unique_id_r first_name_l \\\n", "0 12.738985 0.999854 4 5 Grace \n", "1 11.236825 0.999586 26 29 Thomas \n", "2 11.236825 0.999586 28 29 Thomas \n", "3 8.236439 0.996695 47 50 Erin \n", "4 4.742257 0.963983 49 50 Erin \n", "\n", " first_name_r gamma_first_name bf_first_name surname_l surname_r \\\n", "0 Grace 2 85.548522 NaN Kelly \n", "1 Thomas 2 85.548522 Gabriel Gabriel \n", "2 Thomas 2 85.548522 Gabriel Gabriel \n", "3 Erin 2 85.548522 Rogers Roers \n", "4 Erin 2 85.548522 NaN Roers \n", "\n", " gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob \\\n", "0 -1 1.000000 1997-04-26 1991-04-26 2 93.585230 \n", "1 3 90.169535 1976-09-15 1976-08-15 2 93.585230 \n", "2 3 90.169535 1976-09-15 1976-08-15 2 93.585230 \n", "3 2 79.149409 2010-01-02 2010-03-03 1 13.323346 \n", "4 -1 1.000000 2010-03-01 2010-03-03 2 93.585230 \n", "\n", " city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city \\\n", "0 Hull NaN -1 0.001230 NaN 1.0 1.0 \n", "1 Loodon NaN -1 0.001230 NaN 1.0 1.0 \n", "2 London NaN -1 0.212792 NaN 1.0 1.0 \n", "3 London NaN -1 0.212792 NaN 1.0 1.0 \n", "4 London NaN -1 0.212792 NaN 1.0 1.0 \n", "\n", " email_l email_r gamma_email bf_email \\\n", "0 grace.kelly52@jones.com grace.kelly52@jones.com 3 255.419953 \n", "1 gabriel.t54@nnichls.info NaN -1 1.000000 \n", "2 gabriel.t54@nichols.info NaN -1 1.000000 \n", "3 e.rogers3@hopkins.org NaN -1 1.000000 \n", "4 e.rogers3@honkips.org NaN -1 1.000000 \n", "\n", " match_key \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_predictions = linker.predict(threshold_match_probability=0.2)\n", "df_predictions.as_pandas_dataframe(limit=5)" ] }, { "cell_type": "markdown", "id": "f00370bb", "metadata": {}, "source": [ "## Clustering\n", "\n", "The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:\n", "```\n", "A -> B with score 0.9\n", "B -> C with score 0.95\n", "C -> D with score 0.1\n", "D -> E with score 0.99\n", "```\n", "\n", "Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.\n", "\n", "With a score threshold of 0.5, the above data could be represented conceptually as:\n", "\n", "```\n", "ID, Cluster ID\n", "A, 1\n", "B, 1\n", "C, 1\n", "D, 2\n", "E, 2\n", "```\n", "\n", "The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:" ] }, { "cell_type": "code", "execution_count": 4, "id": "257ae717", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Completed iteration 1, root rows count 10\n", "Completed iteration 2, root rows count 0\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cluster_idunique_idfirst_namesurnamedobcityemailclustertf_city
000RobertAlan1971-06-24NaNrobert255@smith.net0NaN
101RobertAllen1971-05-24NaNroberta25@smith.net0NaN
202RobAllen1971-06-24Londonroberta25@smith.net00.212792
303RobertAlen1971-06-24LononNaN00.007380
444GraceNaN1997-04-26Hullgrace.kelly52@jones.com10.001230
545GraceKelly1991-04-26NaNgrace.kelly52@jones.com1NaN
666LoganpMurphy1973-08-01NaNNaN2NaN
777NaNNaN2015-03-03Portsmouthevied56@harris-bailey.net30.017220
888NaNDean2015-03-03NaNNaN3NaN
989EvieDean2015-03-03Pootsmruthevihd56@earris-bailey.net30.001230
\n", "
" ], "text/plain": [ " cluster_id unique_id first_name surname dob city \\\n", "0 0 0 Robert Alan 1971-06-24 NaN \n", "1 0 1 Robert Allen 1971-05-24 NaN \n", "2 0 2 Rob Allen 1971-06-24 London \n", "3 0 3 Robert Alen 1971-06-24 Lonon \n", "4 4 4 Grace NaN 1997-04-26 Hull \n", "5 4 5 Grace Kelly 1991-04-26 NaN \n", "6 6 6 Logan pMurphy 1973-08-01 NaN \n", "7 7 7 NaN NaN 2015-03-03 Portsmouth \n", "8 8 8 NaN Dean 2015-03-03 NaN \n", "9 8 9 Evie Dean 2015-03-03 Pootsmruth \n", "\n", " email cluster tf_city \n", "0 robert255@smith.net 0 NaN \n", "1 roberta25@smith.net 0 NaN \n", "2 roberta25@smith.net 0 0.212792 \n", "3 NaN 0 0.007380 \n", "4 grace.kelly52@jones.com 1 0.001230 \n", "5 grace.kelly52@jones.com 1 NaN \n", "6 NaN 2 NaN \n", "7 evied56@harris-bailey.net 3 0.017220 \n", "8 NaN 3 NaN \n", "9 evihd56@earris-bailey.net 3 0.001230 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n", "clusters.as_pandas_dataframe(limit=10)" ] } ], "metadata": { "kernelspec": { "display_name": "splink_demos", "language": "python", "name": "splink_demos" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 5 }