{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating m from a sample of pairwise labels\n", "\n", "In this example, we estimate the m probabilities of the model from a table containing pairwise record comparisons which we know are 'true' matches. For example, these may be the result of work by a clerical team who have manually labelled a sample of matches.\n", "\n", "The table must be in the following format:\n", "\n", "|source_dataset_l|unique_id_l|source_dataset_r|unique_id_r|\n", "|----------------|-----------|----------------|-----------|\n", "|df_1 |1 |df_2 |2 |\n", "|df_1 |1 |df_2 |3 |\n", "\n", "It is assumed that every record in the table represents a certain match.\n", "\n", "Note that the column names above are the defaults. They should correspond to the values you've set for [`unique_id_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#unique_id_column_name) and [`source_dataset_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#source_dataset_column_name), if you've chosen custom values.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RendererRegistry.enable('mimetype')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable(\"mimetype\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unique_id_lsource_dataset_lunique_id_rsource_dataset_r
00fake_10003fake_1000
11fake_10003fake_1000
22fake_10003fake_1000
34fake_10005fake_1000
47fake_100010fake_1000
...............
2026978fake_1000979fake_1000
2027985fake_1000986fake_1000
2028624fake_1000626fake_1000
2029625fake_1000626fake_1000
2030624fake_1000625fake_1000
\n", "

2031 rows × 4 columns

\n", "
" ], "text/plain": [ " unique_id_l source_dataset_l unique_id_r source_dataset_r\n", "0 0 fake_1000 3 fake_1000\n", "1 1 fake_1000 3 fake_1000\n", "2 2 fake_1000 3 fake_1000\n", "3 4 fake_1000 5 fake_1000\n", "4 7 fake_1000 10 fake_1000\n", "... ... ... ... ...\n", "2026 978 fake_1000 979 fake_1000\n", "2027 985 fake_1000 986 fake_1000\n", "2028 624 fake_1000 626 fake_1000\n", "2029 625 fake_1000 626 fake_1000\n", "2030 624 fake_1000 625 fake_1000\n", "\n", "[2031 rows x 4 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairwise_labels = pd.read_csv(\"./data/pairwise_labels_to_estimate_m.csv\")\n", "pairwise_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now proceed to estimate the Fellegi Sunter model:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unique_idfirst_namesurnamedobcityemailcluster
00RobertAlan1971-06-24NaNrobert255@smith.net0
11RobertAllen1971-05-24NaNroberta25@smith.net0
\n", "
" ], "text/plain": [ " unique_id first_name surname dob city email cluster\n", "0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0\n", "1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"./data/fake_1000.csv\")\n", "df.head(2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "from splink.duckdb import duckdb_comparison_library as cl\n", "\n", "settings = {\n", " \"link_type\": \"dedupe_only\",\n", " \"blocking_rules_to_generate_predictions\": [\n", " \"l.first_name = r.first_name\",\n", " \"l.surname = r.surname\",\n", " ],\n", " \"comparisons\": [\n", " cl.levenshtein_at_thresholds(\"first_name\", 2),\n", " cl.levenshtein_at_thresholds(\"surname\", 2),\n", " cl.levenshtein_at_thresholds(\"dob\"),\n", " cl.exact_match(\"city\", term_frequency_adjustments=True),\n", " cl.levenshtein_at_thresholds(\"email\"),\n", " ],\n", " \"retain_matching_columns\": True,\n", " \"retain_intermediate_calculation_columns\": True,\n", "}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "linker = DuckDBLinker(df, settings, set_up_basic_logging=False)\n", "deterministic_rules = [\n", " \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n", " \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n", " \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n", " \"l.email = r.email\"\n", "]\n", "\n", "linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "linker.estimate_u_using_random_sampling(max_pairs=1e6)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Register the pairwise labels table with the database, and then use it to estimate the m values\n", "labels_df = linker.register_labels_table(pairwise_labels, overwrite=True)\n", "linker.estimate_m_from_pairwise_labels(labels_df)\n", "\n", "\n", "# Not if the labels table already existing in the dataset you could run\n", "# linker.estimate_m_from_pairwise_labels(\"labels_tablename_here\")\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_blocking_rule = \"l.first_name = r.first_name\"\n", "linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v5.2.0.json", "config": { "title": { "anchor": "middle" }, "view": { "continuousHeight": 300, "continuousWidth": 400 } }, "data": { "values": [ { "comparison_level_label": "Exact match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 2, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0057935713975033705, "estimated_probability_as_log_odds": -7.422948662194144, "m_or_u": "u", "sql_condition": "\"first_name_l\" = \"first_name_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 2, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.4953998584571833, "estimated_probability_as_log_odds": -0.026547154611559317, "m_or_u": "m", "sql_condition": "\"first_name_l\" = \"first_name_r\"" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.010119901990634016, "estimated_probability_as_log_odds": -6.611986562330359, "m_or_u": "u", "sql_condition": "levenshtein(\"first_name_l\", \"first_name_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.27034677990092004, "estimated_probability_as_log_odds": -1.4323997893323595, "m_or_u": "m", "sql_condition": "levenshtein(\"first_name_l\", \"first_name_r\") <= 2" }, { "comparison_level_label": "All other comparisons", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.9840865266118626, "estimated_probability_as_log_odds": 5.950464503226166, "m_or_u": "u", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.23425336164189667, "estimated_probability_as_log_odds": -1.7087973770195977, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "Exact match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.004889975550122249, "estimated_probability_as_log_odds": -7.668884984266247, "m_or_u": "u", "sql_condition": "\"surname_l\" = \"surname_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.4485294117647059, "estimated_probability_as_log_odds": -0.2980813529329946, "m_or_u": "m", "sql_condition": "\"surname_l\" = \"surname_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.6250962916703164, "estimated_probability_as_log_odds": 0.7375583478807848, "m_or_u": "m", "sql_condition": "\"surname_l\" = \"surname_r\"" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.007373772654946249, "estimated_probability_as_log_odds": -7.072703827504528, "m_or_u": "u", "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.2897058823529412, "estimated_probability_as_log_odds": -1.2938275593793966, "m_or_u": "m", "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.31577702531443674, "estimated_probability_as_log_odds": -1.115560337169851, "m_or_u": "m", "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2" }, { "comparison_level_label": "All other comparisons", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.9877362517949315, "estimated_probability_as_log_odds": 6.331653973987579, "m_or_u": "u", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.26176470588235295, "estimated_probability_as_log_odds": -1.495810122984374, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.059126683015246775, "estimated_probability_as_log_odds": -3.9921192358597515, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "Exact match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0017477477477477479, "estimated_probability_as_log_odds": -9.157763635801777, "m_or_u": "u", "sql_condition": "\"dob_l\" = \"dob_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.42146725750861647, "estimated_probability_as_log_odds": -0.4569780550512149, "m_or_u": "m", "sql_condition": "\"dob_l\" = \"dob_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.3984390015055559, "estimated_probability_as_log_odds": -0.5943521609750629, "m_or_u": "m", "sql_condition": "\"dob_l\" = \"dob_r\"" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0016016016016016017, "estimated_probability_as_log_odds": -9.283956487665113, "m_or_u": "u", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.08862629246676514, "estimated_probability_as_log_odds": -3.3622360835408993, "m_or_u": "m", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.13058655173623407, "estimated_probability_as_log_odds": -2.735036080978985, "m_or_u": "m", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.015517517517517518, "estimated_probability_as_log_odds": -5.987395855901018, "m_or_u": "u", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.19940915805022155, "estimated_probability_as_log_odds": -2.005333444303041, "m_or_u": "m", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.18478518670418353, "estimated_probability_as_log_odds": -2.1413311591928785, "m_or_u": "m", "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 2" }, { "comparison_level_label": "All other comparisons", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.9811331331331331, "estimated_probability_as_log_odds": 5.700522147427907, "m_or_u": "u", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.29049729197439683, "estimated_probability_as_log_odds": -1.288283475925544, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.2861892600540266, "estimated_probability_as_log_odds": -1.318572075787331, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "Exact match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0551475711801453, "estimated_probability_as_log_odds": -4.098719767635731, "m_or_u": "u", "sql_condition": "\"city_l\" = \"city_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.5661492978566149, "estimated_probability_as_log_odds": 0.38398388881490503, "m_or_u": "m", "sql_condition": "\"city_l\" = \"city_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.5617206210359919, "estimated_probability_as_log_odds": 0.3580019642355839, "m_or_u": "m", "sql_condition": "\"city_l\" = \"city_r\"" }, { "comparison_level_label": "All other comparisons", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.9448524288198547, "estimated_probability_as_log_odds": 4.09871976763573, "m_or_u": "u", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.43385070214338506, "estimated_probability_as_log_odds": -0.3839838888149049, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.43827937896400826, "estimated_probability_as_log_odds": -0.35800196423558334, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "Exact match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0021938713143283602, "estimated_probability_as_log_odds": -8.829136816196753, "m_or_u": "u", "sql_condition": "\"email_l\" = \"email_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.5340642129992169, "estimated_probability_as_log_odds": 0.1968820708288665, "m_or_u": "m", "sql_condition": "\"email_l\" = \"email_r\"" }, { "comparison_level_label": "Exact match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.5901618249897193, "estimated_probability_as_log_odds": 0.5260562294318846, "m_or_u": "m", "sql_condition": "\"email_l\" = \"email_r\"" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.0007334349848487773, "estimated_probability_as_log_odds": -10.411984784067583, "m_or_u": "u", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.17854346123727485, "estimated_probability_as_log_odds": -2.201908948412526, "m_or_u": "m", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 1", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.1765409703056437, "estimated_probability_as_log_odds": -2.2216938398112753, "m_or_u": "m", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.000620846281034272, "estimated_probability_as_log_odds": -10.652580302192339, "m_or_u": "u", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.1417384494909945, "estimated_probability_as_log_odds": -2.598186195877321, "m_or_u": "m", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2" }, { "comparison_level_label": "Levenshtein <= 2", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.1271138693200017, "estimated_probability_as_log_odds": -2.779672013828551, "m_or_u": "m", "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2" }, { "comparison_level_label": "All other comparisons", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "estimate_description": "estimate u by random sampling", "estimated_probability": 0.9964518474197885, "estimated_probability_as_log_odds": 8.133588228885133, "m_or_u": "u", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "estimate_description": "estimate m from pairwise labels", "estimated_probability": 0.1456538762725137, "estimated_probability_as_log_odds": -2.552276575215576, "m_or_u": "m", "sql_condition": "ELSE" }, { "comparison_level_label": "All other comparisons", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "estimate_description": "EM, blocked on: l.first_name = r.first_name", "estimated_probability": 0.10618333538463538, "estimated_probability_as_log_odds": -3.073421578476505, "m_or_u": "m", "sql_condition": "ELSE" } ] }, "encoding": { "color": { "field": "estimate_description", "type": "nominal" }, "column": { "field": "m_or_u", "title": null, "type": "nominal" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "title": null, "type": "nominal" }, "shape": { "field": "estimate_description", "scale": { "range": [ "circle", "square", "triangle", "diamond" ] }, "type": "nominal" }, "tooltip": [ { "field": "comparison_name", "type": "nominal" }, { "field": "estimate_description", "type": "nominal" }, { "field": "estimated_probability", "type": "quantitative" } ], "x": { "field": "estimated_probability_as_log_odds", "type": "quantitative" }, "y": { "axis": { "grid": true, "title": null }, "field": "comparison_level_label", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "mark": { "filled": false, "opacity": 0.7, "size": 100, "type": "point" }, "resolve": { "scale": { "y": "independent" } }, "selection": { "selection_zoom": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "title": { "subtitle": "Use mousewheeel to zoom", "text": "Comparison of parameter estimates across training sessions" } }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.parameter_estimate_comparisons_chart()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v5.2.json", "config": { "header": { "title": null }, "mark": { "tooltip": null }, "title": { "anchor": "middle" }, "view": { "height": 60, "width": 400 } }, "data": { "values": [ { "bayes_factor": 0.0033430420247643373, "bayes_factor_description": "The probability that two random records drawn at random match is 0.003 or one in 300.1 records.This is equivalent to a starting match weight of -8.225.", "comparison_name": "probability_two_random_records_match", "comparison_sort_order": -1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "m_probability_description": null, "max_comparison_vector_value": 0, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": null, "tf_adjustment_column": null, "tf_adjustment_weight": null, "u_probability": null, "u_probability_description": null }, { "bayes_factor": 85.50854463805632, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.51 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.417996686710603, "m_probability": 0.4953998584571833, "m_probability_description": "Amongst matching record comparisons, 49.54% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0057935713975033705, "u_probability_description": "Amongst non-matching record comparisons, 0.58% of records are in the exact match comparison level" }, { "bayes_factor": 26.71436740702888, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 26.71 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 2", "log2_bayes_factor": 4.739543949609427, "m_probability": 0.27034677990092004, "m_probability_description": "Amongst matching record comparisons, 27.03% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"first_name_l\", \"first_name_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.010119901990634016, "u_probability_description": "Amongst non-matching record comparisons, 1.01% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.2380414275647221, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 4.20 times less likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.0707154199210716, "m_probability": 0.23425336164189667, "m_probability_description": "Amongst matching record comparisons, 23.43% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9840865266118626, "u_probability_description": "Amongst non-matching record comparisons, 98.41% of records are in the all other comparisons comparison level" }, { "bayes_factor": 109.77822817623102, "bayes_factor_description": "If comparison level is `exact match` then comparison is 109.78 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.778448149248172, "m_probability": 0.5368128517175111, "m_probability_description": "Amongst matching record comparisons, 53.68% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"surname_l\" = \"surname_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.004889975550122249, "u_probability_description": "Amongst non-matching record comparisons, 0.49% of records are in the exact match comparison level" }, { "bayes_factor": 41.05652126806665, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 41.06 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 2", "log2_bayes_factor": 5.359539487508672, "m_probability": 0.302741453833689, "m_probability_description": "Amongst matching record comparisons, 30.27% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.007373772654946249, "u_probability_description": "Amongst non-matching record comparisons, 0.74% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.16243779061185123, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 6.16 times less likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.622040785111616, "m_probability": 0.16044569444879986, "m_probability_description": "Amongst matching record comparisons, 16.04% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9877362517949315, "u_probability_description": "Amongst non-matching record comparisons, 98.77% of records are in the all other comparisons comparison level" }, { "bayes_factor": 234.56081121281733, "bayes_factor_description": "If comparison level is `exact match` then comparison is 234.56 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.873818187831652, "m_probability": 0.4099531295070862, "m_probability_description": "Amongst matching record comparisons, 41.00% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"dob_l\" = \"dob_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0017477477477477479, "u_probability_description": "Amongst non-matching record comparisons, 0.17% of records are in the exact match comparison level" }, { "bayes_factor": 68.4355097996238, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 68.44 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 1", "log2_bayes_factor": 6.096673199508259, "m_probability": 0.1096064221014996, "m_probability_description": "Amongst matching record comparisons, 10.96% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0016016016016016017, "u_probability_description": "Amongst non-matching record comparisons, 0.16% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 12.379375255117102, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 12.38 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 2", "log2_bayes_factor": 3.6298666033743396, "m_probability": 0.19209717237720253, "m_probability_description": "Amongst matching record comparisons, 19.21% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.015517517517517518, "u_probability_description": "Amongst non-matching record comparisons, 1.55% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.2938880222028803, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.40 times less likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.766661533654638, "m_probability": 0.2883432760142117, "m_probability_description": "Amongst matching record comparisons, 28.83% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9811331331331331, "u_probability_description": "Amongst non-matching record comparisons, 98.11% of records are in the all other comparisons comparison level" }, { "bayes_factor": 10.225925591612203, "bayes_factor_description": "If comparison level is `exact match` then comparison is 10.23 times more likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "has_tf_adjustments": true, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 3.3541595283715577, "m_probability": 0.5639349594463035, "m_probability_description": "Amongst matching record comparisons, 56.39% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"city_l\" = \"city_r\"", "tf_adjustment_column": "city", "tf_adjustment_weight": 1, "u_probability": 0.0551475711801453, "u_probability_description": "Amongst non-matching record comparisons, 5.51% of records are in the exact match comparison level" }, { "bayes_factor": 0.46151655777437467, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.17 times less likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.1155456866902822, "m_probability": 0.43606504055369666, "m_probability_description": "Amongst matching record comparisons, 43.61% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9448524288198547, "u_probability_description": "Amongst non-matching record comparisons, 94.49% of records are in the all other comparisons comparison level" }, { "bayes_factor": 256.2196858691119, "bayes_factor_description": "If comparison level is `exact match` then comparison is 256.22 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 8.001237514848016, "m_probability": 0.5621130189944681, "m_probability_description": "Amongst matching record comparisons, 56.21% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"email_l\" = \"email_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0021938713143283602, "u_probability_description": "Amongst non-matching record comparisons, 0.22% of records are in the exact match comparison level" }, { "bayes_factor": 242.06946687723885, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 242.07 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 1", "log2_bayes_factor": 7.919277308092294, "m_probability": 0.1775422157714593, "m_probability_description": "Amongst matching record comparisons, 17.75% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0007334349848487773, "u_probability_description": "Amongst non-matching record comparisons, 0.07% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 216.52084181217398, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 216.52 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Levenshtein <= 2", "log2_bayes_factor": 7.758362092010155, "m_probability": 0.13442615940549812, "m_probability_description": "Amongst matching record comparisons, 13.44% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.000620846281034272, "u_probability_description": "Amongst non-matching record comparisons, 0.06% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.12636697513746203, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 7.91 times less likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.9843086173742144, "m_probability": 0.12591860582857453, "m_probability_description": "Amongst matching record comparisons, 12.59% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9964518474197885, "u_probability_description": "Amongst non-matching record comparisons, 99.65% of records are in the all other comparisons comparison level" } ] }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "title": { "subtitle": "Use mousewheel to zoom", "text": "Model parameters (components of final match weight)" }, "vconcat": [ { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "probability_two_random_records_match", "format": ".4f", "title": "Probability two random records match", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Equivalent match weight", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "domain": false, "labels": false, "ticks": false, "title": "" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": "Prior (starting) match weight", "titleAlign": "right", "titleAngle": 0, "titleFontWeight": "normal" }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": 20, "mark": { "clip": true, "height": 15, "type": "bar" }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name == 'probability_two_random_records_match')" } ] }, { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4f", "title": "M probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4f", "title": "U probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "title": "Comparison level match weight = log2(m/u)" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": { "step": 12 }, "mark": { "clip": true, "type": "bar" }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name != 'probability_two_random_records_match')" } ] } ] }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.match_weights_chart()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 4 }