{ "cells": [ { "cell_type": "markdown", "id": "d729e302", "metadata": {}, "source": [ "# Specifying and estimating a linkage model\n", "\n", "\n", "In the last tutorial we looked at how we can use blocking rules to generate pairwise record comparisons.\n", "\n", "Now it's time to estimate a probabilistic linkage model to score each of these comparisons. The resultant match score is a prediction of whether the two records represent the same entity (e.g. are the same person). \n", "\n", "The purpose of estimating the model is to learn the relative importance of different parts of your data for the purpose of data linking. \n", "\n", "For example, a match on date of birth is a much stronger indicator that two records refer to the same entity than a match on gender. A mismatch on gender may be a stronger indicate against two records referring than a mismatch on name, since names are more likely to be entered differently.\n", "\n", "The relative importance of different information is captured in the (partial) 'match weights', which can be learned from your data. These match weights are then added up to compute the overall match score.\n", "\n", "The match weights are are derived from the `m` and `u` parameters of the underlying Fellegi Sunter model. Splink uses various statistical routines to estimate these parameters. Further details of the underlying theory can be found [here](https://www.robinlinacre.com/intro_to_probabilistic_linkage/), which will help you understand this part of the tutorial." ] }, { "cell_type": "code", "execution_count": 1, "id": "aa6a9e30", "metadata": {}, "outputs": [], "source": [ "# Begin by reading in the tutorial data again\n", "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable(\"mimetype\")\n", "df = pd.read_csv(\"./data/fake_1000.csv\")" ] }, { "cell_type": "markdown", "id": "0f104340", "metadata": {}, "source": [ "## Specifying a linkage model\n", "\n", "To build a linkage model, the user defines the partial match weights that `splink` needs to estimate. This is done by defining how the information in the input records should be compared.\n", "\n", "To be concrete, here is an example comparison:\n", "\n", "\n", "first_name_l|first_name_r|surname_l|surname_r|dob_l |dob_r |city_l|city_r|email_l |email_r |\n", "------------|------------|---------|---------|----------|----------|------|------|-------------------|-------------------|\n", "Robert |Rob |Allen |Allen |1971-05-24|1971-06-24|nan |London|roberta25@smith.net|roberta25@smith.net|\n", "\n", "What functions should we use to assess the similarity of `Rob` vs. `Robert` in the the `first_name` field? \n", "\n", "Should similarity in the `dob` field be computed in the same way, or a different way?\n", "\n", "Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have. \n", "\n", "Splink can then estimate how much weight to place on a fuzzy match of `Rob` vs. `Robert`, relative to an exact match on `Robert`, or a non-match.\n", "\n", "Defining these scenarios is done using `Comparison`s." ] }, { "cell_type": "markdown", "id": "8a520392", "metadata": {}, "source": [ "### Comparisons\n", "\n", "The concept of a `Comparison` has a specific definition within Splink: it defines how data from one or more input columns is compared, using SQL expressions to assess similarity.\n", "\n", "For example, one `Comparison` may represent how similarity is assessed for a person's date of birth. \n", "\n", "Another `Comparison` may represent the comparison of a person's name or location.\n", "\n", "A model is composed of many `Comparison`s, which between them assess the similarity of all of the columns being used for data linking. \n", "\n", "Each `Comparison` contains two or more `ComparisonLevels` which define _n_ discrete gradations of similarity between the input columns within the Comparison.\n", "\n", "As such `ComparisonLevels`are nested within `Comparisons` as follows:\n", "\n", "```\n", "Data Linking Model\n", "├─-- Comparison: Date of birth\n", "│ ├─-- ComparisonLevel: Exact match\n", "│ ├─-- ComparisonLevel: One character difference\n", "│ ├─-- ComparisonLevel: All other\n", "├─-- Comparison: Surname\n", "│ ├─-- ComparisonLevel: Exact match on surname\n", "│ ├─-- ComparisonLevel: All other\n", "│ etc.\n", "```\n", "\n", "Our example data would therefore result in the following comparisons, for `dob` and `surname`:\n", "\n", "|dob_l |dob_r |comparison_level |interpretation |\n", "|----------|----------|------------------------|---------------|\n", "|1971-05-24|1971-05-24|Exact match |great match |\n", "|1971-05-24|1971-06-24|One character difference|ok match |\n", "|1971-05-24|2000-01-02|All other |bad match |\n", "\n", "\n", "\n", "surname_l|surname_r|comparison_level |interpretation |\n", "---------|---------|-----------------|-----------------------------------------------------|\n", "Rob |Rob |Exact match |great match |\n", "Rob |Jane |All other |bad match |\n", "Rob |Robert |All other |bad match, this comparison has no notion of nicknames|\n", "\n", "More information about comparisons can be found [here](https://moj-analytical-services.github.io/splink/comparison.html).\n", "\n", "\n", "We will now use these concepts to build a data linking model." ] }, { "cell_type": "markdown", "id": "02000a24", "metadata": {}, "source": [ "### Specifying the model using comparisons\n", "\n", "Splink includes libraries of comparison functions to make it simple to get started:\n", "\n", "Let's start by looking at a `Comparison` for `first_name`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "bd6143e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comparison 'Exact match vs. levenshtein at threshold 2 vs. anything else' of \"first_name\".\n", "Similarity is assessed using the following ComparisonLevels:\n", " - 'Null' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n", " - 'Exact match' with SQL rule: \"first_name_l\" = \"first_name_r\"\n", " - 'levenshtein <= 2' with SQL rule: levenshtein(\"first_name_l\", \"first_name_r\") <= 2\n", " - 'All other comparisons' with SQL rule: ELSE\n", "\n" ] } ], "source": [ "import splink.duckdb.duckdb_comparison_library as cl\n", "\n", "first_name_comparison = cl.levenshtein_at_thresholds(\"first_name\", 2)\n", "print(first_name_comparison.human_readable_description)\n" ] }, { "cell_type": "markdown", "id": "47b7677a", "metadata": {}, "source": [ "## Specifying the full settings dictionary\n", "\n", "`Comparisons` are specified as part of the Splink `settings`, a Python dictionary which controls all of the configuration of a Splink model:" ] }, { "cell_type": "code", "execution_count": 3, "id": "0fa0611a", "metadata": {}, "outputs": [], "source": [ "settings = {\n", " \"link_type\": \"dedupe_only\",\n", " \"comparisons\": [\n", " cl.exact_match(\"first_name\"),\n", " cl.levenshtein_at_thresholds(\"surname\"),\n", " cl.levenshtein_at_thresholds(\"dob\", 1),\n", " cl.exact_match(\"city\", term_frequency_adjustments=True),\n", " cl.levenshtein_at_thresholds(\"email\"),\n", " ],\n", " \"blocking_rules_to_generate_predictions\": [\n", " \"l.first_name = r.first_name\",\n", " \"l.surname = r.surname\",\n", " ],\n", " \"retain_matching_columns\": True,\n", " \"retain_intermediate_calculation_columns\": True,\n", "}\n", "\n", "linker = DuckDBLinker(df, settings)" ] }, { "cell_type": "markdown", "id": "657a1fb8", "metadata": {}, "source": [ "In words, this setting dictionary says:\n", "\n", "\n", "* We are performing a `dedupe_only` (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets).\n", "* When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.\n", "* The `blocking_rules_to_generate_predictions` states that we will only check for duplicates amongst records where either the `first_name` or `surname` is identical.\n", "* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others.\n", "* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to `True` so that Splink outputs additional information that helps the user understand the calculations. If they were `False`, the computations would run faster." ] }, { "cell_type": "markdown", "id": "afa31386", "metadata": {}, "source": [ "## Estimate the parameters of the model\n", "\n", "Now that we have specified our linkage model, we need to estimate the [`probability_two_random_records_match`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#probability_two_random_records_match), `u`, and `m` parameters.\n", "\n", "- The `probability_two_random_records_match` parameter is the probability that two records taken at random from your input data represent a match (typically a very small number).\n", "\n", "- The `u` values are the proportion of records falling into each `ComparisonLevel` amongst truly *non-matching* records.\n", "\n", "- The `m` values are the proportion of records falling into each `ComparisonLevel` amongst truly *matching* records\n", "\n", "You can read more about [the theory of what these mean](https://www.robinlinacre.com/maths_of_fellegi_sunter/).\n", "\n", "We can estimate these parameters using unlabeled data. If we have labels, then we can estimate them even more accurately." ] }, { "cell_type": "markdown", "id": "c2871ac6", "metadata": {}, "source": [ "### Estimation of `probability_two_random_records_match`\n", "\n", "In some cases, the `probability_two_random_records_match` will be known. For example, if you are linking two tables of 10,000 records and expect a one-to-one match, then you should set this value to `1/10_000` in your settings instead of estimating it.\n", "\n", "More generally, this parameter is unknown and needs to be estimated. \n", "\n", "It can be estimated accurately enough for most purposes by combining a series of deterministic matching rules and a guess of the recall corresponding to those rules. For further details of the rationale behind this appraoch see [here](https://github.com/moj-analytical-services/splink/issues/462#issuecomment-1227027995).\n", "\n", "In this example, I guess that the following deterministic matching rules have a recall of about 70%:" ] }, { "cell_type": "code", "execution_count": 4, "id": "cbf92120", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Probability two random records match is estimated to be 0.00333.\n", "This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match. With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs\n" ] } ], "source": [ "deterministic_rules = [\n", " \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n", " \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n", " \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n", " \"l.email = r.email\"\n", "]\n", "\n", "linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)" ] }, { "cell_type": "markdown", "id": "7712860b", "metadata": {}, "source": [ "### Estimation of `u` probabilities\n", "\n", "Once we have the `probability_two_random_records_match` parameter, we can estimate the `u` probabilities.\n", "\n", "We estimate `u` using the `estimate_u_using_random_sampling` method, which doesn't require any labels.\n", "\n", "It works by sampling random pairs of records, since most of these pairs are going to be non-matches. Over these non-matches we compute the distribution of `ComparisonLevel`s for each `Comparison`.\n", "\n", "For instance, for `gender`, we would find that the the gender matches 50% of the time, and mismatches 50% of the time. \n", "\n", "For `dob` on the other hand, we would find that the `dob` matches 1% of the time, has a \"one character difference\" 3% of the time, and everything else happens 96% of the time.\n", "\n", "The larger the random sample, the more accurate the predictions. You control this using the `max_pairs` parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets." ] }, { "cell_type": "code", "execution_count": 5, "id": "b8d49e7a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "----- Estimating u probabilities using random sampling -----\n", "\n", "Estimated u probabilities using random sampling\n", "\n", "Your model is not yet fully trained. Missing estimates for:\n", " - first_name (no m values are trained).\n", " - surname (no m values are trained).\n", " - dob (no m values are trained).\n", " - city (no m values are trained).\n", " - email (no m values are trained).\n" ] } ], "source": [ "linker.estimate_u_using_random_sampling(max_pairs=1e6)" ] }, { "cell_type": "markdown", "id": "a73921b7", "metadata": {}, "source": [ "### Estimation of `m` probabilities\n", "\n", "`m` is the trickiest of the parameters to estimate, because we have to have some idea of what the true matches are.\n", "\n", "If we have labels, we can directly estimate it.\n", "\n", "If we do not have labelled data, the `m` parameters can be estimated using an iterative maximum likelihood approach called Expectation Maximisation. \n", "\n", "#### Estimating directly\n", "\n", "If we have labels, we can estimate `m` directly using the `estimate_m_from_label_column` method of the linker.\n", "\n", "For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.\n", "\n", "Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.\n", "\n", "For example (in this tutorial we don't have labels, so we're not actually going to use this):\n", "\n", "```python\n", "linker.estimate_m_from_label_column(\"social_security_number\")\n", "```\n", "\n", "#### Estimating with Expectation Maximisation\n", "\n", "This algorithm estimates the `m` values by generating pairwise record comparisons, and using them to maximise a likelihood function. \n", "\n", "Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.\n", "\n", "In our first estimation pass, we block on `first_name` and `surname`, meaning we will generate all record comparisons that have `first_name` and `surname` exactly equal. \n", "\n", "Recall we are trying to estimate the `m` values of the model, i.e. proportion of records falling into each `ComparisonLevel` amongst truly matching records.\n", "\n", "This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, since they will be equal for all the comparisons we do.\n", "\n", "We can, however, estimate parameter estimates for all of the other columns. The output messages produced by Splink confirm this." ] }, { "cell_type": "code", "execution_count": 6, "id": "098f0a40", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "----- Starting EM training session -----\n", "\n", "Estimating the m probabilities of the model by blocking on:\n", "l.first_name = r.first_name and l.surname = r.surname\n", "\n", "Parameter estimates will be made for the following comparison(s):\n", " - dob\n", " - city\n", " - email\n", "\n", "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n", " - first_name\n", " - surname\n", "\n", "Iteration 1: Largest change in params was -0.514 in the m_probability of dob, level `Exact match`\n", "Iteration 2: Largest change in params was 0.0474 in probability_two_random_records_match\n", "Iteration 3: Largest change in params was 0.0212 in probability_two_random_records_match\n", "Iteration 4: Largest change in params was 0.0113 in probability_two_random_records_match\n", "Iteration 5: Largest change in params was 0.00694 in probability_two_random_records_match\n", "Iteration 6: Largest change in params was 0.00463 in probability_two_random_records_match\n", "Iteration 7: Largest change in params was 0.00328 in probability_two_random_records_match\n", "Iteration 8: Largest change in params was 0.00243 in probability_two_random_records_match\n", "Iteration 9: Largest change in params was 0.00186 in probability_two_random_records_match\n", "Iteration 10: Largest change in params was 0.00146 in probability_two_random_records_match\n", "Iteration 11: Largest change in params was 0.00117 in probability_two_random_records_match\n", "Iteration 12: Largest change in params was 0.000954 in probability_two_random_records_match\n", "Iteration 13: Largest change in params was 0.000787 in probability_two_random_records_match\n", "Iteration 14: Largest change in params was 0.000658 in probability_two_random_records_match\n", "Iteration 15: Largest change in params was 0.000555 in probability_two_random_records_match\n", "Iteration 16: Largest change in params was 0.000471 in probability_two_random_records_match\n", "Iteration 17: Largest change in params was 0.000404 in probability_two_random_records_match\n", "Iteration 18: Largest change in params was 0.000347 in probability_two_random_records_match\n", "Iteration 19: Largest change in params was 0.000301 in probability_two_random_records_match\n", "Iteration 20: Largest change in params was 0.000261 in probability_two_random_records_match\n", "Iteration 21: Largest change in params was 0.000228 in probability_two_random_records_match\n", "Iteration 22: Largest change in params was 0.0002 in probability_two_random_records_match\n", "Iteration 23: Largest change in params was 0.000175 in probability_two_random_records_match\n", "Iteration 24: Largest change in params was 0.000154 in probability_two_random_records_match\n", "Iteration 25: Largest change in params was 0.000136 in probability_two_random_records_match\n", "\n", "EM converged after 25 iterations\n", "\n", "Your model is not yet fully trained. Missing estimates for:\n", " - first_name (no m values are trained).\n", " - surname (no m values are trained).\n" ] } ], "source": [ "training_blocking_rule = \"l.first_name = r.first_name and l.surname = r.surname\"\n", "training_session_fname_sname = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)" ] }, { "cell_type": "markdown", "id": "92bd4a31", "metadata": {}, "source": [ "In a second estimation pass, we block on dob. This allows us to estimate parameters for the `first_name` and `surname` comparisons.\n", "\n", "Between the two estimation passes, we now have parameter estimates for all comparisons." ] }, { "cell_type": "code", "execution_count": 7, "id": "ac8d3264", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "----- Starting EM training session -----\n", "\n", "Estimating the m probabilities of the model by blocking on:\n", "l.dob = r.dob\n", "\n", "Parameter estimates will be made for the following comparison(s):\n", " - first_name\n", " - surname\n", " - city\n", " - email\n", "\n", "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n", " - dob\n", "\n", "Iteration 1: Largest change in params was -0.392 in the m_probability of surname, level `Exact match`\n", "Iteration 2: Largest change in params was 0.137 in probability_two_random_records_match\n", "Iteration 3: Largest change in params was 0.0416 in probability_two_random_records_match\n", "Iteration 4: Largest change in params was 0.0171 in probability_two_random_records_match\n", "Iteration 5: Largest change in params was 0.00853 in probability_two_random_records_match\n", "Iteration 6: Largest change in params was 0.0047 in probability_two_random_records_match\n", "Iteration 7: Largest change in params was 0.00274 in probability_two_random_records_match\n", "Iteration 8: Largest change in params was 0.00165 in probability_two_random_records_match\n", "Iteration 9: Largest change in params was 0.00101 in probability_two_random_records_match\n", "Iteration 10: Largest change in params was 0.000629 in probability_two_random_records_match\n", "Iteration 11: Largest change in params was 0.000394 in probability_two_random_records_match\n", "Iteration 12: Largest change in params was 0.000247 in probability_two_random_records_match\n", "Iteration 13: Largest change in params was 0.000156 in probability_two_random_records_match\n", "Iteration 14: Largest change in params was 9.86e-05 in probability_two_random_records_match\n", "\n", "EM converged after 14 iterations\n", "\n", "Your model is fully trained. All comparisons have at least one estimate for their m and u values\n" ] } ], "source": [ "from numpy import fix\n", "\n", "\n", "training_blocking_rule = \"l.dob = r.dob\"\n", "training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)" ] }, { "cell_type": "markdown", "id": "efdb0c5f", "metadata": {}, "source": [ "Note that Splink includes other algorithms for estimating m and u values, which are documented [here](https://moj-analytical-services.github.io/splink/linkerest.html)." ] }, { "cell_type": "markdown", "id": "38355535", "metadata": {}, "source": [ "## Visualising model parameters\n", "\n", "Splink can generate a number of charts to help you understand your model. For an introduction to these charts and how to interpret them, please see [this](https://www.youtube.com/watch?v=msz3T741KQI&t=507s) video.\n", "\n", "The final estimated match weights can be viewed in the match weights chart:" ] }, { "cell_type": "code", "execution_count": 8, "id": "3a1e15cc", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v5.2.json", "config": { "header": { "title": null }, "mark": { "tooltip": null }, "title": { "anchor": "middle" }, "view": { "height": 60, "width": 400 } }, "data": { "values": [ { "bayes_factor": 0.0033430420247643373, "bayes_factor_description": "The probability that two random records drawn at random match is 0.003 or one in 300.1 records.This is equivalent to a starting match weight of -8.225.", "comparison_name": "probability_two_random_records_match", "comparison_sort_order": -1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "m_probability_description": null, "max_comparison_vector_value": 0, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": null, "tf_adjustment_column": null, "tf_adjustment_weight": null, "u_probability": null, "u_probability_description": null }, { "bayes_factor": 85.80338234594069, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.80 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.422962614378335, "m_probability": 0.4971080217684876, "m_probability_description": "Amongst matching record comparisons, 49.71% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0057935713975033705, "u_probability_description": "Amongst non-matching record comparisons, 0.58% of records are in the exact match comparison level" }, { "bayes_factor": 0.5058224969822424, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 1.98 times less likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -0.9832968910951042, "m_probability": 0.5028919782315123, "m_probability_description": "Amongst matching record comparisons, 50.29% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9942064286024966, "u_probability_description": "Amongst non-matching record comparisons, 99.42% of records are in the all other comparisons comparison level" }, { "bayes_factor": 89.48089949369762, "bayes_factor_description": "If comparison level is `exact match` then comparison is 89.48 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.483507853838869, "m_probability": 0.43755941072712773, "m_probability_description": "Amongst matching record comparisons, 43.76% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"surname_l\" = \"surname_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.004889975550122249, "u_probability_description": "Amongst non-matching record comparisons, 0.49% of records are in the exact match comparison level" }, { "bayes_factor": 78.52365983637108, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 78.52 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.295055510492454, "m_probability": 0.18566006785783215, "m_probability_description": "Amongst matching record comparisons, 18.57% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.002364383782476692, "u_probability_description": "Amongst non-matching record comparisons, 0.24% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 22.372672328973053, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 22.37 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.483665685906399, "m_probability": 0.11207341581216516, "m_probability_description": "Amongst matching record comparisons, 11.21% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.005009388872469557, "u_probability_description": "Amongst non-matching record comparisons, 0.50% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.2679937130198923, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.73 times less likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.8997289386125742, "m_probability": 0.2647071056028749, "m_probability_description": "Amongst matching record comparisons, 26.47% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9877362517949315, "u_probability_description": "Amongst non-matching record comparisons, 98.77% of records are in the all other comparisons comparison level" }, { "bayes_factor": 222.50382383655824, "bayes_factor_description": "If comparison level is `exact match` then comparison is 222.50 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.797686319483587, "m_probability": 0.3888805569756063, "m_probability_description": "Amongst matching record comparisons, 38.89% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"dob_l\" = \"dob_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0017477477477477479, "u_probability_description": "Amongst non-matching record comparisons, 0.17% of records are in the exact match comparison level" }, { "bayes_factor": 92.7048728526088, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 92.70 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.534573268209908, "m_probability": 0.1484762728370111, "m_probability_description": "Amongst matching record comparisons, 14.85% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0016016016016016017, "u_probability_description": "Amongst non-matching record comparisons, 0.16% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 0.46419793122630476, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.15 times less likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.107188001796095, "m_probability": 0.4626431701873826, "m_probability_description": "Amongst matching record comparisons, 46.26% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9966506506506506, "u_probability_description": "Amongst non-matching record comparisons, 99.67% of records are in the all other comparisons comparison level" }, { "bayes_factor": 10.264353562890939, "bayes_factor_description": "If comparison level is `exact match` then comparison is 10.26 times more likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "has_tf_adjustments": true, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 3.3595708659307335, "m_probability": 0.5660541687277061, "m_probability_description": "Amongst matching record comparisons, 56.61% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"city_l\" = \"city_r\"", "tf_adjustment_column": "city", "tf_adjustment_weight": 1, "u_probability": 0.0551475711801453, "u_probability_description": "Amongst non-matching record comparisons, 5.51% of records are in the exact match comparison level" }, { "bayes_factor": 0.4592736580190661, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.18 times less likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.1225740557985633, "m_probability": 0.4339458312722939, "m_probability_description": "Amongst matching record comparisons, 43.39% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9448524288198547, "u_probability_description": "Amongst non-matching record comparisons, 94.49% of records are in the all other comparisons comparison level" }, { "bayes_factor": 255.301619971695, "bayes_factor_description": "If comparison level is `exact match` then comparison is 255.30 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.99605888191739, "m_probability": 0.560098900557462, "m_probability_description": "Amongst matching record comparisons, 56.01% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"email_l\" = \"email_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0021938713143283602, "u_probability_description": "Amongst non-matching record comparisons, 0.22% of records are in the exact match comparison level" }, { "bayes_factor": 235.63777521191432, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 235.64 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 7.880427026661069, "m_probability": 0.17282498809234997, "m_probability_description": "Amongst matching record comparisons, 17.28% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0007334349848487773, "u_probability_description": "Amongst non-matching record comparisons, 0.07% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 206.57183856156232, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 206.57 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 7.690499778262131, "m_probability": 0.128249357737358, "m_probability_description": "Amongst matching record comparisons, 12.82% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.000620846281034272, "u_probability_description": "Amongst non-matching record comparisons, 0.06% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.13932108608389637, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 7.18 times less likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.8435144702111006, "m_probability": 0.13882675361282992, "m_probability_description": "Amongst matching record comparisons, 13.88% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9964518474197885, "u_probability_description": "Amongst non-matching record comparisons, 99.65% of records are in the all other comparisons comparison level" } ] }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "title": { "subtitle": "Use mousewheel to zoom", "text": "Model parameters (components of final match weight)" }, "vconcat": [ { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "probability_two_random_records_match", "format": ".4f", "title": "Probability two random records match", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Equivalent match weight", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "domain": false, "labels": false, "ticks": false, "title": "" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": "Prior (starting) match weight", "titleAlign": "right", "titleAngle": 0, "titleFontWeight": "normal" }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": 20, "mark": { "clip": true, "height": 15, "type": "bar" }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name == 'probability_two_random_records_match')" } ] }, { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4f", "title": "M probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4f", "title": "U probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "title": "Comparison level match weight = log2(m/u)" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": { "step": 12 }, "mark": { "clip": true, "type": "bar" }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name != 'probability_two_random_records_match')" } ] } ] }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.match_weights_chart()" ] }, { "cell_type": "code", "execution_count": 9, "id": "8576c042", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "config": { "header": { "title": null }, "title": { "anchor": "middle", "offset": 10 }, "view": { "height": 300, "width": 400 } }, "data": { "values": [ { "bayes_factor": 85.80338234594069, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.80 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.422962614378335, "m_probability": 0.4971080217684876, "m_probability_description": "Amongst matching record comparisons, 49.71% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0057935713975033705, "u_probability_description": "Amongst non-matching record comparisons, 0.58% of records are in the exact match comparison level" }, { "bayes_factor": 0.5058224969822424, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 1.98 times less likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -0.9832968910951042, "m_probability": 0.5028919782315123, "m_probability_description": "Amongst matching record comparisons, 50.29% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9942064286024966, "u_probability_description": "Amongst non-matching record comparisons, 99.42% of records are in the all other comparisons comparison level" }, { "bayes_factor": 89.48089949369762, "bayes_factor_description": "If comparison level is `exact match` then comparison is 89.48 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.483507853838869, "m_probability": 0.43755941072712773, "m_probability_description": "Amongst matching record comparisons, 43.76% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"surname_l\" = \"surname_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.004889975550122249, "u_probability_description": "Amongst non-matching record comparisons, 0.49% of records are in the exact match comparison level" }, { "bayes_factor": 78.52365983637108, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 78.52 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.295055510492454, "m_probability": 0.18566006785783215, "m_probability_description": "Amongst matching record comparisons, 18.57% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.002364383782476692, "u_probability_description": "Amongst non-matching record comparisons, 0.24% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 22.372672328973053, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 22.37 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.483665685906399, "m_probability": 0.11207341581216516, "m_probability_description": "Amongst matching record comparisons, 11.21% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.005009388872469557, "u_probability_description": "Amongst non-matching record comparisons, 0.50% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.2679937130198923, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.73 times less likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.8997289386125742, "m_probability": 0.2647071056028749, "m_probability_description": "Amongst matching record comparisons, 26.47% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9877362517949315, "u_probability_description": "Amongst non-matching record comparisons, 98.77% of records are in the all other comparisons comparison level" }, { "bayes_factor": 222.50382383655824, "bayes_factor_description": "If comparison level is `exact match` then comparison is 222.50 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.797686319483587, "m_probability": 0.3888805569756063, "m_probability_description": "Amongst matching record comparisons, 38.89% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"dob_l\" = \"dob_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0017477477477477479, "u_probability_description": "Amongst non-matching record comparisons, 0.17% of records are in the exact match comparison level" }, { "bayes_factor": 92.7048728526088, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 92.70 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.534573268209908, "m_probability": 0.1484762728370111, "m_probability_description": "Amongst matching record comparisons, 14.85% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0016016016016016017, "u_probability_description": "Amongst non-matching record comparisons, 0.16% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 0.46419793122630476, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.15 times less likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.107188001796095, "m_probability": 0.4626431701873826, "m_probability_description": "Amongst matching record comparisons, 46.26% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9966506506506506, "u_probability_description": "Amongst non-matching record comparisons, 99.67% of records are in the all other comparisons comparison level" }, { "bayes_factor": 10.264353562890939, "bayes_factor_description": "If comparison level is `exact match` then comparison is 10.26 times more likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "has_tf_adjustments": true, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 3.3595708659307335, "m_probability": 0.5660541687277061, "m_probability_description": "Amongst matching record comparisons, 56.61% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"city_l\" = \"city_r\"", "tf_adjustment_column": "city", "tf_adjustment_weight": 1, "u_probability": 0.0551475711801453, "u_probability_description": "Amongst non-matching record comparisons, 5.51% of records are in the exact match comparison level" }, { "bayes_factor": 0.4592736580190661, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.18 times less likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.1225740557985633, "m_probability": 0.4339458312722939, "m_probability_description": "Amongst matching record comparisons, 43.39% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9448524288198547, "u_probability_description": "Amongst non-matching record comparisons, 94.49% of records are in the all other comparisons comparison level" }, { "bayes_factor": 255.301619971695, "bayes_factor_description": "If comparison level is `exact match` then comparison is 255.30 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.99605888191739, "m_probability": 0.560098900557462, "m_probability_description": "Amongst matching record comparisons, 56.01% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "\"email_l\" = \"email_r\"", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0021938713143283602, "u_probability_description": "Amongst non-matching record comparisons, 0.22% of records are in the exact match comparison level" }, { "bayes_factor": 235.63777521191432, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 235.64 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 7.880427026661069, "m_probability": 0.17282498809234997, "m_probability_description": "Amongst matching record comparisons, 17.28% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0007334349848487773, "u_probability_description": "Amongst non-matching record comparisons, 0.07% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 206.57183856156232, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 206.57 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 7.690499778262131, "m_probability": 0.128249357737358, "m_probability_description": "Amongst matching record comparisons, 12.82% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.000620846281034272, "u_probability_description": "Amongst non-matching record comparisons, 0.06% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.13932108608389637, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 7.18 times less likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.8435144702111006, "m_probability": 0.13882675361282992, "m_probability_description": "Amongst matching record comparisons, 13.88% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.0033319033319033323, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9964518474197885, "u_probability_description": "Amongst non-matching record comparisons, 99.65% of records are in the all other comparisons comparison level" } ] }, "hconcat": [ { "encoding": { "color": { "value": "green" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "m_probability_description", "title": "m probability description", "type": "nominal" }, { "field": "comparison_name", "title": "Comparison column name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4p", "title": "m probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4p", "title": "u probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" } ], "x": { "axis": { "title": "Proportion of record comparisons" }, "field": "m_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": { "step": 12 }, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontSize": 12, "fontWeight": "bold", "text": "Amongst matching record comparisons:" }, "transform": [ { "filter": "(datum.bayes_factor != 'no-op filter due to vega lite issue 4680')" } ], "width": 150 }, { "encoding": { "color": { "value": "red" }, "row": { "field": "comparison_name", "header": { "labels": false }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "u_probability_description", "title": "u probability description", "type": "nominal" }, { "field": "comparison_name", "title": "Comparison column name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4p", "title": "m probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4p", "title": "u probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" } ], "x": { "axis": { "title": "Proportion of record comparisons" }, "field": "u_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": { "step": 12 }, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontSize": 12, "fontWeight": "bold", "text": "Amongst non-matching record comparisons:" }, "transform": [ { "filter": "(datum.bayes_factor != 'no-op filter2 due to vega lite issue 4680')" } ], "width": 150 } ], "title": { "subtitle": "(m and u probabilities)", "text": "Proportion of record comparisons in each comparison level by match status" } }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.m_u_parameters_chart()" ] }, { "cell_type": "markdown", "id": "c44fcc26", "metadata": {}, "source": [ "### Saving the model\n", "\n", "We can save the model, including our estimated parameters, to a `.json` file, so we can use it in the next tutorial." ] }, { "cell_type": "code", "execution_count": 10, "id": "992703a7", "metadata": {}, "outputs": [], "source": [ "settings = linker.save_settings_to_json(\"./demo_settings/saved_model_from_demo.json\", overwrite=True)" ] }, { "cell_type": "markdown", "id": "d07e6901-110f-449e-9534-8e24bbf1d5fb", "metadata": {}, "source": [ "## Detecting unlinkable records\n", "\n", "An interesting application of our trained model that is useful to explore before making any predictions is to detect 'unlinkable' records.\n", "\n", "Unlinkable records are those which do not contain enough information to be linked. A simple example would be a record containing only 'John Smith', and null in all other fields. This record may link to other records, but we'll never know because there's not enough information to disambiguate any potential links. Unlinkable records can be found by linking records to themselves - if, even when matched to themselves, they don't meet the match threshold score, we can be sure they will never link to anything.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "b0f17f7c-fa83-41b5-b2da-25ae18e11d81", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json", "config": { "view": { "continuousHeight": 300, "continuousWidth": 400 } }, "data": { "values": [ { "cum_prop": 0.0020000000949949026, "match_probability": 0.42655, "match_weight": -0.43, "prop": 0.0020000000949949026 }, { "cum_prop": 0.003000000142492354, "match_probability": 0.96071, "match_weight": 4.61, "prop": 0.0010000000474974513 }, { "cum_prop": 0.006000000168569386, "match_probability": 0.98457, "match_weight": 6, "prop": 0.003000000026077032 }, { "cum_prop": 0.013000000384636223, "match_probability": 0.9852, "match_weight": 6.06, "prop": 0.007000000216066837 }, { "cum_prop": 0.014000000432133675, "match_probability": 0.9856, "match_weight": 6.1, "prop": 0.0010000000474974513 }, { "cum_prop": 0.019000000320374966, "match_probability": 0.99414, "match_weight": 7.41, "prop": 0.004999999888241291 }, { "cum_prop": 0.02100000041536987, "match_probability": 0.99438, "match_weight": 7.47, "prop": 0.0020000000949949026 }, { "cum_prop": 0.027000000467523932, "match_probability": 0.99476, "match_weight": 7.57, "prop": 0.006000000052154064 }, { "cum_prop": 0.030000000493600965, "match_probability": 0.99709, "match_weight": 8.42, "prop": 0.003000000026077032 }, { "cum_prop": 0.038000000873580575, "match_probability": 0.99802, "match_weight": 8.98, "prop": 0.00800000037997961 }, { "cum_prop": 0.04100000089965761, "match_probability": 0.9987, "match_weight": 9.58, "prop": 0.003000000026077032 }, { "cum_prop": 0.04200000094715506, "match_probability": 0.99888, "match_weight": 9.8, "prop": 0.0010000000474974513 }, { "cum_prop": 0.04300000099465251, "match_probability": 0.99931, "match_weight": 10.51, "prop": 0.0010000000474974513 }, { "cum_prop": 0.04400000104214996, "match_probability": 0.99939, "match_weight": 10.67, "prop": 0.0010000000474974513 }, { "cum_prop": 0.04500000108964741, "match_probability": 0.99942, "match_weight": 10.75, "prop": 0.0010000000474974513 }, { "cum_prop": 0.047000001184642315, "match_probability": 0.99945, "match_weight": 10.82, "prop": 0.0020000000949949026 }, { "cum_prop": 0.04900000127963722, "match_probability": 0.99946, "match_weight": 10.84, "prop": 0.0020000000949949026 }, { "cum_prop": 0.05000000132713467, "match_probability": 0.99948, "match_weight": 10.9, "prop": 0.0010000000474974513 }, { "cum_prop": 0.05100000137463212, "match_probability": 0.99952, "match_weight": 11.03, "prop": 0.0010000000474974513 }, { "cum_prop": 0.05200000142212957, "match_probability": 0.99954, "match_weight": 11.09, "prop": 0.0010000000474974513 }, { "cum_prop": 0.05300000146962702, "match_probability": 0.99958, "match_weight": 11.2, "prop": 0.0010000000474974513 }, { "cum_prop": 0.054000001517124474, "match_probability": 0.99959, "match_weight": 11.26, "prop": 0.0010000000474974513 }, { "cum_prop": 0.055000001564621925, "match_probability": 0.99969, "match_weight": 11.67, "prop": 0.0010000000474974513 }, { "cum_prop": 0.05800000159069896, "match_probability": 0.99971, "match_weight": 11.73, "prop": 0.003000000026077032 }, { "cum_prop": 0.05900000163819641, "match_probability": 0.99974, "match_weight": 11.9, "prop": 0.0010000000474974513 }, { "cum_prop": 0.06300000182818621, "match_probability": 0.99976, "match_weight": 12.03, "prop": 0.004000000189989805 }, { "cum_prop": 0.06400000187568367, "match_probability": 0.99977, "match_weight": 12.1, "prop": 0.0010000000474974513 }, { "cum_prop": 0.06500000192318112, "match_probability": 0.99979, "match_weight": 12.25, "prop": 0.0010000000474974513 }, { "cum_prop": 0.06600000197067857, "match_probability": 0.9998, "match_weight": 12.32, "prop": 0.0010000000474974513 }, { "cum_prop": 0.06800000206567347, "match_probability": 0.99981, "match_weight": 12.33, "prop": 0.0020000000949949026 }, { "cum_prop": 0.09200000227428973, "match_probability": 0.99982, "match_weight": 12.48, "prop": 0.024000000208616257 }, { "cum_prop": 0.09500000230036676, "match_probability": 0.99984, "match_weight": 12.61, "prop": 0.003000000026077032 }, { "cum_prop": 0.09600000234786421, "match_probability": 0.99985, "match_weight": 12.71, "prop": 0.0010000000474974513 }, { "cum_prop": 0.09700000239536166, "match_probability": 0.99987, "match_weight": 12.9, "prop": 0.0010000000474974513 }, { "cum_prop": 0.09800000244285911, "match_probability": 0.99991, "match_weight": 13.42, "prop": 0.0010000000474974513 }, { "cum_prop": 0.1210000024875626, "match_probability": 0.99993, "match_weight": 13.9, "prop": 0.023000000044703484 }, { "cum_prop": 0.15500000433530658, "match_probability": 0.99994, "match_weight": 14.05, "prop": 0.03400000184774399 }, { "cum_prop": 0.15600000438280404, "match_probability": 0.99995, "match_weight": 14.42, "prop": 0.0010000000474974513 }, { "cum_prop": 0.16800000448711216, "match_probability": 0.99997, "match_weight": 14.9, "prop": 0.012000000104308128 }, { "cum_prop": 0.21600000490434468, "match_probability": 0.99998, "match_weight": 16, "prop": 0.04800000041723251 }, { "cum_prop": 0.24600000423379242, "match_probability": 0.99999, "match_weight": 17.58, "prop": 0.029999999329447746 } ] }, "height": 400, "layer": [ { "encoding": { "x": { "axis": { "format": "+", "title": "Threshold match weight" }, "field": "match_weight", "type": "quantitative" }, "y": { "axis": { "format": "%", "title": "Percentage of unlinkable records" }, "field": "cum_prop", "type": "quantitative" } }, "mark": "line" }, { "encoding": { "opacity": { "value": 0 }, "tooltip": [ { "field": "match_weight", "format": "+.5", "title": "Match weight", "type": "quantitative" }, { "field": "match_probability", "format": ".5", "title": "Match probability", "type": "quantitative" }, { "field": "cum_prop", "format": ".3%", "title": "Proportion of unlinkable records", "type": "quantitative" } ], "x": { "field": "match_weight", "type": "quantitative" }, "y": { "field": "cum_prop", "type": "quantitative" } }, "mark": "point", "selection": { "selector112": { "empty": "none", "fields": [ "match_weight", "cum_prop" ], "nearest": true, "on": "mouseover", "type": "single" } } }, { "encoding": { "opacity": { "condition": { "selection": "selector112", "value": 1 }, "value": 0 }, "x": { "axis": { "title": "Threshold match weight" }, "field": "match_weight", "type": "quantitative" }, "y": { "axis": { "format": "%", "title": "Percentage of unlinkable records" }, "field": "cum_prop", "type": "quantitative" } }, "mark": "point" }, { "encoding": { "x": { "field": "match_weight", "type": "quantitative" } }, "mark": { "color": "gray", "type": "rule" }, "transform": [ { "filter": { "selection": "selector112" } } ] }, { "encoding": { "y": { "field": "cum_prop", "type": "quantitative" } }, "mark": { "color": "gray", "type": "rule" }, "transform": [ { "filter": { "selection": "selector112" } } ] } ], "title": { "subtitle": "Records with insufficient information to exceed a given match threshold", "text": "Unlinkable records" }, "width": 400 }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.unlinkables_chart()" ] }, { "cell_type": "markdown", "id": "ca2570b7-9f68-4fb2-b7a9-527233a8fcd4", "metadata": {}, "source": [ "In the above chart, we can see that about 1.3% of records in the input dataset are unlinkable at a threshold match weight of 6.11 (correponding to a match probability of around 98.6%)" ] }, { "cell_type": "markdown", "id": "fd531cd2", "metadata": {}, "source": [ "## Next steps\n", "\n", "Now we have trained a model, we can move on to using it predict matching records." ] }, { "cell_type": "markdown", "id": "83fd8e7f", "metadata": {}, "source": [ "\n", "## Further reading\n", "\n", "Full documentation for all of the ways of estimating model parameters can be found [here](https://moj-analytical-services.github.io/splink/linkerest.html)." ] } ], "metadata": { "kernelspec": { "display_name": "splink_demos", "language": "python", "name": "splink_demos" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 5 }