{ "cells": [ { "cell_type": "markdown", "id": "d729e302", "metadata": {}, "source": [ "# Specifying and estimating a linkage model\n", "\n", "We've just seen how to use Splink's exploratory analysis tools to understand our data. \n", "\n", "Now it's time to build a linkage model. This model will make pairwise comparisons of input records and output a match score, which is a prediction of whether the two records represent the same entity (e.g. are the same person). You can read more about the theory behind probabilistic linkage models [here](https://www.robinlinacre.com/intro_to_probabilistic_linkage/).\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "aa6a9e30", "metadata": {}, "outputs": [], "source": [ "# Begin by reading in the tutorial data again\n", "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable(\"mimetype\")\n", "df = pd.read_csv(\"./data/fake_1000.csv\")" ] }, { "cell_type": "markdown", "id": "0f104340", "metadata": {}, "source": [ "## Specifying a linkage model\n", "\n", "To produce a match score, `splink` needs to know how to compare the information in pairs records from the input dataset.\n", "\n", "To be concrete, here is an example pairwise record comparison from our input dataset:\n", "\n", "\n", "| unique_id | first_name | surname | dob | city | email |\n", "|------------:|:-------------|:----------|:-----------|:-------|:--------------------|\n", "| 1 | Robert | Allen | 1971-05-24 | nan | roberta25@smith.net |\n", "| 2 | Rob | Allen | 1971-06-24 | London | roberta25@smith.net |\n", "\n", "What functions should we use to assess the similarity of `Rob` vs. `Robert` in the the `first_name` field? Should similarity in the `dob` field be computed in the same way, or a different way?\n", "\n", "Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have. " ] }, { "cell_type": "markdown", "id": "8a520392", "metadata": {}, "source": [ "### Comparisons\n", "\n", "The concept of a `Comparison` has a specific definition within Splink: it defines how data from one or more input columns is compared, using SQL expressions assess similarity.\n", "\n", "For example, one `Comparison` may represent how similarity is assessed for a person's date of birth. Another `Comparison` may represent the comparison of a person's name or location.\n", "\n", "A model will thereby be composed of many `Comparison`s, which between them assess the similarity of all of the columns being used for data linking. \n", "\n", "Each `Comparison` contains two or more `ComparisonLevels` which define _n_ discrete gradations of similarity between the input columns within the Comparison.\n", "\n", "For example, for the date of birth `Comparison` there may be a `ComparisonLevel` for an exact match, another for a one-character difference, and another for all other comparisons.\n", "\n", "To summarise:\n", "\n", "```\n", "Data Linking Model\n", "├─-- Comparison: Date of birth\n", "│ ├─-- ComparisonLevel: Exact match\n", "│ ├─-- ComparisonLevel: One character difference\n", "│ ├─-- ComparisonLevel: All other\n", "├─-- Comparison: City\n", "│ ├─-- ComparisonLevel: Exact match on city\n", "│ ├─-- ComparisonLevel: All other\n", "│ etc.\n", "```\n", "\n", "More information about comparisons can be found [here](https://moj-analytical-services.github.io/splink/comparison.html).\n", "\n", "\n", "We will now use these concepts to build a data linking model" ] }, { "cell_type": "markdown", "id": "02000a24", "metadata": {}, "source": [ "### Specifying the model using comparisons\n", "\n", "Splink provides utility functions to help formulate some of the most common comparison types, which we'll make use of in this introductory example.\n", "\n", "Let's start by looking at a single comparison:" ] }, { "cell_type": "code", "execution_count": 2, "id": "bd6143e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comparison 'Exact match vs. levenshtein at threshold 2 vs. anything else' of first_name.\n", "Similarity is assessed using the following ComparisonLevels:\n", " - 'Null' with SQL rule: first_name_l IS NULL OR first_name_r IS NULL\n", " - 'Exact match' with SQL rule: first_name_l = first_name_r\n", " - 'levenshtein <= 2' with SQL rule: levenshtein(first_name_l, first_name_r) <= 2\n", " - 'All other comparisons' with SQL rule: ELSE\n", "\n" ] } ], "source": [ "import splink.duckdb.duckdb_comparison_library as cl\n", "\n", "first_name_comparison = cl.levenshtein_at_thresholds(\"first_name\", 2)\n", "print(first_name_comparison.human_readable_description)\n" ] }, { "cell_type": "markdown", "id": "47b7677a", "metadata": {}, "source": [ "## Specifying the full settings dictionary\n", "\n", "`Comparisons` are specified as part of the Splink `settings`, a Python dictionary which controls all of the configuration of a Splink model.\n", "\n", "Let's take a look at a full settings dictionary:" ] }, { "cell_type": "code", "execution_count": 3, "id": "0fa0611a", "metadata": {}, "outputs": [], "source": [ "settings = {\n", " \"probability_two_random_records_match\": 4/1000,\n", " \"link_type\": \"dedupe_only\",\n", " \"comparisons\": [\n", " cl.levenshtein_at_thresholds(\"first_name\", 2),\n", " cl.levenshtein_at_thresholds(\"surname\"),\n", " cl.levenshtein_at_thresholds(\"dob\"),\n", " cl.exact_match(\"city\", term_frequency_adjustments=True),\n", " cl.levenshtein_at_thresholds(\"email\"),\n", " ],\n", " \"blocking_rules_to_generate_predictions\": [\n", " \"l.first_name = r.first_name\",\n", " \"l.surname = r.surname\",\n", " ],\n", " \"retain_matching_columns\": True,\n", " \"retain_intermediate_calculation_columns\": True,\n", " \"additional_columns_to_retain\": [\"cluster\"],\n", "}" ] }, { "cell_type": "markdown", "id": "657a1fb8", "metadata": {}, "source": [ "In words, this setting dictionary says:\n", "\n", "* We have set a starting value for `probability_two_random_records_match` to 4/1000. This is a starting value - we will later estimate this parameter\n", "* We are performing a `dedupe_only` (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)\n", "* When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.\n", "* The `blocking_rules_to_generate_predictions` states that we will only check for duplicates amongst records where either the `first_name` or `surname` is identical.\n", "* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others\n", "* We will retain the `cluster` column in the results even though this is not used as part of comparisons. Later we'll be able to use this to compare Splink scores to the ground truth.\n", "* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to `True` so that Splink outputs additional information that helps the user understand the calculations. If they were `False`, the computations would run faster." ] }, { "cell_type": "markdown", "id": "afa31386", "metadata": {}, "source": [ "## Estimate the parameters of the model\n", "\n", "Now that we have specified our linkage model, we want to estimate its `m` and `u` parameters. \n", "\n", "- The `m` values are the proportion of records falling into each `ComparisonLevel` amongst truly *matching* records\n", "\n", "- The `u` values are the proportion of records falling into each `ComparisonLevel` amongst truly *non-matching* records\n", "\n", "You can read more about the theory of what these mean [here](https://www.robinlinacre.com/maths_of_fellegi_sunter/).\n", "\n", "We begin by using `estimate_u_using_random_sampling` method to compute the `u` values of the model. This is a simple direct estimation algorithm. The larger the random sample, the more accurate the predictions. You control this using the `target_rows` parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets." ] }, { "cell_type": "code", "execution_count": 4, "id": "b8d49e7a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "----- Estimating u probabilities using random sampling -----\n", "\n", "Estimated u probabilities using random sampling\n", "\n", "Your model is not yet fully trained. Missing estimates for:\n", " - first_name (no m values are trained).\n", " - surname (no m values are trained).\n", " - dob (no m values are trained).\n", " - city (no m values are trained).\n", " - email (no m values are trained).\n" ] } ], "source": [ "linker = DuckDBLinker(df, settings)\n", "linker.estimate_u_using_random_sampling(target_rows=1e6)" ] }, { "cell_type": "markdown", "id": "a73921b7", "metadata": {}, "source": [ "We then use the expectation maximisation algorithm to train the `m` values.\n", "\n", "This algorithm estimates the `m` values by generating pairwise record comparisons, and using them to maximise a likelihood function. \n", "\n", "Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a managable level.\n", "\n", "\n", "In our first estimation pass, we block on `first_name` and `surname`, meaning we will generate all record comparisons that have `first_name` and `surname` exactly equal. \n", "\n", "Recall we are trying to estimate the `m` values of the model, i.e. proportion of records falling into each `ComparisonLevel` amongst truly matching records.\n", "\n", "This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, since we have forced them to be equal 100% of the time.\n", "\n", "We can, however, estimate parameter estimates for all of the other columns. The output messages produced by Splink confirm this." ] }, { "cell_type": "code", "execution_count": 5, "id": "098f0a40", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "----- Starting EM training session -----\n", "\n", "Estimating the m probabilities of the model by blocking on:\n", "l.first_name = r.first_name and l.surname = r.surname\n", "\n", "Parameter estimates will be made for the following comparison(s):\n", " - dob\n", " - city\n", " - email\n", "\n", "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n", " - first_name\n", " - surname\n", "\n", "Iteration 1: Largest change in params was -0.531 in the m_probability of dob, level `Exact match`\n", "Iteration 2: Largest change in params was 0.0331 in probability_two_random_records_match\n", "Iteration 3: Largest change in params was 0.0128 in probability_two_random_records_match\n", "Iteration 4: Largest change in params was 0.00635 in probability_two_random_records_match\n", "Iteration 5: Largest change in params was 0.00363 in probability_two_random_records_match\n", "Iteration 6: Largest change in params was 0.00225 in probability_two_random_records_match\n", "Iteration 7: Largest change in params was 0.00146 in probability_two_random_records_match\n", "Iteration 8: Largest change in params was 0.000987 in probability_two_random_records_match\n", "Iteration 9: Largest change in params was 0.000681 in probability_two_random_records_match\n", "Iteration 10: Largest change in params was 0.000478 in probability_two_random_records_match\n", "Iteration 11: Largest change in params was 0.000339 in probability_two_random_records_match\n", "Iteration 12: Largest change in params was 0.000242 in probability_two_random_records_match\n", "Iteration 13: Largest change in params was 0.000174 in probability_two_random_records_match\n", "Iteration 14: Largest change in params was 0.000126 in probability_two_random_records_match\n", "Iteration 15: Largest change in params was 9.12e-05 in probability_two_random_records_match\n", "\n", "EM converged after 15 iterations\n", "\n", "Your model is not yet fully trained. Missing estimates for:\n", " - first_name (no m values are trained).\n", " - surname (no m values are trained).\n" ] } ], "source": [ "training_blocking_rule = \"l.first_name = r.first_name and l.surname = r.surname\"\n", "training_session_fname_sname = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)" ] }, { "cell_type": "markdown", "id": "92bd4a31", "metadata": {}, "source": [ "In a second estimation pass, we block on dob. This allows us to estimate parameters for the `first_name` and `surname` comparisons.\n", "\n", "Between the two estimation passes, we now have parameter estimates for all comparisons." ] }, { "cell_type": "code", "execution_count": 6, "id": "ac8d3264", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "----- Starting EM training session -----\n", "\n", "Estimating the m probabilities of the model by blocking on:\n", "l.dob = r.dob\n", "\n", "Parameter estimates will be made for the following comparison(s):\n", " - first_name\n", " - surname\n", " - city\n", " - email\n", "\n", "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n", " - dob\n", "\n", "Iteration 1: Largest change in params was 0.48 in probability_two_random_records_match\n", "Iteration 2: Largest change in params was 0.151 in probability_two_random_records_match\n", "Iteration 3: Largest change in params was 0.0477 in probability_two_random_records_match\n", "Iteration 4: Largest change in params was 0.0177 in probability_two_random_records_match\n", "Iteration 5: Largest change in params was 0.00797 in probability_two_random_records_match\n", "Iteration 6: Largest change in params was 0.004 in probability_two_random_records_match\n", "Iteration 7: Largest change in params was 0.00213 in probability_two_random_records_match\n", "Iteration 8: Largest change in params was 0.00117 in probability_two_random_records_match\n", "Iteration 9: Largest change in params was 0.00065 in probability_two_random_records_match\n", "Iteration 10: Largest change in params was 0.000366 in probability_two_random_records_match\n", "Iteration 11: Largest change in params was 0.000207 in probability_two_random_records_match\n", "Iteration 12: Largest change in params was 0.000117 in probability_two_random_records_match\n", "Iteration 13: Largest change in params was 6.67e-05 in probability_two_random_records_match\n", "\n", "EM converged after 13 iterations\n", "\n", "Your model is fully trained. All comparisons have at least one estimate for their m and u values\n" ] } ], "source": [ "training_blocking_rule = \"l.dob = r.dob\"\n", "training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)" ] }, { "cell_type": "markdown", "id": "efdb0c5f", "metadata": {}, "source": [ "Note that Splink includes other algorithms for estimating m and u values, which are documented [here](https://moj-analytical-services.github.io/splink/linkerest.html)." ] }, { "cell_type": "markdown", "id": "38355535", "metadata": {}, "source": [ "## Visualising model parameters\n", "\n", "The final estimated match weights can be viewed in the match weights chart:" ] }, { "cell_type": "code", "execution_count": 7, "id": "3a1e15cc", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v5.2.json", "config": { "header": { "title": null }, "mark": { "tooltip": null }, "title": { "anchor": "middle" }, "view": { "height": 60, "width": 400 } }, "data": { "values": [ { "bayes_factor": 0.009833850927123876, "bayes_factor_description": "The probability that two random records drawn at random match is 0.010 or one in 102.7 records.This is equivalent to a starting match weight of -6.668.", "comparison_name": "probability_two_random_records_match", "comparison_sort_order": -1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "", "log2_bayes_factor": -6.66802779937637, "m_probability": null, "m_probability_description": null, "max_comparison_vector_value": 0, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": null, "tf_adjustment_column": null, "tf_adjustment_weight": null, "u_probability": null, "u_probability_description": null }, { "bayes_factor": 85.5492422084233, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.41868317032687, "m_probability": 0.49563564273680927, "m_probability_description": "Amongst matching record comparisons, 49.56% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "first_name_l = first_name_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0057935713975033705, "u_probability_description": "Amongst non-matching record comparisons, 0.58% of records are in the exact match comparison level" }, { "bayes_factor": 26.751309863975084, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 26.75 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.741537628943031, "m_probability": 0.27072063394450885, "m_probability_description": "Amongst matching record comparisons, 27.07% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(first_name_l, first_name_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.010119901990634016, "u_probability_description": "Amongst non-matching record comparisons, 1.01% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.23742193089778274, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 4.21 times less likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.074474890589354, "m_probability": 0.23364372331868066, "m_probability_description": "Amongst matching record comparisons, 23.36% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9840865266118626, "u_probability_description": "Amongst non-matching record comparisons, 98.41% of records are in the all other comparisons comparison level" }, { "bayes_factor": 90.1703772843795, "bayes_factor_description": "If comparison level is `exact match` then comparison is 90.17 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.494581652935134, "m_probability": 0.4409309402659144, "m_probability_description": "Amongst matching record comparisons, 44.09% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "surname_l = surname_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.004889975550122249, "u_probability_description": "Amongst non-matching record comparisons, 0.49% of records are in the exact match comparison level" }, { "bayes_factor": 79.15014455822713, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 79.15 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.306520080161341, "m_probability": 0.187141318174158, "m_probability_description": "Amongst matching record comparisons, 18.71% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(surname_l, surname_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.002364383782476692, "u_probability_description": "Amongst non-matching record comparisons, 0.24% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 22.60384847607212, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 22.60 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.498496518176822, "m_probability": 0.11323146703102362, "m_probability_description": "Amongst matching record comparisons, 11.32% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(surname_l, surname_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.005009388872469557, "u_probability_description": "Amongst non-matching record comparisons, 0.50% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.26190825137661683, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.82 times less likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.9328665826115932, "m_probability": 0.2586962745289042, "m_probability_description": "Amongst matching record comparisons, 25.87% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9877362517949315, "u_probability_description": "Amongst non-matching record comparisons, 98.77% of records are in the all other comparisons comparison level" }, { "bayes_factor": 224.62100960803812, "bayes_factor_description": "If comparison level is `exact match` then comparison is 224.62 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.81134906426195, "m_probability": 0.39258086363927386, "m_probability_description": "Amongst matching record comparisons, 39.26% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "dob_l = dob_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0017477477477477479, "u_probability_description": "Amongst non-matching record comparisons, 0.17% of records are in the exact match comparison level" }, { "bayes_factor": 93.58478813959559, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.58 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.5482021390219165, "m_probability": 0.14988554656992287, "m_probability_description": "Amongst matching record comparisons, 14.99% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(dob_l, dob_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0016016016016016017, "u_probability_description": "Amongst non-matching record comparisons, 0.16% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 13.323287650611148, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 13.32 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 3.735878220238321, "m_probability": 0.2067443495092833, "m_probability_description": "Amongst matching record comparisons, 20.67% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(dob_l, dob_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.015517517517517518, "u_probability_description": "Amongst non-matching record comparisons, 1.55% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.25561183473710014, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.91 times less likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.9679734607884474, "m_probability": 0.25078924028151967, "m_probability_description": "Amongst matching record comparisons, 25.08% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9811331331331331, "u_probability_description": "Amongst non-matching record comparisons, 98.11% of records are in the all other comparisons comparison level" }, { "bayes_factor": 10.257652575272894, "bayes_factor_description": "If comparison level is `exact match` then comparison is 10.26 times more likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "has_tf_adjustments": true, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 3.3586287083381365, "m_probability": 0.5656846255360627, "m_probability_description": "Amongst matching record comparisons, 56.57% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "city_l = city_r", "tf_adjustment_column": "city", "tf_adjustment_weight": 1, "u_probability": 0.0551475711801453, "u_probability_description": "Amongst non-matching record comparisons, 5.51% of records are in the exact match comparison level" }, { "bayes_factor": 0.45966477009156764, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.18 times less likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.1213459964112529, "m_probability": 0.43431537446393775, "m_probability_description": "Amongst matching record comparisons, 43.43% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9448524288198547, "u_probability_description": "Amongst non-matching record comparisons, 94.49% of records are in the all other comparisons comparison level" }, { "bayes_factor": 255.4199334194977, "bayes_factor_description": "If comparison level is `exact match` then comparison is 255.42 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.996727309659809, "m_probability": 0.5603584650366957, "m_probability_description": "Amongst matching record comparisons, 56.04% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "email_l = email_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0021938713143283602, "u_probability_description": "Amongst non-matching record comparisons, 0.22% of records are in the exact match comparison level" }, { "bayes_factor": 235.45821520840934, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 235.46 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 7.879327249357033, "m_probability": 0.17269329250389986, "m_probability_description": "Amongst matching record comparisons, 17.27% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(email_l, email_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0007334349848487773, "u_probability_description": "Amongst non-matching record comparisons, 0.07% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 206.63645011925308, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 206.64 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 7.690950953987196, "m_probability": 0.12828947158266213, "m_probability_description": "Amongst matching record comparisons, 12.83% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(email_l, email_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.000620846281034272, "u_probability_description": "Amongst non-matching record comparisons, 0.06% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.13915250519710004, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 7.19 times less likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.845261212787023, "m_probability": 0.13865877087674205, "m_probability_description": "Amongst matching record comparisons, 13.87% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9964518474197885, "u_probability_description": "Amongst non-matching record comparisons, 99.65% of records are in the all other comparisons comparison level" } ] }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "title": { "subtitle": "Use mousewheel to zoom", "text": "Model parameters (components of final match weight)" }, "vconcat": [ { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "probability_two_random_records_match", "format": ".4f", "title": "Probability two random records match", "type": "nominal" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Equivalent match weight", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "domain": false, "labels": false, "ticks": false, "title": "" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": "Prior (starting) match weight", "titleAlign": "right", "titleAngle": 0, "titleFontWeight": "normal" }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": 30, "mark": { "clip": true, "height": 20, "type": "bar" }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name == 'probability_two_random_records_match')" } ] }, { "encoding": { "color": { "field": "log2_bayes_factor", "scale": { "domain": [ -10, 0, 10 ], "range": [ "red", "orange", "green" ] }, "title": "Match weight", "type": "quantitative" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "comparison_name", "title": "Comparison name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4f", "title": "M probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4f", "title": "U probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "title": "Comparison level match weight = log2(m/u)" }, "field": "log2_bayes_factor", "scale": { "domain": [ -10, 10 ] }, "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "mark": { "clip": true, "type": "bar" }, "resolve": { "axis": { "y": "independent" }, "scale": { "y": "independent" } }, "selection": { "zoom_selector": { "bind": "scales", "encodings": [ "x" ], "type": "interval" } }, "transform": [ { "filter": "(datum.comparison_name != 'probability_two_random_records_match')" } ] } ] }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.match_weights_chart()" ] }, { "cell_type": "code", "execution_count": 8, "id": "8576c042", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v4.json", "config": { "header": { "title": null }, "title": { "anchor": "middle", "offset": 10 }, "view": { "height": 300, "width": 400 } }, "data": { "values": [ { "bayes_factor": 85.5492422084233, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.41868317032687, "m_probability": 0.49563564273680927, "m_probability_description": "Amongst matching record comparisons, 49.56% of records are in the exact match comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "first_name_l = first_name_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0057935713975033705, "u_probability_description": "Amongst non-matching record comparisons, 0.58% of records are in the exact match comparison level" }, { "bayes_factor": 26.751309863975084, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 26.75 times more likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.741537628943031, "m_probability": 0.27072063394450885, "m_probability_description": "Amongst matching record comparisons, 27.07% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(first_name_l, first_name_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.010119901990634016, "u_probability_description": "Amongst non-matching record comparisons, 1.01% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.23742193089778274, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 4.21 times less likely to be a match", "comparison_name": "first_name", "comparison_sort_order": 0, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.074474890589354, "m_probability": 0.23364372331868066, "m_probability_description": "Amongst matching record comparisons, 23.36% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 2, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9840865266118626, "u_probability_description": "Amongst non-matching record comparisons, 98.41% of records are in the all other comparisons comparison level" }, { "bayes_factor": 90.1703772843795, "bayes_factor_description": "If comparison level is `exact match` then comparison is 90.17 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 6.494581652935134, "m_probability": 0.4409309402659144, "m_probability_description": "Amongst matching record comparisons, 44.09% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "surname_l = surname_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.004889975550122249, "u_probability_description": "Amongst non-matching record comparisons, 0.49% of records are in the exact match comparison level" }, { "bayes_factor": 79.15014455822713, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 79.15 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.306520080161341, "m_probability": 0.187141318174158, "m_probability_description": "Amongst matching record comparisons, 18.71% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(surname_l, surname_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.002364383782476692, "u_probability_description": "Amongst non-matching record comparisons, 0.24% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 22.60384847607212, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 22.60 times more likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 4.498496518176822, "m_probability": 0.11323146703102362, "m_probability_description": "Amongst matching record comparisons, 11.32% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(surname_l, surname_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.005009388872469557, "u_probability_description": "Amongst non-matching record comparisons, 0.50% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.26190825137661683, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.82 times less likely to be a match", "comparison_name": "surname", "comparison_sort_order": 1, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.9328665826115932, "m_probability": 0.2586962745289042, "m_probability_description": "Amongst matching record comparisons, 25.87% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9877362517949315, "u_probability_description": "Amongst non-matching record comparisons, 98.77% of records are in the all other comparisons comparison level" }, { "bayes_factor": 224.62100960803812, "bayes_factor_description": "If comparison level is `exact match` then comparison is 224.62 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.81134906426195, "m_probability": 0.39258086363927386, "m_probability_description": "Amongst matching record comparisons, 39.26% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "dob_l = dob_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0017477477477477479, "u_probability_description": "Amongst non-matching record comparisons, 0.17% of records are in the exact match comparison level" }, { "bayes_factor": 93.58478813959559, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.58 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.5482021390219165, "m_probability": 0.14988554656992287, "m_probability_description": "Amongst matching record comparisons, 14.99% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(dob_l, dob_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0016016016016016017, "u_probability_description": "Amongst non-matching record comparisons, 0.16% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 13.323287650611148, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 13.32 times more likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 3.735878220238321, "m_probability": 0.2067443495092833, "m_probability_description": "Amongst matching record comparisons, 20.67% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(dob_l, dob_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.015517517517517518, "u_probability_description": "Amongst non-matching record comparisons, 1.55% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.25561183473710014, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 3.91 times less likely to be a match", "comparison_name": "dob", "comparison_sort_order": 2, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.9679734607884474, "m_probability": 0.25078924028151967, "m_probability_description": "Amongst matching record comparisons, 25.08% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9811331331331331, "u_probability_description": "Amongst non-matching record comparisons, 98.11% of records are in the all other comparisons comparison level" }, { "bayes_factor": 10.257652575272894, "bayes_factor_description": "If comparison level is `exact match` then comparison is 10.26 times more likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 1, "has_tf_adjustments": true, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 3.3586287083381365, "m_probability": 0.5656846255360627, "m_probability_description": "Amongst matching record comparisons, 56.57% of records are in the exact match comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "city_l = city_r", "tf_adjustment_column": "city", "tf_adjustment_weight": 1, "u_probability": 0.0551475711801453, "u_probability_description": "Amongst non-matching record comparisons, 5.51% of records are in the exact match comparison level" }, { "bayes_factor": 0.45966477009156764, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 2.18 times less likely to be a match", "comparison_name": "city", "comparison_sort_order": 3, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -1.1213459964112529, "m_probability": 0.43431537446393775, "m_probability_description": "Amongst matching record comparisons, 43.43% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 1, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9448524288198547, "u_probability_description": "Amongst non-matching record comparisons, 94.49% of records are in the all other comparisons comparison level" }, { "bayes_factor": 255.4199334194977, "bayes_factor_description": "If comparison level is `exact match` then comparison is 255.42 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 3, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "Exact match", "log2_bayes_factor": 7.996727309659809, "m_probability": 0.5603584650366957, "m_probability_description": "Amongst matching record comparisons, 56.04% of records are in the exact match comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "email_l = email_r", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0021938713143283602, "u_probability_description": "Amongst non-matching record comparisons, 0.22% of records are in the exact match comparison level" }, { "bayes_factor": 235.45821520840934, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 235.46 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 2, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 7.879327249357033, "m_probability": 0.17269329250389986, "m_probability_description": "Amongst matching record comparisons, 17.27% of records are in the levenshtein <= 1 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(email_l, email_r) <= 1", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.0007334349848487773, "u_probability_description": "Amongst non-matching record comparisons, 0.07% of records are in the levenshtein <= 1 comparison level" }, { "bayes_factor": 206.63645011925308, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 206.64 times more likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 1, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 7.690950953987196, "m_probability": 0.12828947158266213, "m_probability_description": "Amongst matching record comparisons, 12.83% of records are in the levenshtein <= 2 comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "levenshtein(email_l, email_r) <= 2", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.000620846281034272, "u_probability_description": "Amongst non-matching record comparisons, 0.06% of records are in the levenshtein <= 2 comparison level" }, { "bayes_factor": 0.13915250519710004, "bayes_factor_description": "If comparison level is `all other comparisons` then comparison is 7.19 times less likely to be a match", "comparison_name": "email", "comparison_sort_order": 4, "comparison_vector_value": 0, "has_tf_adjustments": false, "is_null_level": false, "label_for_charts": "All other comparisons", "log2_bayes_factor": -2.845261212787023, "m_probability": 0.13865877087674205, "m_probability_description": "Amongst matching record comparisons, 13.87% of records are in the all other comparisons comparison level", "max_comparison_vector_value": 3, "probability_two_random_records_match": 0.009738088021208104, "sql_condition": "ELSE", "tf_adjustment_column": null, "tf_adjustment_weight": 1, "u_probability": 0.9964518474197885, "u_probability_description": "Amongst non-matching record comparisons, 99.65% of records are in the all other comparisons comparison level" } ] }, "hconcat": [ { "encoding": { "color": { "value": "green" }, "row": { "field": "comparison_name", "header": { "labelAlign": "left", "labelAnchor": "middle", "labelAngle": 0 }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "m_probability_description", "title": "m probability description", "type": "nominal" }, { "field": "comparison_name", "title": "Comparison column name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4p", "title": "m probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4p", "title": "u probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" } ], "x": { "axis": { "title": "Proportion of record comparisons" }, "field": "m_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": 50, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontSize": 12, "fontWeight": "bold", "text": "Amongst matching record comparisons:" }, "transform": [ { "filter": "(datum.bayes_factor != 'no-op filter due to vega lite issue 4680')" } ], "width": 150 }, { "encoding": { "color": { "value": "red" }, "row": { "field": "comparison_name", "header": { "labels": false }, "sort": { "field": "comparison_sort_order" }, "type": "nominal" }, "tooltip": [ { "field": "u_probability_description", "title": "u probability description", "type": "nominal" }, { "field": "comparison_name", "title": "Comparison column name", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "m_probability", "format": ".4p", "title": "m probability", "type": "quantitative" }, { "field": "u_probability", "format": ".4p", "title": "u probability", "type": "quantitative" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" } ], "x": { "axis": { "title": "Proportion of record comparisons" }, "field": "u_probability", "type": "quantitative" }, "y": { "axis": { "title": null }, "field": "label_for_charts", "sort": { "field": "comparison_vector_value", "order": "descending" }, "type": "nominal" } }, "height": 50, "mark": "bar", "resolve": { "scale": { "y": "independent" } }, "title": { "fontSize": 12, "fontWeight": "bold", "text": "Amongst non-matching record comparisons:" }, "transform": [ { "filter": "(datum.bayes_factor != 'no-op filter2 due to vega lite issue 4680')" } ], "width": 150 } ], "title": { "subtitle": "(m and u probabilities)", "text": "Proportion of record comparisons in each comparison level by match status" } }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.m_u_parameters_chart()" ] }, { "cell_type": "markdown", "id": "c44fcc26", "metadata": {}, "source": [ "### Saving the model\n", "\n", "Finally we can save the model, including our estimated parameters, to a `.json` file, so we can use it in the next tutorial." ] }, { "cell_type": "code", "execution_count": 9, "id": "992703a7", "metadata": {}, "outputs": [], "source": [ "linker.save_settings_to_json(\"./demo_settings/saved_model_from_demo.json\", overwrite=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 5 }