{ "cells": [ { "cell_type": "markdown", "id": "f110e018", "metadata": {}, "source": [ "# Visualising predictions\n", "\n", "Splink contains a variety of tools to help you visualise your predictions.\n", "\n", "The idea is that, by developing an understanding of how your model works, you can gain confidence that the predictions it makes are sensible, or alternatively find examples of where your model isn't working, which may help you improve the model specification and fix these problems.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "fb29d421", "metadata": {}, "outputs": [], "source": [ "# Rerun our predictions to we're ready to view the charts\n", "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable('mimetype')\n", "\n", "df = pd.read_csv(\"./data/fake_1000.csv\")\n", "linker = DuckDBLinker(df)\n", "linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")\n", "df_predictions = linker.predict(threshold_match_probability=0.2)" ] }, { "cell_type": "markdown", "id": "7b0dedd9", "metadata": {}, "source": [ "## Waterfall chart\n", "\n", "The waterfall chart provides a means of visualising individual predictions to understand how Splink computed the final matchweight for a particular pairwise record comparison.\n", "\n", "To plot a waterfall chart, the user chooses one or more records from the results of `linker.predict()`, and provides these records to the [`linker.waterfall_chart()`](https://moj-analytical-services.github.io/splink/linkerqa.html#splink.linker.Linker.waterfall_chart) function.\n", "\n", "For an introduction to waterfall charts and how to interpret them, please see [this](https://www.youtube.com/watch?v=msz3T741KQI&t=507s) video." ] }, { "cell_type": "code", "execution_count": 2, "id": "bbfdc70c", "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v4+json": { "$schema": "https://vega.github.io/schema/vega-lite/v5.2.0.json", "config": { "view": { "continuousHeight": 300, "continuousWidth": 400 } }, "data": { "values": [ { "bar_sort_order": 0, "bayes_factor": 0.0033430420247643373, "bayes_factor_description": null, "column_name": "Prior", "comparison_vector_value": null, "label_for_charts": "Starting match weight (prior)", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "record_number": 0, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 1, "bayes_factor": 85.5485223719195, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "column_name": "first_name", "comparison_vector_value": 2, "label_for_charts": "Exact match", "log2_bayes_factor": 6.4186710310150925, "m_probability": 0.49563147231263005, "record_number": 0, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "term_frequency_adjustment": false, "u_probability": 0.0057935713975033705, "value_l": "Grace", "value_r": "Grace" }, { "bar_sort_order": 2, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "surname", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 0, "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "nan", "value_r": "Kelly" }, { "bar_sort_order": 3, "bayes_factor": 93.58522958762835, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.59 times more likely to be a match", "column_name": "dob", "comparison_vector_value": 2, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.548208944330752, "m_probability": 0.14988625359379917, "record_number": 0, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "term_frequency_adjustment": false, "u_probability": 0.0016016016016016017, "value_l": "1997-04-26", "value_r": "1991-04-26" }, { "bar_sort_order": 4, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 0, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "Hull", "value_r": "nan" }, { "bar_sort_order": 5, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "tf_city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 0, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": true, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 6, "bayes_factor": 255.41995286654844, "bayes_factor_description": "If comparison level is `exact match` then comparison is 255.42 times more likely to be a match", "column_name": "email", "comparison_vector_value": 3, "label_for_charts": "Exact match", "log2_bayes_factor": 7.9967274195030855, "m_probability": 0.5603585077010225, "record_number": 0, "sql_condition": "\"email_l\" = \"email_r\"", "term_frequency_adjustment": false, "u_probability": 0.0021938713143283602, "value_l": "grace.kelly52@jones.com", "value_r": "grace.kelly52@jones.com" }, { "bar_sort_order": 7, "bayes_factor": 6836.2270630146295, "bayes_factor_description": null, "column_name": "Final score", "comparison_vector_value": null, "label_for_charts": "Final score", "log2_bayes_factor": 12.738984601109264, "m_probability": null, "record_number": 0, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 0, "bayes_factor": 0.0033430420247643373, "bayes_factor_description": null, "column_name": "Prior", "comparison_vector_value": null, "label_for_charts": "Starting match weight (prior)", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "record_number": 1, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 1, "bayes_factor": 85.5485223719195, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "column_name": "first_name", "comparison_vector_value": 2, "label_for_charts": "Exact match", "log2_bayes_factor": 6.4186710310150925, "m_probability": 0.49563147231263005, "record_number": 1, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "term_frequency_adjustment": false, "u_probability": 0.0057935713975033705, "value_l": "Thomas", "value_r": "Thomas" }, { "bar_sort_order": 2, "bayes_factor": 90.16953460720651, "bayes_factor_description": "If comparison level is `exact match` then comparison is 90.17 times more likely to be a match", "column_name": "surname", "comparison_vector_value": 3, "label_for_charts": "Exact match", "log2_bayes_factor": 6.494568170327033, "m_probability": 0.4409268195951419, "record_number": 1, "sql_condition": "\"surname_l\" = \"surname_r\"", "term_frequency_adjustment": false, "u_probability": 0.004889975550122249, "value_l": "Gabriel", "value_r": "Gabriel" }, { "bar_sort_order": 3, "bayes_factor": 93.58522958762835, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.59 times more likely to be a match", "column_name": "dob", "comparison_vector_value": 2, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.548208944330752, "m_probability": 0.14988625359379917, "record_number": 1, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "term_frequency_adjustment": false, "u_probability": 0.0016016016016016017, "value_l": "1976-09-15", "value_r": "1976-08-15" }, { "bar_sort_order": 4, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 1, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "Loodon", "value_r": "nan" }, { "bar_sort_order": 5, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "tf_city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 1, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": true, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 6, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "email", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 1, "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "gabriel.t54@nnichls.info", "value_r": "nan" }, { "bar_sort_order": 7, "bayes_factor": 2413.356536258096, "bayes_factor_description": null, "column_name": "Final score", "comparison_vector_value": null, "label_for_charts": "Final score", "log2_bayes_factor": 11.236825351933211, "m_probability": null, "record_number": 1, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 0, "bayes_factor": 0.0033430420247643373, "bayes_factor_description": null, "column_name": "Prior", "comparison_vector_value": null, "label_for_charts": "Starting match weight (prior)", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "record_number": 2, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 1, "bayes_factor": 85.5485223719195, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "column_name": "first_name", "comparison_vector_value": 2, "label_for_charts": "Exact match", "log2_bayes_factor": 6.4186710310150925, "m_probability": 0.49563147231263005, "record_number": 2, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "term_frequency_adjustment": false, "u_probability": 0.0057935713975033705, "value_l": "Thomas", "value_r": "Thomas" }, { "bar_sort_order": 2, "bayes_factor": 90.16953460720651, "bayes_factor_description": "If comparison level is `exact match` then comparison is 90.17 times more likely to be a match", "column_name": "surname", "comparison_vector_value": 3, "label_for_charts": "Exact match", "log2_bayes_factor": 6.494568170327033, "m_probability": 0.4409268195951419, "record_number": 2, "sql_condition": "\"surname_l\" = \"surname_r\"", "term_frequency_adjustment": false, "u_probability": 0.004889975550122249, "value_l": "Gabriel", "value_r": "Gabriel" }, { "bar_sort_order": 3, "bayes_factor": 93.58522958762835, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.59 times more likely to be a match", "column_name": "dob", "comparison_vector_value": 2, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.548208944330752, "m_probability": 0.14988625359379917, "record_number": 2, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "term_frequency_adjustment": false, "u_probability": 0.0016016016016016017, "value_l": "1976-09-15", "value_r": "1976-08-15" }, { "bar_sort_order": 4, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 2, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "London", "value_r": "nan" }, { "bar_sort_order": 5, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "tf_city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 2, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": true, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 6, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "email", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 2, "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "gabriel.t54@nichols.info", "value_r": "nan" }, { "bar_sort_order": 7, "bayes_factor": 2413.356536258096, "bayes_factor_description": null, "column_name": "Final score", "comparison_vector_value": null, "label_for_charts": "Final score", "log2_bayes_factor": 11.236825351933211, "m_probability": null, "record_number": 2, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 0, "bayes_factor": 0.0033430420247643373, "bayes_factor_description": null, "column_name": "Prior", "comparison_vector_value": null, "label_for_charts": "Starting match weight (prior)", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "record_number": 3, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 1, "bayes_factor": 85.5485223719195, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "column_name": "first_name", "comparison_vector_value": 2, "label_for_charts": "Exact match", "log2_bayes_factor": 6.4186710310150925, "m_probability": 0.49563147231263005, "record_number": 3, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "term_frequency_adjustment": false, "u_probability": 0.0057935713975033705, "value_l": "Erin", "value_r": "Erin" }, { "bar_sort_order": 2, "bayes_factor": 79.14940905778622, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 79.15 times more likely to be a match", "column_name": "surname", "comparison_vector_value": 2, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.306506673896895, "m_probability": 0.1871395791688435, "record_number": 3, "sql_condition": "levenshtein(\"surname_l\", \"surname_r\") <= 1", "term_frequency_adjustment": false, "u_probability": 0.002364383782476692, "value_l": "Rogers", "value_r": "Roers" }, { "bar_sort_order": 3, "bayes_factor": 13.323346076010694, "bayes_factor_description": "If comparison level is `levenshtein <= 2` then comparison is 13.32 times more likely to be a match", "column_name": "dob", "comparison_vector_value": 1, "label_for_charts": "levenshtein <= 2", "log2_bayes_factor": 3.7358845467435784, "m_probability": 0.20674525612644423, "record_number": 3, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 2", "term_frequency_adjustment": false, "u_probability": 0.015517517517517518, "value_l": "2010-01-02", "value_r": "2010-03-03" }, { "bar_sort_order": 4, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 3, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "London", "value_r": "nan" }, { "bar_sort_order": 5, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "tf_city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 3, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": true, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 6, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "email", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 3, "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "e.rogers3@hopkins.org", "value_r": "nan" }, { "bar_sort_order": 7, "bayes_factor": 301.5888868398943, "bayes_factor_description": null, "column_name": "Final score", "comparison_vector_value": null, "label_for_charts": "Final score", "log2_bayes_factor": 8.2364394579159, "m_probability": null, "record_number": 3, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 0, "bayes_factor": 0.0033430420247643373, "bayes_factor_description": null, "column_name": "Prior", "comparison_vector_value": null, "label_for_charts": "Starting match weight (prior)", "log2_bayes_factor": -8.224622793739668, "m_probability": null, "record_number": 4, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 1, "bayes_factor": 85.5485223719195, "bayes_factor_description": "If comparison level is `exact match` then comparison is 85.55 times more likely to be a match", "column_name": "first_name", "comparison_vector_value": 2, "label_for_charts": "Exact match", "log2_bayes_factor": 6.4186710310150925, "m_probability": 0.49563147231263005, "record_number": 4, "sql_condition": "\"first_name_l\" = \"first_name_r\"", "term_frequency_adjustment": false, "u_probability": 0.0057935713975033705, "value_l": "Erin", "value_r": "Erin" }, { "bar_sort_order": 2, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "surname", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 4, "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "nan", "value_r": "Roers" }, { "bar_sort_order": 3, "bayes_factor": 93.58522958762835, "bayes_factor_description": "If comparison level is `levenshtein <= 1` then comparison is 93.59 times more likely to be a match", "column_name": "dob", "comparison_vector_value": 2, "label_for_charts": "levenshtein <= 1", "log2_bayes_factor": 6.548208944330752, "m_probability": 0.14988625359379917, "record_number": 4, "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", "term_frequency_adjustment": false, "u_probability": 0.0016016016016016017, "value_l": "2010-03-01", "value_r": "2010-03-03" }, { "bar_sort_order": 4, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 4, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "London", "value_r": "nan" }, { "bar_sort_order": 5, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "tf_city", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 4, "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", "term_frequency_adjustment": true, "u_probability": null, "value_l": "", "value_r": "" }, { "bar_sort_order": 6, "bayes_factor": 1, "bayes_factor_description": "If comparison level is `null` then comparison is 1.00 times more likely to be a match", "column_name": "email", "comparison_vector_value": -1, "label_for_charts": "Null", "log2_bayes_factor": 0, "m_probability": null, "record_number": 4, "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL", "term_frequency_adjustment": false, "u_probability": null, "value_l": "e.rogers3@honkips.org", "value_r": "nan" }, { "bar_sort_order": 7, "bayes_factor": 26.76465556544211, "bayes_factor_description": null, "column_name": "Final score", "comparison_vector_value": null, "label_for_charts": "Final score", "log2_bayes_factor": 4.742257181606178, "m_probability": null, "record_number": 4, "sql_condition": null, "term_frequency_adjustment": null, "u_probability": null, "value_l": "", "value_r": "" } ] }, "height": 450, "layer": [ { "layer": [ { "encoding": { "color": { "value": "black" }, "size": { "value": 0.5 }, "y": { "field": "zero", "type": "quantitative" } }, "mark": "rule" }, { "encoding": { "color": { "condition": { "test": "(datum.log2_bayes_factor < 0)", "value": "red" }, "value": "green" }, "opacity": { "condition": { "test": "datum.column_name == 'Prior match weight' || datum.column_name == 'Final score'", "value": 1 }, "value": 0.5 }, "tooltip": [ { "field": "column_name", "title": "Comparison column", "type": "nominal" }, { "field": "value_l", "title": "Value (L)", "type": "nominal" }, { "field": "value_r", "title": "Value (R)", "type": "nominal" }, { "field": "label_for_charts", "title": "Label", "type": "ordinal" }, { "field": "sql_condition", "title": "SQL condition", "type": "nominal" }, { "field": "comparison_vector_value", "title": "Comparison vector value", "type": "nominal" }, { "field": "bayes_factor", "format": ",.4f", "title": "Bayes factor = m/u", "type": "quantitative" }, { "field": "log2_bayes_factor", "format": ",.4f", "title": "Match weight = log2(m/u)", "type": "quantitative" }, { "field": "prob", "format": ".4f", "title": "Adjusted match score", "type": "quantitative" }, { "field": "bayes_factor_description", "title": "Match weight description", "type": "nominal" } ], "x": { "axis": { "grid": true, "labelAlign": "center", "labelAngle": -20, "labelExpr": "datum.value == 'Prior' || datum.value == 'Final score' ? '' : datum.value", "labelPadding": 10, "tickBand": "extent", "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "y": { "axis": { "grid": false, "orient": "left", "title": "log2(Bayes factor)" }, "field": "previous_sum", "type": "quantitative" }, "y2": { "field": "sum" } }, "mark": { "type": "bar", "width": 60 } }, { "encoding": { "color": { "value": "white" }, "text": { "condition": { "field": "log2_bayes_factor", "format": ".2f", "test": "abs(datum.log2_bayes_factor) > 1", "type": "nominal" }, "value": "" }, "x": { "axis": { "labelAngle": 0, "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "y": { "axis": { "orient": "left" }, "field": "center", "type": "quantitative" } }, "mark": { "fontWeight": "bold", "type": "text" } }, { "encoding": { "color": { "value": "black" }, "text": { "field": "column_name", "type": "nominal" }, "x": { "axis": { "labelAngle": 0, "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "y": { "field": "sum_top", "type": "quantitative" } }, "mark": { "baseline": "bottom", "dy": -25, "fontWeight": "bold", "type": "text" } }, { "encoding": { "color": { "value": "grey" }, "text": { "field": "value_l", "type": "nominal" }, "x": { "axis": { "labelAngle": 0, "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "y": { "field": "sum_top", "type": "quantitative" } }, "mark": { "baseline": "bottom", "dy": -13, "fontSize": 8, "type": "text" } }, { "encoding": { "color": { "value": "grey" }, "text": { "field": "value_r", "type": "nominal" }, "x": { "axis": { "labelAngle": 0, "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "y": { "field": "sum_top", "type": "quantitative" } }, "mark": { "baseline": "bottom", "dy": -5, "fontSize": 8, "type": "text" } } ] }, { "encoding": { "x": { "axis": { "labelAngle": 0, "title": "Column" }, "field": "column_name", "sort": { "field": "bar_sort_order", "order": "ascending" }, "type": "nominal" }, "x2": { "field": "lead" }, "y": { "axis": { "labelExpr": "format(1 / (1 + pow(2, -1*datum.value)), '.2r')", "orient": "right", "title": "Probability" }, "field": "sum", "scale": { "zero": false }, "type": "quantitative" } }, "mark": { "color": "black", "strokeWidth": 2, "type": "rule", "x2Offset": 30, "xOffset": -30 } } ], "params": [ { "bind": { "input": "range", "max": 4, "min": 0, "step": 1 }, "description": "Filter by the interation number", "name": "record_number", "value": 0 } ], "resolve": { "axis": { "y": "independent" } }, "title": { "subtitle": "How each comparison contributes to the final match score", "text": "Match weights waterfall chart" }, "transform": [ { "filter": "(datum.record_number == record_number)" }, { "frame": [ null, 0 ], "window": [ { "as": "sum", "field": "log2_bayes_factor", "op": "sum" }, { "as": "lead", "field": "column_name", "op": "lead" } ] }, { "as": "sum", "calculate": "datum.column_name === \"Final score\" ? datum.sum - datum.log2_bayes_factor : datum.sum" }, { "as": "lead", "calculate": "datum.lead === null ? datum.column_name : datum.lead" }, { "as": "previous_sum", "calculate": "datum.column_name === \"Final score\" || datum.column_name === \"Prior match weight\" ? 0 : datum.sum - datum.log2_bayes_factor" }, { "as": "top_label", "calculate": "datum.sum > datum.previous_sum ? datum.column_name : \"\"" }, { "as": "bottom_label", "calculate": "datum.sum < datum.previous_sum ? datum.column_name : \"\"" }, { "as": "sum_top", "calculate": "datum.sum > datum.previous_sum ? datum.sum : datum.previous_sum" }, { "as": "sum_bottom", "calculate": "datum.sum < datum.previous_sum ? datum.sum : datum.previous_sum" }, { "as": "center", "calculate": "(datum.sum + datum.previous_sum) / 2" }, { "as": "text_log2_bayes_factor", "calculate": "(datum.log2_bayes_factor > 0 ? \"+\" : \"\") + datum.log2_bayes_factor" }, { "as": "dy", "calculate": "datum.sum < datum.previous_sum ? 4 : -4" }, { "as": "baseline", "calculate": "datum.sum < datum.previous_sum ? \"top\" : \"bottom\"" }, { "as": "prob", "calculate": "1. / (1 + pow(2, -1.*datum.sum))" }, { "as": "zero", "calculate": "0*datum.sum" } ], "width": { "step": 75 } }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "records_to_view = df_predictions.as_record_dict(limit=5)\n", "linker.waterfall_chart(records_to_view, filter_nulls=False)\n" ] }, { "cell_type": "markdown", "id": "48b76176", "metadata": {}, "source": [ "## Comparison viewer dashboard\n", "\n", "The [comparison viewer dashboard](https://moj-analytical-services.github.io/splink/linkerqa.html#splink.linker.Linker.comparison_viewer_dashboard) takes this one step further by producing an interactive dashboard that contains example predictions from across the spectrum of match scores.\n", "\n", "An in-depth video describing how to interpret the dashboard can be found [here](https://www.youtube.com/watch?v=DNvCMqjipis).\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "da85169c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.comparison_viewer_dashboard(df_predictions, \"scv.html\", overwrite=True)\n", "\n", "# You can view the scv.html file in your browser, or inline in a notbook as follows\n", "from IPython.display import IFrame\n", "IFrame(\n", " src=\"./scv.html\", width=\"100%\", height=1200\n", ") " ] }, { "cell_type": "markdown", "id": "d34df82c", "metadata": {}, "source": [ "## Cluster studio dashboard\n", "\n", "Cluster studio is an interactive dashboards that visualises the results of clustering your predictions.\n", "\n", "It provides examples of clusters of different sizes. The shape and size of clusters can be indicative of problems with record linkage, so it provides a tool to help you find potential false positive and negative links." ] }, { "cell_type": "code", "execution_count": 4, "id": "e2153d91", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Completed iteration 1, root rows count 10\n", "Completed iteration 2, root rows count 0\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n", "\n", "linker.cluster_studio_dashboard(df_predictions, df_clusters, \"cluster_studio.html\", sampling_method=\"by_cluster_size\", overwrite=True)\n", "\n", "# You can view the scv.html file in your browser, or inline in a notbook as follows\n", "from IPython.display import IFrame\n", "IFrame(\n", " src=\"./cluster_studio.html\", width=\"100%\", height=1200\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "splink_demos", "language": "python", "name": "splink_demos" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b" } } }, "nbformat": 4, "nbformat_minor": 5 }