{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating m from a sample of pairwise labels\n", "\n", "In this example, we estimate the m probabilities of the model from a table containing pairwise record comparisons which we know are 'true' matches. For example, these may be the result of work by a clerical team who have manually labelled a sample of matches.\n", "\n", "The table must be in the following format:\n", "\n", "|source_dataset_l|unique_id_l|source_dataset_r|unique_id_r|\n", "|----------------|-----------|----------------|-----------|\n", "|df_1 |1 |df_2 |2 |\n", "|df_1 |1 |df_2 |3 |\n", "\n", "It is assumed that every record in the table represents a certain match.\n", "\n", "Note that the column names above are the defaults. They should correspond to the values you've set for [`unique_id_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#unique_id_column_name) and [`source_dataset_column_name`](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#source_dataset_column_name), if you've chosen custom values.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RendererRegistry.enable('mimetype')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable(\"mimetype\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | unique_id_l | \n", "source_dataset_l | \n", "unique_id_r | \n", "source_dataset_r | \n", "
---|---|---|---|---|
0 | \n", "0 | \n", "fake_1000 | \n", "3 | \n", "fake_1000 | \n", "
1 | \n", "1 | \n", "fake_1000 | \n", "3 | \n", "fake_1000 | \n", "
2 | \n", "2 | \n", "fake_1000 | \n", "3 | \n", "fake_1000 | \n", "
3 | \n", "4 | \n", "fake_1000 | \n", "5 | \n", "fake_1000 | \n", "
4 | \n", "7 | \n", "fake_1000 | \n", "10 | \n", "fake_1000 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2026 | \n", "978 | \n", "fake_1000 | \n", "979 | \n", "fake_1000 | \n", "
2027 | \n", "985 | \n", "fake_1000 | \n", "986 | \n", "fake_1000 | \n", "
2028 | \n", "624 | \n", "fake_1000 | \n", "626 | \n", "fake_1000 | \n", "
2029 | \n", "625 | \n", "fake_1000 | \n", "626 | \n", "fake_1000 | \n", "
2030 | \n", "624 | \n", "fake_1000 | \n", "625 | \n", "fake_1000 | \n", "
2031 rows × 4 columns
\n", "\n", " | unique_id | \n", "first_name | \n", "surname | \n", "dob | \n", "city | \n", "cluster | \n", "|
---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "Robert | \n", "Alan | \n", "1971-06-24 | \n", "NaN | \n", "robert255@smith.net | \n", "0 | \n", "
1 | \n", "1 | \n", "Robert | \n", "Allen | \n", "1971-05-24 | \n", "NaN | \n", "roberta25@smith.net | \n", "0 | \n", "