{ "cells": [ { "cell_type": "markdown", "id": "26e50a28", "metadata": {}, "source": [ "# Exploratory analysis\n", "\n", "The purpose of exploratory analysis is to understand your data and any idiosyncrasies which may be relevant to the task of data linking.\n", "\n", "Splink includes functionality to visualise and summarise your data, to identify characteristics most salient to data linking.\n", "\n", "In this notebook we perform some basic exploratory analysis, and interpret the results." ] }, { "cell_type": "markdown", "id": "96a3d08d", "metadata": {}, "source": [ "### Read in the data\n", "\n", "For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.\n", "\n", "The first five rows of this dataset are printed below.\n", "\n", "Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)" ] }, { "cell_type": "code", "execution_count": 13, "id": "ffceed65", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unique_idfirst_namesurnamedobcityemailcluster
00RobertAlan1971-06-24NaNrobert255@smith.net0
11RobertAllen1971-05-24NaNroberta25@smith.net0
22RobAllen1971-06-24Londonroberta25@smith.net0
33RobertAlen1971-06-24LononNaN0
44GraceNaN1997-04-26Hullgrace.kelly52@jones.com1
\n", "
" ], "text/plain": [ " unique_id first_name surname dob city email \\\n", "0 0 Robert Alan 1971-06-24 NaN robert255@smith.net \n", "1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net \n", "2 2 Rob Allen 1971-06-24 London roberta25@smith.net \n", "3 3 Robert Alen 1971-06-24 Lonon NaN \n", "4 4 Grace NaN 1997-04-26 Hull grace.kelly52@jones.com \n", "\n", " cluster \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 1 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "import altair as alt\n", "alt.renderers.enable('default')\n", "\n", "\n", "df = pd.read_csv(\"./data/fake_1000.csv\")\n", "df.head(5)" ] }, { "cell_type": "markdown", "id": "2d53e50b", "metadata": {}, "source": [ "### Instantiate the linker\n", "\n", "Most of Splink's core functionality can be accessed as methods on a linker object. For example, to make predictions, you would call `linker.predict()`.\n", "\n", "We therefore begin by instantiating the linker, passing in the data we wish to deduplicate." ] }, { "cell_type": "code", "execution_count": 14, "id": "8a1aa029", "metadata": {}, "outputs": [], "source": [ "# Initialise the linker, passing in the input dataset(s)\n", "from splink.duckdb.duckdb_linker import DuckDBLinker\n", "linker = DuckDBLinker(df)" ] }, { "cell_type": "markdown", "id": "3d982939", "metadata": {}, "source": [ "## Analyse missingness" ] }, { "cell_type": "markdown", "id": "bfab11e8", "metadata": {}, "source": [ "It's important to understand the level of missingness in your data, because columns with higher levels of missingness are less useful for data linking." ] }, { "cell_type": "code", "execution_count": 15, "id": "6dae307c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.missingness_chart()" ] }, { "cell_type": "markdown", "id": "2c35a4e0", "metadata": {}, "source": [ "The above summary chart shows that in this dataset, the `email`, `city`, `surname` and `forename` columns contain nulls, but the level of missingness is relatively low (less than 22%)." ] }, { "cell_type": "markdown", "id": "f11cc6b6", "metadata": {}, "source": [ "## Analyse the distribution of values in your data" ] }, { "cell_type": "markdown", "id": "973cb505", "metadata": {}, "source": [ "The distribution of values in your data is important for two main reasons:\n", "\n", "1. Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.\n", "\n", "2. The skew of values is important. If you have a `city` column that has 1,000 distinct values, but 75% of them are `London`, this is much less useful for linkage than if the 1,000 values were equally distributed\n", "\n", "The `linker.profile_columns()` method creates summary charts to help you understand these aspects of your data. \n", "\n", "You may input column names (e.g. `first_name`), or arbitrary sql expressions like `concat(first_name, surname)`." ] }, { "cell_type": "code", "execution_count": 16, "id": "897d183c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linker.profile_columns([\"first_name\", \"city\", \"surname\", \"email\", \"substr(dob, 1,4)\"], top_n=10, bottom_n=5)" ] }, { "cell_type": "markdown", "id": "6c5d3a2d", "metadata": {}, "source": [ "This chart is very information-dense, but here are some key takehomes relevant to our linkage:\n", "\n", "- There is strong skew in the `city` field with around 20% of the values being `London`. We therefore will probably want to use `term_frequency_adjustments` in our linkage model, so that it can weight a match on London differently to a match on, say, `Norwich`.\n", "\n", "- Looking at the \"Bottom 5 values by value count\", we can see typos in the data in most fields. This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.\n", "\n", "- Email is a much more uniquely-identifying field than any others, with a maximum value count of 6. It's likely to be a strong linking variable." ] }, { "cell_type": "markdown", "id": "1f37cb1e", "metadata": {}, "source": [ "## Next steps\n", "\n", "At this point, we have begin to develop a strong understanding of our data. It's time to move on to estimating a linkage model\n" ] }, { "cell_type": "markdown", "id": "be935e71", "metadata": {}, "source": [ "## Further reading\n", "\n", "You can find the documentation for the exploratory analysis tools in Splink [here](https://moj-analytical-services.github.io/splink/linkerexp.html)" ] } ], "metadata": { "kernelspec": { "display_name": "splink_demos", "language": "python", "name": "splink_demos" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "83cd1825940a26b927f4456d916a72166c792cbca23141876bf335b1893d7d4c" } } }, "nbformat": 4, "nbformat_minor": 5 }