{ "cells": [ { "cell_type": "markdown", "id": "26e50a28", "metadata": {}, "source": [ "# Exploratory analysis\n", "\n", "The purpose of exploratory analysis is to understand your data and any idiosyncrasies which may be relevant to the task of data linking.\n", "\n", "Splink includes functionality to visualise and summarise your data, to identify characteristics most salient to data linking.\n", "\n", "In this notebook we perform some basic exploratory analysis, and interpret the results." ] }, { "cell_type": "markdown", "id": "96a3d08d", "metadata": {}, "source": [ "### Read in the data\n", "\n", "For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.\n", "\n", "The first five rows of this dataset are printed below.\n", "\n", "Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)" ] }, { "cell_type": "code", "execution_count": 13, "id": "ffceed65", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | unique_id | \n", "first_name | \n", "surname | \n", "dob | \n", "city | \n", "cluster | \n", "|
---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "Robert | \n", "Alan | \n", "1971-06-24 | \n", "NaN | \n", "robert255@smith.net | \n", "0 | \n", "
1 | \n", "1 | \n", "Robert | \n", "Allen | \n", "1971-05-24 | \n", "NaN | \n", "roberta25@smith.net | \n", "0 | \n", "
2 | \n", "2 | \n", "Rob | \n", "Allen | \n", "1971-06-24 | \n", "London | \n", "roberta25@smith.net | \n", "0 | \n", "
3 | \n", "3 | \n", "Robert | \n", "Alen | \n", "1971-06-24 | \n", "Lonon | \n", "NaN | \n", "0 | \n", "
4 | \n", "4 | \n", "Grace | \n", "NaN | \n", "1997-04-26 | \n", "Hull | \n", "grace.kelly52@jones.com | \n", "1 | \n", "