{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `fuzzymatcher` examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic usage - `link_table`\n", "\n", "In the most basic usage, the user provides `fuzzymatcher` with two pandas dataframes, indicating which columns to join on.\n", "\n", "The central output of `fuzzymatcher` is the `link_table`.\n", "\n", "For each record in the left table, the link table includes one or more possible matching records from the right table.\n", "\n", "The user can then inspect the link table and decide which matches to retain, e.g. by choosing a score threshold ( `match_score > chosen_threshold` ) or just choosing the best match ( `match_rank == 1` )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(level=logging.DEBUG)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fuzzymatcher\n", "import pandas as pd\n", "\n", "df_left = pd.read_csv(\"tests/data/left_1.csv\")\n", "df_left" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_right = pd.read_csv(\"tests/data/right_1.csv\")\n", "df_right" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Columns to match on from df_left\n", "left_on = [\"fname\", \"mname\", \"lname\", \"dob\"]\n", "\n", "# Columns to match on from df_right\n", "right_on = [\"name\", \"middlename\", \"surname\", \"date\"]\n", "\n", "# Note that if left_id_col or right_id_col are admitted a unique id will be autogenerated\n", "fuzzymatcher.link_table(df_left, df_right, left_on, right_on, left_id_col = \"id\", right_id_col = \"id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic usage - `fuzzy_left_join`\n", "\n", "A second option is to use `fuzzy_left_join`, which automatically links the two dataframes based on the highest-scoring match." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fuzzymatcher\n", "import pandas as pd\n", "\n", "df_left = pd.read_csv(\"tests/data/left_1.csv\")\n", "df_right = pd.read_csv(\"tests/data/right_1.csv\")\n", "left_on = [\"fname\", \"lname\", \"dob\"]\n", "right_on = [\"name\", \"surname\", \"date\"]\n", "\n", "fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic example with real data\n", "### Matching the names of Local Authorities provided by Office for National Statistics with the names provided by Ordnance Survey\n", "\n", "We would usually join this data on the Local Authority District (LAD) Codes (e.g. E06000001 = Hartlepool), but sometimes these are unavailable. In this example, we fuzzy match on the name, but provide the LAD code to demonstate it has worked." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ons = pd.read_csv(\"tests/data/las_ons.csv\")\n", "os = pd.read_csv(\"tests/data/las_os.csv\")\n", "\n", "df_joined = fuzzymatcher.fuzzy_left_join(ons, os, left_on = \"lad16nm\", right_on = \"name\")\n", "rename = {\"lad16cd\": \"ons_code\", \"code\": \"os_code\", \"lad16nm\": \"ons_name\", \"name\": \"os_name\"}\n", "df_joined = df_joined.rename(columns=rename)\n", "col_order = [\"best_match_score\", \"ons_name\", \"os_name\", \"ons_code\", \"os_code\"]\n", "df_joined[col_order].sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get a sense of match quality by measuring how often the fuzzy matcher got it right:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_records = len(df_joined)\n", "correct_binary = (df_joined[\"ons_code\"] == df_joined[\"os_code\"])\n", "perc_correct = correct_binary.sum()/num_records\n", "\n", "\"The percentage of codes which were correctly matched was {:,.1f}%\".format(perc_correct*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced usage - configuring the matcher\n", "\n", "`fuzzymatcher` uses a number of components, each one of which can be re-written or adapted by the user:\n", "\n", "* **`data_preprocessor`**: Responsible for normalising strings, removing punctuation etc.\n", "* **`datagetter`**: Responsible for finding a list of possible matches for each df_left record in df_right\n", "* **`scorer`**: Responsible for computing a match score, given a record from df_left and df_right respectively\n", "\n", "The main `link_table` and `fuzzy_left_join` convenience functions use these components under the hood. See [here](https://github.com/RobinL/fuzzymatcher/blob/master/fuzzymatcher/__init__.py) for how this work.\n", "\n", "This section provides a few examples of how an advanced user can compose these components to create a custom matcher" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Replacing the default sqlite datagetter with the cartesian datagetter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fuzzymatcher.data_getter_cartesian import DataGetterCartesian\n", "from fuzzymatcher.matcher import Matcher\n", "\n", "dg = DataGetterCartesian()\n", "\n", "m = Matcher(data_getter = dg)\n", "\n", "df_left = pd.read_csv(\"tests/data/left_3.csv\")\n", "df_right = pd.read_csv(\"tests/data/right_3.csv\")\n", "\n", "on = [\"first_name\", \"surname\", \"dob\", \"city\"]\n", "\n", "m.add_data(df_left, df_right, on, on)\n", "\n", "m.match_all()\n", "lt = m.get_formatted_link_table()\n", "print(\"Length of Cartesian join table: {:,.0f}\".format(len(lt))) # Note, because df_left and df_right are 100 records each, this table is 10,000 records long\n", "lt.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Cartesian matcher considers more potential matches, but its performance is considerably worse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Performance\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_left = pd.read_csv(\"tests/data/left_4.csv\")\n", "# df_left = df_left[:1000]\n", "df_right = pd.read_csv(\"tests/data/right_4.csv\")\n", "# df_right = df_right[:1000]\n", "on = [\"first_name\", \"surname\", \"dob\", \"city\"]\n", "\n", "lt = fuzzymatcher.link_table(df_left, df_right, on, on)\n", "lt.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def link_table_percentage_correct(link_table):\n", " \"\"\"\n", " In this test dataset, we know what the link should be\n", " Therefore we can compute a measure of performance\n", " \"\"\"\n", " lt = link_table.copy()\n", " lt = lt[lt[\"match_rank\"] == 1]\n", " lt[\"__id_left\"] = lt[\"__id_left\"].str.replace(\"_left\", \"\")\n", " lt[\"__id_right\"] = lt[\"__id_right\"].str.replace(\"_right\", \"\")\n", " lt[\"link_correct\"] = (lt[\"__id_left\"] == lt[\"__id_right\"])\n", "\n", " return lt[\"link_correct\"].sum()/len(lt)\n", "\n", "\"Percent matches correct: {:,.1f}%\".format(link_table_percentage_correct(lt)*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note that in this particular case we can improve the match rate by including initials and allowing inversion of first name and surname \n", "\n", "(Within a field, the matcher pays no attention to token order)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_left[\"full_name\"] = df_left[\"first_name\"] + \" \" + df_left[\"surname\"]\n", "df_right[\"full_name\"] = df_right[\"first_name\"] + \" \" + df_right[\"surname\"]\n", "df_left[\"initials\"] = df_left[\"first_name\"].str[0] + df_left[\"surname\"].str[0]\n", "df_right[\"initials\"] = df_right[\"first_name\"].str[0] + df_right[\"surname\"].str[0]\n", "\n", "on = [\"full_name\", \"initials\", \"dob\", \"city\"]\n", "\n", "lt = fuzzymatcher.link_table(df_left, df_right, on, on)\n", "lt.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"Percent matches correct: {:,.1f}%\".format(link_table_percentage_correct(lt)*100)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }