{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84cca40c",
   "metadata": {},
   "source": [
    "# Predicting which records match\n",
    "\n",
    "In the previous tutorial, we built and estimated a linkage model.\n",
    "\n",
    "In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "48f57034",
   "metadata": {},
   "outputs": [],
   "source": [
    "from splink.duckdb.duckdb_linker import DuckDBLinker\n",
    "import pandas as pd \n",
    "pd.options.display.max_columns = 1000\n",
    "df = pd.read_csv(\"./data/fake_1000.csv\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d77b6eb8",
   "metadata": {},
   "source": [
    "## Load estimated model from previous tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "619553a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "linker = DuckDBLinker(df)\n",
    "linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1d97518",
   "metadata": {},
   "source": [
    "# Predicting match weights using the trained model\n",
    "\n",
    "We use `linker.predict()` to run the model.  \n",
    "\n",
    "Under the hood this will:\n",
    "\n",
    "- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`\n",
    "\n",
    "- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data\n",
    "\n",
    "- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores\n",
    "\n",
    "Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ead23f3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_weight</th>\n",
       "      <th>match_probability</th>\n",
       "      <th>unique_id_l</th>\n",
       "      <th>unique_id_r</th>\n",
       "      <th>first_name_l</th>\n",
       "      <th>first_name_r</th>\n",
       "      <th>gamma_first_name</th>\n",
       "      <th>bf_first_name</th>\n",
       "      <th>surname_l</th>\n",
       "      <th>surname_r</th>\n",
       "      <th>gamma_surname</th>\n",
       "      <th>bf_surname</th>\n",
       "      <th>dob_l</th>\n",
       "      <th>dob_r</th>\n",
       "      <th>gamma_dob</th>\n",
       "      <th>bf_dob</th>\n",
       "      <th>city_l</th>\n",
       "      <th>city_r</th>\n",
       "      <th>gamma_city</th>\n",
       "      <th>tf_city_l</th>\n",
       "      <th>tf_city_r</th>\n",
       "      <th>bf_city</th>\n",
       "      <th>bf_tf_adj_city</th>\n",
       "      <th>email_l</th>\n",
       "      <th>email_r</th>\n",
       "      <th>gamma_email</th>\n",
       "      <th>bf_email</th>\n",
       "      <th>match_key</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>12.738985</td>\n",
       "      <td>0.999854</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Grace</td>\n",
       "      <td>Grace</td>\n",
       "      <td>2</td>\n",
       "      <td>85.548522</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kelly</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>1991-04-26</td>\n",
       "      <td>2</td>\n",
       "      <td>93.585230</td>\n",
       "      <td>Hull</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.001230</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>3</td>\n",
       "      <td>255.419953</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11.236825</td>\n",
       "      <td>0.999586</td>\n",
       "      <td>26</td>\n",
       "      <td>29</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>2</td>\n",
       "      <td>85.548522</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>3</td>\n",
       "      <td>90.169535</td>\n",
       "      <td>1976-09-15</td>\n",
       "      <td>1976-08-15</td>\n",
       "      <td>2</td>\n",
       "      <td>93.585230</td>\n",
       "      <td>Loodon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.001230</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>gabriel.t54@nnichls.info</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>11.236825</td>\n",
       "      <td>0.999586</td>\n",
       "      <td>28</td>\n",
       "      <td>29</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>2</td>\n",
       "      <td>85.548522</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>3</td>\n",
       "      <td>90.169535</td>\n",
       "      <td>1976-09-15</td>\n",
       "      <td>1976-08-15</td>\n",
       "      <td>2</td>\n",
       "      <td>93.585230</td>\n",
       "      <td>London</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.212792</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>gabriel.t54@nichols.info</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.236439</td>\n",
       "      <td>0.996695</td>\n",
       "      <td>47</td>\n",
       "      <td>50</td>\n",
       "      <td>Erin</td>\n",
       "      <td>Erin</td>\n",
       "      <td>2</td>\n",
       "      <td>85.548522</td>\n",
       "      <td>Rogers</td>\n",
       "      <td>Roers</td>\n",
       "      <td>2</td>\n",
       "      <td>79.149409</td>\n",
       "      <td>2010-01-02</td>\n",
       "      <td>2010-03-03</td>\n",
       "      <td>1</td>\n",
       "      <td>13.323346</td>\n",
       "      <td>London</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.212792</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>e.rogers3@hopkins.org</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.742257</td>\n",
       "      <td>0.963983</td>\n",
       "      <td>49</td>\n",
       "      <td>50</td>\n",
       "      <td>Erin</td>\n",
       "      <td>Erin</td>\n",
       "      <td>2</td>\n",
       "      <td>85.548522</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Roers</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2010-03-01</td>\n",
       "      <td>2010-03-03</td>\n",
       "      <td>2</td>\n",
       "      <td>93.585230</td>\n",
       "      <td>London</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.212792</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>e.rogers3@honkips.org</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   match_weight  match_probability  unique_id_l  unique_id_r first_name_l  \\\n",
       "0     12.738985           0.999854            4            5        Grace   \n",
       "1     11.236825           0.999586           26           29       Thomas   \n",
       "2     11.236825           0.999586           28           29       Thomas   \n",
       "3      8.236439           0.996695           47           50         Erin   \n",
       "4      4.742257           0.963983           49           50         Erin   \n",
       "\n",
       "  first_name_r  gamma_first_name  bf_first_name surname_l surname_r  \\\n",
       "0        Grace                 2      85.548522       NaN     Kelly   \n",
       "1       Thomas                 2      85.548522   Gabriel   Gabriel   \n",
       "2       Thomas                 2      85.548522   Gabriel   Gabriel   \n",
       "3         Erin                 2      85.548522    Rogers     Roers   \n",
       "4         Erin                 2      85.548522       NaN     Roers   \n",
       "\n",
       "   gamma_surname  bf_surname       dob_l       dob_r  gamma_dob     bf_dob  \\\n",
       "0             -1    1.000000  1997-04-26  1991-04-26          2  93.585230   \n",
       "1              3   90.169535  1976-09-15  1976-08-15          2  93.585230   \n",
       "2              3   90.169535  1976-09-15  1976-08-15          2  93.585230   \n",
       "3              2   79.149409  2010-01-02  2010-03-03          1  13.323346   \n",
       "4             -1    1.000000  2010-03-01  2010-03-03          2  93.585230   \n",
       "\n",
       "   city_l city_r  gamma_city  tf_city_l  tf_city_r  bf_city  bf_tf_adj_city  \\\n",
       "0    Hull    NaN          -1   0.001230        NaN      1.0             1.0   \n",
       "1  Loodon    NaN          -1   0.001230        NaN      1.0             1.0   \n",
       "2  London    NaN          -1   0.212792        NaN      1.0             1.0   \n",
       "3  London    NaN          -1   0.212792        NaN      1.0             1.0   \n",
       "4  London    NaN          -1   0.212792        NaN      1.0             1.0   \n",
       "\n",
       "                    email_l                  email_r  gamma_email    bf_email  \\\n",
       "0   grace.kelly52@jones.com  grace.kelly52@jones.com            3  255.419953   \n",
       "1  gabriel.t54@nnichls.info                      NaN           -1    1.000000   \n",
       "2  gabriel.t54@nichols.info                      NaN           -1    1.000000   \n",
       "3     e.rogers3@hopkins.org                      NaN           -1    1.000000   \n",
       "4     e.rogers3@honkips.org                      NaN           -1    1.000000   \n",
       "\n",
       "  match_key  \n",
       "0         0  \n",
       "1         0  \n",
       "2         0  \n",
       "3         0  \n",
       "4         0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_predictions = linker.predict(threshold_match_probability=0.2)\n",
    "df_predictions.as_pandas_dataframe(limit=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00370bb",
   "metadata": {},
   "source": [
    "## Clustering\n",
    "\n",
    "The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:\n",
    "```\n",
    "A -> B with score 0.9\n",
    "B -> C with score 0.95\n",
    "C -> D with score 0.1\n",
    "D -> E with score 0.99\n",
    "```\n",
    "\n",
    "Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.\n",
    "\n",
    "With a score threshold of 0.5, the above data could be represented conceptually as:\n",
    "\n",
    "```\n",
    "ID, Cluster ID\n",
    "A,  1\n",
    "B,  1\n",
    "C,  1\n",
    "D,  2\n",
    "E,  2\n",
    "```\n",
    "\n",
    "The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink.  You can use it as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "257ae717",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Completed iteration 1, root rows count 10\n",
      "Completed iteration 2, root rows count 0\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_id</th>\n",
       "      <th>unique_id</th>\n",
       "      <th>first_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>city</th>\n",
       "      <th>email</th>\n",
       "      <th>cluster</th>\n",
       "      <th>tf_city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alan</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>robert255@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-05-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Rob</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>London</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>0.212792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>Lonon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0.007380</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>Grace</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>Hull</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>1</td>\n",
       "      <td>0.001230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Grace</td>\n",
       "      <td>Kelly</td>\n",
       "      <td>1991-04-26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>Logan</td>\n",
       "      <td>pMurphy</td>\n",
       "      <td>1973-08-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>Portsmouth</td>\n",
       "      <td>evied56@harris-bailey.net</td>\n",
       "      <td>3</td>\n",
       "      <td>0.017220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Dean</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>Evie</td>\n",
       "      <td>Dean</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>Pootsmruth</td>\n",
       "      <td>evihd56@earris-bailey.net</td>\n",
       "      <td>3</td>\n",
       "      <td>0.001230</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cluster_id  unique_id first_name  surname         dob        city  \\\n",
       "0           0          0     Robert     Alan  1971-06-24         NaN   \n",
       "1           0          1     Robert    Allen  1971-05-24         NaN   \n",
       "2           0          2        Rob    Allen  1971-06-24      London   \n",
       "3           0          3     Robert     Alen  1971-06-24       Lonon   \n",
       "4           4          4      Grace      NaN  1997-04-26        Hull   \n",
       "5           4          5      Grace    Kelly  1991-04-26         NaN   \n",
       "6           6          6      Logan  pMurphy  1973-08-01         NaN   \n",
       "7           7          7        NaN      NaN  2015-03-03  Portsmouth   \n",
       "8           8          8        NaN     Dean  2015-03-03         NaN   \n",
       "9           8          9       Evie     Dean  2015-03-03  Pootsmruth   \n",
       "\n",
       "                       email  cluster   tf_city  \n",
       "0        robert255@smith.net        0       NaN  \n",
       "1        roberta25@smith.net        0       NaN  \n",
       "2        roberta25@smith.net        0  0.212792  \n",
       "3                        NaN        0  0.007380  \n",
       "4    grace.kelly52@jones.com        1  0.001230  \n",
       "5    grace.kelly52@jones.com        1       NaN  \n",
       "6                        NaN        2       NaN  \n",
       "7  evied56@harris-bailey.net        3  0.017220  \n",
       "8                        NaN        3       NaN  \n",
       "9  evihd56@earris-bailey.net        3  0.001230  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n",
    "clusters.as_pandas_dataframe(limit=10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "splink_demos",
   "language": "python",
   "name": "splink_demos"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}