{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84cca40c",
   "metadata": {},
   "source": [
    "# Predicting which records match\n",
    "\n",
    "In the previous tutorial, we built and estimated a linkage model.\n",
    "\n",
    "In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "48f57034",
   "metadata": {},
   "outputs": [],
   "source": [
    "from splink.duckdb.duckdb_linker import DuckDBLinker\n",
    "import pandas as pd \n",
    "pd.options.display.max_columns = 1000\n",
    "df = pd.read_csv(\"./data/fake_1000.csv\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d77b6eb8",
   "metadata": {},
   "source": [
    "## Load estimated model from previous tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "619553a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "linker = DuckDBLinker(df)\n",
    "linker.load_settings_from_json(\"./demo_settings/saved_model_from_demo.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1d97518",
   "metadata": {},
   "source": [
    "# Predicting match weights using the trained model\n",
    "\n",
    "We use `linker.predict()` to run the model.  \n",
    "\n",
    "Under the hood this will:\n",
    "\n",
    "- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`\n",
    "\n",
    "- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data\n",
    "\n",
    "- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores\n",
    "\n",
    "Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ead23f3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_weight</th>\n",
       "      <th>match_probability</th>\n",
       "      <th>unique_id_l</th>\n",
       "      <th>unique_id_r</th>\n",
       "      <th>first_name_l</th>\n",
       "      <th>first_name_r</th>\n",
       "      <th>gamma_first_name</th>\n",
       "      <th>bf_first_name</th>\n",
       "      <th>surname_l</th>\n",
       "      <th>surname_r</th>\n",
       "      <th>gamma_surname</th>\n",
       "      <th>bf_surname</th>\n",
       "      <th>dob_l</th>\n",
       "      <th>dob_r</th>\n",
       "      <th>gamma_dob</th>\n",
       "      <th>bf_dob</th>\n",
       "      <th>city_l</th>\n",
       "      <th>city_r</th>\n",
       "      <th>gamma_city</th>\n",
       "      <th>bf_city</th>\n",
       "      <th>bf_tf_adj_city</th>\n",
       "      <th>tf_city_l</th>\n",
       "      <th>tf_city_r</th>\n",
       "      <th>email_l</th>\n",
       "      <th>email_r</th>\n",
       "      <th>gamma_email</th>\n",
       "      <th>bf_email</th>\n",
       "      <th>cluster_l</th>\n",
       "      <th>cluster_r</th>\n",
       "      <th>match_key</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>14.295585</td>\n",
       "      <td>0.999950</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Grace</td>\n",
       "      <td>Grace</td>\n",
       "      <td>2</td>\n",
       "      <td>85.549242</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kelly</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>1991-04-26</td>\n",
       "      <td>2</td>\n",
       "      <td>93.584788</td>\n",
       "      <td>Hull</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.001230</td>\n",
       "      <td>NaN</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>3</td>\n",
       "      <td>255.419933</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>12.793439</td>\n",
       "      <td>0.999859</td>\n",
       "      <td>26</td>\n",
       "      <td>29</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>2</td>\n",
       "      <td>85.549242</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>3</td>\n",
       "      <td>90.170377</td>\n",
       "      <td>1976-09-15</td>\n",
       "      <td>1976-08-15</td>\n",
       "      <td>2</td>\n",
       "      <td>93.584788</td>\n",
       "      <td>Loodon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.001230</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gabriel.t54@nnichls.info</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>12.793439</td>\n",
       "      <td>0.999859</td>\n",
       "      <td>28</td>\n",
       "      <td>29</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>2</td>\n",
       "      <td>85.549242</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>Gabriel</td>\n",
       "      <td>3</td>\n",
       "      <td>90.170377</td>\n",
       "      <td>1976-09-15</td>\n",
       "      <td>1976-08-15</td>\n",
       "      <td>2</td>\n",
       "      <td>93.584788</td>\n",
       "      <td>London</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.212792</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gabriel.t54@nichols.info</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.626931</td>\n",
       "      <td>0.393039</td>\n",
       "      <td>37</td>\n",
       "      <td>860</td>\n",
       "      <td>Theodore</td>\n",
       "      <td>Theodore</td>\n",
       "      <td>2</td>\n",
       "      <td>85.549242</td>\n",
       "      <td>Morris</td>\n",
       "      <td>Marshall</td>\n",
       "      <td>0</td>\n",
       "      <td>0.261908</td>\n",
       "      <td>1978-08-19</td>\n",
       "      <td>1972-07-25</td>\n",
       "      <td>0</td>\n",
       "      <td>0.255612</td>\n",
       "      <td>Birmingham</td>\n",
       "      <td>Birmingham</td>\n",
       "      <td>1</td>\n",
       "      <td>10.257653</td>\n",
       "      <td>1.120874</td>\n",
       "      <td>0.049200</td>\n",
       "      <td>0.0492</td>\n",
       "      <td>t.m39@brooks-sawyer.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>13</td>\n",
       "      <td>214</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.626931</td>\n",
       "      <td>0.393039</td>\n",
       "      <td>39</td>\n",
       "      <td>860</td>\n",
       "      <td>Theodore</td>\n",
       "      <td>Theodore</td>\n",
       "      <td>2</td>\n",
       "      <td>85.549242</td>\n",
       "      <td>Morris</td>\n",
       "      <td>Marshall</td>\n",
       "      <td>0</td>\n",
       "      <td>0.261908</td>\n",
       "      <td>1978-08-19</td>\n",
       "      <td>1972-07-25</td>\n",
       "      <td>0</td>\n",
       "      <td>0.255612</td>\n",
       "      <td>Birmingham</td>\n",
       "      <td>Birmingham</td>\n",
       "      <td>1</td>\n",
       "      <td>10.257653</td>\n",
       "      <td>1.120874</td>\n",
       "      <td>0.049200</td>\n",
       "      <td>0.0492</td>\n",
       "      <td>t.m39@brooks-sawyer.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>13</td>\n",
       "      <td>214</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   match_weight  match_probability  unique_id_l  unique_id_r first_name_l  \\\n",
       "0     14.295585           0.999950            4            5        Grace   \n",
       "1     12.793439           0.999859           26           29       Thomas   \n",
       "2     12.793439           0.999859           28           29       Thomas   \n",
       "3     -0.626931           0.393039           37          860     Theodore   \n",
       "4     -0.626931           0.393039           39          860     Theodore   \n",
       "\n",
       "  first_name_r  gamma_first_name  bf_first_name surname_l surname_r  \\\n",
       "0        Grace                 2      85.549242       NaN     Kelly   \n",
       "1       Thomas                 2      85.549242   Gabriel   Gabriel   \n",
       "2       Thomas                 2      85.549242   Gabriel   Gabriel   \n",
       "3     Theodore                 2      85.549242    Morris  Marshall   \n",
       "4     Theodore                 2      85.549242    Morris  Marshall   \n",
       "\n",
       "   gamma_surname  bf_surname       dob_l       dob_r  gamma_dob     bf_dob  \\\n",
       "0             -1    1.000000  1997-04-26  1991-04-26          2  93.584788   \n",
       "1              3   90.170377  1976-09-15  1976-08-15          2  93.584788   \n",
       "2              3   90.170377  1976-09-15  1976-08-15          2  93.584788   \n",
       "3              0    0.261908  1978-08-19  1972-07-25          0   0.255612   \n",
       "4              0    0.261908  1978-08-19  1972-07-25          0   0.255612   \n",
       "\n",
       "       city_l      city_r  gamma_city    bf_city  bf_tf_adj_city  tf_city_l  \\\n",
       "0        Hull         NaN          -1   1.000000        1.000000   0.001230   \n",
       "1      Loodon         NaN          -1   1.000000        1.000000   0.001230   \n",
       "2      London         NaN          -1   1.000000        1.000000   0.212792   \n",
       "3  Birmingham  Birmingham           1  10.257653        1.120874   0.049200   \n",
       "4  Birmingham  Birmingham           1  10.257653        1.120874   0.049200   \n",
       "\n",
       "   tf_city_r                   email_l                  email_r  gamma_email  \\\n",
       "0        NaN   grace.kelly52@jones.com  grace.kelly52@jones.com            3   \n",
       "1        NaN  gabriel.t54@nnichls.info                      NaN           -1   \n",
       "2        NaN  gabriel.t54@nichols.info                      NaN           -1   \n",
       "3     0.0492   t.m39@brooks-sawyer.com                      NaN           -1   \n",
       "4     0.0492   t.m39@brooks-sawyer.com                      NaN           -1   \n",
       "\n",
       "     bf_email  cluster_l  cluster_r match_key  \n",
       "0  255.419933          1          1         0  \n",
       "1    1.000000         11         11         0  \n",
       "2    1.000000         11         11         0  \n",
       "3    1.000000         13        214         0  \n",
       "4    1.000000         13        214         0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_predictions = linker.predict(threshold_match_probability=0.2)\n",
    "df_predictions.as_pandas_dataframe(limit=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00370bb",
   "metadata": {},
   "source": [
    "## Clustering\n",
    "\n",
    "The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:\n",
    "```\n",
    "A -> B with score 0.9\n",
    "B -> C with score 0.95\n",
    "C -> D with score 0.1\n",
    "D -> E with score 0.99\n",
    "```\n",
    "\n",
    "Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.\n",
    "\n",
    "With a score threshold of 0.5, the above data could be represented conceptually as:\n",
    "\n",
    "```\n",
    "ID, Cluster ID\n",
    "A,  1\n",
    "B,  1\n",
    "C,  1\n",
    "D,  2\n",
    "E,  2\n",
    "```\n",
    "\n",
    "The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink.  You can use it as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "257ae717",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Completed iteration 1, root rows count 14\n",
      "Completed iteration 2, root rows count 0\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster_id</th>\n",
       "      <th>unique_id</th>\n",
       "      <th>first_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>city</th>\n",
       "      <th>email</th>\n",
       "      <th>cluster</th>\n",
       "      <th>tf_city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alan</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>robert255@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-05-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Rob</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>London</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "      <td>0.212792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>Lonon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0.007380</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>Grace</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>Hull</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>1</td>\n",
       "      <td>0.001230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Grace</td>\n",
       "      <td>Kelly</td>\n",
       "      <td>1991-04-26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>Logan</td>\n",
       "      <td>pMurphy</td>\n",
       "      <td>1973-08-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>Portsmouth</td>\n",
       "      <td>evied56@harris-bailey.net</td>\n",
       "      <td>3</td>\n",
       "      <td>0.017220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Dean</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>Evie</td>\n",
       "      <td>Dean</td>\n",
       "      <td>2015-03-03</td>\n",
       "      <td>Pootsmruth</td>\n",
       "      <td>evihd56@earris-bailey.net</td>\n",
       "      <td>3</td>\n",
       "      <td>0.001230</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cluster_id  unique_id first_name  surname         dob        city  \\\n",
       "0           0          0     Robert     Alan  1971-06-24         NaN   \n",
       "1           0          1     Robert    Allen  1971-05-24         NaN   \n",
       "2           0          2        Rob    Allen  1971-06-24      London   \n",
       "3           0          3     Robert     Alen  1971-06-24       Lonon   \n",
       "4           4          4      Grace      NaN  1997-04-26        Hull   \n",
       "5           4          5      Grace    Kelly  1991-04-26         NaN   \n",
       "6           6          6      Logan  pMurphy  1973-08-01         NaN   \n",
       "7           7          7        NaN      NaN  2015-03-03  Portsmouth   \n",
       "8           8          8        NaN     Dean  2015-03-03         NaN   \n",
       "9           8          9       Evie     Dean  2015-03-03  Pootsmruth   \n",
       "\n",
       "                       email  cluster   tf_city  \n",
       "0        robert255@smith.net        0       NaN  \n",
       "1        roberta25@smith.net        0       NaN  \n",
       "2        roberta25@smith.net        0  0.212792  \n",
       "3                        NaN        0  0.007380  \n",
       "4    grace.kelly52@jones.com        1  0.001230  \n",
       "5    grace.kelly52@jones.com        1       NaN  \n",
       "6                        NaN        2       NaN  \n",
       "7  evied56@harris-bailey.net        3  0.017220  \n",
       "8                        NaN        3       NaN  \n",
       "9  evihd56@earris-bailey.net        3  0.001230  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)\n",
    "clusters.as_pandas_dataframe(limit=10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "3b53fa520a31e303a9636a08ff10a3bbc14893ee50cb37445791fa59628fc75b"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}