{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Memory management in large jobs\n",
    "\n",
    "Users may experience out of memory errors when running large linkage and deduplication jobs with `Splink`'s default settings.\n",
    "\n",
    "This notebook demonstrates how `Splink` can be configured to reduce the amount of memory needed.  \n",
    "\n",
    "However, before demoing these configuration settings it's important to understand why you might be running out of memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why might you run out of memory?\n",
    "\n",
    "---\n",
    "\n",
    "#### 1. Your blocking rules generate too many comparisons, or you are computing a large cartesian product\n",
    "\n",
    "\n",
    "⚠️ **Increasing the size of your Spark cluster or changing `splink` configuration options other than blocking rules is unlikely to solve this problem.**\n",
    "\n",
    "\n",
    "Imagine you are running a deduplication job on a large input dataset called `df`, and one of your blocking rules is:\n",
    "\n",
    "`\"l.first_name = r.first_name\"`\n",
    "\n",
    "Let's assume there are 10,000 people in your dataset with the first name 'John'.\n",
    "\n",
    "`splink` will perform a join that looks a bit like this:\n",
    "\n",
    "```sql\n",
    "select *\n",
    "from df as l\n",
    "inner join df as r\n",
    "on l.first_name = r.first_name\n",
    "```\n",
    "\n",
    "This will result in 10,000 * 10,000 = 100 million rows being generated for Johns _on a single node_ of your spark cluster.  This happens because when Spark performs a join, it hash partitions on the join key.  This means all records for each first_name go to the same node.\n",
    "\n",
    "\n",
    "\n",
    "#### Solution: Tighten your blocking rules so they generate fewer comparisons.  \n",
    "\n",
    "---\n",
    "\n",
    "\n",
    "#### 2. You're getting memory errors at the blocking stage despite having tight blocking rules\n",
    "\n",
    "⚠️ **Increasing the size of your Spark cluster will sometimes mitigate these issues but there is a more efficient solution**\n",
    "\n",
    "This is often because Spark's default parallelism of 200 shuffle partitions is not really appropriate to the task of generating comparisons from blocking rules.\n",
    "\n",
    "When generating record comparisons from your blocking rules, your input dataset will be split into 200 parts (jobs).  \n",
    "\n",
    "This is generally insufficient for large jobs because, whilst 200 parts may be sufficient for your input dataset, the dataset of comparisons generated by your blocking rules is typically 10 times as large more.\n",
    "\n",
    "Furthermore, blocking rules can often result in skew - e.g. the number of comparisons generated by 'John's is larger than those generated by 'Robin's - meaning that if a few 'John's happen to end up on the same node, it can run out of memory.\n",
    "\n",
    "\n",
    "#### Solution: Increase spark's parallelism\n",
    "\n",
    "Try increasing the spark's parallelism as follows (start with 1000, but try different values)\n",
    "```\n",
    "spark.conf.set(\"spark.sql.shuffle.partitions\", \"1000\")\n",
    "spark.conf.set(\"spark.default.parallelism\", \"1000\")\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "#### 3. Failing to 'break the lineage'.\n",
    "\n",
    "⚠️  **Increasing the size of your Spark cluster will sometimes mitigate these issues but there is a more efficient solution**\n",
    "\n",
    "By default, `splink` attempts to compute its results as a single spark job.  However, there is a [known problem](https://www.pdl.cmu.edu/PDL-FTP/Storage/CMU-PDL-18-101.pdf) in Spark of long lineage, and iterative algorithms like the one used by `splink` are often cited as the culprit.  \n",
    "\n",
    "Breaking the lineage means splitting the job up into smaller parts which can be computed independently.  This can result in somewhat longer execution times (perhaps 30% longer), but often allows the same job to successfully complete on a smaller cluster.\n",
    "\n",
    "`splink` has some configuration options which allow you to break the lineage.  This must be configured by the user because the solution will depend on the specifics of your spark cluster and/or cloud provider.\n",
    "\n",
    "In the remainder of this notebook we give a simple example of breaking the lineage, using the same job as in the [data deduplication quick start](quickstart_demo_deduplication.ipynb)\n",
    "\n",
    "---\n",
    "\n",
    "#### 4. You need a bigger cluster\n",
    "\n",
    "If you've tried all of the above and are still getting out of memory errors, you may need a bigger cluster, or higher memory nodes.\n",
    "\n",
    "As a guide, we've successfully processed 15 million input records and 170 million comparisons using AWS Glue with a cluster of size 5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customising a job to break the lineage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1:  Imports and setup\n",
    "\n",
    "The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "pd.options.display.max_columns = 500\n",
    "pd.options.display.max_rows = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging \n",
    "logging.basicConfig() \n",
    "logging.getLogger(\"splink\").setLevel(logging.INFO)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utility_functions.demo_utils import get_spark\n",
    "spark = get_spark() # See utility_functions/demo_utils.py for how to set up Spark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Write two functions to break the lineage\n",
    "\n",
    "In the underlying `splink` algorithm, there are two sensible places to break the lineage:\n",
    "- After comparisons have been generated and the comparison vector has been computed.  (These need to be computed once but are used many times during the EM algorithm)\n",
    "- After the EM algorithm has completed iterating, but before term frequencies adjustments are computed.\n",
    "\n",
    "`splink` allows the user to provide two custom functions that are executed in these two places to break the lineage.  \n",
    "\n",
    "They both take as an argument a dataframe, and require the user to break the lineage.  Usually this is done by:\n",
    "- writing the dataframe to disk, and loading it back from disk (as in this example)\n",
    "- writing the dataframe to cloud storage, and loading it back from cloud storage.\n",
    "- `persist`ing, `checkpoint`ing or `cache`ing the dataframe (although in our experience, this has been unreliable)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def blocked_comparisons_to_disk(df, spark):\n",
    "    df.write.mode(\"overwrite\").parquet(\"gammas_temp/\")\n",
    "    df_new = spark.read.parquet(\"gammas_temp/\")\n",
    "    return df_new\n",
    "\n",
    "def scored_comparisons_to_disk(df, spark):\n",
    "    df.write.mode(\"overwrite\").parquet(\"df_e_temp/\")\n",
    "    df_new = spark.read.parquet(\"df_e_temp/\")\n",
    "    return df_new\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Set up a `splink` job, providing the custom functions as arguments\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:splink.iterate:Iteration 0 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.49493410587310793 for key surname, level 2\n",
      "INFO:splink.iterate:Iteration 1 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.12059885263442993 for key surname, level 2\n",
      "INFO:splink.iterate:Iteration 2 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.039048194885253906 for key surname, level 0\n",
      "INFO:splink.iterate:Iteration 3 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.017310582101345062 for key dob, level 1\n",
      "INFO:splink.iterate:Iteration 4 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.014871925115585327 for key email, level 0\n",
      "INFO:splink.iterate:Iteration 5 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.011928170919418335 for key email, level 0\n",
      "INFO:splink.iterate:Iteration 6 complete\n",
      "INFO:splink.model:The maximum change in parameters was 0.009333670139312744 for key email, level 1\n",
      "INFO:splink.iterate:EM algorithm has converged\n"
     ]
    }
   ],
   "source": [
    "from splink import Splink\n",
    "\n",
    "df = spark.read.parquet(\"data/fake_1000.parquet\")\n",
    "\n",
    "settings = {\n",
    "    \"link_type\": \"dedupe_only\",\n",
    "    \"blocking_rules\": [\n",
    "        \"l.first_name = r.first_name\",\n",
    "        \"l.surname = r.surname\",\n",
    "        \"l.dob = r.dob\"\n",
    "    ],\n",
    "    \"comparison_columns\": [\n",
    "        {\n",
    "            \"col_name\": \"first_name\",\n",
    "            \"num_levels\": 3,\n",
    "            \"term_frequency_adjustments\": True\n",
    "        },\n",
    "        {\n",
    "            \"col_name\": \"surname\",\n",
    "            \"num_levels\": 3,\n",
    "            \"term_frequency_adjustments\": True\n",
    "        },\n",
    "        {\n",
    "            \"col_name\": \"dob\"\n",
    "        },\n",
    "        {\n",
    "            \"col_name\": \"city\"\n",
    "        },\n",
    "        {\n",
    "            \"col_name\": \"email\"\n",
    "        }\n",
    "    ],\n",
    "    \"additional_columns_to_retain\": [\"group\"],\n",
    "    \"em_convergence\": 0.01\n",
    "}\n",
    "\n",
    "from splink import Splink\n",
    "\n",
    "linker = Splink(settings, \n",
    "                df,\n",
    "                spark, \n",
    "                break_lineage_blocked_comparisons = blocked_comparisons_to_disk,\n",
    "                break_lineage_scored_comparisons = scored_comparisons_to_disk\n",
    "                \n",
    "               )\n",
    "df_e = linker.get_scored_comparisons()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect results \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_probability</th>\n",
       "      <th>unique_id_l</th>\n",
       "      <th>unique_id_r</th>\n",
       "      <th>group_l</th>\n",
       "      <th>group_r</th>\n",
       "      <th>first_name_l</th>\n",
       "      <th>first_name_r</th>\n",
       "      <th>surname_l</th>\n",
       "      <th>surname_r</th>\n",
       "      <th>dob_l</th>\n",
       "      <th>dob_r</th>\n",
       "      <th>city_l</th>\n",
       "      <th>city_r</th>\n",
       "      <th>email_l</th>\n",
       "      <th>email_r</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.955579</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Julia</td>\n",
       "      <td>None</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>2015-10-29</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>London</td>\n",
       "      <td>London</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.955579</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Julia</td>\n",
       "      <td>None</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>2015-10-29</td>\n",
       "      <td>2016-01-27</td>\n",
       "      <td>London</td>\n",
       "      <td>London</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.771380</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Julia</td>\n",
       "      <td>None</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>2015-10-29</td>\n",
       "      <td>2015-10-29</td>\n",
       "      <td>London</td>\n",
       "      <td>None</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "      <td>hannah88opowersc@m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.939962</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>2016-01-27</td>\n",
       "      <td>London</td>\n",
       "      <td>London</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.022550</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Julia</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>Taylor</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>2015-10-29</td>\n",
       "      <td>London</td>\n",
       "      <td>None</td>\n",
       "      <td>hannah88@powers.com</td>\n",
       "      <td>hannah88opowersc@m</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   match_probability  unique_id_l  unique_id_r  group_l  group_r first_name_l  \\\n",
       "2           0.955579            0            1        0        0       Julia    \n",
       "1           0.955579            0            2        0        0       Julia    \n",
       "0           0.771380            0            3        0        0       Julia    \n",
       "4           0.939962            1            2        0        0       Julia    \n",
       "3           0.022550            1            3        0        0       Julia    \n",
       "\n",
       "  first_name_r surname_l surname_r       dob_l       dob_r  city_l  city_r  \\\n",
       "2       Julia       None    Taylor  2015-10-29  2015-07-31  London  London   \n",
       "1       Julia       None    Taylor  2015-10-29  2016-01-27  London  London   \n",
       "0       Julia       None    Taylor  2015-10-29  2015-10-29  London    None   \n",
       "4       Julia     Taylor    Taylor  2015-07-31  2016-01-27  London  London   \n",
       "3       Julia     Taylor    Taylor  2015-07-31  2015-10-29  London    None   \n",
       "\n",
       "               email_l              email_r  \n",
       "2  hannah88@powers.com  hannah88@powers.com  \n",
       "1  hannah88@powers.com  hannah88@powers.com  \n",
       "0  hannah88@powers.com   hannah88opowersc@m  \n",
       "4  hannah88@powers.com  hannah88@powers.com  \n",
       "3  hannah88@powers.com   hannah88opowersc@m  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inspect main dataframe that contains the match scores\n",
    "cols_to_inspect = [\"match_probability\",\"unique_id_l\",\"unique_id_r\",\"group_l\", \"group_r\", \"first_name_l\",\"first_name_r\",\"surname_l\",\"surname_r\",\"dob_l\",\"dob_r\",\"city_l\",\"city_r\",\"email_l\",\"email_r\",]\n",
    "\n",
    "df_e.toPandas()[cols_to_inspect].sort_values([\"unique_id_l\", \"unique_id_r\"]).head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}