{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contents\n",
    "========\n",
    "- [Introduction](#Introduction)\n",
    "- [Block Using the Sorted Neighborhood Blocker](#Block-Using-the-Sorted-Neighborhood-Blocker)\n",
    "  - [Block Tables to Produce a Candidate Set of Tuple Pairs](#Block-Tables-to-Produce-a-Candidate-Set-of-Tuple-Pairs)\n",
    "  - [Handling Missing Values](#Handling-Missing-Values)\n",
    "  - [Window Size](#Window-Size)\n",
    "  - [Stable Sort Order](#Stable-Sort-Order)\n",
    "- [Sorted Neighborhood Blocker Limitations](#Sorted-Neighborhood-Blocker-limitations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>WARNING: The sorted neighborhood blocker is still experimental and has not been fully tested yet. Use this blocker at your own risk.</font>\n",
    "\n",
    "Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The *py_entitymatching* package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is the sorted neighborhood blocker. This IPython notebook illustrates how to perform blocking using the sorted neighborhood blocker.\n",
    "\n",
    "Note, often the sorted neighborhood blocking technique is used on a single table.  In this case we have implemented sorted neighborhood blocking between two tables.  We first enrich the tables with whether the table is the left table, or right table.  Then we merge the tables.  At this point we perform sorted neighborhood blocking, which is to pass a sliding window of `window_size` (default 2) across the merged dataset. Within the sliding window all tuple pairs that have one tuple from the left table and one tuple from the right table are returned."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to import *py_entitymatching* package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbpresent": {
     "id": "9a89351b-e44f-47ad-afff-b148744173af"
    }
   },
   "outputs": [],
   "source": [
    "# Import py_entitymatching package\n",
    "import py_entitymatching as em\n",
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0f59e4ac-032a-4d59-9172-8ee653831acb"
    }
   },
   "source": [
    "Then, read the input tablse from the datasets directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "nbpresent": {
     "id": "2401accd-3160-4b07-aed4-de2f9a4dea35"
    }
   },
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "# Get the paths of the input tables\n",
    "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n",
    "path_B = datasets_dir + os.sep + 'person_table_B.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "nbpresent": {
     "id": "e51b3877-75c8-431d-bdbd-77362bbf2191"
    }
   },
   "outputs": [],
   "source": [
    "# Read the CSV files and set 'ID' as the key attribute\n",
    "A = em.read_csv_metadata(path_A, key='ID')\n",
    "B = em.read_csv_metadata(path_B, key='ID')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "nbpresent": {
     "id": "ac2eb60b-bf26-4a6a-a453-120cc7f660c4"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith        1989         30.0   \n",
       "1  a2  Michael Franklin        1988         27.5   \n",
       "2  a3    William Bridge        1986         32.0   \n",
       "3  a4      Binto George        1987         32.5   \n",
       "4  a5   Alphonse Kemper        1984         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco    94107  \n",
       "1  1652 Stockton St, San Francisco    94122  \n",
       "2   3131 Webster St, San Francisco    94107  \n",
       "3     423 Powell St, San Francisco    94122  \n",
       "4  1702 Post Street, San Francisco    94122  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "nbpresent": {
     "id": "afadc046-692d-42ad-9c72-523493597682"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b1</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>29.5</td>\n",
       "      <td>108 Clement St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b2</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b3</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b4</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>26.0</td>\n",
       "      <td>108 South Park, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>b5</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>170 Post St, Apt 4,  San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID           name  birth_year  hourly_wage  \\\n",
       "0  b1    Mark Levene        1987         29.5   \n",
       "1  b2    Bill Bridge        1986         32.0   \n",
       "2  b3  Mike Franklin        1988         27.5   \n",
       "3  b4    Joseph Kuan        1982         26.0   \n",
       "4  b5  Alfons Kemper        1984         35.0   \n",
       "\n",
       "                              address  zipcode  \n",
       "0       108 Clement St, San Francisco    94107  \n",
       "1      3131 Webster St, San Francisco    94107  \n",
       "2     1652 Stockton St, San Francisco    94122  \n",
       "3       108 South Park, San Francisco    94122  \n",
       "4  170 Post St, Apt 4,  San Francisco    94122  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ca7f0c34-9c21-4b6e-8010-eda1030df041"
    }
   },
   "source": [
    "# Block Using the Sorted Neighborhood Blocker\n",
    "\n",
    "Once the tables are read, we can do blocking using sorted neighborhood blocker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "45585660-2ba9-4211-adee-ace14fe8745f"
    }
   },
   "source": [
    "With the sorted neighborhood blocker, you can only block between two tables to produce a candidate set of tuple pairs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "726bd6c9-23a5-4543-a201-f84864433f20"
    }
   },
   "source": [
    "## Block Tables to Produce a Candidate Set of Tuple Pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "nbpresent": {
     "id": "d2001a06-fe74-4ebd-896c-992803828753"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n"
     ]
    }
   ],
   "source": [
    "# Instantiate attribute equivalence blocker object\n",
    "sn = em.SortedNeighborhoodBlocker()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "06bcba06-ff85-43b7-bb04-ce1566a29dd9"
    }
   },
   "source": [
    "For the given two tables, we will assume that two persons with different `zipcode` values do not refer to the same real world person. So, we apply attribute equivalence blocking on `zipcode`. That is, we block all the tuple pairs that have different zipcodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "nbpresent": {
     "id": "2cdc68f4-5874-43d1-a378-b0bd31552ef8"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n"
     ]
    }
   ],
   "source": [
    "# Use block_tables to apply blocking over two input tables.\n",
    "C1 = sn.block_tables(A, B, \n",
    "                    l_block_attr='birth_year', r_block_attr='birth_year', \n",
    "                    l_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_', window_size=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "nbpresent": {
     "id": "7b4967f5-2f99-4394-bfe9-29ff15334d39"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a5</td>\n",
       "      <td>b4</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a3</td>\n",
       "      <td>b5</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a4</td>\n",
       "      <td>b2</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>94122</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id l_ID r_ID           l_name  l_birth_year  l_zipcode         r_name  \\\n",
       "0    0   a5   b4  Alphonse Kemper          1984      94122    Joseph Kuan   \n",
       "1    1   a5   b5  Alphonse Kemper          1984      94122  Alfons Kemper   \n",
       "2    2   a3   b5   William Bridge          1986      94107  Alfons Kemper   \n",
       "3    3   a3   b2   William Bridge          1986      94107    Bill Bridge   \n",
       "4    4   a4   b2     Binto George          1987      94122    Bill Bridge   \n",
       "\n",
       "   r_birth_year  r_zipcode  \n",
       "0          1982      94122  \n",
       "1          1984      94122  \n",
       "2          1984      94122  \n",
       "3          1986      94107  \n",
       "4          1986      94107  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the candidate set of tuple pairs\n",
    "C1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "386dbb7d-085e-4946-b376-93e686eb62f5"
    }
   },
   "source": [
    "Note that the tuple pairs in the candidate set have the same zipcode. \n",
    "\n",
    "The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "nbpresent": {
     "id": "fa6af6e5-471b-4296-98f4-1f4d4ee2869a"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 139837734495736\n",
      "key: _id\n",
      "fk_ltable: l_ID\n",
      "fk_rtable: r_ID\n",
      "ltable(obj.id): 139837734692656\n",
      "rtable(obj.id): 139837734692520\n"
     ]
    }
   ],
   "source": [
    "# Show the metadata of C1\n",
    "em.show_properties(C1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "nbpresent": {
     "id": "6ec70bd1-adea-40af-9f30-304f6236c5ca"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(139837734759952, 139837734846816)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "id(A), id(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "51679b6c-2667-4ed9-88aa-caff4c81edbf"
    }
   },
   "source": [
    "Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/u/p/m/pmartinkus/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: \n",
      ".ix is deprecated. Please use\n",
      ".loc for label based indexing or\n",
      ".iloc for positional indexing\n",
      "\n",
      "See the documentation here:\n",
      "http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n"
     ]
    }
   ],
   "source": [
    "# Introduce some missing values\n",
    "A1 = em.read_csv_metadata(path_A, key='ID')\n",
    "A1.ix[0, 'zipcode'] = pd.np.NaN\n",
    "A1.ix[0, 'birth_year'] = pd.np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith         NaN         30.0   \n",
       "1  a2  Michael Franklin      1988.0         27.5   \n",
       "2  a3    William Bridge      1986.0         32.0   \n",
       "3  a4      Binto George      1987.0         32.5   \n",
       "4  a5   Alphonse Kemper      1984.0         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco      NaN  \n",
       "1  1652 Stockton St, San Francisco  94122.0  \n",
       "2   3131 Webster St, San Francisco  94107.0  \n",
       "3     423 Powell St, San Francisco  94122.0  \n",
       "4  1702 Post Street, San Francisco  94122.0  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n"
     ]
    }
   ],
   "source": [
    "# Use block_tables to apply blocking over two input tables.\n",
    "C2 = sn.block_tables(A1, B, \n",
    "                    l_block_attr='zipcode', r_block_attr='zipcode', \n",
    "                    l_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_', \n",
    "                    allow_missing=True) # setting allow_missing parameter to True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11, 9)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(C1), len(C2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a3</td>\n",
       "      <td>b1</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a2</td>\n",
       "      <td>b6</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>a5</td>\n",
       "      <td>b3</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>a1</td>\n",
       "      <td>b3</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>a1</td>\n",
       "      <td>b4</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982.0</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a1</td>\n",
       "      <td>b5</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id l_ID r_ID            l_name  l_birth_year  l_zipcode          r_name  \\\n",
       "0    0   a3   b1    William Bridge        1986.0    94107.0     Mark Levene   \n",
       "1    1   a1   b1       Kevin Smith           NaN        NaN     Mark Levene   \n",
       "2    2   a1   b2       Kevin Smith           NaN        NaN     Bill Bridge   \n",
       "3    3   a1   b6       Kevin Smith           NaN        NaN  Michael Brodie   \n",
       "4    4   a2   b6  Michael Franklin        1988.0    94122.0  Michael Brodie   \n",
       "5    5   a5   b3   Alphonse Kemper        1984.0    94122.0   Mike Franklin   \n",
       "6    6   a1   b3       Kevin Smith           NaN        NaN   Mike Franklin   \n",
       "7    7   a1   b4       Kevin Smith           NaN        NaN     Joseph Kuan   \n",
       "8    8   a1   b5       Kevin Smith           NaN        NaN   Alfons Kemper   \n",
       "\n",
       "   r_birth_year  r_zipcode  \n",
       "0        1987.0    94107.0  \n",
       "1        1987.0    94107.0  \n",
       "2        1986.0    94107.0  \n",
       "3        1987.0    94107.0  \n",
       "4        1987.0    94107.0  \n",
       "5        1988.0    94122.0  \n",
       "6        1988.0    94122.0  \n",
       "7        1982.0    94122.0  \n",
       "8        1984.0    94122.0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The candidate set C2 includes all possible tuple pairs with missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Window Size\n",
    "\n",
    "A tunable parameter to the Sorted Neighborhood Blocker is the Window size.  To perform the same result as above with a larger window size is via the `window_size` argument.  Note that it has more results than C1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n"
     ]
    }
   ],
   "source": [
    "C3 = sn.block_tables(A, B, \n",
    "                    l_block_attr='birth_year', r_block_attr='birth_year', \n",
    "                    l_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_', window_size=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(C1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(C3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stable Sort Order\n",
    "\n",
    "One final challenge for the Sorted Neighborhood Blocker is making the sort order stable.  If the column being sorted on has multiple identical keys, and those keys are longer than the window size, then different results may occur between runs.  To always guarantee the same results for every run, make sure to make the sorting column unique.  One method to do so is to append the id of the tuple onto the end of the sorting column.  Here is an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n"
     ]
    }
   ],
   "source": [
    "A[\"birth_year_plus_id\"]=A[\"birth_year\"].map(str)+'-'+A[\"ID\"].map(str)\n",
    "B[\"birth_year_plus_id\"]=B[\"birth_year\"].map(str)+'-'+A[\"ID\"].map(str)\n",
    "C3 = sn.block_tables(A, B, \n",
    "                    l_block_attr='birth_year_plus_id', r_block_attr='birth_year_plus_id', \n",
    "                    l_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_', window_size=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year_plus_id</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year_plus_id</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a5</td>\n",
       "      <td>b4</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984-a5</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982-a4</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984-a5</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984-a5</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a5</td>\n",
       "      <td>b2</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984-a5</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986-a2</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a3</td>\n",
       "      <td>b4</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986-a3</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982-a4</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a3</td>\n",
       "      <td>b5</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986-a3</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984-a5</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id l_ID r_ID           l_name l_birth_year_plus_id  l_birth_year  \\\n",
       "0    0   a5   b4  Alphonse Kemper              1984-a5          1984   \n",
       "1    1   a5   b5  Alphonse Kemper              1984-a5          1984   \n",
       "2    2   a5   b2  Alphonse Kemper              1984-a5          1984   \n",
       "3    3   a3   b4   William Bridge              1986-a3          1986   \n",
       "4    4   a3   b5   William Bridge              1986-a3          1986   \n",
       "\n",
       "   l_zipcode         r_name r_birth_year_plus_id  r_birth_year  r_zipcode  \n",
       "0      94122    Joseph Kuan              1982-a4          1982      94122  \n",
       "1      94122  Alfons Kemper              1984-a5          1984      94122  \n",
       "2      94122    Bill Bridge              1986-a2          1986      94107  \n",
       "3      94107    Joseph Kuan              1982-a4          1982      94122  \n",
       "4      94107  Alfons Kemper              1984-a5          1984      94122  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C3.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sorted Neighborhood Blocker limitations\n",
    "\n",
    "Since the sorted neighborhood blocker requires position in sorted order, unlike other blockers, blocking on a candidate set or checking two tuples is not applicable.  Attempts to call `block_candset` or `block_tuples` will raise an assertion."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}