{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The *py_entitymatching* package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is attribute equivalence blocker. This IPython notebook illustrates how to perform blocking using attribute equivalence blocker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to import *py_entitymatching* package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%load_ext autotime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "9a89351b-e44f-47ad-afff-b148744173af"
    }
   },
   "outputs": [],
   "source": [
    "# Import py_entitymatching package\n",
    "import py_entitymatching as em\n",
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0f59e4ac-032a-4d59-9172-8ee653831acb"
    }
   },
   "source": [
    "Then, read the input tablse from the datasets directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "2401accd-3160-4b07-aed4-de2f9a4dea35"
    }
   },
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "# Get the paths of the input tables\n",
    "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n",
    "path_B = datasets_dir + os.sep + 'person_table_B.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "e51b3877-75c8-431d-bdbd-77362bbf2191"
    }
   },
   "outputs": [],
   "source": [
    "# Read the CSV files and set 'ID' as the key attribute\n",
    "A = em.read_csv_metadata(path_A, key='ID')\n",
    "B = em.read_csv_metadata(path_B, key='ID')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "ac2eb60b-bf26-4a6a-a453-120cc7f660c4"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith        1989         30.0   \n",
       "1  a2  Michael Franklin        1988         27.5   \n",
       "2  a3    William Bridge        1986         32.0   \n",
       "3  a4      Binto George        1987         32.5   \n",
       "4  a5   Alphonse Kemper        1984         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco    94107  \n",
       "1  1652 Stockton St, San Francisco    94122  \n",
       "2   3131 Webster St, San Francisco    94107  \n",
       "3     423 Powell St, San Francisco    94122  \n",
       "4  1702 Post Street, San Francisco    94122  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "afadc046-692d-42ad-9c72-523493597682"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b1</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>29.5</td>\n",
       "      <td>108 Clement St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b2</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b3</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b4</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>26.0</td>\n",
       "      <td>108 South Park, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>b5</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>170 Post St, Apt 4,  San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID           name  birth_year  hourly_wage  \\\n",
       "0  b1    Mark Levene        1987         29.5   \n",
       "1  b2    Bill Bridge        1986         32.0   \n",
       "2  b3  Mike Franklin        1988         27.5   \n",
       "3  b4    Joseph Kuan        1982         26.0   \n",
       "4  b5  Alfons Kemper        1984         35.0   \n",
       "\n",
       "                              address  zipcode  \n",
       "0       108 Clement St, San Francisco    94107  \n",
       "1      3131 Webster St, San Francisco    94107  \n",
       "2     1652 Stockton St, San Francisco    94122  \n",
       "3       108 South Park, San Francisco    94122  \n",
       "4  170 Post St, Apt 4,  San Francisco    94122  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ca7f0c34-9c21-4b6e-8010-eda1030df041"
    }
   },
   "source": [
    "# Different Ways to Block Using Attribute Equivalence Blocker\n",
    "\n",
    "Once the tables are read, we can do blocking using attribute equivalence blocker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "45585660-2ba9-4211-adee-ace14fe8745f"
    }
   },
   "source": [
    "There are three different ways to do attribute equivalence blocking:\n",
    "\n",
    "1. Block two tables to produce a candidate set of tuple pairs.\n",
    "2. Block a candidate set of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n",
    "3. Block two tuples to check if a tuple pair would get blocked."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "726bd6c9-23a5-4543-a201-f84864433f20"
    }
   },
   "source": [
    "## Block Tables to Produce a Candidate Set of Tuple Pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "d2001a06-fe74-4ebd-896c-992803828753"
    }
   },
   "outputs": [],
   "source": [
    "# Instantiate attribute equivalence blocker object\n",
    "ab = em.AttrEquivalenceBlocker()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "06bcba06-ff85-43b7-bb04-ce1566a29dd9"
    }
   },
   "source": [
    "For the given two tables, we will assume that two persons with different `zipcode` values do not refer to the same real world person. So, we apply attribute equivalence blocking on `zipcode`. That is, we block all the tuple pairs that have different zipcodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "2cdc68f4-5874-43d1-a378-b0bd31552ef8"
    }
   },
   "outputs": [],
   "source": [
    "# Use block_tables to apply blocking over two input tables.\n",
    "C1 = ab.block_tables(A, B, \n",
    "                    l_block_attr='zipcode', r_block_attr='zipcode', \n",
    "                    l_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "7b4967f5-2f99-4394-bfe9-29ff15334d39"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>94107</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>94107</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>94107</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a3</td>\n",
       "      <td>b1</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id l_ID r_ID          l_name  l_birth_year  l_zipcode          r_name  \\\n",
       "0    0   a1   b1     Kevin Smith          1989      94107     Mark Levene   \n",
       "1    1   a1   b2     Kevin Smith          1989      94107     Bill Bridge   \n",
       "2    2   a1   b6     Kevin Smith          1989      94107  Michael Brodie   \n",
       "3    3   a3   b1  William Bridge          1986      94107     Mark Levene   \n",
       "4    4   a3   b2  William Bridge          1986      94107     Bill Bridge   \n",
       "\n",
       "   r_birth_year  r_zipcode  \n",
       "0          1987      94107  \n",
       "1          1986      94107  \n",
       "2          1987      94107  \n",
       "3          1987      94107  \n",
       "4          1986      94107  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the candidate set of tuple pairs\n",
    "C1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "386dbb7d-085e-4946-b376-93e686eb62f5"
    }
   },
   "source": [
    "Note that the tuple pairs in the candidate set have the same zipcode. \n",
    "\n",
    "The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "fa6af6e5-471b-4296-98f4-1f4d4ee2869a"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 4565680872\n",
      "fk_rtable: r_ID\n",
      "rtable(obj.id): 4565204272\n",
      "fk_ltable: l_ID\n",
      "key: _id\n",
      "ltable(obj.id): 4565203432\n"
     ]
    }
   ],
   "source": [
    "# Show the metadata of C1\n",
    "em.show_properties(C1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "6ec70bd1-adea-40af-9f30-304f6236c5ca"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4565203432, 4565204272)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "id(A), id(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "51679b6c-2667-4ed9-88aa-caff4c81edbf"
    }
   },
   "source": [
    "Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Introduce some missing values\n",
    "A1 = em.read_csv_metadata(path_A, key='ID')\n",
    "A1.ix[0, 'zipcode'] = pd.np.NaN\n",
    "A1.ix[0, 'birth_year'] = pd.np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith         NaN         30.0   \n",
       "1  a2  Michael Franklin      1988.0         27.5   \n",
       "2  a3    William Bridge      1986.0         32.0   \n",
       "3  a4      Binto George      1987.0         32.5   \n",
       "4  a5   Alphonse Kemper      1984.0         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco      NaN  \n",
       "1  1652 Stockton St, San Francisco  94122.0  \n",
       "2   3131 Webster St, San Francisco  94107.0  \n",
       "3     423 Powell St, San Francisco  94122.0  \n",
       "4  1702 Post Street, San Francisco  94122.0  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Use block_tables to apply blocking over two input tables.\n",
    "C2 = ab.block_tables(A1, B, \n",
    "                    l_block_attr='zipcode', r_block_attr='zipcode', \n",
    "                    l_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    r_output_attrs=['name', 'birth_year', 'zipcode'],\n",
    "                    l_output_prefix='l_', r_output_prefix='r_', \n",
    "                    allow_missing=True) # setting allow_missing parameter to True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(15, 18)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(C1), len(C2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a2</td>\n",
       "      <td>b4</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a2</td>\n",
       "      <td>b5</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a4</td>\n",
       "      <td>b3</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a4</td>\n",
       "      <td>b4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>a4</td>\n",
       "      <td>b5</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>a5</td>\n",
       "      <td>b3</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>a5</td>\n",
       "      <td>b4</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>a3</td>\n",
       "      <td>b1</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>a3</td>\n",
       "      <td>b6</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>a1</td>\n",
       "      <td>b3</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>a1</td>\n",
       "      <td>b4</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>a1</td>\n",
       "      <td>b5</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_ID r_ID            l_name  l_birth_year  l_zipcode          r_name  \\\n",
       "0     0   a2   b3  Michael Franklin        1988.0    94122.0   Mike Franklin   \n",
       "1     1   a2   b4  Michael Franklin        1988.0    94122.0     Joseph Kuan   \n",
       "2     2   a2   b5  Michael Franklin        1988.0    94122.0   Alfons Kemper   \n",
       "3     3   a4   b3      Binto George        1987.0    94122.0   Mike Franklin   \n",
       "4     4   a4   b4      Binto George        1987.0    94122.0     Joseph Kuan   \n",
       "5     5   a4   b5      Binto George        1987.0    94122.0   Alfons Kemper   \n",
       "6     6   a5   b3   Alphonse Kemper        1984.0    94122.0   Mike Franklin   \n",
       "7     7   a5   b4   Alphonse Kemper        1984.0    94122.0     Joseph Kuan   \n",
       "8     8   a5   b5   Alphonse Kemper        1984.0    94122.0   Alfons Kemper   \n",
       "9     9   a3   b1    William Bridge        1986.0    94107.0     Mark Levene   \n",
       "10   10   a3   b2    William Bridge        1986.0    94107.0     Bill Bridge   \n",
       "11   11   a3   b6    William Bridge        1986.0    94107.0  Michael Brodie   \n",
       "12   12   a1   b1       Kevin Smith           NaN        NaN     Mark Levene   \n",
       "13   13   a1   b2       Kevin Smith           NaN        NaN     Bill Bridge   \n",
       "14   14   a1   b3       Kevin Smith           NaN        NaN   Mike Franklin   \n",
       "15   15   a1   b4       Kevin Smith           NaN        NaN     Joseph Kuan   \n",
       "16   16   a1   b5       Kevin Smith           NaN        NaN   Alfons Kemper   \n",
       "17   17   a1   b6       Kevin Smith           NaN        NaN  Michael Brodie   \n",
       "\n",
       "    r_birth_year  r_zipcode  \n",
       "0           1988    94122.0  \n",
       "1           1982    94122.0  \n",
       "2           1984    94122.0  \n",
       "3           1988    94122.0  \n",
       "4           1982    94122.0  \n",
       "5           1984    94122.0  \n",
       "6           1988    94122.0  \n",
       "7           1982    94122.0  \n",
       "8           1984    94122.0  \n",
       "9           1987    94107.0  \n",
       "10          1986    94107.0  \n",
       "11          1987    94107.0  \n",
       "12          1987    94107.0  \n",
       "13          1986    94107.0  \n",
       "14          1988    94122.0  \n",
       "15          1982    94122.0  \n",
       "16          1984    94122.0  \n",
       "17          1987    94107.0  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The candidate set C2 includes all possible tuple pairs with missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "10a59253-c303-4420-a12b-e99a12ed3e2d"
    }
   },
   "source": [
    "## Block a Candidate Set of Tuple Pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "6e4a3e7e-3019-4597-8a85-4b3803c47623"
    }
   },
   "source": [
    "In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have different birth years. We will assume that two persons with different birth years cannot refer to the same person. So, we block the candidate set of tuple pairs on `birth_year`. That is, we block all the tuple pairs that have different birth years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "1b494eac-63c8-4fe6-9a13-76c482dfccb7"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%           100%\n",
      "[###############] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "time: 45.3 ms\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "# Instantiate Attr. Equivalence Blocker\n",
    "ab = em.AttrEquivalenceBlocker()\n",
    "# Use block_tables to apply blocking over two input tables.\n",
    "C3 = ab.block_candset(C1, l_block_attr='birth_year', r_block_attr='birth_year')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "fb465204-b14c-4e45-b9cd-00306acbd93a"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_ID r_ID            l_name  l_birth_year  l_zipcode         r_name  \\\n",
       "4     4   a3   b2    William Bridge          1986      94107    Bill Bridge   \n",
       "6     6   a2   b3  Michael Franklin          1988      94122  Mike Franklin   \n",
       "14   14   a5   b5   Alphonse Kemper          1984      94122  Alfons Kemper   \n",
       "\n",
       "    r_birth_year  r_zipcode  \n",
       "4           1986      94107  \n",
       "6           1988      94122  \n",
       "14          1984      94122  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C3.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "bb198ef6-9e92-40f9-a8f1-849b52283295"
    }
   },
   "source": [
    "Note that, the tuple pairs in the resulting candidate set have the same birth year. \n",
    "\n",
    "The attributes included in the resulting candidate set are based on the input candidate set (i.e the same attributes are retained)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "308a90fc-353a-416b-9553-33558a4ab44a"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 4565765144\n",
      "fk_rtable: r_ID\n",
      "rtable(obj.id): 4565204272\n",
      "fk_ltable: l_ID\n",
      "key: _id\n",
      "ltable(obj.id): 4565203432\n"
     ]
    }
   ],
   "source": [
    "# Show the metadata of C1\n",
    "em.show_properties(C3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "c0f515f6-4e0c-4389-a921-dfc3a9842d69"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4565203432, 4565204272)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "id(A), id(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "fbeae903-9041-4540-9781-cbe79adbae7c"
    }
   },
   "source": [
    "As we saw earlier the metadata of C3 includes the same metadata as C1. That is, it includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the tuple pairs included in the candidate set have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a2</td>\n",
       "      <td>b4</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a2</td>\n",
       "      <td>b5</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a4</td>\n",
       "      <td>b3</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a4</td>\n",
       "      <td>b4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>a4</td>\n",
       "      <td>b5</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>a5</td>\n",
       "      <td>b3</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>a5</td>\n",
       "      <td>b4</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>a3</td>\n",
       "      <td>b1</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>a3</td>\n",
       "      <td>b6</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>a1</td>\n",
       "      <td>b3</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>a1</td>\n",
       "      <td>b4</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>a1</td>\n",
       "      <td>b5</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_ID r_ID            l_name  l_birth_year  l_zipcode          r_name  \\\n",
       "0     0   a2   b3  Michael Franklin        1988.0    94122.0   Mike Franklin   \n",
       "1     1   a2   b4  Michael Franklin        1988.0    94122.0     Joseph Kuan   \n",
       "2     2   a2   b5  Michael Franklin        1988.0    94122.0   Alfons Kemper   \n",
       "3     3   a4   b3      Binto George        1987.0    94122.0   Mike Franklin   \n",
       "4     4   a4   b4      Binto George        1987.0    94122.0     Joseph Kuan   \n",
       "5     5   a4   b5      Binto George        1987.0    94122.0   Alfons Kemper   \n",
       "6     6   a5   b3   Alphonse Kemper        1984.0    94122.0   Mike Franklin   \n",
       "7     7   a5   b4   Alphonse Kemper        1984.0    94122.0     Joseph Kuan   \n",
       "8     8   a5   b5   Alphonse Kemper        1984.0    94122.0   Alfons Kemper   \n",
       "9     9   a3   b1    William Bridge        1986.0    94107.0     Mark Levene   \n",
       "10   10   a3   b2    William Bridge        1986.0    94107.0     Bill Bridge   \n",
       "11   11   a3   b6    William Bridge        1986.0    94107.0  Michael Brodie   \n",
       "12   12   a1   b1       Kevin Smith           NaN        NaN     Mark Levene   \n",
       "13   13   a1   b2       Kevin Smith           NaN        NaN     Bill Bridge   \n",
       "14   14   a1   b3       Kevin Smith           NaN        NaN   Mike Franklin   \n",
       "15   15   a1   b4       Kevin Smith           NaN        NaN     Joseph Kuan   \n",
       "16   16   a1   b5       Kevin Smith           NaN        NaN   Alfons Kemper   \n",
       "17   17   a1   b6       Kevin Smith           NaN        NaN  Michael Brodie   \n",
       "\n",
       "    r_birth_year  r_zipcode  \n",
       "0           1988    94122.0  \n",
       "1           1982    94122.0  \n",
       "2           1984    94122.0  \n",
       "3           1988    94122.0  \n",
       "4           1982    94122.0  \n",
       "5           1984    94122.0  \n",
       "6           1988    94122.0  \n",
       "7           1982    94122.0  \n",
       "8           1984    94122.0  \n",
       "9           1987    94107.0  \n",
       "10          1986    94107.0  \n",
       "11          1987    94107.0  \n",
       "12          1987    94107.0  \n",
       "13          1986    94107.0  \n",
       "14          1988    94122.0  \n",
       "15          1982    94122.0  \n",
       "16          1984    94122.0  \n",
       "17          1987    94107.0  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display C2 (got by blocking over A1 and B)\n",
    "C2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 4565567248\n",
      "fk_rtable: r_ID\n",
      "rtable(obj.id): 4565204272\n",
      "fk_ltable: l_ID\n",
      "key: _id\n",
      "ltable(obj.id): 4565680928\n"
     ]
    }
   ],
   "source": [
    "em.show_properties(C2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id: 4565680928\n",
      "key: ID\n"
     ]
    }
   ],
   "source": [
    "em.show_properties(A1) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that `A1` is the left table to `C2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987.0</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith         NaN         30.0   \n",
       "1  a2  Michael Franklin      1988.0         27.5   \n",
       "2  a3    William Bridge      1986.0         32.0   \n",
       "3  a4      Binto George      1987.0         32.5   \n",
       "4  a5   Alphonse Kemper      1984.0         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco      NaN  \n",
       "1  1652 Stockton St, San Francisco  94122.0  \n",
       "2   3131 Webster St, San Francisco  94107.0  \n",
       "3     423 Powell St, San Francisco  94122.0  \n",
       "4  1702 Post Street, San Francisco  94122.0  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%              100%\n",
      "[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "C4 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_ID r_ID            l_name  l_birth_year  l_zipcode         r_name  \\\n",
       "0     0   a2   b3  Michael Franklin        1988.0    94122.0  Mike Franklin   \n",
       "8     8   a5   b5   Alphonse Kemper        1984.0    94122.0  Alfons Kemper   \n",
       "10   10   a3   b2    William Bridge        1986.0    94107.0    Bill Bridge   \n",
       "\n",
       "    r_birth_year  r_zipcode  \n",
       "0           1988    94122.0  \n",
       "8           1984    94122.0  \n",
       "10          1986    94107.0  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%              100%\n",
      "[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "# Set allow_missing to True\n",
    "C5 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=True) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3, 9)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(C4), len(C5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_ID</th>\n",
       "      <th>r_ID</th>\n",
       "      <th>l_name</th>\n",
       "      <th>l_birth_year</th>\n",
       "      <th>l_zipcode</th>\n",
       "      <th>r_name</th>\n",
       "      <th>r_birth_year</th>\n",
       "      <th>r_zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986.0</td>\n",
       "      <td>94107.0</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>a1</td>\n",
       "      <td>b3</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>a1</td>\n",
       "      <td>b4</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>a1</td>\n",
       "      <td>b5</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>94122.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>94107.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_ID r_ID            l_name  l_birth_year  l_zipcode          r_name  \\\n",
       "0     0   a2   b3  Michael Franklin        1988.0    94122.0   Mike Franklin   \n",
       "8     8   a5   b5   Alphonse Kemper        1984.0    94122.0   Alfons Kemper   \n",
       "10   10   a3   b2    William Bridge        1986.0    94107.0     Bill Bridge   \n",
       "12   12   a1   b1       Kevin Smith           NaN        NaN     Mark Levene   \n",
       "13   13   a1   b2       Kevin Smith           NaN        NaN     Bill Bridge   \n",
       "14   14   a1   b3       Kevin Smith           NaN        NaN   Mike Franklin   \n",
       "15   15   a1   b4       Kevin Smith           NaN        NaN     Joseph Kuan   \n",
       "16   16   a1   b5       Kevin Smith           NaN        NaN   Alfons Kemper   \n",
       "17   17   a1   b6       Kevin Smith           NaN        NaN  Michael Brodie   \n",
       "\n",
       "    r_birth_year  r_zipcode  \n",
       "0           1988    94122.0  \n",
       "8           1984    94122.0  \n",
       "10          1986    94107.0  \n",
       "12          1987    94107.0  \n",
       "13          1986    94107.0  \n",
       "14          1988    94122.0  \n",
       "15          1982    94122.0  \n",
       "16          1984    94122.0  \n",
       "17          1987    94107.0  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "186ec555-850c-41a3-a167-9a57d63c2b45"
    }
   },
   "source": [
    "## Block Two tuples To Check If a Tuple Pair Would Get Blocked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "c516958b-85c4-47d1-8927-0dcbaf942b1c"
    }
   },
   "source": [
    "We can apply attribute equivalence blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on `zipcode`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "62731f40-a32c-44e0-958d-a4b5c07534cb"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID         name  birth_year  hourly_wage                     address  \\\n",
       "0  a1  Kevin Smith        1989         30.0  607 From St, San Francisco   \n",
       "\n",
       "   zipcode  \n",
       "0    94107  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first tuple from table A\n",
    "A.ix[[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "af99c211-c53a-44c1-93b6-889fa962a0b7"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b1</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>29.5</td>\n",
       "      <td>108 Clement St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID         name  birth_year  hourly_wage                        address  \\\n",
       "0  b1  Mark Levene        1987         29.5  108 Clement St, San Francisco   \n",
       "\n",
       "   zipcode  \n",
       "0    94107  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first tuple from table B\n",
    "B.ix[[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false,
    "nbpresent": {
     "id": "13947a8a-7ffe-49d0-b3ba-d6fe2c5e0d5b"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n"
     ]
    }
   ],
   "source": [
    "# Instantiate Attr. Equivalence Blocker\n",
    "ab = em.AttrEquivalenceBlocker()\n",
    "\n",
    "# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status\n",
    "status = ab.block_tuples(A.ix[0], B.ix[0], l_block_attr='zipcode', r_block_attr='zipcode')\n",
    "\n",
    "# Print the blocking status\n",
    "print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "aa4dee9f-fab2-4fef-b3c5-f8fc03fba405"
    }
   },
   "source": [
    "The above result says that the tuple pair will not be blocked, i.e. this tuple pair will be included in the candidate set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently, block_tuples command does not handle missing values"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}