{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This IPython notebook illustrates how to perform blocking using Overlap blocker.\n",
    "\n",
    "First, we need to import *py_entitymatching* package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Import py_entitymatching package\n",
    "import py_entitymatching as em\n",
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, read the (sample) input tables for blocking purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "# Get the paths of the input tables\n",
    "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n",
    "path_B = datasets_dir + os.sep + 'person_table_B.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Read the CSV files and set 'ID' as the key attribute\n",
    "A = em.read_csv_metadata(path_A, key='ID')\n",
    "B = em.read_csv_metadata(path_B, key='ID')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith        1989         30.0   \n",
       "1  a2  Michael Franklin        1988         27.5   \n",
       "2  a3    William Bridge        1986         32.0   \n",
       "3  a4      Binto George        1987         32.5   \n",
       "4  a5   Alphonse Kemper        1984         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco    94107  \n",
       "1  1652 Stockton St, San Francisco    94122  \n",
       "2   3131 Webster St, San Francisco    94107  \n",
       "3     423 Powell St, San Francisco    94122  \n",
       "4  1702 Post Street, San Francisco    94122  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ways To Do Overlap Blocking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three different ways to do overlap blocking:\n",
    "\n",
    "1. Block two tables to produce a `candidate set` of tuple pairs.\n",
    "2. Block a `candidate set` of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n",
    "3. Block two tuples to check if a tuple pair would get blocked."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Block Tables to Produce a Candidate Set of Tuple Pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Instantiate overlap blocker object\n",
    "ob = em.OverlapBlocker()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the given two tables, we will assume that two persons with no sufficient overlap between their addresses do not refer to the same real world person. So, we apply overlap blocking on `address`. Specifically, we tokenize the address by word and include the tuple pairs if the addresses have at least 3 overlapping tokens. That is, we block all the tuple pairs that do not share at least 3 tokens in `address`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "invalid syntax (<ipython-input-6-fd7e6f608750>, line 5)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-6-fd7e6f608750>\"\u001b[0;36m, line \u001b[0;32m5\u001b[0m\n\u001b[0;31m    show_progress=False)\u001b[0m\n\u001b[0m                ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
     ]
    }
   ],
   "source": [
    "# Specify the tokenization to be 'word' level and set overlap_size to be 3.\n",
    "C1 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, \n",
    "                    l_output_attrs=['name', 'birth_year', 'address'], \n",
    "                    r_output_attrs=['name', 'birth_year', 'address']\n",
    "                    show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Display first 5 tuple pairs in the candidate set.\n",
    "C1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above, we used word-level tokenizer. Overlap blocker also supports q-gram based tokenizer and it can be used as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Set the word_level to be False and set the value of q (using q_val)\n",
    "C2 = ob.block_tables(A, B, 'address', 'address', word_level=False, q_val=3, overlap_size=3, \n",
    "                    l_output_attrs=['name', 'birth_year', 'address'], \n",
    "                    r_output_attrs=['name', 'birth_year', 'address'],\n",
    "                    show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Display first 5 tuple pairs\n",
    "C2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Updating Stopwords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Commands in the Overlap Blocker removes some stop words by default. You can avoid this by specifying `rem_stop_words` parameter to False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Set the parameter to remove stop words to False\n",
    "C3 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, rem_stop_words=False,\n",
    "                    l_output_attrs=['name', 'birth_year', 'address'], \n",
    "                    r_output_attrs=['name', 'birth_year', 'address'],\n",
    "                    show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Display first 5 tuple pairs\n",
    "C3.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check what stop words are getting removed like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ob.stop_words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can update this stop word list (with some domain specific stop words) and do the blocking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Include Franciso as one of the stop words\n",
    "ob.stop_words.append('francisco')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ob.stop_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Set the word level tokenizer to be True\n",
    "C4 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, \n",
    "                    l_output_attrs=['name', 'birth_year', 'address'], \n",
    "                    r_output_attrs=['name', 'birth_year', 'address'],\n",
    "                    show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C4.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling Missing Values "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the input tuples have missing values in the blocking attribute, then they are ignored by default. You can set `allow_missing_values` to be True to include all possible tuple pairs with missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Introduce some missing value\n",
    "A1 = em.read_csv_metadata(path_A, key='ID')\n",
    "A1.ix[0, 'address'] = pd.np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Set the word level tokenizer to be True\n",
    "C5 = ob.block_tables(A1, B, 'address', 'address', word_level=True, overlap_size=3, allow_missing=True,\n",
    "                    l_output_attrs=['name', 'birth_year', 'address'], \n",
    "                    r_output_attrs=['name', 'birth_year', 'address'],\n",
    "                    show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(C5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Block a Candidata Set  To Produce Reduced Set of Tuple Pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Instantiate the overlap blocker\n",
    "ob = em.OverlapBlocker()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have at least three tokens in overlap. Adding to that, we will assume that two persons with no overlap of their names cannot refer to the same person. So, we block the candidate set of tuple pairs on `name`. That is, we block all the tuple pairs that have no overlap of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Specify the tokenization to be 'word' level and set overlap_size to be 1.\n",
    "C6 = ob.block_candset(C1, 'name', 'name', word_level=True, overlap_size=1, show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above, we saw that word level tokenization was used to tokenize the names. You can also use q-gram tokenization like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Specify the tokenization to be 'word' level and set overlap_size to be 1.\n",
    "C7 = ob.block_candset(C1, 'name', 'name', word_level=False, q_val= 3, overlap_size=1, show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C7.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling Missing Values "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[As we saw with block_tables](#Handling-Missing-Values), you can include all the possible tuple pairs with the missing values using `allow_missing` parameter  block the candidate set with the updated set of stop words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Introduce some missing values\n",
    "A1.ix[2, 'name'] = pd.np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C8 = ob.block_candset(C5, 'name', 'name', word_level=True, overlap_size=1, allow_missing=True, show_progress=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Block Two tuples To Check If a Tuple Pair Would Get Blocked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can apply overlap blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on `address`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Display the first tuple from table A\n",
    "A.ix[[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Display the first tuple from table B\n",
    "B.ix[[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Instantiate Attr. Equivalence Blocker\n",
    "ob = em.OverlapBlocker()\n",
    "\n",
    "# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status\n",
    "status = ob.block_tuples(A.ix[0], B.ix[0],'address', 'address', overlap_size=1, show_progress=False)\n",
    "\n",
    "# Print the blocking status\n",
    "print(status)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}