{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"This IPython notebook illustrates how to perform blocking using rule-based blocker.\n",
"\n",
"First, we need to import *py_entitymatching* package and other libraries as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
" \"This module will be removed in 0.20.\", DeprecationWarning)\n"
]
}
],
"source": [
"# Import py_entitymatching package\n",
"import py_entitymatching as em\n",
"import os\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, read the (sample) input tables for blocking purposes.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Get the datasets directory\n",
"datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
"\n",
"# Get the paths of the input tables\n",
"path_A = datasets_dir + os.sep + 'person_table_A.csv'\n",
"path_B = datasets_dir + os.sep + 'person_table_B.csv'"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read the CSV files and set 'ID' as the key attribute\n",
"A = em.read_csv_metadata(path_A, key='ID')\n",
"B = em.read_csv_metadata(path_B, key='ID')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Different Ways to Block Using Blackbox Based Blocker\n",
"\n",
"There are three different ways to do overlap blocking:\n",
"\n",
"1. Block two tables to produce a `candidate set` of tuple pairs.\n",
"2. Block a `candidate set` of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n",
"3. Block two tuples to check if a tuple pair would get blocked."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Block Tables to Produce a Candidate Set of Tuple Pairs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, define a blackbox function"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def address_address_function(x, y):\n",
" # x, y will be of type pandas series\n",
" \n",
" # get name attribute\n",
" x_address = x['address']\n",
" y_address = y['address']\n",
" # get the city\n",
" x_split, y_split = x_address.split(','), y_address.split(',')\n",
" x_city = x_split[len(x_split) - 1]\n",
" y_city = y_split[len(y_split) - 1]\n",
" # check if the cities match\n",
" if x_city != y_city:\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Instantiate blackbox blocker\n",
"bb = em.BlackBoxBlocker()\n",
"# Set the black box function\n",
"bb.set_black_box_function(address_address_function)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"0% 100%\n",
"[##############################] | ETA: 00:00:00\n",
"Total time elapsed: 00:00:00\n"
]
}
],
"source": [
"C = bb.block_tables(A, B, l_output_attrs=['name', 'address'], r_output_attrs=['name', 'address'])"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" ltable_ID | \n",
" rtable_ID | \n",
" ltable_name | \n",
" ltable_address | \n",
" rtable_name | \n",
" rtable_address | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" a1 | \n",
" b1 | \n",
" Kevin Smith | \n",
" 607 From St, San Francisco | \n",
" Mark Levene | \n",
" 108 Clement St, San Francisco | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" a1 | \n",
" b2 | \n",
" Kevin Smith | \n",
" 607 From St, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" a1 | \n",
" b3 | \n",
" Kevin Smith | \n",
" 607 From St, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" a1 | \n",
" b4 | \n",
" Kevin Smith | \n",
" 607 From St, San Francisco | \n",
" Joseph Kuan | \n",
" 108 South Park, San Francisco | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" a1 | \n",
" b6 | \n",
" Kevin Smith | \n",
" 607 From St, San Francisco | \n",
" Michael Brodie | \n",
" 133 Clement Street, San Francisco | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" a2 | \n",
" b1 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Mark Levene | \n",
" 108 Clement St, San Francisco | \n",
"
\n",
" \n",
" 6 | \n",
" 6 | \n",
" a2 | \n",
" b2 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
" 7 | \n",
" 7 | \n",
" a2 | \n",
" b3 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 8 | \n",
" 8 | \n",
" a2 | \n",
" b4 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Joseph Kuan | \n",
" 108 South Park, San Francisco | \n",
"
\n",
" \n",
" 9 | \n",
" 9 | \n",
" a2 | \n",
" b6 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Michael Brodie | \n",
" 133 Clement Street, San Francisco | \n",
"
\n",
" \n",
" 10 | \n",
" 10 | \n",
" a3 | \n",
" b1 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Mark Levene | \n",
" 108 Clement St, San Francisco | \n",
"
\n",
" \n",
" 11 | \n",
" 11 | \n",
" a3 | \n",
" b2 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
" 12 | \n",
" 12 | \n",
" a3 | \n",
" b3 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 13 | \n",
" 13 | \n",
" a3 | \n",
" b4 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Joseph Kuan | \n",
" 108 South Park, San Francisco | \n",
"
\n",
" \n",
" 14 | \n",
" 14 | \n",
" a3 | \n",
" b6 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Michael Brodie | \n",
" 133 Clement Street, San Francisco | \n",
"
\n",
" \n",
" 15 | \n",
" 15 | \n",
" a4 | \n",
" b1 | \n",
" Binto George | \n",
" 423 Powell St, San Francisco | \n",
" Mark Levene | \n",
" 108 Clement St, San Francisco | \n",
"
\n",
" \n",
" 16 | \n",
" 16 | \n",
" a4 | \n",
" b2 | \n",
" Binto George | \n",
" 423 Powell St, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
" 17 | \n",
" 17 | \n",
" a4 | \n",
" b3 | \n",
" Binto George | \n",
" 423 Powell St, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 18 | \n",
" 18 | \n",
" a4 | \n",
" b4 | \n",
" Binto George | \n",
" 423 Powell St, San Francisco | \n",
" Joseph Kuan | \n",
" 108 South Park, San Francisco | \n",
"
\n",
" \n",
" 19 | \n",
" 19 | \n",
" a4 | \n",
" b6 | \n",
" Binto George | \n",
" 423 Powell St, San Francisco | \n",
" Michael Brodie | \n",
" 133 Clement Street, San Francisco | \n",
"
\n",
" \n",
" 20 | \n",
" 20 | \n",
" a5 | \n",
" b1 | \n",
" Alphonse Kemper | \n",
" 1702 Post Street, San Francisco | \n",
" Mark Levene | \n",
" 108 Clement St, San Francisco | \n",
"
\n",
" \n",
" 21 | \n",
" 21 | \n",
" a5 | \n",
" b2 | \n",
" Alphonse Kemper | \n",
" 1702 Post Street, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
" 22 | \n",
" 22 | \n",
" a5 | \n",
" b3 | \n",
" Alphonse Kemper | \n",
" 1702 Post Street, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 23 | \n",
" 23 | \n",
" a5 | \n",
" b4 | \n",
" Alphonse Kemper | \n",
" 1702 Post Street, San Francisco | \n",
" Joseph Kuan | \n",
" 108 South Park, San Francisco | \n",
"
\n",
" \n",
" 24 | \n",
" 24 | \n",
" a5 | \n",
" b6 | \n",
" Alphonse Kemper | \n",
" 1702 Post Street, San Francisco | \n",
" Michael Brodie | \n",
" 133 Clement Street, San Francisco | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id ltable_ID rtable_ID ltable_name \\\n",
"0 0 a1 b1 Kevin Smith \n",
"1 1 a1 b2 Kevin Smith \n",
"2 2 a1 b3 Kevin Smith \n",
"3 3 a1 b4 Kevin Smith \n",
"4 4 a1 b6 Kevin Smith \n",
"5 5 a2 b1 Michael Franklin \n",
"6 6 a2 b2 Michael Franklin \n",
"7 7 a2 b3 Michael Franklin \n",
"8 8 a2 b4 Michael Franklin \n",
"9 9 a2 b6 Michael Franklin \n",
"10 10 a3 b1 William Bridge \n",
"11 11 a3 b2 William Bridge \n",
"12 12 a3 b3 William Bridge \n",
"13 13 a3 b4 William Bridge \n",
"14 14 a3 b6 William Bridge \n",
"15 15 a4 b1 Binto George \n",
"16 16 a4 b2 Binto George \n",
"17 17 a4 b3 Binto George \n",
"18 18 a4 b4 Binto George \n",
"19 19 a4 b6 Binto George \n",
"20 20 a5 b1 Alphonse Kemper \n",
"21 21 a5 b2 Alphonse Kemper \n",
"22 22 a5 b3 Alphonse Kemper \n",
"23 23 a5 b4 Alphonse Kemper \n",
"24 24 a5 b6 Alphonse Kemper \n",
"\n",
" ltable_address rtable_name \\\n",
"0 607 From St, San Francisco Mark Levene \n",
"1 607 From St, San Francisco Bill Bridge \n",
"2 607 From St, San Francisco Mike Franklin \n",
"3 607 From St, San Francisco Joseph Kuan \n",
"4 607 From St, San Francisco Michael Brodie \n",
"5 1652 Stockton St, San Francisco Mark Levene \n",
"6 1652 Stockton St, San Francisco Bill Bridge \n",
"7 1652 Stockton St, San Francisco Mike Franklin \n",
"8 1652 Stockton St, San Francisco Joseph Kuan \n",
"9 1652 Stockton St, San Francisco Michael Brodie \n",
"10 3131 Webster St, San Francisco Mark Levene \n",
"11 3131 Webster St, San Francisco Bill Bridge \n",
"12 3131 Webster St, San Francisco Mike Franklin \n",
"13 3131 Webster St, San Francisco Joseph Kuan \n",
"14 3131 Webster St, San Francisco Michael Brodie \n",
"15 423 Powell St, San Francisco Mark Levene \n",
"16 423 Powell St, San Francisco Bill Bridge \n",
"17 423 Powell St, San Francisco Mike Franklin \n",
"18 423 Powell St, San Francisco Joseph Kuan \n",
"19 423 Powell St, San Francisco Michael Brodie \n",
"20 1702 Post Street, San Francisco Mark Levene \n",
"21 1702 Post Street, San Francisco Bill Bridge \n",
"22 1702 Post Street, San Francisco Mike Franklin \n",
"23 1702 Post Street, San Francisco Joseph Kuan \n",
"24 1702 Post Street, San Francisco Michael Brodie \n",
"\n",
" rtable_address \n",
"0 108 Clement St, San Francisco \n",
"1 3131 Webster St, San Francisco \n",
"2 1652 Stockton St, San Francisco \n",
"3 108 South Park, San Francisco \n",
"4 133 Clement Street, San Francisco \n",
"5 108 Clement St, San Francisco \n",
"6 3131 Webster St, San Francisco \n",
"7 1652 Stockton St, San Francisco \n",
"8 108 South Park, San Francisco \n",
"9 133 Clement Street, San Francisco \n",
"10 108 Clement St, San Francisco \n",
"11 3131 Webster St, San Francisco \n",
"12 1652 Stockton St, San Francisco \n",
"13 108 South Park, San Francisco \n",
"14 133 Clement Street, San Francisco \n",
"15 108 Clement St, San Francisco \n",
"16 3131 Webster St, San Francisco \n",
"17 1652 Stockton St, San Francisco \n",
"18 108 South Park, San Francisco \n",
"19 133 Clement Street, San Francisco \n",
"20 108 Clement St, San Francisco \n",
"21 3131 Webster St, San Francisco \n",
"22 1652 Stockton St, San Francisco \n",
"23 108 South Park, San Francisco \n",
"24 133 Clement Street, San Francisco "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"C"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Block Candidate Set\n",
"\n",
"First, define a blackbox function"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def name_name_function(x, y):\n",
" # x, y will be of type pandas series\n",
" \n",
" # get name attribute\n",
" x_name = x['name']\n",
" y_name = y['name']\n",
" # get last names\n",
" x_name = x_name.split(' ')[1]\n",
" y_name = y_name.split(' ')[1]\n",
" # check if last names match\n",
" if x_name != y_name:\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Instantiate blackbox blocker\n",
"bb = em.BlackBoxBlocker()\n",
"# Set the black box function\n",
"bb.set_black_box_function(name_name_function)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"0% 100%\n",
"[#########################] | ETA: 00:00:00\n",
"Total time elapsed: 00:00:00\n"
]
}
],
"source": [
"D = bb.block_candset(C)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" ltable_ID | \n",
" rtable_ID | \n",
" ltable_name | \n",
" ltable_address | \n",
" rtable_name | \n",
" rtable_address | \n",
"
\n",
" \n",
" \n",
" \n",
" 7 | \n",
" 7 | \n",
" a2 | \n",
" b3 | \n",
" Michael Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
" Mike Franklin | \n",
" 1652 Stockton St, San Francisco | \n",
"
\n",
" \n",
" 11 | \n",
" 11 | \n",
" a3 | \n",
" b2 | \n",
" William Bridge | \n",
" 3131 Webster St, San Francisco | \n",
" Bill Bridge | \n",
" 3131 Webster St, San Francisco | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id ltable_ID rtable_ID ltable_name \\\n",
"7 7 a2 b3 Michael Franklin \n",
"11 11 a3 b2 William Bridge \n",
"\n",
" ltable_address rtable_name \\\n",
"7 1652 Stockton St, San Francisco Mike Franklin \n",
"11 3131 Webster St, San Francisco Bill Bridge \n",
"\n",
" rtable_address \n",
"7 1652 Stockton St, San Francisco \n",
"11 3131 Webster St, San Francisco "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Block Two tuples To Check If a Tuple Pair Would Get Blocked"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, define the black box function first"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def address_address_function(x, y):\n",
" # x, y will be of type pandas series\n",
" \n",
" # get name attribute\n",
" x_address = x['address']\n",
" y_address = y['address']\n",
" # get the city\n",
" x_split, y_split = x_address.split(','), y_address.split(',')\n",
" x_city = x_split[len(x_split) - 1]\n",
" y_city = y_split[len(y_split) - 1]\n",
" # check if the cities match\n",
" if x_city != y_city:\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Instantiate blackabox blocker\n",
"bb = em.BlackBoxBlocker()\n",
"# Set the blackbox function \n",
"bb.set_black_box_function(address_address_function)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" name | \n",
" birth_year | \n",
" hourly_wage | \n",
" address | \n",
" zipcode | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a1 | \n",
" Kevin Smith | \n",
" 1989 | \n",
" 30.0 | \n",
" 607 From St, San Francisco | \n",
" 94107 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID name birth_year hourly_wage address \\\n",
"0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco \n",
"\n",
" zipcode \n",
"0 94107 "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A.ix[[0]]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" name | \n",
" birth_year | \n",
" hourly_wage | \n",
" address | \n",
" zipcode | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" b1 | \n",
" Mark Levene | \n",
" 1987 | \n",
" 29.5 | \n",
" 108 Clement St, San Francisco | \n",
" 94107 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID name birth_year hourly_wage address \\\n",
"0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco \n",
"\n",
" zipcode \n",
"0 94107 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"B.ix[[0]]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False\n"
]
}
],
"source": [
"status = bb.block_tuples(A.ix[0], B.ix[0])\n",
"\n",
"print(status)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}