{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "This IPython notebook illustrates how to perform blocking using rule-based blocker.\n", "\n", "First, we need to import *py_entitymatching* package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", " \"This module will be removed in 0.20.\", DeprecationWarning)\n" ] } ], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, read the (sample) input tables for blocking purposes.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the paths of the input tables\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n", "path_B = datasets_dir + os.sep + 'person_table_B.csv'" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Read the CSV files and set 'ID' as the key attribute\n", "A = em.read_csv_metadata(path_A, key='ID')\n", "B = em.read_csv_metadata(path_B, key='ID')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Different Ways to Block Using Blackbox Based Blocker\n", "\n", "There are three different ways to do overlap blocking:\n", "\n", "1. Block two tables to produce a `candidate set` of tuple pairs.\n", "2. Block a `candidate set` of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n", "3. Block two tuples to check if a tuple pair would get blocked." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block Tables to Produce a Candidate Set of Tuple Pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, define a blackbox function" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def address_address_function(x, y):\n", " # x, y will be of type pandas series\n", " \n", " # get name attribute\n", " x_address = x['address']\n", " y_address = y['address']\n", " # get the city\n", " x_split, y_split = x_address.split(','), y_address.split(',')\n", " x_city = x_split[len(x_split) - 1]\n", " y_city = y_split[len(y_split) - 1]\n", " # check if the cities match\n", " if x_city != y_city:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Instantiate blackbox blocker\n", "bb = em.BlackBoxBlocker()\n", "# Set the black box function\n", "bb.set_black_box_function(address_address_function)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[##############################] | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "C = bb.block_tables(A, B, l_output_attrs=['name', 'address'], r_output_attrs=['name', 'address'])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_addressrtable_namertable_address
00a1b1Kevin Smith607 From St, San FranciscoMark Levene108 Clement St, San Francisco
11a1b2Kevin Smith607 From St, San FranciscoBill Bridge3131 Webster St, San Francisco
22a1b3Kevin Smith607 From St, San FranciscoMike Franklin1652 Stockton St, San Francisco
33a1b4Kevin Smith607 From St, San FranciscoJoseph Kuan108 South Park, San Francisco
44a1b6Kevin Smith607 From St, San FranciscoMichael Brodie133 Clement Street, San Francisco
55a2b1Michael Franklin1652 Stockton St, San FranciscoMark Levene108 Clement St, San Francisco
66a2b2Michael Franklin1652 Stockton St, San FranciscoBill Bridge3131 Webster St, San Francisco
77a2b3Michael Franklin1652 Stockton St, San FranciscoMike Franklin1652 Stockton St, San Francisco
88a2b4Michael Franklin1652 Stockton St, San FranciscoJoseph Kuan108 South Park, San Francisco
99a2b6Michael Franklin1652 Stockton St, San FranciscoMichael Brodie133 Clement Street, San Francisco
1010a3b1William Bridge3131 Webster St, San FranciscoMark Levene108 Clement St, San Francisco
1111a3b2William Bridge3131 Webster St, San FranciscoBill Bridge3131 Webster St, San Francisco
1212a3b3William Bridge3131 Webster St, San FranciscoMike Franklin1652 Stockton St, San Francisco
1313a3b4William Bridge3131 Webster St, San FranciscoJoseph Kuan108 South Park, San Francisco
1414a3b6William Bridge3131 Webster St, San FranciscoMichael Brodie133 Clement Street, San Francisco
1515a4b1Binto George423 Powell St, San FranciscoMark Levene108 Clement St, San Francisco
1616a4b2Binto George423 Powell St, San FranciscoBill Bridge3131 Webster St, San Francisco
1717a4b3Binto George423 Powell St, San FranciscoMike Franklin1652 Stockton St, San Francisco
1818a4b4Binto George423 Powell St, San FranciscoJoseph Kuan108 South Park, San Francisco
1919a4b6Binto George423 Powell St, San FranciscoMichael Brodie133 Clement Street, San Francisco
2020a5b1Alphonse Kemper1702 Post Street, San FranciscoMark Levene108 Clement St, San Francisco
2121a5b2Alphonse Kemper1702 Post Street, San FranciscoBill Bridge3131 Webster St, San Francisco
2222a5b3Alphonse Kemper1702 Post Street, San FranciscoMike Franklin1652 Stockton St, San Francisco
2323a5b4Alphonse Kemper1702 Post Street, San FranciscoJoseph Kuan108 South Park, San Francisco
2424a5b6Alphonse Kemper1702 Post Street, San FranciscoMichael Brodie133 Clement Street, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name \\\n", "0 0 a1 b1 Kevin Smith \n", "1 1 a1 b2 Kevin Smith \n", "2 2 a1 b3 Kevin Smith \n", "3 3 a1 b4 Kevin Smith \n", "4 4 a1 b6 Kevin Smith \n", "5 5 a2 b1 Michael Franklin \n", "6 6 a2 b2 Michael Franklin \n", "7 7 a2 b3 Michael Franklin \n", "8 8 a2 b4 Michael Franklin \n", "9 9 a2 b6 Michael Franklin \n", "10 10 a3 b1 William Bridge \n", "11 11 a3 b2 William Bridge \n", "12 12 a3 b3 William Bridge \n", "13 13 a3 b4 William Bridge \n", "14 14 a3 b6 William Bridge \n", "15 15 a4 b1 Binto George \n", "16 16 a4 b2 Binto George \n", "17 17 a4 b3 Binto George \n", "18 18 a4 b4 Binto George \n", "19 19 a4 b6 Binto George \n", "20 20 a5 b1 Alphonse Kemper \n", "21 21 a5 b2 Alphonse Kemper \n", "22 22 a5 b3 Alphonse Kemper \n", "23 23 a5 b4 Alphonse Kemper \n", "24 24 a5 b6 Alphonse Kemper \n", "\n", " ltable_address rtable_name \\\n", "0 607 From St, San Francisco Mark Levene \n", "1 607 From St, San Francisco Bill Bridge \n", "2 607 From St, San Francisco Mike Franklin \n", "3 607 From St, San Francisco Joseph Kuan \n", "4 607 From St, San Francisco Michael Brodie \n", "5 1652 Stockton St, San Francisco Mark Levene \n", "6 1652 Stockton St, San Francisco Bill Bridge \n", "7 1652 Stockton St, San Francisco Mike Franklin \n", "8 1652 Stockton St, San Francisco Joseph Kuan \n", "9 1652 Stockton St, San Francisco Michael Brodie \n", "10 3131 Webster St, San Francisco Mark Levene \n", "11 3131 Webster St, San Francisco Bill Bridge \n", "12 3131 Webster St, San Francisco Mike Franklin \n", "13 3131 Webster St, San Francisco Joseph Kuan \n", "14 3131 Webster St, San Francisco Michael Brodie \n", "15 423 Powell St, San Francisco Mark Levene \n", "16 423 Powell St, San Francisco Bill Bridge \n", "17 423 Powell St, San Francisco Mike Franklin \n", "18 423 Powell St, San Francisco Joseph Kuan \n", "19 423 Powell St, San Francisco Michael Brodie \n", "20 1702 Post Street, San Francisco Mark Levene \n", "21 1702 Post Street, San Francisco Bill Bridge \n", "22 1702 Post Street, San Francisco Mike Franklin \n", "23 1702 Post Street, San Francisco Joseph Kuan \n", "24 1702 Post Street, San Francisco Michael Brodie \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 3131 Webster St, San Francisco \n", "2 1652 Stockton St, San Francisco \n", "3 108 South Park, San Francisco \n", "4 133 Clement Street, San Francisco \n", "5 108 Clement St, San Francisco \n", "6 3131 Webster St, San Francisco \n", "7 1652 Stockton St, San Francisco \n", "8 108 South Park, San Francisco \n", "9 133 Clement Street, San Francisco \n", "10 108 Clement St, San Francisco \n", "11 3131 Webster St, San Francisco \n", "12 1652 Stockton St, San Francisco \n", "13 108 South Park, San Francisco \n", "14 133 Clement Street, San Francisco \n", "15 108 Clement St, San Francisco \n", "16 3131 Webster St, San Francisco \n", "17 1652 Stockton St, San Francisco \n", "18 108 South Park, San Francisco \n", "19 133 Clement Street, San Francisco \n", "20 108 Clement St, San Francisco \n", "21 3131 Webster St, San Francisco \n", "22 1652 Stockton St, San Francisco \n", "23 108 South Park, San Francisco \n", "24 133 Clement Street, San Francisco " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block Candidate Set\n", "\n", "First, define a blackbox function" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def name_name_function(x, y):\n", " # x, y will be of type pandas series\n", " \n", " # get name attribute\n", " x_name = x['name']\n", " y_name = y['name']\n", " # get last names\n", " x_name = x_name.split(' ')[1]\n", " y_name = y_name.split(' ')[1]\n", " # check if last names match\n", " if x_name != y_name:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Instantiate blackbox blocker\n", "bb = em.BlackBoxBlocker()\n", "# Set the black box function\n", "bb.set_black_box_function(name_name_function)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[#########################] | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "D = bb.block_candset(C)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_addressrtable_namertable_address
77a2b3Michael Franklin1652 Stockton St, San FranciscoMike Franklin1652 Stockton St, San Francisco
1111a3b2William Bridge3131 Webster St, San FranciscoBill Bridge3131 Webster St, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name \\\n", "7 7 a2 b3 Michael Franklin \n", "11 11 a3 b2 William Bridge \n", "\n", " ltable_address rtable_name \\\n", "7 1652 Stockton St, San Francisco Mike Franklin \n", "11 3131 Webster St, San Francisco Bill Bridge \n", "\n", " rtable_address \n", "7 1652 Stockton St, San Francisco \n", "11 3131 Webster St, San Francisco " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block Two tuples To Check If a Tuple Pair Would Get Blocked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, define the black box function first" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def address_address_function(x, y):\n", " # x, y will be of type pandas series\n", " \n", " # get name attribute\n", " x_address = x['address']\n", " y_address = y['address']\n", " # get the city\n", " x_split, y_split = x_address.split(','), y_address.split(',')\n", " x_city = x_split[len(x_split) - 1]\n", " y_city = y_split[len(y_split) - 1]\n", " # check if the cities match\n", " if x_city != y_city:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Instantiate blackabox blocker\n", "bb = em.BlackBoxBlocker()\n", "# Set the blackbox function \n", "bb.set_black_box_function(address_address_function)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.ix[[0]]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0b1Mark Levene198729.5108 Clement St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.ix[[0]]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "status = bb.block_tuples(A.ix[0], B.ix[0])\n", "\n", "print(status)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }