{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The *py_entitymatching* package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is attribute equivalence blocker. This IPython notebook illustrates how to perform blocking using attribute equivalence blocker." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to import *py_entitymatching* package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext autotime" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "nbpresent": { "id": "9a89351b-e44f-47ad-afff-b148744173af" } }, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "0f59e4ac-032a-4d59-9172-8ee653831acb" } }, "source": [ "Then, read the input tablse from the datasets directory" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "nbpresent": { "id": "2401accd-3160-4b07-aed4-de2f9a4dea35" } }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the paths of the input tables\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n", "path_B = datasets_dir + os.sep + 'person_table_B.csv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "nbpresent": { "id": "e51b3877-75c8-431d-bdbd-77362bbf2191" } }, "outputs": [], "source": [ "# Read the CSV files and set 'ID' as the key attribute\n", "A = em.read_csv_metadata(path_A, key='ID')\n", "B = em.read_csv_metadata(path_B, key='ID')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "nbpresent": { "id": "ac2eb60b-bf26-4a6a-a453-120cc7f660c4" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
1a2Michael Franklin198827.51652 Stockton St, San Francisco94122
2a3William Bridge198632.03131 Webster St, San Francisco94107
3a4Binto George198732.5423 Powell St, San Francisco94122
4a5Alphonse Kemper198435.01702 Post Street, San Francisco94122
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith 1989 30.0 \n", "1 a2 Michael Franklin 1988 27.5 \n", "2 a3 William Bridge 1986 32.0 \n", "3 a4 Binto George 1987 32.5 \n", "4 a5 Alphonse Kemper 1984 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco 94107 \n", "1 1652 Stockton St, San Francisco 94122 \n", "2 3131 Webster St, San Francisco 94107 \n", "3 423 Powell St, San Francisco 94122 \n", "4 1702 Post Street, San Francisco 94122 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "nbpresent": { "id": "afadc046-692d-42ad-9c72-523493597682" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0b1Mark Levene198729.5108 Clement St, San Francisco94107
1b2Bill Bridge198632.03131 Webster St, San Francisco94107
2b3Mike Franklin198827.51652 Stockton St, San Francisco94122
3b4Joseph Kuan198226.0108 South Park, San Francisco94122
4b5Alfons Kemper198435.0170 Post St, Apt 4, San Francisco94122
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 b1 Mark Levene 1987 29.5 \n", "1 b2 Bill Bridge 1986 32.0 \n", "2 b3 Mike Franklin 1988 27.5 \n", "3 b4 Joseph Kuan 1982 26.0 \n", "4 b5 Alfons Kemper 1984 35.0 \n", "\n", " address zipcode \n", "0 108 Clement St, San Francisco 94107 \n", "1 3131 Webster St, San Francisco 94107 \n", "2 1652 Stockton St, San Francisco 94122 \n", "3 108 South Park, San Francisco 94122 \n", "4 170 Post St, Apt 4, San Francisco 94122 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.head()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "ca7f0c34-9c21-4b6e-8010-eda1030df041" } }, "source": [ "# Different Ways to Block Using Attribute Equivalence Blocker\n", "\n", "Once the tables are read, we can do blocking using attribute equivalence blocker." ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "45585660-2ba9-4211-adee-ace14fe8745f" } }, "source": [ "There are three different ways to do attribute equivalence blocking:\n", "\n", "1. Block two tables to produce a candidate set of tuple pairs.\n", "2. Block a candidate set of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n", "3. Block two tuples to check if a tuple pair would get blocked." ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "726bd6c9-23a5-4543-a201-f84864433f20" } }, "source": [ "## Block Tables to Produce a Candidate Set of Tuple Pairs" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true, "nbpresent": { "id": "d2001a06-fe74-4ebd-896c-992803828753" } }, "outputs": [], "source": [ "# Instantiate attribute equivalence blocker object\n", "ab = em.AttrEquivalenceBlocker()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "06bcba06-ff85-43b7-bb04-ce1566a29dd9" } }, "source": [ "For the given two tables, we will assume that two persons with different `zipcode` values do not refer to the same real world person. So, we apply attribute equivalence blocking on `zipcode`. That is, we block all the tuple pairs that have different zipcodes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "nbpresent": { "id": "2cdc68f4-5874-43d1-a378-b0bd31552ef8" } }, "outputs": [], "source": [ "# Use block_tables to apply blocking over two input tables.\n", "C1 = ab.block_tables(A, B, \n", " l_block_attr='zipcode', r_block_attr='zipcode', \n", " l_output_attrs=['name', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "nbpresent": { "id": "7b4967f5-2f99-4394-bfe9-29ff15334d39" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a1b1Kevin Smith198994107Mark Levene198794107
11a1b2Kevin Smith198994107Bill Bridge198694107
22a1b6Kevin Smith198994107Michael Brodie198794107
33a3b1William Bridge198694107Mark Levene198794107
44a3b2William Bridge198694107Bill Bridge198694107
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a1 b1 Kevin Smith 1989 94107 Mark Levene \n", "1 1 a1 b2 Kevin Smith 1989 94107 Bill Bridge \n", "2 2 a1 b6 Kevin Smith 1989 94107 Michael Brodie \n", "3 3 a3 b1 William Bridge 1986 94107 Mark Levene \n", "4 4 a3 b2 William Bridge 1986 94107 Bill Bridge \n", "\n", " r_birth_year r_zipcode \n", "0 1987 94107 \n", "1 1986 94107 \n", "2 1987 94107 \n", "3 1987 94107 \n", "4 1986 94107 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the candidate set of tuple pairs\n", "C1.head()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "386dbb7d-085e-4946-b376-93e686eb62f5" } }, "source": [ "Note that the tuple pairs in the candidate set have the same zipcode. \n", "\n", "The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "nbpresent": { "id": "fa6af6e5-471b-4296-98f4-1f4d4ee2869a" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: 4565680872\n", "fk_rtable: r_ID\n", "rtable(obj.id): 4565204272\n", "fk_ltable: l_ID\n", "key: _id\n", "ltable(obj.id): 4565203432\n" ] } ], "source": [ "# Show the metadata of C1\n", "em.show_properties(C1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "nbpresent": { "id": "6ec70bd1-adea-40af-9f30-304f6236c5ca" } }, "outputs": [ { "data": { "text/plain": [ "(4565203432, 4565204272)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(A), id(B)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "51679b6c-2667-4ed9-88aa-caff4c81edbf" } }, "source": [ "Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Introduce some missing values\n", "A1 = em.read_csv_metadata(path_A, key='ID')\n", "A1.ix[0, 'zipcode'] = pd.np.NaN\n", "A1.ix[0, 'birth_year'] = pd.np.NaN" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin SmithNaN30.0607 From St, San FranciscoNaN
1a2Michael Franklin1988.027.51652 Stockton St, San Francisco94122.0
2a3William Bridge1986.032.03131 Webster St, San Francisco94107.0
3a4Binto George1987.032.5423 Powell St, San Francisco94122.0
4a5Alphonse Kemper1984.035.01702 Post Street, San Francisco94122.0
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith NaN 30.0 \n", "1 a2 Michael Franklin 1988.0 27.5 \n", "2 a3 William Bridge 1986.0 32.0 \n", "3 a4 Binto George 1987.0 32.5 \n", "4 a5 Alphonse Kemper 1984.0 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco NaN \n", "1 1652 Stockton St, San Francisco 94122.0 \n", "2 3131 Webster St, San Francisco 94107.0 \n", "3 423 Powell St, San Francisco 94122.0 \n", "4 1702 Post Street, San Francisco 94122.0 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A1" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Use block_tables to apply blocking over two input tables.\n", "C2 = ab.block_tables(A1, B, \n", " l_block_attr='zipcode', r_block_attr='zipcode', \n", " l_output_attrs=['name', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_', \n", " allow_missing=True) # setting allow_missing parameter to True" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(15, 18)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C1), len(C2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a2b3Michael Franklin1988.094122.0Mike Franklin198894122.0
11a2b4Michael Franklin1988.094122.0Joseph Kuan198294122.0
22a2b5Michael Franklin1988.094122.0Alfons Kemper198494122.0
33a4b3Binto George1987.094122.0Mike Franklin198894122.0
44a4b4Binto George1987.094122.0Joseph Kuan198294122.0
55a4b5Binto George1987.094122.0Alfons Kemper198494122.0
66a5b3Alphonse Kemper1984.094122.0Mike Franklin198894122.0
77a5b4Alphonse Kemper1984.094122.0Joseph Kuan198294122.0
88a5b5Alphonse Kemper1984.094122.0Alfons Kemper198494122.0
99a3b1William Bridge1986.094107.0Mark Levene198794107.0
1010a3b2William Bridge1986.094107.0Bill Bridge198694107.0
1111a3b6William Bridge1986.094107.0Michael Brodie198794107.0
1212a1b1Kevin SmithNaNNaNMark Levene198794107.0
1313a1b2Kevin SmithNaNNaNBill Bridge198694107.0
1414a1b3Kevin SmithNaNNaNMike Franklin198894122.0
1515a1b4Kevin SmithNaNNaNJoseph Kuan198294122.0
1616a1b5Kevin SmithNaNNaNAlfons Kemper198494122.0
1717a1b6Kevin SmithNaNNaNMichael Brodie198794107.0
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a2 b3 Michael Franklin 1988.0 94122.0 Mike Franklin \n", "1 1 a2 b4 Michael Franklin 1988.0 94122.0 Joseph Kuan \n", "2 2 a2 b5 Michael Franklin 1988.0 94122.0 Alfons Kemper \n", "3 3 a4 b3 Binto George 1987.0 94122.0 Mike Franklin \n", "4 4 a4 b4 Binto George 1987.0 94122.0 Joseph Kuan \n", "5 5 a4 b5 Binto George 1987.0 94122.0 Alfons Kemper \n", "6 6 a5 b3 Alphonse Kemper 1984.0 94122.0 Mike Franklin \n", "7 7 a5 b4 Alphonse Kemper 1984.0 94122.0 Joseph Kuan \n", "8 8 a5 b5 Alphonse Kemper 1984.0 94122.0 Alfons Kemper \n", "9 9 a3 b1 William Bridge 1986.0 94107.0 Mark Levene \n", "10 10 a3 b2 William Bridge 1986.0 94107.0 Bill Bridge \n", "11 11 a3 b6 William Bridge 1986.0 94107.0 Michael Brodie \n", "12 12 a1 b1 Kevin Smith NaN NaN Mark Levene \n", "13 13 a1 b2 Kevin Smith NaN NaN Bill Bridge \n", "14 14 a1 b3 Kevin Smith NaN NaN Mike Franklin \n", "15 15 a1 b4 Kevin Smith NaN NaN Joseph Kuan \n", "16 16 a1 b5 Kevin Smith NaN NaN Alfons Kemper \n", "17 17 a1 b6 Kevin Smith NaN NaN Michael Brodie \n", "\n", " r_birth_year r_zipcode \n", "0 1988 94122.0 \n", "1 1982 94122.0 \n", "2 1984 94122.0 \n", "3 1988 94122.0 \n", "4 1982 94122.0 \n", "5 1984 94122.0 \n", "6 1988 94122.0 \n", "7 1982 94122.0 \n", "8 1984 94122.0 \n", "9 1987 94107.0 \n", "10 1986 94107.0 \n", "11 1987 94107.0 \n", "12 1987 94107.0 \n", "13 1986 94107.0 \n", "14 1988 94122.0 \n", "15 1982 94122.0 \n", "16 1984 94122.0 \n", "17 1987 94107.0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The candidate set C2 includes all possible tuple pairs with missing values." ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "10a59253-c303-4420-a12b-e99a12ed3e2d" } }, "source": [ "## Block a Candidate Set of Tuple Pairs" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "6e4a3e7e-3019-4597-8a85-4b3803c47623" } }, "source": [ "In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have different birth years. We will assume that two persons with different birth years cannot refer to the same person. So, we block the candidate set of tuple pairs on `birth_year`. That is, we block all the tuple pairs that have different birth years." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false, "nbpresent": { "id": "1b494eac-63c8-4fe6-9a13-76c482dfccb7" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[###############] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00" ] }, { "name": "stdout", "output_type": "stream", "text": [ "time: 45.3 ms\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "# Instantiate Attr. Equivalence Blocker\n", "ab = em.AttrEquivalenceBlocker()\n", "# Use block_tables to apply blocking over two input tables.\n", "C3 = ab.block_candset(C1, l_block_attr='birth_year', r_block_attr='birth_year')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "nbpresent": { "id": "fb465204-b14c-4e45-b9cd-00306acbd93a" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
44a3b2William Bridge198694107Bill Bridge198694107
66a2b3Michael Franklin198894122Mike Franklin198894122
1414a5b5Alphonse Kemper198494122Alfons Kemper198494122
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "4 4 a3 b2 William Bridge 1986 94107 Bill Bridge \n", "6 6 a2 b3 Michael Franklin 1988 94122 Mike Franklin \n", "14 14 a5 b5 Alphonse Kemper 1984 94122 Alfons Kemper \n", "\n", " r_birth_year r_zipcode \n", "4 1986 94107 \n", "6 1988 94122 \n", "14 1984 94122 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C3.head()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "bb198ef6-9e92-40f9-a8f1-849b52283295" } }, "source": [ "Note that, the tuple pairs in the resulting candidate set have the same birth year. \n", "\n", "The attributes included in the resulting candidate set are based on the input candidate set (i.e the same attributes are retained)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "nbpresent": { "id": "308a90fc-353a-416b-9553-33558a4ab44a" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: 4565765144\n", "fk_rtable: r_ID\n", "rtable(obj.id): 4565204272\n", "fk_ltable: l_ID\n", "key: _id\n", "ltable(obj.id): 4565203432\n" ] } ], "source": [ "# Show the metadata of C1\n", "em.show_properties(C3)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "nbpresent": { "id": "c0f515f6-4e0c-4389-a921-dfc3a9842d69" } }, "outputs": [ { "data": { "text/plain": [ "(4565203432, 4565204272)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(A), id(B)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "fbeae903-9041-4540-9781-cbe79adbae7c" } }, "source": [ "As we saw earlier the metadata of C3 includes the same metadata as C1. That is, it includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the tuple pairs included in the candidate set have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a2b3Michael Franklin1988.094122.0Mike Franklin198894122.0
11a2b4Michael Franklin1988.094122.0Joseph Kuan198294122.0
22a2b5Michael Franklin1988.094122.0Alfons Kemper198494122.0
33a4b3Binto George1987.094122.0Mike Franklin198894122.0
44a4b4Binto George1987.094122.0Joseph Kuan198294122.0
55a4b5Binto George1987.094122.0Alfons Kemper198494122.0
66a5b3Alphonse Kemper1984.094122.0Mike Franklin198894122.0
77a5b4Alphonse Kemper1984.094122.0Joseph Kuan198294122.0
88a5b5Alphonse Kemper1984.094122.0Alfons Kemper198494122.0
99a3b1William Bridge1986.094107.0Mark Levene198794107.0
1010a3b2William Bridge1986.094107.0Bill Bridge198694107.0
1111a3b6William Bridge1986.094107.0Michael Brodie198794107.0
1212a1b1Kevin SmithNaNNaNMark Levene198794107.0
1313a1b2Kevin SmithNaNNaNBill Bridge198694107.0
1414a1b3Kevin SmithNaNNaNMike Franklin198894122.0
1515a1b4Kevin SmithNaNNaNJoseph Kuan198294122.0
1616a1b5Kevin SmithNaNNaNAlfons Kemper198494122.0
1717a1b6Kevin SmithNaNNaNMichael Brodie198794107.0
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a2 b3 Michael Franklin 1988.0 94122.0 Mike Franklin \n", "1 1 a2 b4 Michael Franklin 1988.0 94122.0 Joseph Kuan \n", "2 2 a2 b5 Michael Franklin 1988.0 94122.0 Alfons Kemper \n", "3 3 a4 b3 Binto George 1987.0 94122.0 Mike Franklin \n", "4 4 a4 b4 Binto George 1987.0 94122.0 Joseph Kuan \n", "5 5 a4 b5 Binto George 1987.0 94122.0 Alfons Kemper \n", "6 6 a5 b3 Alphonse Kemper 1984.0 94122.0 Mike Franklin \n", "7 7 a5 b4 Alphonse Kemper 1984.0 94122.0 Joseph Kuan \n", "8 8 a5 b5 Alphonse Kemper 1984.0 94122.0 Alfons Kemper \n", "9 9 a3 b1 William Bridge 1986.0 94107.0 Mark Levene \n", "10 10 a3 b2 William Bridge 1986.0 94107.0 Bill Bridge \n", "11 11 a3 b6 William Bridge 1986.0 94107.0 Michael Brodie \n", "12 12 a1 b1 Kevin Smith NaN NaN Mark Levene \n", "13 13 a1 b2 Kevin Smith NaN NaN Bill Bridge \n", "14 14 a1 b3 Kevin Smith NaN NaN Mike Franklin \n", "15 15 a1 b4 Kevin Smith NaN NaN Joseph Kuan \n", "16 16 a1 b5 Kevin Smith NaN NaN Alfons Kemper \n", "17 17 a1 b6 Kevin Smith NaN NaN Michael Brodie \n", "\n", " r_birth_year r_zipcode \n", "0 1988 94122.0 \n", "1 1982 94122.0 \n", "2 1984 94122.0 \n", "3 1988 94122.0 \n", "4 1982 94122.0 \n", "5 1984 94122.0 \n", "6 1988 94122.0 \n", "7 1982 94122.0 \n", "8 1984 94122.0 \n", "9 1987 94107.0 \n", "10 1986 94107.0 \n", "11 1987 94107.0 \n", "12 1987 94107.0 \n", "13 1986 94107.0 \n", "14 1988 94122.0 \n", "15 1982 94122.0 \n", "16 1984 94122.0 \n", "17 1987 94107.0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display C2 (got by blocking over A1 and B)\n", "C2" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: 4565567248\n", "fk_rtable: r_ID\n", "rtable(obj.id): 4565204272\n", "fk_ltable: l_ID\n", "key: _id\n", "ltable(obj.id): 4565680928\n" ] } ], "source": [ "em.show_properties(C2)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: 4565680928\n", "key: ID\n" ] } ], "source": [ "em.show_properties(A1) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that `A1` is the left table to `C2`." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin SmithNaN30.0607 From St, San FranciscoNaN
1a2Michael Franklin1988.027.51652 Stockton St, San Francisco94122.0
2a3William Bridge1986.032.03131 Webster St, San Francisco94107.0
3a4Binto George1987.032.5423 Powell St, San Francisco94122.0
4a5Alphonse Kemper1984.035.01702 Post Street, San Francisco94122.0
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith NaN 30.0 \n", "1 a2 Michael Franklin 1988.0 27.5 \n", "2 a3 William Bridge 1986.0 32.0 \n", "3 a4 Binto George 1987.0 32.5 \n", "4 a5 Alphonse Kemper 1984.0 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco NaN \n", "1 1652 Stockton St, San Francisco 94122.0 \n", "2 3131 Webster St, San Francisco 94107.0 \n", "3 423 Powell St, San Francisco 94122.0 \n", "4 1702 Post Street, San Francisco 94122.0 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A1.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "C4 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=False)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a2b3Michael Franklin1988.094122.0Mike Franklin198894122.0
88a5b5Alphonse Kemper1984.094122.0Alfons Kemper198494122.0
1010a3b2William Bridge1986.094107.0Bill Bridge198694107.0
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a2 b3 Michael Franklin 1988.0 94122.0 Mike Franklin \n", "8 8 a5 b5 Alphonse Kemper 1984.0 94122.0 Alfons Kemper \n", "10 10 a3 b2 William Bridge 1986.0 94107.0 Bill Bridge \n", "\n", " r_birth_year r_zipcode \n", "0 1988 94122.0 \n", "8 1984 94122.0 \n", "10 1986 94107.0 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C4" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "# Set allow_missing to True\n", "C5 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=True) " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(3, 9)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C4), len(C5)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a2b3Michael Franklin1988.094122.0Mike Franklin198894122.0
88a5b5Alphonse Kemper1984.094122.0Alfons Kemper198494122.0
1010a3b2William Bridge1986.094107.0Bill Bridge198694107.0
1212a1b1Kevin SmithNaNNaNMark Levene198794107.0
1313a1b2Kevin SmithNaNNaNBill Bridge198694107.0
1414a1b3Kevin SmithNaNNaNMike Franklin198894122.0
1515a1b4Kevin SmithNaNNaNJoseph Kuan198294122.0
1616a1b5Kevin SmithNaNNaNAlfons Kemper198494122.0
1717a1b6Kevin SmithNaNNaNMichael Brodie198794107.0
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a2 b3 Michael Franklin 1988.0 94122.0 Mike Franklin \n", "8 8 a5 b5 Alphonse Kemper 1984.0 94122.0 Alfons Kemper \n", "10 10 a3 b2 William Bridge 1986.0 94107.0 Bill Bridge \n", "12 12 a1 b1 Kevin Smith NaN NaN Mark Levene \n", "13 13 a1 b2 Kevin Smith NaN NaN Bill Bridge \n", "14 14 a1 b3 Kevin Smith NaN NaN Mike Franklin \n", "15 15 a1 b4 Kevin Smith NaN NaN Joseph Kuan \n", "16 16 a1 b5 Kevin Smith NaN NaN Alfons Kemper \n", "17 17 a1 b6 Kevin Smith NaN NaN Michael Brodie \n", "\n", " r_birth_year r_zipcode \n", "0 1988 94122.0 \n", "8 1984 94122.0 \n", "10 1986 94107.0 \n", "12 1987 94107.0 \n", "13 1986 94107.0 \n", "14 1988 94122.0 \n", "15 1982 94122.0 \n", "16 1984 94122.0 \n", "17 1987 94107.0 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C5" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "186ec555-850c-41a3-a167-9a57d63c2b45" } }, "source": [ "## Block Two tuples To Check If a Tuple Pair Would Get Blocked" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "c516958b-85c4-47d1-8927-0dcbaf942b1c" } }, "source": [ "We can apply attribute equivalence blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on `zipcode`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "nbpresent": { "id": "62731f40-a32c-44e0-958d-a4b5c07534cb" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first tuple from table A\n", "A.ix[[0]]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "nbpresent": { "id": "af99c211-c53a-44c1-93b6-889fa962a0b7" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0b1Mark Levene198729.5108 Clement St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first tuple from table B\n", "B.ix[[0]]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "nbpresent": { "id": "13947a8a-7ffe-49d0-b3ba-d6fe2c5e0d5b" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "# Instantiate Attr. Equivalence Blocker\n", "ab = em.AttrEquivalenceBlocker()\n", "\n", "# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status\n", "status = ab.block_tuples(A.ix[0], B.ix[0], l_block_attr='zipcode', r_block_attr='zipcode')\n", "\n", "# Print the blocking status\n", "print(status)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "aa4dee9f-fab2-4fef-b3c5-f8fc03fba405" } }, "source": [ "The above result says that the tuple pair will not be blocked, i.e. this tuple pair will be included in the candidate set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently, block_tuples command does not handle missing values" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }