{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Contents\n", "========\n", "- [Introduction](#Introduction)\n", "- [Block Using the Sorted Neighborhood Blocker](#Block-Using-the-Sorted-Neighborhood-Blocker)\n", " - [Block Tables to Produce a Candidate Set of Tuple Pairs](#Block-Tables-to-Produce-a-Candidate-Set-of-Tuple-Pairs)\n", " - [Handling Missing Values](#Handling-Missing-Values)\n", " - [Window Size](#Window-Size)\n", " - [Stable Sort Order](#Stable-Sort-Order)\n", "- [Sorted Neighborhood Blocker Limitations](#Sorted-Neighborhood-Blocker-limitations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "WARNING: The sorted neighborhood blocker is still experimental and has not been fully tested yet. Use this blocker at your own risk.\n", "\n", "Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The *py_entitymatching* package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is the sorted neighborhood blocker. This IPython notebook illustrates how to perform blocking using the sorted neighborhood blocker.\n", "\n", "Note, often the sorted neighborhood blocking technique is used on a single table. In this case we have implemented sorted neighborhood blocking between two tables. We first enrich the tables with whether the table is the left table, or right table. Then we merge the tables. At this point we perform sorted neighborhood blocking, which is to pass a sliding window of `window_size` (default 2) across the merged dataset. Within the sliding window all tuple pairs that have one tuple from the left table and one tuple from the right table are returned." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to import *py_entitymatching* package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbpresent": { "id": "9a89351b-e44f-47ad-afff-b148744173af" } }, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "0f59e4ac-032a-4d59-9172-8ee653831acb" } }, "source": [ "Then, read the input tablse from the datasets directory" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "nbpresent": { "id": "2401accd-3160-4b07-aed4-de2f9a4dea35" } }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the paths of the input tables\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n", "path_B = datasets_dir + os.sep + 'person_table_B.csv'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "nbpresent": { "id": "e51b3877-75c8-431d-bdbd-77362bbf2191" } }, "outputs": [], "source": [ "# Read the CSV files and set 'ID' as the key attribute\n", "A = em.read_csv_metadata(path_A, key='ID')\n", "B = em.read_csv_metadata(path_B, key='ID')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "nbpresent": { "id": "ac2eb60b-bf26-4a6a-a453-120cc7f660c4" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
1a2Michael Franklin198827.51652 Stockton St, San Francisco94122
2a3William Bridge198632.03131 Webster St, San Francisco94107
3a4Binto George198732.5423 Powell St, San Francisco94122
4a5Alphonse Kemper198435.01702 Post Street, San Francisco94122
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith 1989 30.0 \n", "1 a2 Michael Franklin 1988 27.5 \n", "2 a3 William Bridge 1986 32.0 \n", "3 a4 Binto George 1987 32.5 \n", "4 a5 Alphonse Kemper 1984 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco 94107 \n", "1 1652 Stockton St, San Francisco 94122 \n", "2 3131 Webster St, San Francisco 94107 \n", "3 423 Powell St, San Francisco 94122 \n", "4 1702 Post Street, San Francisco 94122 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "nbpresent": { "id": "afadc046-692d-42ad-9c72-523493597682" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0b1Mark Levene198729.5108 Clement St, San Francisco94107
1b2Bill Bridge198632.03131 Webster St, San Francisco94107
2b3Mike Franklin198827.51652 Stockton St, San Francisco94122
3b4Joseph Kuan198226.0108 South Park, San Francisco94122
4b5Alfons Kemper198435.0170 Post St, Apt 4, San Francisco94122
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 b1 Mark Levene 1987 29.5 \n", "1 b2 Bill Bridge 1986 32.0 \n", "2 b3 Mike Franklin 1988 27.5 \n", "3 b4 Joseph Kuan 1982 26.0 \n", "4 b5 Alfons Kemper 1984 35.0 \n", "\n", " address zipcode \n", "0 108 Clement St, San Francisco 94107 \n", "1 3131 Webster St, San Francisco 94107 \n", "2 1652 Stockton St, San Francisco 94122 \n", "3 108 South Park, San Francisco 94122 \n", "4 170 Post St, Apt 4, San Francisco 94122 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.head()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "ca7f0c34-9c21-4b6e-8010-eda1030df041" } }, "source": [ "# Block Using the Sorted Neighborhood Blocker\n", "\n", "Once the tables are read, we can do blocking using sorted neighborhood blocker." ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "45585660-2ba9-4211-adee-ace14fe8745f" } }, "source": [ "With the sorted neighborhood blocker, you can only block between two tables to produce a candidate set of tuple pairs." ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "726bd6c9-23a5-4543-a201-f84864433f20" } }, "source": [ "## Block Tables to Produce a Candidate Set of Tuple Pairs" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "nbpresent": { "id": "d2001a06-fe74-4ebd-896c-992803828753" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n" ] } ], "source": [ "# Instantiate attribute equivalence blocker object\n", "sn = em.SortedNeighborhoodBlocker()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "06bcba06-ff85-43b7-bb04-ce1566a29dd9" } }, "source": [ "For the given two tables, we will assume that two persons with different `zipcode` values do not refer to the same real world person. So, we apply attribute equivalence blocking on `zipcode`. That is, we block all the tuple pairs that have different zipcodes." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "nbpresent": { "id": "2cdc68f4-5874-43d1-a378-b0bd31552ef8" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n" ] } ], "source": [ "# Use block_tables to apply blocking over two input tables.\n", "C1 = sn.block_tables(A, B, \n", " l_block_attr='birth_year', r_block_attr='birth_year', \n", " l_output_attrs=['name', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_', window_size=3)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "nbpresent": { "id": "7b4967f5-2f99-4394-bfe9-29ff15334d39" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a5b4Alphonse Kemper198494122Joseph Kuan198294122
11a5b5Alphonse Kemper198494122Alfons Kemper198494122
22a3b5William Bridge198694107Alfons Kemper198494122
33a3b2William Bridge198694107Bill Bridge198694107
44a4b2Binto George198794122Bill Bridge198694107
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a5 b4 Alphonse Kemper 1984 94122 Joseph Kuan \n", "1 1 a5 b5 Alphonse Kemper 1984 94122 Alfons Kemper \n", "2 2 a3 b5 William Bridge 1986 94107 Alfons Kemper \n", "3 3 a3 b2 William Bridge 1986 94107 Bill Bridge \n", "4 4 a4 b2 Binto George 1987 94122 Bill Bridge \n", "\n", " r_birth_year r_zipcode \n", "0 1982 94122 \n", "1 1984 94122 \n", "2 1984 94122 \n", "3 1986 94107 \n", "4 1986 94107 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the candidate set of tuple pairs\n", "C1.head()" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "386dbb7d-085e-4946-b376-93e686eb62f5" } }, "source": [ "Note that the tuple pairs in the candidate set have the same zipcode. \n", "\n", "The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "nbpresent": { "id": "fa6af6e5-471b-4296-98f4-1f4d4ee2869a" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: 139837734495736\n", "key: _id\n", "fk_ltable: l_ID\n", "fk_rtable: r_ID\n", "ltable(obj.id): 139837734692656\n", "rtable(obj.id): 139837734692520\n" ] } ], "source": [ "# Show the metadata of C1\n", "em.show_properties(C1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "nbpresent": { "id": "6ec70bd1-adea-40af-9f30-304f6236c5ca" } }, "outputs": [ { "data": { "text/plain": [ "(139837734759952, 139837734846816)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(A), id(B)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "51679b6c-2667-4ed9-88aa-caff4c81edbf" } }, "source": [ "Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set `allow_missing` paramater to be True." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/u/p/m/pmartinkus/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: \n", ".ix is deprecated. Please use\n", ".loc for label based indexing or\n", ".iloc for positional indexing\n", "\n", "See the documentation here:\n", "http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated\n", " This is separate from the ipykernel package so we can avoid doing imports until\n" ] } ], "source": [ "# Introduce some missing values\n", "A1 = em.read_csv_metadata(path_A, key='ID')\n", "A1.ix[0, 'zipcode'] = pd.np.NaN\n", "A1.ix[0, 'birth_year'] = pd.np.NaN" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin SmithNaN30.0607 From St, San FranciscoNaN
1a2Michael Franklin1988.027.51652 Stockton St, San Francisco94122.0
2a3William Bridge1986.032.03131 Webster St, San Francisco94107.0
3a4Binto George1987.032.5423 Powell St, San Francisco94122.0
4a5Alphonse Kemper1984.035.01702 Post Street, San Francisco94122.0
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith NaN 30.0 \n", "1 a2 Michael Franklin 1988.0 27.5 \n", "2 a3 William Bridge 1986.0 32.0 \n", "3 a4 Binto George 1987.0 32.5 \n", "4 a5 Alphonse Kemper 1984.0 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco NaN \n", "1 1652 Stockton St, San Francisco 94122.0 \n", "2 3131 Webster St, San Francisco 94107.0 \n", "3 423 Powell St, San Francisco 94122.0 \n", "4 1702 Post Street, San Francisco 94122.0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A1" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n" ] } ], "source": [ "# Use block_tables to apply blocking over two input tables.\n", "C2 = sn.block_tables(A1, B, \n", " l_block_attr='zipcode', r_block_attr='zipcode', \n", " l_output_attrs=['name', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_', \n", " allow_missing=True) # setting allow_missing parameter to True" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(11, 9)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C1), len(C2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_yearl_zipcoder_namer_birth_yearr_zipcode
00a3b1William Bridge1986.094107.0Mark Levene1987.094107.0
11a1b1Kevin SmithNaNNaNMark Levene1987.094107.0
22a1b2Kevin SmithNaNNaNBill Bridge1986.094107.0
33a1b6Kevin SmithNaNNaNMichael Brodie1987.094107.0
44a2b6Michael Franklin1988.094122.0Michael Brodie1987.094107.0
55a5b3Alphonse Kemper1984.094122.0Mike Franklin1988.094122.0
66a1b3Kevin SmithNaNNaNMike Franklin1988.094122.0
77a1b4Kevin SmithNaNNaNJoseph Kuan1982.094122.0
88a1b5Kevin SmithNaNNaNAlfons Kemper1984.094122.0
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year l_zipcode r_name \\\n", "0 0 a3 b1 William Bridge 1986.0 94107.0 Mark Levene \n", "1 1 a1 b1 Kevin Smith NaN NaN Mark Levene \n", "2 2 a1 b2 Kevin Smith NaN NaN Bill Bridge \n", "3 3 a1 b6 Kevin Smith NaN NaN Michael Brodie \n", "4 4 a2 b6 Michael Franklin 1988.0 94122.0 Michael Brodie \n", "5 5 a5 b3 Alphonse Kemper 1984.0 94122.0 Mike Franklin \n", "6 6 a1 b3 Kevin Smith NaN NaN Mike Franklin \n", "7 7 a1 b4 Kevin Smith NaN NaN Joseph Kuan \n", "8 8 a1 b5 Kevin Smith NaN NaN Alfons Kemper \n", "\n", " r_birth_year r_zipcode \n", "0 1987.0 94107.0 \n", "1 1987.0 94107.0 \n", "2 1986.0 94107.0 \n", "3 1987.0 94107.0 \n", "4 1987.0 94107.0 \n", "5 1988.0 94122.0 \n", "6 1988.0 94122.0 \n", "7 1982.0 94122.0 \n", "8 1984.0 94122.0 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The candidate set C2 includes all possible tuple pairs with missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Window Size\n", "\n", "A tunable parameter to the Sorted Neighborhood Blocker is the Window size. To perform the same result as above with a larger window size is via the `window_size` argument. Note that it has more results than C1." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n" ] } ], "source": [ "C3 = sn.block_tables(A, B, \n", " l_block_attr='birth_year', r_block_attr='birth_year', \n", " l_output_attrs=['name', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_', window_size=5)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C1)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stable Sort Order\n", "\n", "One final challenge for the Sorted Neighborhood Blocker is making the sort order stable. If the column being sorted on has multiple identical keys, and those keys are longer than the window size, then different results may occur between runs. To always guarantee the same results for every run, make sure to make the sorting column unique. One method to do so is to append the id of the tuple onto the end of the sorting column. Here is an example." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.\n" ] } ], "source": [ "A[\"birth_year_plus_id\"]=A[\"birth_year\"].map(str)+'-'+A[\"ID\"].map(str)\n", "B[\"birth_year_plus_id\"]=B[\"birth_year\"].map(str)+'-'+A[\"ID\"].map(str)\n", "C3 = sn.block_tables(A, B, \n", " l_block_attr='birth_year_plus_id', r_block_attr='birth_year_plus_id', \n", " l_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],\n", " r_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],\n", " l_output_prefix='l_', r_output_prefix='r_', window_size=5)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idl_IDr_IDl_namel_birth_year_plus_idl_birth_yearl_zipcoder_namer_birth_year_plus_idr_birth_yearr_zipcode
00a5b4Alphonse Kemper1984-a5198494122Joseph Kuan1982-a4198294122
11a5b5Alphonse Kemper1984-a5198494122Alfons Kemper1984-a5198494122
22a5b2Alphonse Kemper1984-a5198494122Bill Bridge1986-a2198694107
33a3b4William Bridge1986-a3198694107Joseph Kuan1982-a4198294122
44a3b5William Bridge1986-a3198694107Alfons Kemper1984-a5198494122
\n", "
" ], "text/plain": [ " _id l_ID r_ID l_name l_birth_year_plus_id l_birth_year \\\n", "0 0 a5 b4 Alphonse Kemper 1984-a5 1984 \n", "1 1 a5 b5 Alphonse Kemper 1984-a5 1984 \n", "2 2 a5 b2 Alphonse Kemper 1984-a5 1984 \n", "3 3 a3 b4 William Bridge 1986-a3 1986 \n", "4 4 a3 b5 William Bridge 1986-a3 1986 \n", "\n", " l_zipcode r_name r_birth_year_plus_id r_birth_year r_zipcode \n", "0 94122 Joseph Kuan 1982-a4 1982 94122 \n", "1 94122 Alfons Kemper 1984-a5 1984 94122 \n", "2 94122 Bill Bridge 1986-a2 1986 94107 \n", "3 94107 Joseph Kuan 1982-a4 1982 94122 \n", "4 94107 Alfons Kemper 1984-a5 1984 94122 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C3.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sorted Neighborhood Blocker limitations\n", "\n", "Since the sorted neighborhood blocker requires position in sorted order, unlike other blockers, blocking on a candidate set or checking two tuples is not applicable. Attempts to call `block_candset` or `block_tuples` will raise an assertion." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }