{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to perform blocking using Overlap blocker.\n", "\n", "First, we need to import *py_entitymatching* package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, read the (sample) input tables for blocking purposes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the paths of the input tables\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'\n", "path_B = datasets_dir + os.sep + 'person_table_B.csv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Read the CSV files and set 'ID' as the key attribute\n", "A = em.read_csv_metadata(path_A, key='ID')\n", "B = em.read_csv_metadata(path_B, key='ID')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
1a2Michael Franklin198827.51652 Stockton St, San Francisco94122
2a3William Bridge198632.03131 Webster St, San Francisco94107
3a4Binto George198732.5423 Powell St, San Francisco94122
4a5Alphonse Kemper198435.01702 Post Street, San Francisco94122
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage \\\n", "0 a1 Kevin Smith 1989 30.0 \n", "1 a2 Michael Franklin 1988 27.5 \n", "2 a3 William Bridge 1986 32.0 \n", "3 a4 Binto George 1987 32.5 \n", "4 a5 Alphonse Kemper 1984 35.0 \n", "\n", " address zipcode \n", "0 607 From St, San Francisco 94107 \n", "1 1652 Stockton St, San Francisco 94122 \n", "2 3131 Webster St, San Francisco 94107 \n", "3 423 Powell St, San Francisco 94122 \n", "4 1702 Post Street, San Francisco 94122 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Ways To Do Overlap Blocking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three different ways to do overlap blocking:\n", "\n", "1. Block two tables to produce a `candidate set` of tuple pairs.\n", "2. Block a `candidate set` of tuple pairs to typically produce a reduced candidate set of tuple pairs.\n", "3. Block two tuples to check if a tuple pair would get blocked." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block Tables to Produce a Candidate Set of Tuple Pairs" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Instantiate overlap blocker object\n", "ob = em.OverlapBlocker()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the given two tables, we will assume that two persons with no sufficient overlap between their addresses do not refer to the same real world person. So, we apply overlap blocking on `address`. Specifically, we tokenize the address by word and include the tuple pairs if the addresses have at least 3 overlapping tokens. That is, we block all the tuple pairs that do not share at least 3 tokens in `address`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Specify the tokenization to be 'word' level and set overlap_size to be 3.\n", "C1 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, \n", " l_output_attrs=['name', 'birth_year', 'address'], \n", " r_output_attrs=['name', 'birth_year', 'address'],\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
00a1b1Kevin Smith1989607 From St, San FranciscoMark Levene1987108 Clement St, San Francisco
11a2b1Michael Franklin19881652 Stockton St, San FranciscoMark Levene1987108 Clement St, San Francisco
22a3b1William Bridge19863131 Webster St, San FranciscoMark Levene1987108 Clement St, San Francisco
33a4b1Binto George1987423 Powell St, San FranciscoMark Levene1987108 Clement St, San Francisco
44a1b2Kevin Smith1989607 From St, San FranciscoBill Bridge19863131 Webster St, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "0 0 a1 b1 Kevin Smith 1989 \n", "1 1 a2 b1 Michael Franklin 1988 \n", "2 2 a3 b1 William Bridge 1986 \n", "3 3 a4 b1 Binto George 1987 \n", "4 4 a1 b2 Kevin Smith 1989 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "0 607 From St, San Francisco Mark Levene 1987 \n", "1 1652 Stockton St, San Francisco Mark Levene 1987 \n", "2 3131 Webster St, San Francisco Mark Levene 1987 \n", "3 423 Powell St, San Francisco Mark Levene 1987 \n", "4 607 From St, San Francisco Bill Bridge 1986 \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 108 Clement St, San Francisco \n", "2 108 Clement St, San Francisco \n", "3 108 Clement St, San Francisco \n", "4 3131 Webster St, San Francisco " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display first 5 tuple pairs in the candidate set.\n", "C1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, we used word-level tokenizer. Overlap blocker also supports q-gram based tokenizer and it can be used as follows:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Set the word_level to be False and set the value of q (using q_val)\n", "C2 = ob.block_tables(A, B, 'address', 'address', word_level=False, q_val=3, overlap_size=3, \n", " l_output_attrs=['name', 'birth_year', 'address'], \n", " r_output_attrs=['name', 'birth_year', 'address'],\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
00a1b1Kevin Smith1989607 From St, San FranciscoMark Levene1987108 Clement St, San Francisco
11a2b1Michael Franklin19881652 Stockton St, San FranciscoMark Levene1987108 Clement St, San Francisco
22a3b1William Bridge19863131 Webster St, San FranciscoMark Levene1987108 Clement St, San Francisco
33a4b1Binto George1987423 Powell St, San FranciscoMark Levene1987108 Clement St, San Francisco
44a5b1Alphonse Kemper19841702 Post Street, San FranciscoMark Levene1987108 Clement St, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "0 0 a1 b1 Kevin Smith 1989 \n", "1 1 a2 b1 Michael Franklin 1988 \n", "2 2 a3 b1 William Bridge 1986 \n", "3 3 a4 b1 Binto George 1987 \n", "4 4 a5 b1 Alphonse Kemper 1984 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "0 607 From St, San Francisco Mark Levene 1987 \n", "1 1652 Stockton St, San Francisco Mark Levene 1987 \n", "2 3131 Webster St, San Francisco Mark Levene 1987 \n", "3 423 Powell St, San Francisco Mark Levene 1987 \n", "4 1702 Post Street, San Francisco Mark Levene 1987 \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 108 Clement St, San Francisco \n", "2 108 Clement St, San Francisco \n", "3 108 Clement St, San Francisco \n", "4 108 Clement St, San Francisco " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display first 5 tuple pairs\n", "C2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Updating Stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Commands in the Overlap Blocker removes some stop words by default. You can avoid this by specifying `rem_stop_words` parameter to False" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Set the parameter to remove stop words to False\n", "C3 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, rem_stop_words=False,\n", " l_output_attrs=['name', 'birth_year', 'address'], \n", " r_output_attrs=['name', 'birth_year', 'address'],\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
00a1b1Kevin Smith1989607 From St, San FranciscoMark Levene1987108 Clement St, San Francisco
11a2b1Michael Franklin19881652 Stockton St, San FranciscoMark Levene1987108 Clement St, San Francisco
22a3b1William Bridge19863131 Webster St, San FranciscoMark Levene1987108 Clement St, San Francisco
33a4b1Binto George1987423 Powell St, San FranciscoMark Levene1987108 Clement St, San Francisco
44a1b2Kevin Smith1989607 From St, San FranciscoBill Bridge19863131 Webster St, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "0 0 a1 b1 Kevin Smith 1989 \n", "1 1 a2 b1 Michael Franklin 1988 \n", "2 2 a3 b1 William Bridge 1986 \n", "3 3 a4 b1 Binto George 1987 \n", "4 4 a1 b2 Kevin Smith 1989 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "0 607 From St, San Francisco Mark Levene 1987 \n", "1 1652 Stockton St, San Francisco Mark Levene 1987 \n", "2 3131 Webster St, San Francisco Mark Levene 1987 \n", "3 423 Powell St, San Francisco Mark Levene 1987 \n", "4 607 From St, San Francisco Bill Bridge 1986 \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 108 Clement St, San Francisco \n", "2 108 Clement St, San Francisco \n", "3 108 Clement St, San Francisco \n", "4 3131 Webster St, San Francisco " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display first 5 tuple pairs\n", "C3.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check what stop words are getting removed like this:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a',\n", " 'an',\n", " 'and',\n", " 'are',\n", " 'as',\n", " 'at',\n", " 'be',\n", " 'by',\n", " 'for',\n", " 'from',\n", " 'has',\n", " 'he',\n", " 'in',\n", " 'is',\n", " 'it',\n", " 'its',\n", " 'on',\n", " 'that',\n", " 'the',\n", " 'to',\n", " 'was',\n", " 'were',\n", " 'will',\n", " 'with']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ob.stop_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can update this stop word list (with some domain specific stop words) and do the blocking." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Include Franciso as one of the stop words\n", "ob.stop_words.append('francisco')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a',\n", " 'an',\n", " 'and',\n", " 'are',\n", " 'as',\n", " 'at',\n", " 'be',\n", " 'by',\n", " 'for',\n", " 'from',\n", " 'has',\n", " 'he',\n", " 'in',\n", " 'is',\n", " 'it',\n", " 'its',\n", " 'on',\n", " 'that',\n", " 'the',\n", " 'to',\n", " 'was',\n", " 'were',\n", " 'will',\n", " 'with',\n", " 'francisco']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ob.stop_words" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Set the word level tokenizer to be True\n", "C4 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, \n", " l_output_attrs=['name', 'birth_year', 'address'], \n", " r_output_attrs=['name', 'birth_year', 'address'],\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
00a1b1Kevin Smith1989607 From St, San FranciscoMark Levene1987108 Clement St, San Francisco
11a2b1Michael Franklin19881652 Stockton St, San FranciscoMark Levene1987108 Clement St, San Francisco
22a3b1William Bridge19863131 Webster St, San FranciscoMark Levene1987108 Clement St, San Francisco
33a4b1Binto George1987423 Powell St, San FranciscoMark Levene1987108 Clement St, San Francisco
44a1b2Kevin Smith1989607 From St, San FranciscoBill Bridge19863131 Webster St, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "0 0 a1 b1 Kevin Smith 1989 \n", "1 1 a2 b1 Michael Franklin 1988 \n", "2 2 a3 b1 William Bridge 1986 \n", "3 3 a4 b1 Binto George 1987 \n", "4 4 a1 b2 Kevin Smith 1989 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "0 607 From St, San Francisco Mark Levene 1987 \n", "1 1652 Stockton St, San Francisco Mark Levene 1987 \n", "2 3131 Webster St, San Francisco Mark Levene 1987 \n", "3 423 Powell St, San Francisco Mark Levene 1987 \n", "4 607 From St, San Francisco Bill Bridge 1986 \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 108 Clement St, San Francisco \n", "2 108 Clement St, San Francisco \n", "3 108 Clement St, San Francisco \n", "4 3131 Webster St, San Francisco " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C4.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the input tuples have missing values in the blocking attribute, then they are ignored by default. You can set `allow_missing_values` to be True to include all possible tuple pairs with missing values." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Introduce some missing value\n", "A1 = em.read_csv_metadata(path_A, key='ID')\n", "A1.loc[0, 'address'] = pd.np.NaN" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Set the word level tokenizer to be True\n", "C5 = ob.block_tables(A1, B, 'address', 'address', word_level=True, overlap_size=3, allow_missing=True,\n", " l_output_attrs=['name', 'birth_year', 'address'], \n", " r_output_attrs=['name', 'birth_year', 'address'],\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C5)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
00a2b1Michael Franklin19881652 Stockton St, San FranciscoMark Levene1987108 Clement St, San Francisco
11a3b1William Bridge19863131 Webster St, San FranciscoMark Levene1987108 Clement St, San Francisco
22a4b1Binto George1987423 Powell St, San FranciscoMark Levene1987108 Clement St, San Francisco
33a2b2Michael Franklin19881652 Stockton St, San FranciscoBill Bridge19863131 Webster St, San Francisco
44a3b2William Bridge19863131 Webster St, San FranciscoBill Bridge19863131 Webster St, San Francisco
55a4b2Binto George1987423 Powell St, San FranciscoBill Bridge19863131 Webster St, San Francisco
66a2b3Michael Franklin19881652 Stockton St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
77a3b3William Bridge19863131 Webster St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
88a4b3Binto George1987423 Powell St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
99a2b5Michael Franklin19881652 Stockton St, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
1010a3b5William Bridge19863131 Webster St, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
1111a4b5Binto George1987423 Powell St, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
1212a5b5Alphonse Kemper19841702 Post Street, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
1313a5b6Alphonse Kemper19841702 Post Street, San FranciscoMichael Brodie1987133 Clement Street, San Francisco
014a1b1Kevin Smith1989NaNMark Levene1987108 Clement St, San Francisco
115a1b2Kevin Smith1989NaNBill Bridge19863131 Webster St, San Francisco
216a1b3Kevin Smith1989NaNMike Franklin19881652 Stockton St, San Francisco
317a1b4Kevin Smith1989NaNJoseph Kuan1982108 South Park, San Francisco
418a1b5Kevin Smith1989NaNAlfons Kemper1984170 Post St, Apt 4, San Francisco
519a1b6Kevin Smith1989NaNMichael Brodie1987133 Clement Street, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "0 0 a2 b1 Michael Franklin 1988 \n", "1 1 a3 b1 William Bridge 1986 \n", "2 2 a4 b1 Binto George 1987 \n", "3 3 a2 b2 Michael Franklin 1988 \n", "4 4 a3 b2 William Bridge 1986 \n", "5 5 a4 b2 Binto George 1987 \n", "6 6 a2 b3 Michael Franklin 1988 \n", "7 7 a3 b3 William Bridge 1986 \n", "8 8 a4 b3 Binto George 1987 \n", "9 9 a2 b5 Michael Franklin 1988 \n", "10 10 a3 b5 William Bridge 1986 \n", "11 11 a4 b5 Binto George 1987 \n", "12 12 a5 b5 Alphonse Kemper 1984 \n", "13 13 a5 b6 Alphonse Kemper 1984 \n", "0 14 a1 b1 Kevin Smith 1989 \n", "1 15 a1 b2 Kevin Smith 1989 \n", "2 16 a1 b3 Kevin Smith 1989 \n", "3 17 a1 b4 Kevin Smith 1989 \n", "4 18 a1 b5 Kevin Smith 1989 \n", "5 19 a1 b6 Kevin Smith 1989 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "0 1652 Stockton St, San Francisco Mark Levene 1987 \n", "1 3131 Webster St, San Francisco Mark Levene 1987 \n", "2 423 Powell St, San Francisco Mark Levene 1987 \n", "3 1652 Stockton St, San Francisco Bill Bridge 1986 \n", "4 3131 Webster St, San Francisco Bill Bridge 1986 \n", "5 423 Powell St, San Francisco Bill Bridge 1986 \n", "6 1652 Stockton St, San Francisco Mike Franklin 1988 \n", "7 3131 Webster St, San Francisco Mike Franklin 1988 \n", "8 423 Powell St, San Francisco Mike Franklin 1988 \n", "9 1652 Stockton St, San Francisco Alfons Kemper 1984 \n", "10 3131 Webster St, San Francisco Alfons Kemper 1984 \n", "11 423 Powell St, San Francisco Alfons Kemper 1984 \n", "12 1702 Post Street, San Francisco Alfons Kemper 1984 \n", "13 1702 Post Street, San Francisco Michael Brodie 1987 \n", "0 NaN Mark Levene 1987 \n", "1 NaN Bill Bridge 1986 \n", "2 NaN Mike Franklin 1988 \n", "3 NaN Joseph Kuan 1982 \n", "4 NaN Alfons Kemper 1984 \n", "5 NaN Michael Brodie 1987 \n", "\n", " rtable_address \n", "0 108 Clement St, San Francisco \n", "1 108 Clement St, San Francisco \n", "2 108 Clement St, San Francisco \n", "3 3131 Webster St, San Francisco \n", "4 3131 Webster St, San Francisco \n", "5 3131 Webster St, San Francisco \n", "6 1652 Stockton St, San Francisco \n", "7 1652 Stockton St, San Francisco \n", "8 1652 Stockton St, San Francisco \n", "9 170 Post St, Apt 4, San Francisco \n", "10 170 Post St, Apt 4, San Francisco \n", "11 170 Post St, Apt 4, San Francisco \n", "12 170 Post St, Apt 4, San Francisco \n", "13 133 Clement Street, San Francisco \n", "0 108 Clement St, San Francisco \n", "1 3131 Webster St, San Francisco \n", "2 1652 Stockton St, San Francisco \n", "3 108 South Park, San Francisco \n", "4 170 Post St, Apt 4, San Francisco \n", "5 133 Clement Street, San Francisco " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block a Candidata Set To Produce Reduced Set of Tuple Pairs" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "#Instantiate the overlap blocker\n", "ob = em.OverlapBlocker()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have at least three tokens in overlap. Adding to that, we will assume that two persons with no overlap of their names cannot refer to the same person. So, we block the candidate set of tuple pairs on `name`. That is, we block all the tuple pairs that have no overlap of tokens." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Specify the tokenization to be 'word' level and set overlap_size to be 1.\n", "C6 = ob.block_candset(C1, 'name', 'name', word_level=True, overlap_size=1, show_progress=False)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
66a3b2William Bridge19863131 Webster St, San FranciscoBill Bridge19863131 Webster St, San Francisco
99a2b3Michael Franklin19881652 Stockton St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
1616a5b5Alphonse Kemper19841702 Post Street, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "6 6 a3 b2 William Bridge 1986 \n", "9 9 a2 b3 Michael Franklin 1988 \n", "16 16 a5 b5 Alphonse Kemper 1984 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "6 3131 Webster St, San Francisco Bill Bridge 1986 \n", "9 1652 Stockton St, San Francisco Mike Franklin 1988 \n", "16 1702 Post Street, San Francisco Alfons Kemper 1984 \n", "\n", " rtable_address \n", "6 3131 Webster St, San Francisco \n", "9 1652 Stockton St, San Francisco \n", "16 170 Post St, Apt 4, San Francisco " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, we saw that word level tokenization was used to tokenize the names. You can also use q-gram tokenization like this:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Specify the tokenization to be 'word' level and set overlap_size to be 1.\n", "C7 = ob.block_candset(C1, 'name', 'name', word_level=False, q_val= 3, overlap_size=1, show_progress=False)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_IDrtable_IDltable_nameltable_birth_yearltable_addressrtable_namertable_birth_yearrtable_address
66a3b2William Bridge19863131 Webster St, San FranciscoBill Bridge19863131 Webster St, San Francisco
77a4b2Binto George1987423 Powell St, San FranciscoBill Bridge19863131 Webster St, San Francisco
88a1b3Kevin Smith1989607 From St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
99a2b3Michael Franklin19881652 Stockton St, San FranciscoMike Franklin19881652 Stockton St, San Francisco
1616a5b5Alphonse Kemper19841702 Post Street, San FranciscoAlfons Kemper1984170 Post St, Apt 4, San Francisco
\n", "
" ], "text/plain": [ " _id ltable_ID rtable_ID ltable_name ltable_birth_year \\\n", "6 6 a3 b2 William Bridge 1986 \n", "7 7 a4 b2 Binto George 1987 \n", "8 8 a1 b3 Kevin Smith 1989 \n", "9 9 a2 b3 Michael Franklin 1988 \n", "16 16 a5 b5 Alphonse Kemper 1984 \n", "\n", " ltable_address rtable_name rtable_birth_year \\\n", "6 3131 Webster St, San Francisco Bill Bridge 1986 \n", "7 423 Powell St, San Francisco Bill Bridge 1986 \n", "8 607 From St, San Francisco Mike Franklin 1988 \n", "9 1652 Stockton St, San Francisco Mike Franklin 1988 \n", "16 1702 Post Street, San Francisco Alfons Kemper 1984 \n", "\n", " rtable_address \n", "6 3131 Webster St, San Francisco \n", "7 3131 Webster St, San Francisco \n", "8 1652 Stockton St, San Francisco \n", "9 1652 Stockton St, San Francisco \n", "16 170 Post St, Apt 4, San Francisco " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C7.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[As we saw with block_tables](#Handling-Missing-Values), you can include all the possible tuple pairs with the missing values using `allow_missing` parameter block the candidate set with the updated set of stop words." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Introduce some missing values\n", "A1.loc[2, 'name'] = pd.np.NaN" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "C8 = ob.block_candset(C5, 'name', 'name', word_level=True, overlap_size=1, allow_missing=True, show_progress=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Block Two tuples To Check If a Tuple Pair Would Get Blocked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply overlap blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on `address`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0a1Kevin Smith198930.0607 From St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first tuple from table A\n", "A.loc[[0]]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamebirth_yearhourly_wageaddresszipcode
0b1Mark Levene198729.5108 Clement St, San Francisco94107
\n", "
" ], "text/plain": [ " ID name birth_year hourly_wage address \\\n", "0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco \n", "\n", " zipcode \n", "0 94107 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first tuple from table B\n", "B.loc[[0]]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "# Instantiate Attr. Equivalence Blocker\n", "ob = em.OverlapBlocker()\n", "\n", "# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status\n", "status = ob.block_tuples(A.loc[0], B.loc[0],'address', 'address', overlap_size=1)\n", "\n", "# Print the blocking status\n", "print(status)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2.0 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 0 }