{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This quickstart guide explains how to join two tables A and B using edit distance measure. First, you need to import the required packages as follows (if you have installed **py_stringsimjoin** it will automatically install the dependencies **py_stringmatching** and **pandas**):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import py_stringsimjoin as ssj\n",
    "import py_stringmatching as sm\n",
    "import pandas as pd\n",
    "import os, sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "python version: 2.7.12 |Anaconda 2.5.0 (x86_64)| (default, Jul  2 2016, 17:43:17) \n",
      "[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]\n",
      "py_stringsimjoin version: 0.1.0\n",
      "py_stringmatching version: 0.2.0\n",
      "pandas version: 0.17.1\n"
     ]
    }
   ],
   "source": [
    "print('python version: ' + sys.version)\n",
    "print('py_stringsimjoin version: ' + ssj.__version__)\n",
    "print('py_stringmatching version: ' + sm.__version__)\n",
    "print('pandas version: ' + pd.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Joining two tables using edit distance measure typically consists of three steps:\n",
    "1. Loading the input tables\n",
    "2. Profiling the tables\n",
    "3. Performing the join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Loading the input tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin by loading the two tables. For the purpose of this guide, \n",
    "we use the sample dataset that comes with the package. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# construct the path of the tables to be loaded. Since we are loading a \n",
    "# dataset from the package, we need to access the data from the path \n",
    "# where the package is installed. If you need to load your own data, you can directly\n",
    "# provide your table path to the read_csv command.\n",
    "\n",
    "table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_A.csv'])\n",
    "table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_B.csv'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of records in A: 7\n",
      "Number of records in B: 8\n"
     ]
    }
   ],
   "source": [
    "# Load csv files as dataframes.\n",
    "A = pd.read_csv(table_A_path)\n",
    "B = pd.read_csv(table_B_path)\n",
    "print('Number of records in A: ' + str(len(A)))\n",
    "print('Number of records in B: ' + str(len(B)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A.id</th>\n",
       "      <th>A.name</th>\n",
       "      <th>A.birth_year</th>\n",
       "      <th>A.hourly_wage</th>\n",
       "      <th>A.address</th>\n",
       "      <th>A.zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>a6</td>\n",
       "      <td></td>\n",
       "      <td>1990</td>\n",
       "      <td>40.0</td>\n",
       "      <td>24th Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>a7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1986</td>\n",
       "      <td>25.0</td>\n",
       "      <td>20th Street, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  A.id            A.name  A.birth_year  A.hourly_wage  \\\n",
       "0   a1       Kevin Smith          1989           30.0   \n",
       "1   a2  Michael Franklin          1988           27.5   \n",
       "2   a3    William Bridge          1986           32.0   \n",
       "3   a4      Binto George          1987           32.5   \n",
       "4   a5   Alphonse Kemper          1984           35.0   \n",
       "5   a6                            1990           40.0   \n",
       "6   a7               NaN          1986           25.0   \n",
       "\n",
       "                         A.address  A.zipcode  \n",
       "0       607 From St, San Francisco      94107  \n",
       "1  1652 Stockton St, San Francisco      94122  \n",
       "2   3131 Webster St, San Francisco      94107  \n",
       "3     423 Powell St, San Francisco      94122  \n",
       "4  1702 Post Street, San Francisco      94122  \n",
       "5       24th Street, San Francisco      94122  \n",
       "6       20th Street, San Francisco        NaN  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B.id</th>\n",
       "      <th>B.name</th>\n",
       "      <th>B.birth_year</th>\n",
       "      <th>B.hourly_wage</th>\n",
       "      <th>B.address</th>\n",
       "      <th>B.zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b1</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>29.5</td>\n",
       "      <td>108 Clement St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b2</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b3</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b4</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>26.0</td>\n",
       "      <td>108 South Park, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>b5</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>170 Post St, Apt 4,  San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>b6</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>133 Clement Street, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>b7</td>\n",
       "      <td></td>\n",
       "      <td>1990</td>\n",
       "      <td>40.0</td>\n",
       "      <td>24th Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>b8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1986</td>\n",
       "      <td>25.0</td>\n",
       "      <td>20th Street, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  B.id          B.name  B.birth_year  B.hourly_wage  \\\n",
       "0   b1     Mark Levene          1987           29.5   \n",
       "1   b2     Bill Bridge          1986           32.0   \n",
       "2   b3   Mike Franklin          1988           27.5   \n",
       "3   b4     Joseph Kuan          1982           26.0   \n",
       "4   b5   Alfons Kemper          1984           35.0   \n",
       "5   b6  Michael Brodie          1987           32.5   \n",
       "6   b7                          1990           40.0   \n",
       "7   b8             NaN          1986           25.0   \n",
       "\n",
       "                            B.address  B.zipcode  \n",
       "0       108 Clement St, San Francisco      94107  \n",
       "1      3131 Webster St, San Francisco      94107  \n",
       "2     1652 Stockton St, San Francisco      94122  \n",
       "3       108 South Park, San Francisco      94122  \n",
       "4  170 Post St, Apt 4,  San Francisco      94122  \n",
       "5   133 Clement Street, San Francisco      94107  \n",
       "6          24th Street, San Francisco      94122  \n",
       "7          20th Street, San Francisco        NaN  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Profiling the tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before performing the join, we may want to profile the tables to \n",
    "know about the characteristics of the attributes. This can help identify:\n",
    "\n",
    "a) unique attributes in the table which can be used as key attribute when performing \n",
    "   the join. A key attribute is needed to uniquely identify a tuple. \n",
    "   \n",
    "b) the number of missing values present in each attribute. This can \n",
    "   help you in deciding the attribute on which to perform the join. \n",
    "   For example, an attribute with a lot of missing values may not be a good \n",
    "   join attribute. Further, based on the missing value information you \n",
    "   need to decide on how to handle missing values when performing the join \n",
    "   (See the section below on 'Handling missing values' to know more about\n",
    "   the options available for handling missing values when performing the join).\n",
    "   \n",
    "You can profile the attributes in a table using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unique values</th>\n",
       "      <th>Missing values</th>\n",
       "      <th>Comments</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Attribute</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A.id</th>\n",
       "      <td>7 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A.name</th>\n",
       "      <td>7 (100.0%)</td>\n",
       "      <td>1 (14.29%)</td>\n",
       "      <td>Joining on this attribute will ignore 1 (14.29%) rows.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A.birth_year</th>\n",
       "      <td>6 (85.71%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A.hourly_wage</th>\n",
       "      <td>7 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A.address</th>\n",
       "      <td>7 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A.zipcode</th>\n",
       "      <td>3 (42.86%)</td>\n",
       "      <td>1 (14.29%)</td>\n",
       "      <td>Joining on this attribute will ignore 1 (14.29%) rows.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Unique values Missing values  \\\n",
       "Attribute                                    \n",
       "A.id             7 (100.0%)       0 (0.0%)   \n",
       "A.name           7 (100.0%)     1 (14.29%)   \n",
       "A.birth_year     6 (85.71%)       0 (0.0%)   \n",
       "A.hourly_wage    7 (100.0%)       0 (0.0%)   \n",
       "A.address        7 (100.0%)       0 (0.0%)   \n",
       "A.zipcode        3 (42.86%)     1 (14.29%)   \n",
       "\n",
       "                                                             Comments  \n",
       "Attribute                                                              \n",
       "A.id                   This attribute can be used as a key attribute.  \n",
       "A.name         Joining on this attribute will ignore 1 (14.29%) rows.  \n",
       "A.birth_year                                                           \n",
       "A.hourly_wage          This attribute can be used as a key attribute.  \n",
       "A.address              This attribute can be used as a key attribute.  \n",
       "A.zipcode      Joining on this attribute will ignore 1 (14.29%) rows.  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# profile attributes in table A\n",
    "ssj.profile_table_for_join(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unique values</th>\n",
       "      <th>Missing values</th>\n",
       "      <th>Comments</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Attribute</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>B.id</th>\n",
       "      <td>8 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B.name</th>\n",
       "      <td>8 (100.0%)</td>\n",
       "      <td>1 (12.5%)</td>\n",
       "      <td>Joining on this attribute will ignore 1 (12.5%) rows.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B.birth_year</th>\n",
       "      <td>6 (75.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B.hourly_wage</th>\n",
       "      <td>8 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B.address</th>\n",
       "      <td>8 (100.0%)</td>\n",
       "      <td>0 (0.0%)</td>\n",
       "      <td>This attribute can be used as a key attribute.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B.zipcode</th>\n",
       "      <td>3 (37.5%)</td>\n",
       "      <td>1 (12.5%)</td>\n",
       "      <td>Joining on this attribute will ignore 1 (12.5%) rows.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Unique values Missing values  \\\n",
       "Attribute                                    \n",
       "B.id             8 (100.0%)       0 (0.0%)   \n",
       "B.name           8 (100.0%)      1 (12.5%)   \n",
       "B.birth_year      6 (75.0%)       0 (0.0%)   \n",
       "B.hourly_wage    8 (100.0%)       0 (0.0%)   \n",
       "B.address        8 (100.0%)       0 (0.0%)   \n",
       "B.zipcode         3 (37.5%)      1 (12.5%)   \n",
       "\n",
       "                                                            Comments  \n",
       "Attribute                                                             \n",
       "B.id                  This attribute can be used as a key attribute.  \n",
       "B.name         Joining on this attribute will ignore 1 (12.5%) rows.  \n",
       "B.birth_year                                                          \n",
       "B.hourly_wage         This attribute can be used as a key attribute.  \n",
       "B.address             This attribute can be used as a key attribute.  \n",
       "B.zipcode      Joining on this attribute will ignore 1 (12.5%) rows.  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# profile attributes in table B\n",
    "ssj.profile_table_for_join(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the input tables does not contain any key attribute, then you need \n",
    "to create a key attribute. In the current example, both the input tables\n",
    "A and B have key attributes, and hence you can proceed to the next step.\n",
    "In the case the table does not have any key attribute, you can \n",
    "add a key attribute using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B.id</th>\n",
       "      <th>B.name</th>\n",
       "      <th>B.birth_year</th>\n",
       "      <th>B.hourly_wage</th>\n",
       "      <th>B.address</th>\n",
       "      <th>B.zipcode</th>\n",
       "      <th>new_key_attr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b1</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>1987</td>\n",
       "      <td>29.5</td>\n",
       "      <td>108 Clement St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b2</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b3</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b4</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>1982</td>\n",
       "      <td>26.0</td>\n",
       "      <td>108 South Park, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>b5</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>170 Post St, Apt 4,  San Francisco</td>\n",
       "      <td>94122</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>b6</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>133 Clement Street, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>b7</td>\n",
       "      <td></td>\n",
       "      <td>1990</td>\n",
       "      <td>40.0</td>\n",
       "      <td>24th Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>b8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1986</td>\n",
       "      <td>25.0</td>\n",
       "      <td>20th Street, San Francisco</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  B.id          B.name  B.birth_year  B.hourly_wage  \\\n",
       "0   b1     Mark Levene          1987           29.5   \n",
       "1   b2     Bill Bridge          1986           32.0   \n",
       "2   b3   Mike Franklin          1988           27.5   \n",
       "3   b4     Joseph Kuan          1982           26.0   \n",
       "4   b5   Alfons Kemper          1984           35.0   \n",
       "5   b6  Michael Brodie          1987           32.5   \n",
       "6   b7                          1990           40.0   \n",
       "7   b8             NaN          1986           25.0   \n",
       "\n",
       "                            B.address  B.zipcode  new_key_attr  \n",
       "0       108 Clement St, San Francisco      94107             0  \n",
       "1      3131 Webster St, San Francisco      94107             1  \n",
       "2     1652 Stockton St, San Francisco      94122             2  \n",
       "3       108 South Park, San Francisco      94122             3  \n",
       "4  170 Post St, Apt 4,  San Francisco      94122             4  \n",
       "5   133 Clement Street, San Francisco      94107             5  \n",
       "6          24th Street, San Francisco      94122             6  \n",
       "7          20th Street, San Francisco        NaN             7  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B['new_key_attr'] = range(0, len(B))\n",
    "B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purpose of this guide, we will now join tables A and B on \n",
    "'name' attribute using edit distance measure. Next, we need to decide on what \n",
    "threshold to use for the join. For this guide, we will use a threshold of 5. \n",
    "Specifically, the join will now find tuple pairs from A and B such that \n",
    "the edit distance over the 'name' attributes is at most 5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Performing the join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to perform the edit distance join using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%   100%\n",
      "[#######] | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "# find all pairs from A and B such that the edit distance\n",
    "# on 'name' is at most 5.\n",
    "# l_out_attrs and r_out_attrs denote the attributes from the \n",
    "# left table (A) and right table (B) that need to be included in the output.\n",
    "\n",
    "output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, \n",
    "                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(output_pairs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_A.id</th>\n",
       "      <th>r_B.id</th>\n",
       "      <th>l_A.name</th>\n",
       "      <th>r_B.name</th>\n",
       "      <th>_sim_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a6</td>\n",
       "      <td>b7</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id l_A.id r_B.id          l_A.name       r_B.name  _sim_score\n",
       "0    0     a3     b2    William Bridge    Bill Bridge           4\n",
       "1    1     a2     b3  Michael Franklin  Mike Franklin           4\n",
       "2    2     a5     b5   Alphonse Kemper  Alfons Kemper           3\n",
       "3    3     a6     b7                                            0"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the output pairs\n",
    "output_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Handling missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "By default, pairs with missing values are not included\n",
    "in the output. This is because a string with a missing value\n",
    "can potentially match with all strings in the other table and \n",
    "hence the number of output pairs can become huge. If you want \n",
    "to include pairs with missing value in the output, you need to \n",
    "set the **allow_missing** flag to True, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%   100%\n",
      "[#######] | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n",
      "0%  100%\n",
      "[##] | ETA: 00:00:00"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finding pairs with missing value...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, allow_missing=True, \n",
    "                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_A.id</th>\n",
       "      <th>r_B.id</th>\n",
       "      <th>l_A.name</th>\n",
       "      <th>r_B.name</th>\n",
       "      <th>_sim_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a6</td>\n",
       "      <td>b7</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>a7</td>\n",
       "      <td>b1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mark Levene</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>a7</td>\n",
       "      <td>b2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bill Bridge</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>a7</td>\n",
       "      <td>b3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mike Franklin</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7</td>\n",
       "      <td>a7</td>\n",
       "      <td>b4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Joseph Kuan</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8</td>\n",
       "      <td>a7</td>\n",
       "      <td>b5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Alfons Kemper</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>9</td>\n",
       "      <td>a7</td>\n",
       "      <td>b6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Michael Brodie</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>10</td>\n",
       "      <td>a7</td>\n",
       "      <td>b7</td>\n",
       "      <td>NaN</td>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11</td>\n",
       "      <td>a7</td>\n",
       "      <td>b8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>12</td>\n",
       "      <td>a1</td>\n",
       "      <td>b8</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>13</td>\n",
       "      <td>a2</td>\n",
       "      <td>b8</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>14</td>\n",
       "      <td>a3</td>\n",
       "      <td>b8</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>15</td>\n",
       "      <td>a4</td>\n",
       "      <td>b8</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>16</td>\n",
       "      <td>a5</td>\n",
       "      <td>b8</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>17</td>\n",
       "      <td>a6</td>\n",
       "      <td>b8</td>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_A.id r_B.id          l_A.name        r_B.name  _sim_score\n",
       "0     0     a3     b2    William Bridge     Bill Bridge           4\n",
       "1     1     a2     b3  Michael Franklin   Mike Franklin           4\n",
       "2     2     a5     b5   Alphonse Kemper   Alfons Kemper           3\n",
       "3     3     a6     b7                                             0\n",
       "0     4     a7     b1               NaN     Mark Levene         NaN\n",
       "1     5     a7     b2               NaN     Bill Bridge         NaN\n",
       "2     6     a7     b3               NaN   Mike Franklin         NaN\n",
       "3     7     a7     b4               NaN     Joseph Kuan         NaN\n",
       "4     8     a7     b5               NaN   Alfons Kemper         NaN\n",
       "5     9     a7     b6               NaN  Michael Brodie         NaN\n",
       "6    10     a7     b7               NaN                         NaN\n",
       "7    11     a7     b8               NaN             NaN         NaN\n",
       "8    12     a1     b8       Kevin Smith             NaN         NaN\n",
       "9    13     a2     b8  Michael Franklin             NaN         NaN\n",
       "10   14     a3     b8    William Bridge             NaN         NaN\n",
       "11   15     a4     b8      Binto George             NaN         NaN\n",
       "12   16     a5     b8   Alphonse Kemper             NaN         NaN\n",
       "13   17     a6     b8                               NaN         NaN"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Enabling parallel processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have multiple cores which you want to exploit for performing the \n",
    "join, you need to use the **n_jobs** option. If n_jobs is -1, all CPUs \n",
    "are used. If 1 is given, no parallel computing code is used at all, \n",
    "which is useful for debugging and is the default option. For n_jobs below \n",
    "-1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of \n",
    "CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If \n",
    "(n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code \n",
    "will be used (i.e., equivalent to the default).\n",
    "\n",
    "The following command exploits all the cores available to perform the join:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%  100%\n",
      "[  ]\r",
      "[# ] | ETA: 00:00:00\r",
      "[##] | ETA: 00:00:00\r",
      "[##] | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, \n",
    "                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'], n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(output_pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to set n_jobs to 1 when you are debugging or you do not want \n",
    "to use any parallel computing code. If you want to execute the join as \n",
    "fast as possible, you need to set n_jobs to -1 which will exploit all \n",
    "the CPUs in your machine. In case there are other concurrent processes \n",
    "running in your machine and you do not want to halt them, then you may \n",
    "need to set n_jobs to a value below -1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Performing join on numeric attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The join method expects the join attributes to be of string type. \n",
    "If you need to perform the join over numeric attributes, then you need \n",
    "to first convert the attributes to string type and then perform the join.\n",
    "For example, if you need to join 'A.zipcode' in table A with 'B.zipcode' in\n",
    "table B, you need to first convert the attributes to string type using \n",
    "the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ssj.dataframe_column_to_str(A, 'A.zipcode', inplace=True)\n",
    "ssj.dataframe_column_to_str(B, 'B.zipcode', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the above command preserves the NaN values while converting the numeric column to string type. Next, you can perform the join as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0%   100%\n",
      "[#######] | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>l_A.id</th>\n",
       "      <th>r_B.id</th>\n",
       "      <th>l_A.zipcode</th>\n",
       "      <th>r_B.zipcode</th>\n",
       "      <th>_sim_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a1</td>\n",
       "      <td>b1</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a3</td>\n",
       "      <td>b1</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>a1</td>\n",
       "      <td>b2</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>a3</td>\n",
       "      <td>b2</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>a2</td>\n",
       "      <td>b3</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>a4</td>\n",
       "      <td>b3</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>a5</td>\n",
       "      <td>b3</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>a6</td>\n",
       "      <td>b3</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>a2</td>\n",
       "      <td>b4</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>a4</td>\n",
       "      <td>b4</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>a5</td>\n",
       "      <td>b4</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>a6</td>\n",
       "      <td>b4</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>a2</td>\n",
       "      <td>b5</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>a4</td>\n",
       "      <td>b5</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>a5</td>\n",
       "      <td>b5</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>a6</td>\n",
       "      <td>b5</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>a1</td>\n",
       "      <td>b6</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>a3</td>\n",
       "      <td>b6</td>\n",
       "      <td>94107</td>\n",
       "      <td>94107</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>a2</td>\n",
       "      <td>b7</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>a4</td>\n",
       "      <td>b7</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>20</td>\n",
       "      <td>a5</td>\n",
       "      <td>b7</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>21</td>\n",
       "      <td>a6</td>\n",
       "      <td>b7</td>\n",
       "      <td>94122</td>\n",
       "      <td>94122</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    _id l_A.id r_B.id l_A.zipcode r_B.zipcode  _sim_score\n",
       "0     0     a1     b1       94107       94107           0\n",
       "1     1     a3     b1       94107       94107           0\n",
       "2     2     a1     b2       94107       94107           0\n",
       "3     3     a3     b2       94107       94107           0\n",
       "4     4     a2     b3       94122       94122           0\n",
       "5     5     a4     b3       94122       94122           0\n",
       "6     6     a5     b3       94122       94122           0\n",
       "7     7     a6     b3       94122       94122           0\n",
       "8     8     a2     b4       94122       94122           0\n",
       "9     9     a4     b4       94122       94122           0\n",
       "10   10     a5     b4       94122       94122           0\n",
       "11   11     a6     b4       94122       94122           0\n",
       "12   12     a2     b5       94122       94122           0\n",
       "13   13     a4     b5       94122       94122           0\n",
       "14   14     a5     b5       94122       94122           0\n",
       "15   15     a6     b5       94122       94122           0\n",
       "16   16     a1     b6       94107       94107           0\n",
       "17   17     a3     b6       94107       94107           0\n",
       "18   18     a2     b7       94122       94122           0\n",
       "19   19     a4     b7       94122       94122           0\n",
       "20   20     a5     b7       94122       94122           0\n",
       "21   21     a6     b7       94122       94122           0"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.zipcode', 'B.zipcode', 1, \n",
    "                                      l_out_attrs=['A.zipcode'], r_out_attrs=['B.zipcode'])\n",
    "output_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Additional options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find all the options available for the edit distance \n",
    "join function using the **help** command as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "help(ssj.edit_distance_join)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# More information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to edit distance measure, you can use the package to perform \n",
    "join using other measures such as cosine, Dice, Jaccard, overlap and \n",
    "overlap coefficient. For measures such as TF-IDF which are not \n",
    "directly supported, you can perform the join using the filters provided \n",
    "in the package. To know more about other join methods as well as how to \n",
    "use filters, refer to the how-to guide (available from the \n",
    "[package homepage](https://sites.google.com/site/anhaidgroup/projects/py_stringsimjoin))."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}