{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic EM workflow 2 (Restaurants data set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This IPython notebook explains a basic workflow two tables using *py_entitymatching*. Our goal is to come up with a workflow to match restaurants from Fodors and Zagat sites. Specifically, we want to achieve precision and recall above 96%. The datasets contain information about the restaurants.\n",
    "\n",
    "First, we need to import *py_entitymatching* package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')\n",
    "\n",
    "import py_entitymatching as em\n",
    "import pandas as pd\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "python version: 2.7.13 | packaged by conda-forge | (default, May  2 2017, 13:29:36) \n",
      "[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]\n",
      "pandas version: 0.20.3\n",
      "magellan version: 0.2.0\n"
     ]
    }
   ],
   "source": [
    "# Display the versions\n",
    "print('python version: ' + sys.version )\n",
    "print('pandas version: ' + pd.__version__ )\n",
    "print('magellan version: ' + em.__version__ )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Matching two tables typically consists of the following three steps:\n",
    "\n",
    "** 1. Reading the input tables **\n",
    "\n",
    "** 2. Blocking the input tables to get a candidate set **\n",
    "\n",
    "** 3. Matching the tuple pairs in the candidate set **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Read input tables\n",
    "\n",
    "We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Get the paths\n",
    "path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'\n",
    "path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No handlers could be found for logger \"py_entitymatching.io.parsers\"\n"
     ]
    }
   ],
   "source": [
    "# Load csv files as dataframes and set the key attribute in the dataframe\n",
    "A = em.read_csv_metadata(path_A, key='id')\n",
    "B = em.read_csv_metadata(path_B, key='id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of tuples in A: 533\n",
      "Number of tuples in B: 331\n",
      "Number of tuples in A X B (i.e the cartesian product): 176423\n"
     ]
    }
   ],
   "source": [
    "print('Number of tuples in A: ' + str(len(A)))\n",
    "print('Number of tuples in B: ' + str(len(B)))\n",
    "print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>addr</th>\n",
       "      <th>city</th>\n",
       "      <th>phone</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>534</td>\n",
       "      <td>arnie mortons of chicago</td>\n",
       "      <td>435 s. la cienega blv.</td>\n",
       "      <td>los angeles</td>\n",
       "      <td>310/246-1501</td>\n",
       "      <td>american</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>535</td>\n",
       "      <td>arts delicatessen</td>\n",
       "      <td>12224 ventura blvd.</td>\n",
       "      <td>studio city</td>\n",
       "      <td>818/762-1221</td>\n",
       "      <td>american</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id                      name                    addr         city  \\\n",
       "0  534  arnie mortons of chicago  435 s. la cienega blv.  los angeles   \n",
       "1  535         arts delicatessen     12224 ventura blvd.  studio city   \n",
       "\n",
       "          phone      type  \n",
       "0  310/246-1501  american  \n",
       "1  818/762-1221  american  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>addr</th>\n",
       "      <th>city</th>\n",
       "      <th>phone</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>apple pan  the</td>\n",
       "      <td>10801 w. pico blvd.</td>\n",
       "      <td>west la</td>\n",
       "      <td>310-475-3585</td>\n",
       "      <td>american</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>asahi ramen</td>\n",
       "      <td>2027 sawtelle blvd.</td>\n",
       "      <td>west la</td>\n",
       "      <td>310-479-2231</td>\n",
       "      <td>noodle shops</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id            name                 addr     city         phone  \\\n",
       "0   1  apple pan  the  10801 w. pico blvd.  west la  310-475-3585   \n",
       "1   2     asahi ramen  2027 sawtelle blvd.  west la  310-479-2231   \n",
       "\n",
       "           type  \n",
       "0      american  \n",
       "1  noodle shops  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('id', 'id')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the keys of the input tables\n",
    "em.get_key(A), em.get_key(B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(146, 200)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If the tables are large we can downsample the tables like this\n",
    "A1, B1 = em.down_sample(A, B, 200, 1, show_progress=False)\n",
    "len(A1), len(B1)\n",
    "\n",
    "# But for the purposes of this notebook, we will use the entire table A and B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Block Tables To Get Candidate Set\n",
    "\n",
    "Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.\n",
    "*py_entitymatching* provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.\n",
    "\n",
    "For the matching problem at hand, we know that two restaurants with no overlap between the names will not match. So we decide the apply blocking over names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Blocking plan\n",
    "\n",
    "# A, B -- Overlap blocker [name] --------------------|---> candidate set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2915"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create overlap blocker\n",
    "ob = em.OverlapBlocker()\n",
    "\n",
    "# Block tables using 'name' attribute \n",
    "C = ob.block_tables(A, B, 'name', 'name', \n",
    "                    l_output_attrs=['name', 'addr', 'city', 'phone'], \n",
    "                    r_output_attrs=['name', 'addr', 'city', 'phone'],\n",
    "                    overlap_size=1, show_progress=False\n",
    "                    )\n",
    "len(C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Match tuple pairs in candidate set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.\n",
    "This typically involves the following four steps:\n",
    "\n",
    "1. Sampling and labeling the candidate set\n",
    "2. Splitting the labeled data into development and evaluation set\n",
    "3. Selecting the best learning based matcher using the development set\n",
    "4. Evaluating the selected matcher using the evaluation set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling and labeling the candidate set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we randomly sample 450 tuple pairs for labeling purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Sample  candidate set\n",
    "S = em.sample_table(C, 450)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Label S\n",
    "G = em.label_table(S, 'gold')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "450"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the pre-labeled data\n",
    "path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'\n",
    "G = em.read_csv_metadata(path_G, \n",
    "                         key='_id',\n",
    "                         ltable=A, rtable=B, \n",
    "                         fk_ltable='ltable_id', fk_rtable='rtable_id')\n",
    "len(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting the labeled data into development and evaluation set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Split S into development set (I) and evaluation set (J)\n",
    "IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)\n",
    "I = IJ['train']\n",
    "J = IJ['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting the best learning-based matcher "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting the best learning-based matcher typically involves the following steps:\n",
    "\n",
    "1. Creating a set of learning-based matchers\n",
    "2. Creating features\n",
    "3. Converting the development set into feature vectors\n",
    "4. Selecting the best learning-based matcher using k-fold cross validation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a set of learning-based matchers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a set of ML-matchers\n",
    "dt = em.DTMatcher(name='DecisionTree', random_state=0)\n",
    "svm = em.SVMMatcher(name='SVM', random_state=0)\n",
    "rf = em.RFMatcher(name='RF', random_state=0)\n",
    "lg = em.LogRegMatcher(name='LogReg', random_state=0)\n",
    "ln = em.LinRegMatcher(name='LinReg')\n",
    "nb = em.NBMatcher(name='NaiveBayes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to create a set of features for the development set. *py_entitymatching* provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Generate features\n",
    "feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting the development set to feature vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the I into a set of feature vectors using F\n",
    "H = em.extract_feature_vecs(I, \n",
    "                            feature_table=feature_table, \n",
    "                            attrs_after='gold',\n",
    "                            show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>id_id_exm</th>\n",
       "      <th>id_id_anm</th>\n",
       "      <th>id_id_lev_dist</th>\n",
       "      <th>id_id_lev_sim</th>\n",
       "      <th>name_name_jac_qgm_3_qgm_3</th>\n",
       "      <th>name_name_cos_dlm_dc0_dlm_dc0</th>\n",
       "      <th>name_name_jac_dlm_dc0_dlm_dc0</th>\n",
       "      <th>...</th>\n",
       "      <th>city_city_sw</th>\n",
       "      <th>type_type_jac_qgm_3_qgm_3</th>\n",
       "      <th>type_type_cos_dlm_dc0_dlm_dc0</th>\n",
       "      <th>type_type_jac_dlm_dc0_dlm_dc0</th>\n",
       "      <th>type_type_mel</th>\n",
       "      <th>type_type_lev_dist</th>\n",
       "      <th>type_type_lev_sim</th>\n",
       "      <th>type_type_nmw</th>\n",
       "      <th>type_type_sw</th>\n",
       "      <th>gold</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>221</th>\n",
       "      <td>1790</td>\n",
       "      <td>563</td>\n",
       "      <td>248</td>\n",
       "      <td>0</td>\n",
       "      <td>0.440497</td>\n",
       "      <td>3</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.235294</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.883333</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.416667</td>\n",
       "      <td>-2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>439</th>\n",
       "      <td>794</td>\n",
       "      <td>544</td>\n",
       "      <td>116</td>\n",
       "      <td>0</td>\n",
       "      <td>0.213235</td>\n",
       "      <td>3</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.258065</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>191</th>\n",
       "      <td>2315</td>\n",
       "      <td>589</td>\n",
       "      <td>305</td>\n",
       "      <td>0</td>\n",
       "      <td>0.517827</td>\n",
       "      <td>3</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.172414</td>\n",
       "      <td>0.353553</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.451923</td>\n",
       "      <td>24.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 40 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      _id  ltable_id  rtable_id  id_id_exm  id_id_anm  id_id_lev_dist  \\\n",
       "221  1790        563        248          0   0.440497               3   \n",
       "439   794        544        116          0   0.213235               3   \n",
       "191  2315        589        305          0   0.517827               3   \n",
       "\n",
       "     id_id_lev_sim  name_name_jac_qgm_3_qgm_3  name_name_cos_dlm_dc0_dlm_dc0  \\\n",
       "221            0.0                   1.000000                       1.000000   \n",
       "439            0.0                   0.258065                       0.500000   \n",
       "191            0.0                   0.172414                       0.353553   \n",
       "\n",
       "     name_name_jac_dlm_dc0_dlm_dc0  ...   city_city_sw  \\\n",
       "221                       1.000000  ...            8.0   \n",
       "439                       0.333333  ...            1.0   \n",
       "191                       0.200000  ...            1.0   \n",
       "\n",
       "     type_type_jac_qgm_3_qgm_3  type_type_cos_dlm_dc0_dlm_dc0  \\\n",
       "221                   0.235294                            0.0   \n",
       "439                   0.000000                            0.0   \n",
       "191                   0.000000                            0.0   \n",
       "\n",
       "     type_type_jac_dlm_dc0_dlm_dc0  type_type_mel  type_type_lev_dist  \\\n",
       "221                            0.0       0.883333                 7.0   \n",
       "439                            0.0       0.466667                 4.0   \n",
       "191                            0.0       0.451923                24.0   \n",
       "\n",
       "     type_type_lev_sim  type_type_nmw  type_type_sw  gold  \n",
       "221           0.416667           -2.0           4.0     1  \n",
       "439           0.200000            0.0           1.0     0  \n",
       "191           0.000000          -11.0           2.0     0  \n",
       "\n",
       "[3 rows x 40 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display first few rows\n",
    "H.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting the best matcher using cross-validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Average precision</th>\n",
       "      <th>Average recall</th>\n",
       "      <th>Average f1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>0.976471</td>\n",
       "      <td>0.936450</td>\n",
       "      <td>0.954078</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>0.975735</td>\n",
       "      <td>0.963235</td>\n",
       "      <td>0.968450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.297437</td>\n",
       "      <td>0.440952</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>0.944444</td>\n",
       "      <td>0.920497</td>\n",
       "      <td>0.929657</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>0.974603</td>\n",
       "      <td>0.948214</td>\n",
       "      <td>0.960212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaiveBayes</td>\n",
       "      <td>0.953947</td>\n",
       "      <td>0.973214</td>\n",
       "      <td>0.962530</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Matcher  Average precision  Average recall  Average f1\n",
       "0  DecisionTree           0.976471        0.936450    0.954078\n",
       "1            RF           0.975735        0.963235    0.968450\n",
       "2           SVM           1.000000        0.297437    0.440952\n",
       "3        LinReg           0.944444        0.920497    0.929657\n",
       "4        LogReg           0.974603        0.948214    0.960212\n",
       "5    NaiveBayes           0.953947        0.973214    0.962530"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the best ML matcher using CV\n",
    "result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],\n",
    "        k=5,\n",
    "        target_attr='gold', metric_to_select_matcher='precision', random_state=0)\n",
    "result['cv_stats']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Average precision</th>\n",
       "      <th>Average recall</th>\n",
       "      <th>Average f1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>0.976471</td>\n",
       "      <td>0.936450</td>\n",
       "      <td>0.954078</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>0.975735</td>\n",
       "      <td>0.963235</td>\n",
       "      <td>0.968450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.297437</td>\n",
       "      <td>0.440952</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>0.944444</td>\n",
       "      <td>0.920497</td>\n",
       "      <td>0.929657</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>0.974603</td>\n",
       "      <td>0.948214</td>\n",
       "      <td>0.960212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaiveBayes</td>\n",
       "      <td>0.953947</td>\n",
       "      <td>0.973214</td>\n",
       "      <td>0.962530</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Matcher  Average precision  Average recall  Average f1\n",
       "0  DecisionTree           0.976471        0.936450    0.954078\n",
       "1            RF           0.975735        0.963235    0.968450\n",
       "2           SVM           1.000000        0.297437    0.440952\n",
       "3        LinReg           0.944444        0.920497    0.929657\n",
       "4        LogReg           0.974603        0.948214    0.960212\n",
       "5    NaiveBayes           0.953947        0.973214    0.962530"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],\n",
    "        k=5,\n",
    "        target_attr='gold', metric_to_select_matcher='recall', random_state=0)\n",
    "result['cv_stats']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the best matcher (RF) is  getting us the best precision and recall. So, we select this matcher and now we can proceed on to evaluating the best matcher on the unseen data (the evaluation set)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Evaluating the matching output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluating the matching outputs for the evaluation set typically involves the following four steps:\n",
    "1. Converting the evaluation set to feature vectors\n",
    "2. Training matcher using the feature vectors extracted from the development set\n",
    "3. Predicting the evaluation set using the trained matcher\n",
    "4. Evaluating the predicted matches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting the evaluation set to  feature vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, we convert to the feature vectors (using the feature table and the evaluation set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert J into a set of feature vectors using feature table\n",
    "L = em.extract_feature_vecs(J, feature_table=feature_table,\n",
    "                            attrs_after='gold', show_progress=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the selected matcher"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train using feature vectors from I \n",
    "dt.fit(table=H, \n",
    "       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], \n",
    "       target_attr='gold')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicting the matches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we predict the matches for the evaluation set (using the feature vectors extracted from it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict on L \n",
    "predictions = dt.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'], \n",
    "              append=True, target_attr='predicted', inplace=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we evaluate the accuracy of predicted outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision : 97.14% (34/35)\n",
      "Recall : 100.0% (34/34)\n",
      "F1 : 98.55%\n",
      "False positives : 1 (out of 35 positive predictions)\n",
      "False negatives : 0 (out of 100 negative predictions)\n"
     ]
    }
   ],
   "source": [
    "# Evaluate the predictions\n",
    "eval_result = em.eval_matches(predictions, 'gold', 'predicted')\n",
    "em.print_eval_summary(eval_result)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}