{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import py_entitymatching package\n",
    "import py_entitymatching as em\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "# Set the seed value \n",
    "seed = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Adding Features to Feature Table.ipynb\r\n",
      "Combining Multiple Blockers.ipynb\r\n",
      "Debugging Blocker Output.ipynb\r\n",
      "Down Sampling.ipynb\r\n",
      "Editing and Generating Features Manually.ipynb\r\n",
      "Evaluating the Selected Matcher.ipynb\r\n",
      "Generating Features Manually.ipynb\r\n",
      "Performing Blocking Using Blackbox Blocker.ipynb\r\n",
      "Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb\r\n",
      "Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb\r\n",
      "Performing Blocking Using Rule-Based Blocking.ipynb\r\n",
      "Reading CSV Files from Disk.ipynb\r\n",
      "Removing Features From Feature Table.ipynb\r\n",
      "Sampling and Labeling.ipynb\r\n",
      "Selecting the Best Learning Matcher.ipynb\r\n"
     ]
    }
   ],
   "source": [
    "!ls $datasets_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "path_A = datasets_dir + os.sep + 'dblp_demo.csv'\n",
    "path_B = datasets_dir + os.sep + 'acm_demo.csv'\n",
    "path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Metadata file is not present in the given path; proceeding to read the csv file.\n",
      "Metadata file is not present in the given path; proceeding to read the csv file.\n"
     ]
    }
   ],
   "source": [
    "A = em.read_csv_metadata(path_A, key='id')\n",
    "B = em.read_csv_metadata(path_B, key='id')\n",
    "# Load the pre-labeled data\n",
    "S = em.read_csv_metadata(path_labeled_data, \n",
    "                         key='_id',\n",
    "                         ltable=A, rtable=B, \n",
    "                         fk_ltable='ltable_id', fk_rtable='rtable_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Split S into I an J\n",
    "IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)\n",
    "I = IJ['train']\n",
    "J = IJ['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting the Best learning-based matcher "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This, typically involves the following steps:\n",
    "1. Creating a set of learning-based matchers\n",
    "2. Creating features\n",
    "3. Extracting feature vectors\n",
    "4. Selecting the best learning-based matcher using k-fold cross validation\n",
    "5. Debugging the matcher (and possibly repeat the above steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a set of learning-based matchers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create a set of ML-matchers\n",
    "dt = em.DTMatcher(name='DecisionTree', random_state=0)\n",
    "svm = em.SVMMatcher(name='SVM', random_state=0)\n",
    "rf = em.RFMatcher(name='RF', random_state=0)\n",
    "lg = em.LogRegMatcher(name='LogReg', random_state=0)\n",
    "ln = em.LinRegMatcher(name='LinReg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating features\n",
    "\n",
    "Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Generate a set of features\n",
    "F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                          id_id_lev_dist\n",
       "1                           id_id_lev_sim\n",
       "2                               id_id_jar\n",
       "3                               id_id_jwn\n",
       "4                               id_id_exm\n",
       "5                   id_id_jac_qgm_3_qgm_3\n",
       "6             title_title_jac_qgm_3_qgm_3\n",
       "7         title_title_cos_dlm_dc0_dlm_dc0\n",
       "8                         title_title_mel\n",
       "9                    title_title_lev_dist\n",
       "10                    title_title_lev_sim\n",
       "11        authors_authors_jac_qgm_3_qgm_3\n",
       "12    authors_authors_cos_dlm_dc0_dlm_dc0\n",
       "13                    authors_authors_mel\n",
       "14               authors_authors_lev_dist\n",
       "15                authors_authors_lev_sim\n",
       "16                          year_year_exm\n",
       "17                          year_year_anm\n",
       "18                     year_year_lev_dist\n",
       "19                      year_year_lev_sim\n",
       "Name: feature_name, dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F.feature_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting feature vectors\n",
    "\n",
    "In this step, we extract feature vectors using the development set and the created features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Convert the I into a set of feature vectors using F\n",
    "H = em.extract_feature_vecs(I, \n",
    "                            feature_table=F, \n",
    "                            attrs_after='label',\n",
    "                            show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>id_id_lev_dist</th>\n",
       "      <th>id_id_lev_sim</th>\n",
       "      <th>id_id_jar</th>\n",
       "      <th>id_id_jwn</th>\n",
       "      <th>id_id_exm</th>\n",
       "      <th>id_id_jac_qgm_3_qgm_3</th>\n",
       "      <th>title_title_jac_qgm_3_qgm_3</th>\n",
       "      <th>...</th>\n",
       "      <th>authors_authors_jac_qgm_3_qgm_3</th>\n",
       "      <th>authors_authors_cos_dlm_dc0_dlm_dc0</th>\n",
       "      <th>authors_authors_mel</th>\n",
       "      <th>authors_authors_lev_dist</th>\n",
       "      <th>authors_authors_lev_sim</th>\n",
       "      <th>year_year_exm</th>\n",
       "      <th>year_year_anm</th>\n",
       "      <th>year_year_lev_dist</th>\n",
       "      <th>year_year_lev_sim</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>430</th>\n",
       "      <td>430</td>\n",
       "      <td>l1494</td>\n",
       "      <td>r1257</td>\n",
       "      <td>4</td>\n",
       "      <td>0.20</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.445707</td>\n",
       "      <td>44.0</td>\n",
       "      <td>0.083333</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>35</td>\n",
       "      <td>l1385</td>\n",
       "      <td>r1160</td>\n",
       "      <td>4</td>\n",
       "      <td>0.20</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.025641</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.589417</td>\n",
       "      <td>43.0</td>\n",
       "      <td>0.271186</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>394</th>\n",
       "      <td>394</td>\n",
       "      <td>l1345</td>\n",
       "      <td>r85</td>\n",
       "      <td>4</td>\n",
       "      <td>0.20</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.951111</td>\n",
       "      <td>0.945946</td>\n",
       "      <td>0.822080</td>\n",
       "      <td>172.0</td>\n",
       "      <td>0.338462</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>29</td>\n",
       "      <td>l611</td>\n",
       "      <td>r141</td>\n",
       "      <td>3</td>\n",
       "      <td>0.25</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>0</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>0.049383</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.531543</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0.277778</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>181</td>\n",
       "      <td>l1164</td>\n",
       "      <td>r1161</td>\n",
       "      <td>2</td>\n",
       "      <td>0.60</td>\n",
       "      <td>0.733333</td>\n",
       "      <td>0.733333</td>\n",
       "      <td>0</td>\n",
       "      <td>0.076923</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.592593</td>\n",
       "      <td>0.668153</td>\n",
       "      <td>0.684700</td>\n",
       "      <td>34.0</td>\n",
       "      <td>0.244444</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     _id ltable_id rtable_id  id_id_lev_dist  id_id_lev_sim  id_id_jar  \\\n",
       "430  430     l1494     r1257               4           0.20   0.466667   \n",
       "35    35     l1385     r1160               4           0.20   0.466667   \n",
       "394  394     l1345       r85               4           0.20   0.000000   \n",
       "29    29      l611      r141               3           0.25   0.666667   \n",
       "181  181     l1164     r1161               2           0.60   0.733333   \n",
       "\n",
       "     id_id_jwn  id_id_exm  id_id_jac_qgm_3_qgm_3  title_title_jac_qgm_3_qgm_3  \\\n",
       "430   0.466667          0               0.000000                     0.000000   \n",
       "35    0.466667          0               0.000000                     0.025641   \n",
       "394   0.000000          0               0.090909                     1.000000   \n",
       "29    0.666667          0               0.090909                     0.049383   \n",
       "181   0.733333          0               0.076923                     1.000000   \n",
       "\n",
       "     ...    authors_authors_jac_qgm_3_qgm_3  \\\n",
       "430  ...                           0.000000   \n",
       "35   ...                           0.000000   \n",
       "394  ...                           0.951111   \n",
       "29   ...                           0.000000   \n",
       "181  ...                           0.592593   \n",
       "\n",
       "     authors_authors_cos_dlm_dc0_dlm_dc0  authors_authors_mel  \\\n",
       "430                             0.000000             0.445707   \n",
       "35                              0.000000             0.589417   \n",
       "394                             0.945946             0.822080   \n",
       "29                              0.000000             0.531543   \n",
       "181                             0.668153             0.684700   \n",
       "\n",
       "     authors_authors_lev_dist  authors_authors_lev_sim  year_year_exm  \\\n",
       "430                      44.0                 0.083333              1   \n",
       "35                       43.0                 0.271186              1   \n",
       "394                     172.0                 0.338462              1   \n",
       "29                       26.0                 0.277778              1   \n",
       "181                      34.0                 0.244444              1   \n",
       "\n",
       "     year_year_anm  year_year_lev_dist  year_year_lev_sim  label  \n",
       "430            1.0                 0.0                1.0      0  \n",
       "35             1.0                 0.0                1.0      0  \n",
       "394            1.0                 0.0                1.0      1  \n",
       "29             1.0                 0.0                1.0      0  \n",
       "181            1.0                 0.0                1.0      1  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display first few rows\n",
    "H.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check if the feature vectors contain missing values\n",
    "# A return value of True means that there are missing values\n",
    "any(pd.notnull(H))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Impute feature vectors with the mean of the column values.\n",
    "H = em.impute_table(H, \n",
    "                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n",
    "                strategy='mean')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting the best matcher using cross-validation\n",
    "\n",
    "Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Average precision</th>\n",
       "      <th>Average recall</th>\n",
       "      <th>Average f1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>0.915322</td>\n",
       "      <td>0.950714</td>\n",
       "      <td>0.930980</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950714</td>\n",
       "      <td>0.974131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>0.977778</td>\n",
       "      <td>0.810632</td>\n",
       "      <td>0.883248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.935330</td>\n",
       "      <td>0.966131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>0.985714</td>\n",
       "      <td>0.935330</td>\n",
       "      <td>0.958724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Matcher  Average precision  Average recall  Average f1\n",
       "0  DecisionTree           0.915322        0.950714    0.930980\n",
       "1            RF           1.000000        0.950714    0.974131\n",
       "2           SVM           0.977778        0.810632    0.883248\n",
       "3        LinReg           1.000000        0.935330    0.966131\n",
       "4        LogReg           0.985714        0.935330    0.958724"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the best ML matcher using CV\n",
    "result = em.select_matcher([dt, rf, svm, ln, lg], table=H, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n",
    "        k=5,\n",
    "        target_attr='label', metric_to_select_matcher='f1', random_state=0)\n",
    "result['cv_stats']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Num folds</th>\n",
       "      <th>Fold 1</th>\n",
       "      <th>Fold 2</th>\n",
       "      <th>Fold 3</th>\n",
       "      <th>Fold 4</th>\n",
       "      <th>Fold 5</th>\n",
       "      <th>Mean score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>&lt;py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.764706</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.915322</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>&lt;py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>&lt;py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.977778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.985714</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Name  \\\n",
       "0  DecisionTree   \n",
       "1            RF   \n",
       "2           SVM   \n",
       "3        LinReg   \n",
       "4        LogReg   \n",
       "\n",
       "                                                                         Matcher  \\\n",
       "0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>   \n",
       "1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>   \n",
       "2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>   \n",
       "3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>   \n",
       "4  <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>   \n",
       "\n",
       "   Num folds  Fold 1    Fold 2    Fold 3    Fold 4    Fold 5  Mean score  \n",
       "0          5    0.95  1.000000  0.764706  0.933333  0.928571    0.915322  \n",
       "1          5    1.00  1.000000  1.000000  1.000000  1.000000    1.000000  \n",
       "2          5    1.00  1.000000  0.888889  1.000000  1.000000    0.977778  \n",
       "3          5    1.00  1.000000  1.000000  1.000000  1.000000    1.000000  \n",
       "4          5    1.00  0.928571  1.000000  1.000000  1.000000    0.985714  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result['drill_down_cv_stats']['precision']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Num folds</th>\n",
       "      <th>Fold 1</th>\n",
       "      <th>Fold 2</th>\n",
       "      <th>Fold 3</th>\n",
       "      <th>Fold 4</th>\n",
       "      <th>Fold 5</th>\n",
       "      <th>Mean score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>&lt;py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.8750</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>&lt;py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.8750</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>&lt;py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.90</td>\n",
       "      <td>0.923077</td>\n",
       "      <td>0.571429</td>\n",
       "      <td>0.8125</td>\n",
       "      <td>0.846154</td>\n",
       "      <td>0.810632</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.8750</td>\n",
       "      <td>0.923077</td>\n",
       "      <td>0.935330</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.8750</td>\n",
       "      <td>0.923077</td>\n",
       "      <td>0.935330</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Name  \\\n",
       "0  DecisionTree   \n",
       "1            RF   \n",
       "2           SVM   \n",
       "3        LinReg   \n",
       "4        LogReg   \n",
       "\n",
       "                                                                         Matcher  \\\n",
       "0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>   \n",
       "1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>   \n",
       "2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>   \n",
       "3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>   \n",
       "4  <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>   \n",
       "\n",
       "   Num folds  Fold 1    Fold 2    Fold 3  Fold 4    Fold 5  Mean score  \n",
       "0          5    0.95  1.000000  0.928571  0.8750  1.000000    0.950714  \n",
       "1          5    0.95  1.000000  0.928571  0.8750  1.000000    0.950714  \n",
       "2          5    0.90  0.923077  0.571429  0.8125  0.846154    0.810632  \n",
       "3          5    0.95  1.000000  0.928571  0.8750  0.923077    0.935330  \n",
       "4          5    0.95  1.000000  0.928571  0.8750  0.923077    0.935330  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result['drill_down_cv_stats']['recall']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Num folds</th>\n",
       "      <th>Fold 1</th>\n",
       "      <th>Fold 2</th>\n",
       "      <th>Fold 3</th>\n",
       "      <th>Fold 4</th>\n",
       "      <th>Fold 5</th>\n",
       "      <th>Mean score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>&lt;py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.950000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.838710</td>\n",
       "      <td>0.903226</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.930980</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>&lt;py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.974131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>&lt;py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.695652</td>\n",
       "      <td>0.896552</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.883248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.966131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.958724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Name  \\\n",
       "0  DecisionTree   \n",
       "1            RF   \n",
       "2           SVM   \n",
       "3        LinReg   \n",
       "4        LogReg   \n",
       "\n",
       "                                                                         Matcher  \\\n",
       "0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>   \n",
       "1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>   \n",
       "2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>   \n",
       "3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>   \n",
       "4  <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>   \n",
       "\n",
       "   Num folds    Fold 1    Fold 2    Fold 3    Fold 4    Fold 5  Mean score  \n",
       "0          5  0.950000  1.000000  0.838710  0.903226  0.962963    0.930980  \n",
       "1          5  0.974359  1.000000  0.962963  0.933333  1.000000    0.974131  \n",
       "2          5  0.947368  0.960000  0.695652  0.896552  0.916667    0.883248  \n",
       "3          5  0.974359  1.000000  0.962963  0.933333  0.960000    0.966131  \n",
       "4          5  0.974359  0.962963  0.962963  0.933333  0.960000    0.958724  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result['drill_down_cv_stats']['f1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Debug X (Random Forest)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Split H into P and Q\n",
    "PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)\n",
    "P = PQ['train']\n",
    "Q = PQ['test']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Debug RF matcher using GUI\n",
    "em.vis_debug_rf(rf, P, Q, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n",
    "        target_attr='label')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Add a feature to do Jaccard on title + authors and add it to F\n",
    "\n",
    "# Create a feature declaratively\n",
    "sim = em.get_sim_funs_for_matching()\n",
    "tok = em.get_tokenizers_for_matching()\n",
    "feature_string = \"\"\"jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()), \n",
    "                            wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))\"\"\"\n",
    "feature = em.get_feature_fn(feature_string, sim, tok)\n",
    "\n",
    "# Add feature to F\n",
    "em.add_feature(F, 'jac_ws_title_authors', feature)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Convert I into feature vectors using updated F\n",
    "H = em.extract_feature_vecs(I, \n",
    "                            feature_table=F, \n",
    "                            attrs_after='label',\n",
    "                            show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Num folds</th>\n",
       "      <th>Fold 1</th>\n",
       "      <th>Fold 2</th>\n",
       "      <th>Fold 3</th>\n",
       "      <th>Fold 4</th>\n",
       "      <th>Fold 5</th>\n",
       "      <th>Mean score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RF</td>\n",
       "      <td>&lt;py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.974131</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Name                                                                Matcher  \\\n",
       "0   RF  <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>   \n",
       "\n",
       "   Num folds    Fold 1  Fold 2    Fold 3    Fold 4  Fold 5  Mean score  \n",
       "0          5  0.974359     1.0  0.962963  0.933333     1.0    0.974131  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check whether the updated F improves X (Random Forest)\n",
    "result = em.select_matcher([rf], table=H, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n",
    "        k=5,\n",
    "        target_attr='label', metric_to_select_matcher='f1', random_state=0)\n",
    "result['drill_down_cv_stats']['f1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Average precision</th>\n",
       "      <th>Average recall</th>\n",
       "      <th>Average f1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950714</td>\n",
       "      <td>0.974131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.837418</td>\n",
       "      <td>0.907995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.970330</td>\n",
       "      <td>0.984593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>0.985714</td>\n",
       "      <td>0.935330</td>\n",
       "      <td>0.958724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Matcher  Average precision  Average recall  Average f1\n",
       "0  DecisionTree           1.000000        1.000000    1.000000\n",
       "1            RF           1.000000        0.950714    0.974131\n",
       "2           SVM           1.000000        0.837418    0.907995\n",
       "3        LinReg           1.000000        0.970330    0.984593\n",
       "4        LogReg           0.985714        0.935330    0.958724"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the best matcher again using CV\n",
    "result = em.select_matcher([dt, rf, svm, ln, lg], table=H, \n",
    "        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n",
    "        k=5,\n",
    "        target_attr='label', metric_to_select_matcher='f1', random_state=0)\n",
    "result['cv_stats']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Matcher</th>\n",
       "      <th>Num folds</th>\n",
       "      <th>Fold 1</th>\n",
       "      <th>Fold 2</th>\n",
       "      <th>Fold 3</th>\n",
       "      <th>Fold 4</th>\n",
       "      <th>Fold 5</th>\n",
       "      <th>Mean score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DecisionTree</td>\n",
       "      <td>&lt;py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RF</td>\n",
       "      <td>&lt;py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.974131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SVM</td>\n",
       "      <td>&lt;py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.782609</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.907995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>LinReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.984593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>LogReg</td>\n",
       "      <td>&lt;py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>0.974359</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>0.958724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Name  \\\n",
       "0  DecisionTree   \n",
       "1            RF   \n",
       "2           SVM   \n",
       "3        LinReg   \n",
       "4        LogReg   \n",
       "\n",
       "                                                                         Matcher  \\\n",
       "0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>   \n",
       "1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>   \n",
       "2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>   \n",
       "3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>   \n",
       "4  <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>   \n",
       "\n",
       "   Num folds    Fold 1    Fold 2    Fold 3    Fold 4    Fold 5  Mean score  \n",
       "0          5  1.000000  1.000000  1.000000  1.000000  1.000000    1.000000  \n",
       "1          5  0.974359  1.000000  0.962963  0.933333  1.000000    0.974131  \n",
       "2          5  0.947368  0.960000  0.782609  0.933333  0.916667    0.907995  \n",
       "3          5  1.000000  1.000000  0.962963  1.000000  0.960000    0.984593  \n",
       "4          5  0.974359  0.962963  0.962963  0.933333  0.960000    0.958724  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result['drill_down_cv_stats']['f1']"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}