{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "This IPython notebook illustrates how to performing matching with a ML matcher. In particular we show examples with a decision tree matcher, but the same principles apply to all of the other ML matchers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import py_entitymatching package\n",
    "import py_entitymatching as em\n",
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the orignal tables and a set of labeled data into py_entitymatching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "path_A = datasets_dir + os.sep + 'dblp_demo.csv'\n",
    "path_B = datasets_dir + os.sep + 'acm_demo.csv'\n",
    "path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Metadata file is not present in the given path; proceeding to read the csv file.\n",
      "Metadata file is not present in the given path; proceeding to read the csv file.\n"
     ]
    }
   ],
   "source": [
    "A = em.read_csv_metadata(path_A, key='id')\n",
    "B = em.read_csv_metadata(path_B, key='id')\n",
    "# Load the pre-labeled data\n",
    "S = em.read_csv_metadata(path_labeled_data, \n",
    "                         key='_id',\n",
    "                         ltable=A, rtable=B, \n",
    "                         fk_ltable='ltable_id', fk_rtable='rtable_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training the ML Matcher\n",
    "\n",
    "Now, we can train our ML matcher. In this notebook we will demonstrate this process with a decision tree matcher. First, we need to split our labeled data into a training set and a test set. Then we will exract feature vectors from the training set and train our decision tree with the fit command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split S into I an J\n",
    "IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)\n",
    "I = IJ['train']\n",
    "J = IJ['test']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate a set of features\n",
    "F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert I into feature vectors using updated F\n",
    "H = em.extract_feature_vecs(I, \n",
    "                            feature_table=F, \n",
    "                            attrs_after='label',\n",
    "                            show_progress=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate the matcher to evaluate.\n",
    "dt = em.DTMatcher(name='DecisionTree', random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train using feature vectors from I \n",
    "dt.fit(table=H, \n",
    "       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n",
    "       target_attr='label')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Predictions with the ML Matcher\n",
    "\n",
    "Since we now have a trained decision tree, we can use our matcher to get predictions on the test set. Below, we will show four different ways to get the predictions with the predict command that will be useful in various contexts. \n",
    "\n",
    "### Getting a List of Predictions\n",
    "\n",
    "First up, we will demonstrate how to get just a list of predictions using the predict command. This is the default method of getting predictions. As shown below, the resulting variable, predictions, is just an array containing the predictions for each of the feature vectors in the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 1, 1, 1, 0, 1, 0, 0])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert J into a set of feature vectors using F\n",
    "L1 = em.extract_feature_vecs(J, feature_table=F,\n",
    "                            attrs_after='label', show_progress=False)\n",
    "\n",
    "# Predict on L \n",
    "predictions = dt.predict(table=L1, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'])\n",
    "\n",
    "# Show the predictions\n",
    "predictions[0:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting a List of Predictions and a List of Probabilities\n",
    "\n",
    "Next we will demonstrate how to get both a list of prediction for the test set, as well as a list of the associated probabilities for the predictions. This is done by setting the 'return_probs' argument to true. Note that the probabilities shown are the probability for a match. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predictions for first ten entries: [0 0 0 1 1 1 0 1 0 0]\n",
      "Probabilities of a match for first ten entries: [0. 0. 0. 1. 1. 1. 0. 1. 0. 0.]\n"
     ]
    }
   ],
   "source": [
    "# Convert J into a set of feature vectors using F\n",
    "L2 = em.extract_feature_vecs(J, feature_table=F,\n",
    "                            attrs_after='label', show_progress=False)\n",
    "\n",
    "# Predict on L \n",
    "predictions, probs = dt.predict(table=L2, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], return_probs=True)\n",
    "\n",
    "# Show the predictions and probabilities\n",
    "print('Predictions for first ten entries: {0}'.format(predictions[0:10]))\n",
    "print('Probabilities of a match for first ten entries: {0}'.format(probs[0:10]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appending the Predictions to the Feature Vectors Table\n",
    "\n",
    "Often, we want to include the predictions with the feature vector table. We can return predictions appended to a copy of the feature vector table if we use the 'append' argument to true. We can choose the name of the new predictions column using the 'target_attr' argument. We can also append the probabilites by setting 'return_probs' to true and setting the new probabilities column name with the 'probs_attr'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>label</th>\n",
       "      <th>prediction</th>\n",
       "      <th>probability</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>124</td>\n",
       "      <td>l1647</td>\n",
       "      <td>r366</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>54</td>\n",
       "      <td>l332</td>\n",
       "      <td>r1463</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>268</th>\n",
       "      <td>268</td>\n",
       "      <td>l1499</td>\n",
       "      <td>r1725</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>293</th>\n",
       "      <td>293</td>\n",
       "      <td>l759</td>\n",
       "      <td>r1749</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>230</td>\n",
       "      <td>l1580</td>\n",
       "      <td>r1711</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     _id ltable_id rtable_id  label  prediction  probability\n",
       "124  124     l1647      r366      0           0          0.0\n",
       "54    54      l332     r1463      0           0          0.0\n",
       "268  268     l1499     r1725      0           0          0.0\n",
       "293  293      l759     r1749      1           1          1.0\n",
       "230  230     l1580     r1711      1           1          1.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert J into a set of feature vectors using F\n",
    "L3 = em.extract_feature_vecs(J, feature_table=F,\n",
    "                            attrs_after='label', show_progress=False)\n",
    "\n",
    "# Predict on L \n",
    "predictions = dt.predict(table=L3, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n",
    "                         target_attr='prediction', append=True,\n",
    "                         return_probs=True, probs_attr='probability')\n",
    "\n",
    "# Show the predictions and probabilities\n",
    "predictions[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probability']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appending the Prediction to the Original Feature Vectors Table In-place\n",
    "\n",
    "Lastly, we will show how to append the predictions to the original feature vector dataframe. We can accomplish this by setting the 'append' argument to true, setting the name of the new column with the 'target_attr' argument and then setting the 'inplace' argument to true. Again, we can include the probabilites with the 'return_probs' and 'probs_attr' arguments. This will append the predictions and probabilities to the original feature vector dataframe as opposed to the mthod used above which will create a copy of the feature vectors and append the predictions to that copy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>label</th>\n",
       "      <th>prediction</th>\n",
       "      <th>probabilities</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>124</td>\n",
       "      <td>l1647</td>\n",
       "      <td>r366</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>54</td>\n",
       "      <td>l332</td>\n",
       "      <td>r1463</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>268</th>\n",
       "      <td>268</td>\n",
       "      <td>l1499</td>\n",
       "      <td>r1725</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>293</th>\n",
       "      <td>293</td>\n",
       "      <td>l759</td>\n",
       "      <td>r1749</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>230</td>\n",
       "      <td>l1580</td>\n",
       "      <td>r1711</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     _id ltable_id rtable_id  label  prediction  probabilities\n",
       "124  124     l1647      r366      0           0            0.0\n",
       "54    54      l332     r1463      0           0            0.0\n",
       "268  268     l1499     r1725      0           0            0.0\n",
       "293  293      l759     r1749      1           1            1.0\n",
       "230  230     l1580     r1711      1           1            1.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert J into a set of feature vectors using F\n",
    "L4 = em.extract_feature_vecs(J, feature_table=F,\n",
    "                            attrs_after='label', show_progress=False)\n",
    "\n",
    "# Predict on L \n",
    "dt.predict(table=L4, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n",
    "           target_attr='prediction', append=True,\n",
    "           return_probs=True, probs_attr='probabilities',\n",
    "           inplace=True)\n",
    "\n",
    "# Show the predictions and probabilities\n",
    "L4[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probabilities']].head()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}