{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a part of the ***Magellan*** ecosystem, in this tutorial we will show how to use `deepmatcher` together with `Magellan`, to perform an end-to-end Entity Matching (EM) task. Specifically, `Magellan` performs EM in a **two-stage** fashion where given two tables,\n",
    "1. Perform **blocking** on the two tables by removing obvious non-matching tuple pairs to get a candidate set K.\n",
    "2. Perform **matching** to predict each pair in K as match or non-match. This stage consists of the following substeps:\n",
    "    1. *Magellan* first helps the user take a sample S from K.\n",
    "    2. The user labels all pairs in S as the training data.\n",
    "    3. A classifer L will be learned on S.\n",
    "    4. The user applies L to K to predict each pair as match or non-match.\n",
    "\n",
    "Given the above workflow, `deepmatcher` fits in the step C, which is to take advantage of deep learning to learn a classifer. For the rest of the tutorial, we will use a real example, which is to match songs across two tables from iTunes and Amazon Music, to go through the workflow.\n",
    "\n",
    "For more information on `Magellan`, please go to the project website: https://sites.google.com/site/anhaidgroup/projects/magellan."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 0. Preparation\n",
    "In order to use ***Magellan***, we need to first install it. The easiest way is to use \"pip\" as follow (please consult the package website for other installation options)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import py_entitymatching as em\n",
    "except:\n",
    "    !pip install py_entitymatching"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/deepmatcher')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import deepmatcher as dm\n",
    "import py_entitymatching as em\n",
    "\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1. Load data\n",
    "We first load the two input tables in the csv format using `Magellan`, that contain songs from iTunes and Amazon Music. These two tables are in the \"example\" directory included in our package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The path to the two input tables.\n",
    "path_A = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableA.csv')\n",
    "path_B = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableB.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n",
      "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n"
     ]
    }
   ],
   "source": [
    "# Load the two tables.\n",
    "A = em.read_csv_metadata(path_A, key='id')\n",
    "B = em.read_csv_metadata(path_B, key='id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of tuples in A: 6907\n",
      "Number of tuples in B: 55923\n",
      "Number of tuples in A X B (i.e the cartesian product): 386260161\n"
     ]
    }
   ],
   "source": [
    "# Basic information about the tables.\n",
    "print('Number of tuples in A: ' + str(len(A)))\n",
    "print('Number of tuples in B: ' + str(len(B)))\n",
    "print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>Song_Name</th>\n",
       "      <th>Artist_Name</th>\n",
       "      <th>Album_Name</th>\n",
       "      <th>Genre</th>\n",
       "      <th>Price</th>\n",
       "      <th>CopyRight</th>\n",
       "      <th>Time</th>\n",
       "      <th>Released</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Runaway Train</td>\n",
       "      <td>Cam</td>\n",
       "      <td>Welcome to Cam Country - EP</td>\n",
       "      <td>Country , Music , Contemporary Country , Honky Tonk</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2015 Sony Music Entertainment</td>\n",
       "      <td>3:01</td>\n",
       "      <td>31-Mar-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Track 14</td>\n",
       "      <td>Omi</td>\n",
       "      <td>Me 4 U</td>\n",
       "      <td>Pop/Rock , Music , Pop , Dance , R&amp;B / Soul</td>\n",
       "      <td>Album Only</td>\n",
       "      <td>2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ...</td>\n",
       "      <td>3:41</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Lips Are Movin</td>\n",
       "      <td>Meghan Trainor</td>\n",
       "      <td>Title</td>\n",
       "      <td>Pop , Music , Pop/Rock , Dance , Rock , Teen Pop</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 , 2015 Epic Records , a division of Sony Music Entertainment</td>\n",
       "      <td>3:01</td>\n",
       "      <td>9-Jan-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>I Want a Hippopotamus for Christmas</td>\n",
       "      <td>A Great Big World</td>\n",
       "      <td>I 'll Be Home For Christmas</td>\n",
       "      <td>Holiday , Music</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment</td>\n",
       "      <td>2:20</td>\n",
       "      <td>24-Nov-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Credit</td>\n",
       "      <td>Meghan Trainor</td>\n",
       "      <td>Title ( Deluxe )</td>\n",
       "      <td>Pop , Music , Rock , Pop/Rock , Dance , Teen Pop</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 , 2015 Epic Records , a division of Sony Music Entertainment</td>\n",
       "      <td>2:51</td>\n",
       "      <td>9-Jan-15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                            Song_Name        Artist_Name  \\\n",
       "0   0                        Runaway Train                Cam   \n",
       "1   1                             Track 14                Omi   \n",
       "2   2                       Lips Are Movin     Meghan Trainor   \n",
       "3   3  I Want a Hippopotamus for Christmas  A Great Big World   \n",
       "4   4                               Credit     Meghan Trainor   \n",
       "\n",
       "                    Album_Name  \\\n",
       "0  Welcome to Cam Country - EP   \n",
       "1                       Me 4 U   \n",
       "2                        Title   \n",
       "3  I 'll Be Home For Christmas   \n",
       "4             Title ( Deluxe )   \n",
       "\n",
       "                                                 Genre       Price  \\\n",
       "0  Country , Music , Contemporary Country , Honky Tonk      $ 0.99   \n",
       "1          Pop/Rock , Music , Pop , Dance , R&B / Soul  Album Only   \n",
       "2     Pop , Music , Pop/Rock , Dance , Rock , Teen Pop      $ 1.29   \n",
       "3                                      Holiday , Music      $ 1.29   \n",
       "4     Pop , Music , Rock , Pop/Rock , Dance , Teen Pop      $ 1.29   \n",
       "\n",
       "                                                                                             CopyRight  \\\n",
       "0                                                                        2015 Sony Music Entertainment   \n",
       "1  2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ...   \n",
       "2                                    2014 , 2015 Epic Records , a division of Sony Music Entertainment   \n",
       "3                         Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment   \n",
       "4                                    2014 , 2015 Epic Records , a division of Sony Music Entertainment   \n",
       "\n",
       "   Time   Released  \n",
       "0  3:01  31-Mar-15  \n",
       "1  3:41        NaN  \n",
       "2  3:01   9-Jan-15  \n",
       "3  2:20  24-Nov-14  \n",
       "4  2:51   9-Jan-15  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The first few tuples in table A.\n",
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>Song_Name</th>\n",
       "      <th>Artist_Name</th>\n",
       "      <th>Album_Name</th>\n",
       "      <th>Genre</th>\n",
       "      <th>Price</th>\n",
       "      <th>CopyRight</th>\n",
       "      <th>Time</th>\n",
       "      <th>Released</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Saxophone Stomp [ Explicit ]</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:20</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>I Wan na Mingle [ feat . Pusher ]</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>2:36</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Lytah</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:48</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Slappy Pappy</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:55</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Fushion Jam</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>1:48</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                          Song_Name Artist_Name  \\\n",
       "0   0       Saxophone Stomp [ Explicit ]       Rusko   \n",
       "1   1  I Wan na Mingle [ feat . Pusher ]       Rusko   \n",
       "2   2                              Lytah       Rusko   \n",
       "3   3                       Slappy Pappy       Rusko   \n",
       "4   4                        Fushion Jam       Rusko   \n",
       "\n",
       "                    Album_Name                         Genre   Price  \\\n",
       "0  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep  $ 1.29   \n",
       "1  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep  $ 1.29   \n",
       "2  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep  $ 1.29   \n",
       "3  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep  $ 1.29   \n",
       "4  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep  $ 1.29   \n",
       "\n",
       "                                                                CopyRight  \\\n",
       "0  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "1  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "2  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "3  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "4  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "\n",
       "   Time             Released  \n",
       "0  3:20  September 16 , 2014  \n",
       "1  2:36  September 16 , 2014  \n",
       "2  3:48  September 16 , 2014  \n",
       "3  3:55  September 16 , 2014  \n",
       "4  1:48  September 16 , 2014  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The first few tuples in table B.\n",
    "B.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2. Block tables to get the candidate set\n",
    "We first perform blocking on A and B to get a candidate set K, by removing obvious non-matching tuple pairs. `Magellan` supports four different types of blocker: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Typically, users need to mix and match these blockers with the debugging functionality provided in `Magellan` to get a good candidate set. (Developing a good blocker is not the focus of this tutorial. For more information on developing and debugging a blocker, please consult the user manual of `Magellan`.)\n",
    "\n",
    "Here we show an example of blocking. Observe that matching tuple pairs should have some common words in the album name, so we first create an overlap blocker on the attribute name \"Album_Name\" with threshold 2, to remove all pairs with word overlap less 2 in that attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [##############################] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:18\n"
     ]
    }
   ],
   "source": [
    "# Create an overlap blocker in Magellan and apply it to A and B to get the candidate set K1 which is in the format of \n",
    "# a dataframe. The \"l_out_attrs\" and \"r_out_attrs\" parameters indicate the columns that will be included in K1 from A\n",
    "# and B respectively.\n",
    "ob = em.OverlapBlocker()\n",
    "K1 = ob.block_tables(A, B, 'Album_Name', 'Album_Name',\n",
    "                    l_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'], \n",
    "                    r_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],\n",
    "                    overlap_size=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3152021"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The number of tuple pairs in K1.\n",
    "len(K1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>ltable_Song_Name</th>\n",
       "      <th>ltable_Artist_Name</th>\n",
       "      <th>ltable_Album_Name</th>\n",
       "      <th>ltable_Genre</th>\n",
       "      <th>ltable_Price</th>\n",
       "      <th>ltable_CopyRight</th>\n",
       "      <th>ltable_Time</th>\n",
       "      <th>ltable_Released</th>\n",
       "      <th>rtable_Song_Name</th>\n",
       "      <th>rtable_Artist_Name</th>\n",
       "      <th>rtable_Album_Name</th>\n",
       "      <th>rtable_Genre</th>\n",
       "      <th>rtable_Price</th>\n",
       "      <th>rtable_CopyRight</th>\n",
       "      <th>rtable_Time</th>\n",
       "      <th>rtable_Released</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>3995</td>\n",
       "      <td>0</td>\n",
       "      <td>Rudolph Blue</td>\n",
       "      <td>Brian Fechino</td>\n",
       "      <td>A Rock By The Sea Christmas : : Volume 2</td>\n",
       "      <td>Holiday , Music , Rock</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2011 Hubbub !</td>\n",
       "      <td>1:57</td>\n",
       "      <td>8-Nov-11</td>\n",
       "      <td>Saxophone Stomp [ Explicit ]</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:20</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3995</td>\n",
       "      <td>1</td>\n",
       "      <td>Rudolph Blue</td>\n",
       "      <td>Brian Fechino</td>\n",
       "      <td>A Rock By The Sea Christmas : : Volume 2</td>\n",
       "      <td>Holiday , Music , Rock</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2011 Hubbub !</td>\n",
       "      <td>1:57</td>\n",
       "      <td>8-Nov-11</td>\n",
       "      <td>I Wan na Mingle [ feat . Pusher ]</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>2:36</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>3995</td>\n",
       "      <td>2</td>\n",
       "      <td>Rudolph Blue</td>\n",
       "      <td>Brian Fechino</td>\n",
       "      <td>A Rock By The Sea Christmas : : Volume 2</td>\n",
       "      <td>Holiday , Music , Rock</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2011 Hubbub !</td>\n",
       "      <td>1:57</td>\n",
       "      <td>8-Nov-11</td>\n",
       "      <td>Lytah</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:48</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3995</td>\n",
       "      <td>3</td>\n",
       "      <td>Rudolph Blue</td>\n",
       "      <td>Brian Fechino</td>\n",
       "      <td>A Rock By The Sea Christmas : : Volume 2</td>\n",
       "      <td>Holiday , Music , Rock</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2011 Hubbub !</td>\n",
       "      <td>1:57</td>\n",
       "      <td>8-Nov-11</td>\n",
       "      <td>Slappy Pappy</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>3:55</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>3995</td>\n",
       "      <td>4</td>\n",
       "      <td>Rudolph Blue</td>\n",
       "      <td>Brian Fechino</td>\n",
       "      <td>A Rock By The Sea Christmas : : Volume 2</td>\n",
       "      <td>Holiday , Music , Rock</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>2011 Hubbub !</td>\n",
       "      <td>1:57</td>\n",
       "      <td>8-Nov-11</td>\n",
       "      <td>Fushion Jam</td>\n",
       "      <td>Rusko</td>\n",
       "      <td>! ( Volume 2 ) [ Explicit ]</td>\n",
       "      <td>Dance &amp; Electronic , Dubstep</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.</td>\n",
       "      <td>1:48</td>\n",
       "      <td>September 16 , 2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _id  ltable_id  rtable_id ltable_Song_Name ltable_Artist_Name  \\\n",
       "0    0       3995          0     Rudolph Blue      Brian Fechino   \n",
       "1    1       3995          1     Rudolph Blue      Brian Fechino   \n",
       "2    2       3995          2     Rudolph Blue      Brian Fechino   \n",
       "3    3       3995          3     Rudolph Blue      Brian Fechino   \n",
       "4    4       3995          4     Rudolph Blue      Brian Fechino   \n",
       "\n",
       "                          ltable_Album_Name            ltable_Genre  \\\n",
       "0  A Rock By The Sea Christmas : : Volume 2  Holiday , Music , Rock   \n",
       "1  A Rock By The Sea Christmas : : Volume 2  Holiday , Music , Rock   \n",
       "2  A Rock By The Sea Christmas : : Volume 2  Holiday , Music , Rock   \n",
       "3  A Rock By The Sea Christmas : : Volume 2  Holiday , Music , Rock   \n",
       "4  A Rock By The Sea Christmas : : Volume 2  Holiday , Music , Rock   \n",
       "\n",
       "  ltable_Price ltable_CopyRight ltable_Time ltable_Released  \\\n",
       "0       $ 0.99    2011 Hubbub !        1:57        8-Nov-11   \n",
       "1       $ 0.99    2011 Hubbub !        1:57        8-Nov-11   \n",
       "2       $ 0.99    2011 Hubbub !        1:57        8-Nov-11   \n",
       "3       $ 0.99    2011 Hubbub !        1:57        8-Nov-11   \n",
       "4       $ 0.99    2011 Hubbub !        1:57        8-Nov-11   \n",
       "\n",
       "                    rtable_Song_Name rtable_Artist_Name  \\\n",
       "0       Saxophone Stomp [ Explicit ]              Rusko   \n",
       "1  I Wan na Mingle [ feat . Pusher ]              Rusko   \n",
       "2                              Lytah              Rusko   \n",
       "3                       Slappy Pappy              Rusko   \n",
       "4                        Fushion Jam              Rusko   \n",
       "\n",
       "             rtable_Album_Name                  rtable_Genre rtable_Price  \\\n",
       "0  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep       $ 1.29   \n",
       "1  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep       $ 1.29   \n",
       "2  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep       $ 1.29   \n",
       "3  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep       $ 1.29   \n",
       "4  ! ( Volume 2 ) [ Explicit ]  Dance & Electronic , Dubstep       $ 1.29   \n",
       "\n",
       "                                                         rtable_CopyRight  \\\n",
       "0  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "1  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "2  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "3  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "4  ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.   \n",
       "\n",
       "  rtable_Time      rtable_Released  \n",
       "0        3:20  September 16 , 2014  \n",
       "1        2:36  September 16 , 2014  \n",
       "2        3:48  September 16 , 2014  \n",
       "3        3:55  September 16 , 2014  \n",
       "4        1:48  September 16 , 2014  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The first few tuple pairs in K1\n",
    "K1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that K1 has more than 3 million pair in it, which is too large to considered for matching. So we create an overlap blocking with threshold 1 on the attribute \"Artist_Name\" for K1, to filter all pairs in K1 that don't share any word in \"Artist_Name\". And we get the candidate set K2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [##############################] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:29\n"
     ]
    }
   ],
   "source": [
    "# Create a new overlap blocker to remove pairs from K1 that have no common word in \"Artist_Name\".\n",
    "K2 = ob.block_candset(K1, 'Artist_Name', 'Artist_Name', overlap_size=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "167180"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The number of tuple pairs in K2.\n",
    "len(K2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the second blocking step, now we have a candidate set K2 with about 170K pairs. But we think it is still a bit larger to consider for matching. So we apply a third blocker, which is an overlap blocker on the attribute \"Song_Name\", to further reduce the size of the candidate set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [##############################] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:01\n"
     ]
    }
   ],
   "source": [
    "# Apply the third overlap blocker.\n",
    "K3 = ob.block_candset(K2, 'Song_Name', 'Song_Name', overlap_size=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38015"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The number of tuples pairs in K3.\n",
    "len(K3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a candidate set with 38K pairs which is reasonable, so we take K3 as the final candidate set. We save the candidate to the disk in the csv format for future reuse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_K = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv')\n",
    "K3.to_csv(path_K, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3. Match tuple pairs in the candidate set\n",
    "In this stage we will match tuple pairs in the candidate set to predict each of them as match or non-match. This is the part that `deepmatcher` will be involved. Specifically, it consists of the following steps:\n",
    "1. Take a sample S from the candidate set K, and label all pairs in S.\n",
    "2. Train a classifier L using S. Specifically, we will train a classifier using `deepmatcher`.\n",
    "3. Apply L to the candidiate set K."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample and label the candidate set\n",
    "We first take a random sample S from the candidate set K using `Magellan`. Here for example, we sample 500 pairs for K. Then we label each pair as match (enter 1) or non-match (enter 0) and use S as the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take a sample of 500 pairs from the candidate set.\n",
    "S = em.sample_table(K3, 500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Label the sample S in a GUI. Enter 1 for match and 0 for non-match.\n",
    "G = em.label_table(S, 'gold')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purposes of this tutorial, we will load in a pre-labeled dataset (of 539 tuple pairs) included in this package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The path to the labeled data file.\n",
    "path_G = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'gold.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of labeled pairs: 539\n"
     ]
    }
   ],
   "source": [
    "# Load the labeled data into a dataframe.\n",
    "G = em.read_csv_metadata(path_G, \n",
    "                         key='_id',\n",
    "                         ltable=A, rtable=B, \n",
    "                         fk_ltable='ltable_id', fk_rtable='rtable_id')\n",
    "print('Number of labeled pairs:', len(G))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a classifier using labeled data\n",
    "Once we have the labeled data, we use `deepmatcher` to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The directory where the data splits will be saved.\n",
    "split_path = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.\n",
    "dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',\n",
    "              [3, 1, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:deepmatcher.data.dataset:Rebuilding data cache because: ['One or more data files have been modified.']\n",
      "\n",
      "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/train.csv\"\n",
      "0% [############################# ] 100% | ETA: 00:00:00\n",
      "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/valid.csv\"\n",
      "0% [############################# ] 100% | ETA: 00:00:00\n",
      "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/test.csv\"\n",
      "0% [############################# ] 100% | ETA: 00:00:00\n",
      "Building vocabulary\n",
      "0% [#] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n",
      "\n",
      "Computing principal components\n",
      "0% [#] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    }
   ],
   "source": [
    "# Load the training data files from the disk. Ignore the \"left_id\" and \"right_id\" \n",
    "# columns for data preprocessing.\n",
    "# The 'use_magellan_convention' parameter asks deepmatcher to use Magellan's \n",
    "# naming convention for the left and right table column name prefixes \n",
    "# (\"ltable_\", and \"rtable_\"), and also to consider \"_id\" as the ID column.\n",
    "train, validation, test = dm.data.process(\n",
    "    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial'),\n",
    "    cache='train_cache.pth',\n",
    "    train='train.csv',\n",
    "    validation='valid.csv',\n",
    "    test='test.csv',\n",
    "    use_magellan_convention=True,\n",
    "    ignore_columns=('ltable_id', 'rtable_id'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we get the training data, we can use `deepmatcher` to train a classifier. Here we train a hybrid model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a hybrid model.\n",
    "model = dm.MatchingModel(attr_summarizer='hybrid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* Number of trainable parameters: 17757810\n",
      "===>  TRAIN Epoch 1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:04\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 1 || Run Time:    4.2 | Load Time:    0.4 || F1:  39.60 | Prec:  31.25 | Rec:  54.05 || Ex/s:  69.89\n",
      "\n",
      "===>  EVAL Epoch 1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 1 || Run Time:    0.6 | Load Time:    0.2 || F1:  60.61 | Prec:  50.00 | Rec:  76.92 || Ex/s: 146.30\n",
      "\n",
      "* Best F1: 60.60606060606061\n",
      "Saving best model...\n",
      "Done.\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:04\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 2 || Run Time:    3.9 | Load Time:    0.4 || F1:  65.62 | Prec:  53.39 | Rec:  85.14 || Ex/s:  75.12\n",
      "\n",
      "===>  EVAL Epoch 2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 2 || Run Time:    0.6 | Load Time:    0.2 || F1:  76.00 | Prec:  79.17 | Rec:  73.08 || Ex/s: 146.34\n",
      "\n",
      "* Best F1: 76.0\n",
      "Saving best model...\n",
      "Done.\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:03\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 3 || Run Time:    3.8 | Load Time:    0.4 || F1:  87.27 | Prec:  79.12 | Rec:  97.30 || Ex/s:  76.32\n",
      "\n",
      "===>  EVAL Epoch 3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 3 || Run Time:    0.6 | Load Time:    0.2 || F1:  84.21 | Prec:  77.42 | Rec:  92.31 || Ex/s: 143.25\n",
      "\n",
      "* Best F1: 84.21052631578948\n",
      "Saving best model...\n",
      "Done.\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:03\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 4 || Run Time:    3.8 | Load Time:    0.4 || F1:  90.57 | Prec:  84.71 | Rec:  97.30 || Ex/s:  76.22\n",
      "\n",
      "===>  EVAL Epoch 4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 4 || Run Time:    0.6 | Load Time:    0.2 || F1:  82.76 | Prec:  75.00 | Rec:  92.31 || Ex/s: 146.11\n",
      "\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:04\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 5 || Run Time:    3.9 | Load Time:    0.4 || F1:  97.37 | Prec:  94.87 | Rec: 100.00 || Ex/s:  74.96\n",
      "\n",
      "===>  EVAL Epoch 5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 5 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.57 | Prec:  88.89 | Rec:  92.31 || Ex/s: 146.11\n",
      "\n",
      "* Best F1: 90.56603773584906\n",
      "Saving best model...\n",
      "Done.\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:04\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 6 || Run Time:    3.9 | Load Time:    0.4 || F1:  98.67 | Prec:  97.37 | Rec: 100.00 || Ex/s:  74.94\n",
      "\n",
      "===>  EVAL Epoch 6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 6 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.73\n",
      "\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:03\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 7 || Run Time:    3.8 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.88\n",
      "\n",
      "===>  EVAL Epoch 7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 7 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.56\n",
      "\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:03\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 8 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.16\n",
      "\n",
      "===>  EVAL Epoch 8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 8 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.76\n",
      "\n",
      "* Best F1: 92.00000000000001\n",
      "Saving best model...\n",
      "Done.\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 9\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:03\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 9 || Run Time:    3.7 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  77.50\n",
      "\n",
      "===>  EVAL Epoch 9\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 9 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.27\n",
      "\n",
      "---------------------\n",
      "\n",
      "===>  TRAIN Epoch 10\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:04\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 10 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.28\n",
      "\n",
      "===>  EVAL Epoch 10\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [█] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 10 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.95\n",
      "\n",
      "---------------------\n",
      "\n",
      "Loading best model...\n",
      "Training done.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "92.00000000000001"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative \n",
    "# ratio to be 3. We save the best model (with the \n",
    "# highest F1 score on the validation set) to 'hybrid_model.pth'.\n",
    "model.run_train(\n",
    "    train,\n",
    "    validation,\n",
    "    epochs=10,\n",
    "    batch_size=16,\n",
    "    best_save_path='hybrid_model.pth',\n",
    "    pos_neg_ratio=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have the trained classifier, we can evaluate the accuracy using the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===>  EVAL Epoch 8\n",
      "Finished Epoch 8 || Run Time:    0.3 | Load Time:    0.2 || F1:  92.31 | Prec:  90.91 | Rec:  93.75 || Ex/s: 208.04\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "92.3076923076923"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Evaluate the accuracy on the test data.\n",
    "model.run_eval(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apply the trained classifier to the candidate set\n",
    "Now we have a trained classifer, we can apply it to the candidate set to predict each pair as match or non-match. To achieve this, we first load the trained model and the candidate set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the model.\n",
    "model = dm.MatchingModel(attr_summarizer='hybrid')\n",
    "model.load_state('hybrid_model.pth')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/candidate.csv\"\n",
      "0% [##############################] 100% | ETA: 00:00:00"
     ]
    }
   ],
   "source": [
    "# Load the candidate set. Note that the trained model is an input parameter as we need to trained \n",
    "# model for candidate set preprocessing.\n",
    "candidate = dm.data.process_unlabeled(\n",
    "    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv'),\n",
    "    trained_model=model,\n",
    "    ignore_columns=('ltable_id', 'rtable_id'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we load the candidate set, now we make the predictions using the `run_prediction` function. The *output_attribute* argument indicates the columns that will be included in the prediction table. Here \n",
    "we use all of the columns in the candidate set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===>  PREDICT Epoch 8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0% [██████████████████████████████] 100% | ETA: 00:00:00\n",
      "Total time elapsed: 00:02:18\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Epoch 8 || Run Time:   97.2 | Load Time:   41.5 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Predict the pairs in the candidate set and return a dataframe containing the pair id with \n",
    "# the score of being a match.\n",
    "predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below shows a few pairs in the prediction table. Note that this table is not sorted in the descending order of the \n",
    "match scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_score</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>ltable_Song_Name</th>\n",
       "      <th>ltable_Artist_Name</th>\n",
       "      <th>ltable_Album_Name</th>\n",
       "      <th>ltable_Genre</th>\n",
       "      <th>ltable_Price</th>\n",
       "      <th>ltable_CopyRight</th>\n",
       "      <th>ltable_Time</th>\n",
       "      <th>ltable_Released</th>\n",
       "      <th>rtable_Song_Name</th>\n",
       "      <th>rtable_Artist_Name</th>\n",
       "      <th>rtable_Album_Name</th>\n",
       "      <th>rtable_Genre</th>\n",
       "      <th>rtable_Price</th>\n",
       "      <th>rtable_CopyRight</th>\n",
       "      <th>rtable_Time</th>\n",
       "      <th>rtable_Released</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3237</th>\n",
       "      <td>0.960461</td>\n",
       "      <td>2583</td>\n",
       "      <td>228</td>\n",
       "      <td>#Selfie ( Instrumental Mix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( Instrumental Mix ) - Single</td>\n",
       "      <td>Dance , Music , Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.</td>\n",
       "      <td>3:03</td>\n",
       "      <td>20-Mar-14</td>\n",
       "      <td>#SELFIE ( Instrumental Mix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( Instrumental Mix )</td>\n",
       "      <td>Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:03</td>\n",
       "      <td>March 20 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3274</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>2003</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3275</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>2574</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3276</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>2575</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3359</th>\n",
       "      <td>0.161323</td>\n",
       "      <td>2003</td>\n",
       "      <td>230</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Botnek Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:36</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      match_score  ltable_id  rtable_id              ltable_Song_Name  \\\n",
       "_id                                                                     \n",
       "3237     0.960461       2583        228  #Selfie ( Instrumental Mix )   \n",
       "3274     0.294473       2003        229    #SELFIE ( Caked Up Remix )   \n",
       "3275     0.294473       2574        229    #SELFIE ( Caked Up Remix )   \n",
       "3276     0.294473       2575        229    #SELFIE ( Caked Up Remix )   \n",
       "3359     0.161323       2003        230    #SELFIE ( Caked Up Remix )   \n",
       "\n",
       "     ltable_Artist_Name                      ltable_Album_Name  \\\n",
       "_id                                                              \n",
       "3237   The Chainsmokers  #Selfie ( Instrumental Mix ) - Single   \n",
       "3274   The Chainsmokers       #SELFIE ( The Remixes ) - Single   \n",
       "3275   The Chainsmokers       #SELFIE ( The Remixes ) - Single   \n",
       "3276   The Chainsmokers       #SELFIE ( The Remixes ) - Single   \n",
       "3359   The Chainsmokers       #SELFIE ( The Remixes ) - Single   \n",
       "\n",
       "                    ltable_Genre ltable_Price  \\\n",
       "_id                                             \n",
       "3237  Dance , Music , Electronic       $ 1.29   \n",
       "3274  Electronic , Music , Dance       $ 1.29   \n",
       "3275  Electronic , Music , Dance       $ 1.29   \n",
       "3276  Electronic , Music , Dance       $ 1.29   \n",
       "3359  Electronic , Music , Dance       $ 1.29   \n",
       "\n",
       "                                                                                         ltable_CopyRight  \\\n",
       "_id                                                                                                         \n",
       "3237  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.   \n",
       "3274  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3275  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3276  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3359  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "\n",
       "     ltable_Time ltable_Released               rtable_Song_Name  \\\n",
       "_id                                                               \n",
       "3237        3:03       20-Mar-14   #SELFIE ( Instrumental Mix )   \n",
       "3274        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3275        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3276        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3359        3:15       17-Jun-14       #Selfie ( Botnek Remix )   \n",
       "\n",
       "     rtable_Artist_Name             rtable_Album_Name  \\\n",
       "_id                                                     \n",
       "3237   The Chainsmokers  #SELFIE ( Instrumental Mix )   \n",
       "3274   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3275   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3276   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3359   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "\n",
       "                          rtable_Genre rtable_Price  \\\n",
       "_id                                                   \n",
       "3237                Dance & Electronic       $ 1.29   \n",
       "3274  Electronica , Dance & Electronic       $ 1.29   \n",
       "3275  Electronica , Dance & Electronic       $ 1.29   \n",
       "3276  Electronica , Dance & Electronic       $ 1.29   \n",
       "3359  Electronica , Dance & Electronic       $ 1.29   \n",
       "\n",
       "                                                                                         rtable_CopyRight  \\\n",
       "_id                                                                                                         \n",
       "3237  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3274  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3275  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3276  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3359  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "\n",
       "     rtable_Time  rtable_Released  \n",
       "_id                                \n",
       "3237        3:03  March 20 , 2014  \n",
       "3274        3:52   June 17 , 2014  \n",
       "3275        3:52   June 17 , 2014  \n",
       "3276        3:52   June 17 , 2014  \n",
       "3359        3:36   June 17 , 2014  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can simply manipulate the prediction table above to get the pairs that we want. Suppose we are only interested in pairs with the match score more over 0.9, and we want to display them sorting in the descending order on match score. This can be achieved as follow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_score</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>ltable_Song_Name</th>\n",
       "      <th>ltable_Artist_Name</th>\n",
       "      <th>ltable_Album_Name</th>\n",
       "      <th>ltable_Genre</th>\n",
       "      <th>ltable_Price</th>\n",
       "      <th>ltable_CopyRight</th>\n",
       "      <th>ltable_Time</th>\n",
       "      <th>ltable_Released</th>\n",
       "      <th>rtable_Song_Name</th>\n",
       "      <th>rtable_Artist_Name</th>\n",
       "      <th>rtable_Album_Name</th>\n",
       "      <th>rtable_Genre</th>\n",
       "      <th>rtable_Price</th>\n",
       "      <th>rtable_CopyRight</th>\n",
       "      <th>rtable_Time</th>\n",
       "      <th>rtable_Released</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1472466</th>\n",
       "      <td>0.995653</td>\n",
       "      <td>2462</td>\n",
       "      <td>30798</td>\n",
       "      <td>It 's Never Easy to Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me and You</td>\n",
       "      <td>Country , Music , Urban Cowboy , Contemporary Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>1996 BMG Entertainment</td>\n",
       "      <td>4:44</td>\n",
       "      <td>1-Jan-90</td>\n",
       "      <td>It 's Never Easy To Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me And You</td>\n",
       "      <td>Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC</td>\n",
       "      <td>4:44</td>\n",
       "      <td>January 1 , 1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1472465</th>\n",
       "      <td>0.995653</td>\n",
       "      <td>2453</td>\n",
       "      <td>30798</td>\n",
       "      <td>It 's Never Easy to Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me and You</td>\n",
       "      <td>Country , Music , Urban Cowboy , Contemporary Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>1996 BMG Entertainment</td>\n",
       "      <td>4:44</td>\n",
       "      <td>1-Jan-90</td>\n",
       "      <td>It 's Never Easy To Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me And You</td>\n",
       "      <td>Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC</td>\n",
       "      <td>4:44</td>\n",
       "      <td>January 1 , 1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1472464</th>\n",
       "      <td>0.995653</td>\n",
       "      <td>2438</td>\n",
       "      <td>30798</td>\n",
       "      <td>It 's Never Easy to Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me and You</td>\n",
       "      <td>Country , Music , Urban Cowboy , Contemporary Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>1996 BMG Entertainment</td>\n",
       "      <td>4:44</td>\n",
       "      <td>1-Jan-90</td>\n",
       "      <td>It 's Never Easy To Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me And You</td>\n",
       "      <td>Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC</td>\n",
       "      <td>4:44</td>\n",
       "      <td>January 1 , 1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1472463</th>\n",
       "      <td>0.995653</td>\n",
       "      <td>2435</td>\n",
       "      <td>30798</td>\n",
       "      <td>It 's Never Easy to Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me and You</td>\n",
       "      <td>Country , Music , Urban Cowboy , Contemporary Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>1996 BMG Entertainment</td>\n",
       "      <td>4:44</td>\n",
       "      <td>1-Jan-90</td>\n",
       "      <td>It 's Never Easy To Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me And You</td>\n",
       "      <td>Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC</td>\n",
       "      <td>4:44</td>\n",
       "      <td>January 1 , 1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1472462</th>\n",
       "      <td>0.995653</td>\n",
       "      <td>2430</td>\n",
       "      <td>30798</td>\n",
       "      <td>It 's Never Easy to Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me and You</td>\n",
       "      <td>Country , Music , Urban Cowboy , Contemporary Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>1996 BMG Entertainment</td>\n",
       "      <td>4:44</td>\n",
       "      <td>1-Jan-90</td>\n",
       "      <td>It 's Never Easy To Say Goodbye</td>\n",
       "      <td>Kenny Chesney</td>\n",
       "      <td>Me And You</td>\n",
       "      <td>Country</td>\n",
       "      <td>$ 0.99</td>\n",
       "      <td>( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC</td>\n",
       "      <td>4:44</td>\n",
       "      <td>January 1 , 1990</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         match_score  ltable_id  rtable_id                 ltable_Song_Name  \\\n",
       "_id                                                                           \n",
       "1472466     0.995653       2462      30798  It 's Never Easy to Say Goodbye   \n",
       "1472465     0.995653       2453      30798  It 's Never Easy to Say Goodbye   \n",
       "1472464     0.995653       2438      30798  It 's Never Easy to Say Goodbye   \n",
       "1472463     0.995653       2435      30798  It 's Never Easy to Say Goodbye   \n",
       "1472462     0.995653       2430      30798  It 's Never Easy to Say Goodbye   \n",
       "\n",
       "        ltable_Artist_Name ltable_Album_Name  \\\n",
       "_id                                            \n",
       "1472466      Kenny Chesney        Me and You   \n",
       "1472465      Kenny Chesney        Me and You   \n",
       "1472464      Kenny Chesney        Me and You   \n",
       "1472463      Kenny Chesney        Me and You   \n",
       "1472462      Kenny Chesney        Me and You   \n",
       "\n",
       "                                                  ltable_Genre ltable_Price  \\\n",
       "_id                                                                           \n",
       "1472466  Country , Music , Urban Cowboy , Contemporary Country       $ 0.99   \n",
       "1472465  Country , Music , Urban Cowboy , Contemporary Country       $ 0.99   \n",
       "1472464  Country , Music , Urban Cowboy , Contemporary Country       $ 0.99   \n",
       "1472463  Country , Music , Urban Cowboy , Contemporary Country       $ 0.99   \n",
       "1472462  Country , Music , Urban Cowboy , Contemporary Country       $ 0.99   \n",
       "\n",
       "               ltable_CopyRight ltable_Time ltable_Released  \\\n",
       "_id                                                           \n",
       "1472466  1996 BMG Entertainment        4:44        1-Jan-90   \n",
       "1472465  1996 BMG Entertainment        4:44        1-Jan-90   \n",
       "1472464  1996 BMG Entertainment        4:44        1-Jan-90   \n",
       "1472463  1996 BMG Entertainment        4:44        1-Jan-90   \n",
       "1472462  1996 BMG Entertainment        4:44        1-Jan-90   \n",
       "\n",
       "                        rtable_Song_Name rtable_Artist_Name rtable_Album_Name  \\\n",
       "_id                                                                             \n",
       "1472466  It 's Never Easy To Say Goodbye      Kenny Chesney        Me And You   \n",
       "1472465  It 's Never Easy To Say Goodbye      Kenny Chesney        Me And You   \n",
       "1472464  It 's Never Easy To Say Goodbye      Kenny Chesney        Me And You   \n",
       "1472463  It 's Never Easy To Say Goodbye      Kenny Chesney        Me And You   \n",
       "1472462  It 's Never Easy To Say Goodbye      Kenny Chesney        Me And You   \n",
       "\n",
       "        rtable_Genre rtable_Price  \\\n",
       "_id                                 \n",
       "1472466      Country       $ 0.99   \n",
       "1472465      Country       $ 0.99   \n",
       "1472464      Country       $ 0.99   \n",
       "1472463      Country       $ 0.99   \n",
       "1472462      Country       $ 0.99   \n",
       "\n",
       "                                                                     rtable_CopyRight  \\\n",
       "_id                                                                                     \n",
       "1472466  ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC   \n",
       "1472465  ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC   \n",
       "1472464  ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC   \n",
       "1472463  ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC   \n",
       "1472462  ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC   \n",
       "\n",
       "        rtable_Time   rtable_Released  \n",
       "_id                                    \n",
       "1472466        4:44  January 1 , 1990  \n",
       "1472465        4:44  January 1 , 1990  \n",
       "1472464        4:44  January 1 , 1990  \n",
       "1472463        4:44  January 1 , 1990  \n",
       "1472462        4:44  January 1 , 1990  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)\n",
    "high_score_pairs.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you have seen, `deepmatcher` does not include a column indicating match / non-match predictions in the output table. You can easily create such a column by thresholding the `match_score` at 0.5 (which is what `deepmatcher` uses as the threshold to compute F1), or using a threshold value or your choice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match_score</th>\n",
       "      <th>match_prediction</th>\n",
       "      <th>ltable_id</th>\n",
       "      <th>rtable_id</th>\n",
       "      <th>ltable_Song_Name</th>\n",
       "      <th>ltable_Artist_Name</th>\n",
       "      <th>ltable_Album_Name</th>\n",
       "      <th>ltable_Genre</th>\n",
       "      <th>ltable_Price</th>\n",
       "      <th>ltable_CopyRight</th>\n",
       "      <th>ltable_Time</th>\n",
       "      <th>ltable_Released</th>\n",
       "      <th>rtable_Song_Name</th>\n",
       "      <th>rtable_Artist_Name</th>\n",
       "      <th>rtable_Album_Name</th>\n",
       "      <th>rtable_Genre</th>\n",
       "      <th>rtable_Price</th>\n",
       "      <th>rtable_CopyRight</th>\n",
       "      <th>rtable_Time</th>\n",
       "      <th>rtable_Released</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3237</th>\n",
       "      <td>0.960461</td>\n",
       "      <td>1</td>\n",
       "      <td>2583</td>\n",
       "      <td>228</td>\n",
       "      <td>#Selfie ( Instrumental Mix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( Instrumental Mix ) - Single</td>\n",
       "      <td>Dance , Music , Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.</td>\n",
       "      <td>3:03</td>\n",
       "      <td>20-Mar-14</td>\n",
       "      <td>#SELFIE ( Instrumental Mix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( Instrumental Mix )</td>\n",
       "      <td>Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:03</td>\n",
       "      <td>March 20 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3274</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>0</td>\n",
       "      <td>2003</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3275</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>0</td>\n",
       "      <td>2574</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3276</th>\n",
       "      <td>0.294473</td>\n",
       "      <td>0</td>\n",
       "      <td>2575</td>\n",
       "      <td>229</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Will Sparks Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:52</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3359</th>\n",
       "      <td>0.161323</td>\n",
       "      <td>0</td>\n",
       "      <td>2003</td>\n",
       "      <td>230</td>\n",
       "      <td>#SELFIE ( Caked Up Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#SELFIE ( The Remixes ) - Single</td>\n",
       "      <td>Electronic , Music , Dance</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...</td>\n",
       "      <td>3:15</td>\n",
       "      <td>17-Jun-14</td>\n",
       "      <td>#Selfie ( Botnek Remix )</td>\n",
       "      <td>The Chainsmokers</td>\n",
       "      <td>#Selfie ( The Remixes )</td>\n",
       "      <td>Electronica , Dance &amp; Electronic</td>\n",
       "      <td>$ 1.29</td>\n",
       "      <td>( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...</td>\n",
       "      <td>3:36</td>\n",
       "      <td>June 17 , 2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      match_score  match_prediction  ltable_id  rtable_id  \\\n",
       "_id                                                         \n",
       "3237     0.960461                 1       2583        228   \n",
       "3274     0.294473                 0       2003        229   \n",
       "3275     0.294473                 0       2574        229   \n",
       "3276     0.294473                 0       2575        229   \n",
       "3359     0.161323                 0       2003        230   \n",
       "\n",
       "                  ltable_Song_Name ltable_Artist_Name  \\\n",
       "_id                                                     \n",
       "3237  #Selfie ( Instrumental Mix )   The Chainsmokers   \n",
       "3274    #SELFIE ( Caked Up Remix )   The Chainsmokers   \n",
       "3275    #SELFIE ( Caked Up Remix )   The Chainsmokers   \n",
       "3276    #SELFIE ( Caked Up Remix )   The Chainsmokers   \n",
       "3359    #SELFIE ( Caked Up Remix )   The Chainsmokers   \n",
       "\n",
       "                          ltable_Album_Name                ltable_Genre  \\\n",
       "_id                                                                       \n",
       "3237  #Selfie ( Instrumental Mix ) - Single  Dance , Music , Electronic   \n",
       "3274       #SELFIE ( The Remixes ) - Single  Electronic , Music , Dance   \n",
       "3275       #SELFIE ( The Remixes ) - Single  Electronic , Music , Dance   \n",
       "3276       #SELFIE ( The Remixes ) - Single  Electronic , Music , Dance   \n",
       "3359       #SELFIE ( The Remixes ) - Single  Electronic , Music , Dance   \n",
       "\n",
       "     ltable_Price  \\\n",
       "_id                 \n",
       "3237       $ 1.29   \n",
       "3274       $ 1.29   \n",
       "3275       $ 1.29   \n",
       "3276       $ 1.29   \n",
       "3359       $ 1.29   \n",
       "\n",
       "                                                                                         ltable_CopyRight  \\\n",
       "_id                                                                                                         \n",
       "3237  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.   \n",
       "3274  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3275  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3276  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "3359  2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...   \n",
       "\n",
       "     ltable_Time ltable_Released               rtable_Song_Name  \\\n",
       "_id                                                               \n",
       "3237        3:03       20-Mar-14   #SELFIE ( Instrumental Mix )   \n",
       "3274        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3275        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3276        3:15       17-Jun-14  #Selfie ( Will Sparks Remix )   \n",
       "3359        3:15       17-Jun-14       #Selfie ( Botnek Remix )   \n",
       "\n",
       "     rtable_Artist_Name             rtable_Album_Name  \\\n",
       "_id                                                     \n",
       "3237   The Chainsmokers  #SELFIE ( Instrumental Mix )   \n",
       "3274   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3275   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3276   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "3359   The Chainsmokers       #Selfie ( The Remixes )   \n",
       "\n",
       "                          rtable_Genre rtable_Price  \\\n",
       "_id                                                   \n",
       "3237                Dance & Electronic       $ 1.29   \n",
       "3274  Electronica , Dance & Electronic       $ 1.29   \n",
       "3275  Electronica , Dance & Electronic       $ 1.29   \n",
       "3276  Electronica , Dance & Electronic       $ 1.29   \n",
       "3359  Electronica , Dance & Electronic       $ 1.29   \n",
       "\n",
       "                                                                                         rtable_CopyRight  \\\n",
       "_id                                                                                                         \n",
       "3237  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3274  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3275  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3276  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "3359  ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...   \n",
       "\n",
       "     rtable_Time  rtable_Released  \n",
       "_id                                \n",
       "3237        3:03  March 20 , 2014  \n",
       "3274        3:52   June 17 , 2014  \n",
       "3275        3:52   June 17 , 2014  \n",
       "3276        3:52   June 17 , 2014  \n",
       "3359        3:36   June 17 , 2014  "
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions['match_prediction'] = predictions['match_score'].apply(lambda score: 1 if score >= 0.5 else 0)\n",
    "\n",
    "# Reorder columns to avoid scrolling...\n",
    "predictions =  predictions[['match_score', 'match_prediction'] + predictions.columns.values[1:-1].tolist()]\n",
    "predictions.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reset index as Magellan requires the key to be a column in the table\n",
    "predictions.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Update metadata in the catalog. This information can later be used by triggers to modify the labels from \n",
    "# the learning-based matcher \n",
    "em.set_key(predictions, '_id')\n",
    "em.set_fk_ltable(predictions, 'ltable_id')\n",
    "em.set_fk_rtable(predictions, 'rtable_id')\n",
    "em.set_ltable(predictions, A)\n",
    "em.set_rtable(predictions, B)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}