{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "This IPython notebook illustrates how to performing matching with a ML matcher. In particular we show examples with a decision tree matcher, but the same principles apply to all of the other ML matchers." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the orignal tables and a set of labeled data into py_entitymatching." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "path_A = datasets_dir + os.sep + 'dblp_demo.csv'\n", "path_B = datasets_dir + os.sep + 'acm_demo.csv'\n", "path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "A = em.read_csv_metadata(path_A, key='id')\n", "B = em.read_csv_metadata(path_B, key='id')\n", "# Load the pre-labeled data\n", "S = em.read_csv_metadata(path_labeled_data, \n", " key='_id',\n", " ltable=A, rtable=B, \n", " fk_ltable='ltable_id', fk_rtable='rtable_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Training the ML Matcher\n", "\n", "Now, we can train our ML matcher. In this notebook we will demonstrate this process with a decision tree matcher. First, we need to split our labeled data into a training set and a test set. Then we will exract feature vectors from the training set and train our decision tree with the fit command." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Split S into I an J\n", "IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)\n", "I = IJ['train']\n", "J = IJ['test']" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Generate a set of features\n", "F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Convert I into feature vectors using updated F\n", "H = em.extract_feature_vecs(I, \n", " feature_table=F, \n", " attrs_after='label',\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Instantiate the matcher to evaluate.\n", "dt = em.DTMatcher(name='DecisionTree', random_state=0)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Train using feature vectors from I \n", "dt.fit(table=H, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n", " target_attr='label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Predictions with the ML Matcher\n", "\n", "Since we now have a trained decision tree, we can use our matcher to get predictions on the test set. Below, we will show four different ways to get the predictions with the predict command that will be useful in various contexts. \n", "\n", "### Getting a List of Predictions\n", "\n", "First up, we will demonstrate how to get just a list of predictions using the predict command. This is the default method of getting predictions. As shown below, the resulting variable, predictions, is just an array containing the predictions for each of the feature vectors in the test set. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 1, 1, 1, 0, 1, 0, 0])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert J into a set of feature vectors using F\n", "L1 = em.extract_feature_vecs(J, feature_table=F,\n", " attrs_after='label', show_progress=False)\n", "\n", "# Predict on L \n", "predictions = dt.predict(table=L1, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'])\n", "\n", "# Show the predictions\n", "predictions[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting a List of Predictions and a List of Probabilities\n", "\n", "Next we will demonstrate how to get both a list of prediction for the test set, as well as a list of the associated probabilities for the predictions. This is done by setting the 'return_probs' argument to true. Note that the probabilities shown are the probability for a match. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predictions for first ten entries: [0 0 0 1 1 1 0 1 0 0]\n", "Probabilities of a match for first ten entries: [0. 0. 0. 1. 1. 1. 0. 1. 0. 0.]\n" ] } ], "source": [ "# Convert J into a set of feature vectors using F\n", "L2 = em.extract_feature_vecs(J, feature_table=F,\n", " attrs_after='label', show_progress=False)\n", "\n", "# Predict on L \n", "predictions, probs = dt.predict(table=L2, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], return_probs=True)\n", "\n", "# Show the predictions and probabilities\n", "print('Predictions for first ten entries: {0}'.format(predictions[0:10]))\n", "print('Probabilities of a match for first ten entries: {0}'.format(probs[0:10]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appending the Predictions to the Feature Vectors Table\n", "\n", "Often, we want to include the predictions with the feature vector table. We can return predictions appended to a copy of the feature vector table if we use the 'append' argument to true. We can choose the name of the new predictions column using the 'target_attr' argument. We can also append the probabilites by setting 'return_probs' to true and setting the new probabilities column name with the 'probs_attr'." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idlabelpredictionprobability
124124l1647r366000.0
5454l332r1463000.0
268268l1499r1725000.0
293293l759r1749111.0
230230l1580r1711111.0
\n", "
" ], "text/plain": [ " _id ltable_id rtable_id label prediction probability\n", "124 124 l1647 r366 0 0 0.0\n", "54 54 l332 r1463 0 0 0.0\n", "268 268 l1499 r1725 0 0 0.0\n", "293 293 l759 r1749 1 1 1.0\n", "230 230 l1580 r1711 1 1 1.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert J into a set of feature vectors using F\n", "L3 = em.extract_feature_vecs(J, feature_table=F,\n", " attrs_after='label', show_progress=False)\n", "\n", "# Predict on L \n", "predictions = dt.predict(table=L3, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n", " target_attr='prediction', append=True,\n", " return_probs=True, probs_attr='probability')\n", "\n", "# Show the predictions and probabilities\n", "predictions[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probability']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appending the Prediction to the Original Feature Vectors Table In-place\n", "\n", "Lastly, we will show how to append the predictions to the original feature vector dataframe. We can accomplish this by setting the 'append' argument to true, setting the name of the new column with the 'target_attr' argument and then setting the 'inplace' argument to true. Again, we can include the probabilites with the 'return_probs' and 'probs_attr' arguments. This will append the predictions and probabilities to the original feature vector dataframe as opposed to the mthod used above which will create a copy of the feature vectors and append the predictions to that copy." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idlabelpredictionprobabilities
124124l1647r366000.0
5454l332r1463000.0
268268l1499r1725000.0
293293l759r1749111.0
230230l1580r1711111.0
\n", "
" ], "text/plain": [ " _id ltable_id rtable_id label prediction probabilities\n", "124 124 l1647 r366 0 0 0.0\n", "54 54 l332 r1463 0 0 0.0\n", "268 268 l1499 r1725 0 0 0.0\n", "293 293 l759 r1749 1 1 1.0\n", "230 230 l1580 r1711 1 1 1.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert J into a set of feature vectors using F\n", "L4 = em.extract_feature_vecs(J, feature_table=F,\n", " attrs_after='label', show_progress=False)\n", "\n", "# Predict on L \n", "dt.predict(table=L4, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], \n", " target_attr='prediction', append=True,\n", " return_probs=True, probs_attr='probabilities',\n", " inplace=True)\n", "\n", "# Show the predictions and probabilities\n", "L4[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probabilities']].head()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }