{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd\n", "\n", "# Set the seed value \n", "seed = 0" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Adding Features to Feature Table.ipynb\r\n", "Combining Multiple Blockers.ipynb\r\n", "Debugging Blocker Output.ipynb\r\n", "Down Sampling.ipynb\r\n", "Editing and Generating Features Manually.ipynb\r\n", "Evaluating the Selected Matcher.ipynb\r\n", "Generating Features Manually.ipynb\r\n", "Performing Blocking Using Blackbox Blocker.ipynb\r\n", "Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb\r\n", "Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb\r\n", "Performing Blocking Using Rule-Based Blocking.ipynb\r\n", "Reading CSV Files from Disk.ipynb\r\n", "Removing Features From Feature Table.ipynb\r\n", "Sampling and Labeling.ipynb\r\n", "Selecting the Best Learning Matcher.ipynb\r\n" ] } ], "source": [ "!ls $datasets_dir" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "path_A = datasets_dir + os.sep + 'dblp_demo.csv'\n", "path_B = datasets_dir + os.sep + 'acm_demo.csv'\n", "path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "A = em.read_csv_metadata(path_A, key='id')\n", "B = em.read_csv_metadata(path_B, key='id')\n", "# Load the pre-labeled data\n", "S = em.read_csv_metadata(path_labeled_data, \n", " key='_id',\n", " ltable=A, rtable=B, \n", " fk_ltable='ltable_id', fk_rtable='rtable_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Split S into I an J\n", "IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)\n", "I = IJ['train']\n", "J = IJ['test']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Selecting the Best learning-based matcher " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This, typically involves the following steps:\n", "1. Creating a set of learning-based matchers\n", "2. Creating features\n", "3. Extracting feature vectors\n", "4. Selecting the best learning-based matcher using k-fold cross validation\n", "5. Debugging the matcher (and possibly repeat the above steps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a set of learning-based matchers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a set of ML-matchers\n", "dt = em.DTMatcher(name='DecisionTree', random_state=0)\n", "svm = em.SVMMatcher(name='SVM', random_state=0)\n", "rf = em.RFMatcher(name='RF', random_state=0)\n", "lg = em.LogRegMatcher(name='LogReg', random_state=0)\n", "ln = em.LinRegMatcher(name='LinReg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating features\n", "\n", "Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Generate a set of features\n", "F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 id_id_lev_dist\n", "1 id_id_lev_sim\n", "2 id_id_jar\n", "3 id_id_jwn\n", "4 id_id_exm\n", "5 id_id_jac_qgm_3_qgm_3\n", "6 title_title_jac_qgm_3_qgm_3\n", "7 title_title_cos_dlm_dc0_dlm_dc0\n", "8 title_title_mel\n", "9 title_title_lev_dist\n", "10 title_title_lev_sim\n", "11 authors_authors_jac_qgm_3_qgm_3\n", "12 authors_authors_cos_dlm_dc0_dlm_dc0\n", "13 authors_authors_mel\n", "14 authors_authors_lev_dist\n", "15 authors_authors_lev_sim\n", "16 year_year_exm\n", "17 year_year_anm\n", "18 year_year_lev_dist\n", "19 year_year_lev_sim\n", "Name: feature_name, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.feature_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting feature vectors\n", "\n", "In this step, we extract feature vectors using the development set and the created features." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Convert the I into a set of feature vectors using F\n", "H = em.extract_feature_vecs(I, \n", " feature_table=F, \n", " attrs_after='label',\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idid_id_lev_distid_id_lev_simid_id_jarid_id_jwnid_id_exmid_id_jac_qgm_3_qgm_3title_title_jac_qgm_3_qgm_3...authors_authors_jac_qgm_3_qgm_3authors_authors_cos_dlm_dc0_dlm_dc0authors_authors_melauthors_authors_lev_distauthors_authors_lev_simyear_year_exmyear_year_anmyear_year_lev_distyear_year_lev_simlabel
430430l1494r125740.200.4666670.46666700.0000000.000000...0.0000000.0000000.44570744.00.08333311.00.01.00
3535l1385r116040.200.4666670.46666700.0000000.025641...0.0000000.0000000.58941743.00.27118611.00.01.00
394394l1345r8540.200.0000000.00000000.0909091.000000...0.9511110.9459460.822080172.00.33846211.00.01.01
2929l611r14130.250.6666670.66666700.0909090.049383...0.0000000.0000000.53154326.00.27777811.00.01.00
181181l1164r116120.600.7333330.73333300.0769231.000000...0.5925930.6681530.68470034.00.24444411.00.01.01
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " _id ltable_id rtable_id id_id_lev_dist id_id_lev_sim id_id_jar \\\n", "430 430 l1494 r1257 4 0.20 0.466667 \n", "35 35 l1385 r1160 4 0.20 0.466667 \n", "394 394 l1345 r85 4 0.20 0.000000 \n", "29 29 l611 r141 3 0.25 0.666667 \n", "181 181 l1164 r1161 2 0.60 0.733333 \n", "\n", " id_id_jwn id_id_exm id_id_jac_qgm_3_qgm_3 title_title_jac_qgm_3_qgm_3 \\\n", "430 0.466667 0 0.000000 0.000000 \n", "35 0.466667 0 0.000000 0.025641 \n", "394 0.000000 0 0.090909 1.000000 \n", "29 0.666667 0 0.090909 0.049383 \n", "181 0.733333 0 0.076923 1.000000 \n", "\n", " ... authors_authors_jac_qgm_3_qgm_3 \\\n", "430 ... 0.000000 \n", "35 ... 0.000000 \n", "394 ... 0.951111 \n", "29 ... 0.000000 \n", "181 ... 0.592593 \n", "\n", " authors_authors_cos_dlm_dc0_dlm_dc0 authors_authors_mel \\\n", "430 0.000000 0.445707 \n", "35 0.000000 0.589417 \n", "394 0.945946 0.822080 \n", "29 0.000000 0.531543 \n", "181 0.668153 0.684700 \n", "\n", " authors_authors_lev_dist authors_authors_lev_sim year_year_exm \\\n", "430 44.0 0.083333 1 \n", "35 43.0 0.271186 1 \n", "394 172.0 0.338462 1 \n", "29 26.0 0.277778 1 \n", "181 34.0 0.244444 1 \n", "\n", " year_year_anm year_year_lev_dist year_year_lev_sim label \n", "430 1.0 0.0 1.0 0 \n", "35 1.0 0.0 1.0 0 \n", "394 1.0 0.0 1.0 1 \n", "29 1.0 0.0 1.0 0 \n", "181 1.0 0.0 1.0 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display first few rows\n", "H.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check if the feature vectors contain missing values\n", "# A return value of True means that there are missing values\n", "any(pd.notnull(H))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Impute feature vectors with the mean of the column values.\n", "H = em.impute_table(H, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n", " strategy='mean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting the best matcher using cross-validation\n", "\n", "Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MatcherAverage precisionAverage recallAverage f1
0DecisionTree0.9153220.9507140.930980
1RF1.0000000.9507140.974131
2SVM0.9777780.8106320.883248
3LinReg1.0000000.9353300.966131
4LogReg0.9857140.9353300.958724
\n", "
" ], "text/plain": [ " Matcher Average precision Average recall Average f1\n", "0 DecisionTree 0.915322 0.950714 0.930980\n", "1 RF 1.000000 0.950714 0.974131\n", "2 SVM 0.977778 0.810632 0.883248\n", "3 LinReg 1.000000 0.935330 0.966131\n", "4 LogReg 0.985714 0.935330 0.958724" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the best ML matcher using CV\n", "result = em.select_matcher([dt, rf, svm, ln, lg], table=H, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n", " k=5,\n", " target_attr='label', metric_to_select_matcher='f1', random_state=0)\n", "result['cv_stats']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0DecisionTree<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>50.951.0000000.7647060.9333330.9285710.915322
1RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>51.001.0000001.0000001.0000001.0000001.000000
2SVM<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>51.001.0000000.8888891.0000001.0000000.977778
3LinReg<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>51.001.0000001.0000001.0000001.0000001.000000
4LogReg<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>51.000.9285711.0000001.0000001.0000000.985714
\n", "
" ], "text/plain": [ " Name \\\n", "0 DecisionTree \n", "1 RF \n", "2 SVM \n", "3 LinReg \n", "4 LogReg \n", "\n", " Matcher \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score \n", "0 5 0.95 1.000000 0.764706 0.933333 0.928571 0.915322 \n", "1 5 1.00 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "2 5 1.00 1.000000 0.888889 1.000000 1.000000 0.977778 \n", "3 5 1.00 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "4 5 1.00 0.928571 1.000000 1.000000 1.000000 0.985714 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['drill_down_cv_stats']['precision']" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0DecisionTree<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>50.951.0000000.9285710.87501.0000000.950714
1RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>50.951.0000000.9285710.87501.0000000.950714
2SVM<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>50.900.9230770.5714290.81250.8461540.810632
3LinReg<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>50.951.0000000.9285710.87500.9230770.935330
4LogReg<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>50.951.0000000.9285710.87500.9230770.935330
\n", "
" ], "text/plain": [ " Name \\\n", "0 DecisionTree \n", "1 RF \n", "2 SVM \n", "3 LinReg \n", "4 LogReg \n", "\n", " Matcher \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score \n", "0 5 0.95 1.000000 0.928571 0.8750 1.000000 0.950714 \n", "1 5 0.95 1.000000 0.928571 0.8750 1.000000 0.950714 \n", "2 5 0.90 0.923077 0.571429 0.8125 0.846154 0.810632 \n", "3 5 0.95 1.000000 0.928571 0.8750 0.923077 0.935330 \n", "4 5 0.95 1.000000 0.928571 0.8750 0.923077 0.935330 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['drill_down_cv_stats']['recall']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0DecisionTree<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>50.9500001.0000000.8387100.9032260.9629630.930980
1RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>50.9743591.0000000.9629630.9333331.0000000.974131
2SVM<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>50.9473680.9600000.6956520.8965520.9166670.883248
3LinReg<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>50.9743591.0000000.9629630.9333330.9600000.966131
4LogReg<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>50.9743590.9629630.9629630.9333330.9600000.958724
\n", "
" ], "text/plain": [ " Name \\\n", "0 DecisionTree \n", "1 RF \n", "2 SVM \n", "3 LinReg \n", "4 LogReg \n", "\n", " Matcher \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score \n", "0 5 0.950000 1.000000 0.838710 0.903226 0.962963 0.930980 \n", "1 5 0.974359 1.000000 0.962963 0.933333 1.000000 0.974131 \n", "2 5 0.947368 0.960000 0.695652 0.896552 0.916667 0.883248 \n", "3 5 0.974359 1.000000 0.962963 0.933333 0.960000 0.966131 \n", "4 5 0.974359 0.962963 0.962963 0.933333 0.960000 0.958724 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['drill_down_cv_stats']['f1']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Debug X (Random Forest)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Split H into P and Q\n", "PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)\n", "P = PQ['train']\n", "Q = PQ['test']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Debug RF matcher using GUI\n", "em.vis_debug_rf(rf, P, Q, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n", " target_attr='label')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add a feature to do Jaccard on title + authors and add it to F\n", "\n", "# Create a feature declaratively\n", "sim = em.get_sim_funs_for_matching()\n", "tok = em.get_tokenizers_for_matching()\n", "feature_string = \"\"\"jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()), \n", " wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))\"\"\"\n", "feature = em.get_feature_fn(feature_string, sim, tok)\n", "\n", "# Add feature to F\n", "em.add_feature(F, 'jac_ws_title_authors', feature)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Convert I into feature vectors using updated F\n", "H = em.extract_feature_vecs(I, \n", " feature_table=F, \n", " attrs_after='label',\n", " show_progress=False)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>50.9743591.00.9629630.9333331.00.974131
\n", "
" ], "text/plain": [ " Name Matcher \\\n", "0 RF \n", "\n", " Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score \n", "0 5 0.974359 1.0 0.962963 0.933333 1.0 0.974131 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check whether the updated F improves X (Random Forest)\n", "result = em.select_matcher([rf], table=H, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n", " k=5,\n", " target_attr='label', metric_to_select_matcher='f1', random_state=0)\n", "result['drill_down_cv_stats']['f1']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MatcherAverage precisionAverage recallAverage f1
0DecisionTree1.0000001.0000001.000000
1RF1.0000000.9507140.974131
2SVM1.0000000.8374180.907995
3LinReg1.0000000.9703300.984593
4LogReg0.9857140.9353300.958724
\n", "
" ], "text/plain": [ " Matcher Average precision Average recall Average f1\n", "0 DecisionTree 1.000000 1.000000 1.000000\n", "1 RF 1.000000 0.950714 0.974131\n", "2 SVM 1.000000 0.837418 0.907995\n", "3 LinReg 1.000000 0.970330 0.984593\n", "4 LogReg 0.985714 0.935330 0.958724" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the best matcher again using CV\n", "result = em.select_matcher([dt, rf, svm, ln, lg], table=H, \n", " exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],\n", " k=5,\n", " target_attr='label', metric_to_select_matcher='f1', random_state=0)\n", "result['cv_stats']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0DecisionTree<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>51.0000001.0000001.0000001.0000001.0000001.000000
1RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>50.9743591.0000000.9629630.9333331.0000000.974131
2SVM<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>50.9473680.9600000.7826090.9333330.9166670.907995
3LinReg<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>51.0000001.0000000.9629631.0000000.9600000.984593
4LogReg<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>50.9743590.9629630.9629630.9333330.9600000.958724
\n", "
" ], "text/plain": [ " Name \\\n", "0 DecisionTree \n", "1 RF \n", "2 SVM \n", "3 LinReg \n", "4 LogReg \n", "\n", " Matcher \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score \n", "0 5 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "1 5 0.974359 1.000000 0.962963 0.933333 1.000000 0.974131 \n", "2 5 0.947368 0.960000 0.782609 0.933333 0.916667 0.907995 \n", "3 5 1.000000 1.000000 0.962963 1.000000 0.960000 0.984593 \n", "4 5 0.974359 0.962963 0.962963 0.933333 0.960000 0.958724 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['drill_down_cv_stats']['f1']" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }