# Introduction

This IPython notebook illustrates how to perform matching using the rule-based matcher.

First, we need to import py_entitymatching package and other libraries as follows:

In [2]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables for matching purposes.

In [3]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [5]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_year,rtable_title,rtable_authors,rtable_year,label
0,0,l1223,r498,Dynamic Information Visualization,Yannis E. Ioannidis,1996,Dynamic information visualization,Yannis E. Ioannidis,1996,1
1,1,l1563,r1285,Dynamic Load Balancing in Hierarchical Parallel Database Systems,"Luc Bouganim, Daniela Florescu, Patrick Valduriez",1996,Dynamic Load Balancing in Hierarchical Parallel Database Systems,"Luc Bouganim, Daniela Florescu, Patrick Valduriez",1996,1
2,2,l1514,r1348,Query Processing and Optimization in Oracle Rdb,"Gennady Antoshenkov, Mohamed Ziauddin",1996,prospector: a content-based multimedia server for massively parallel architectures,"S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader",1996,0
3,3,l206,r1641,An Asymptotically Optimal Multiversion B-Tree,"Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger",1996,A complete temporal relational algebra,"Debabrata Dey, Terence M. Barron, Veda C. Storey",1996,0
4,4,l1589,r495,Evaluating Probabilistic Queries over Imprecise Data,"Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar",2003,Evaluating probabilistic queries over imprecise data,"Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar",2003,1


Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher

In [6]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

# Creating and Using a Rule-Based Matcher

This, typically involves the following steps:
1. Creating the rule-based matcher
2. Creating features
3. Adding Rules
4. Using the Matcher to Predict Results

## Creating the Rule-Based Matcher

In [7]:
brm = em.BooleanRuleMatcher()

## Creating Features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [8]:
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.

In [9]:
F.feature_name

0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16                          year_year_exm
17                          year_year_anm
18                     year_year_lev_dist
19                      year_year_lev_sim
Name: feature_name, dtype: object

## Adding Rules

Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.

In [10]:
# Add two rules to the rule-based matcher

# The first rule has two predicates, one comparing the titles and the other looking for an exact match of the years
brm.add_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], F)
# This second rule compares the authors
brm.add_rule(['authors_authors_lev_sim(ltuple, rtuple) > 0.4'], F)
brm.get_rule_names()

['_rule_0', '_rule_1']

In [11]:
# Rules can also be deleted from the rule-based matcher
brm.delete_rule('_rule_1')

True

## Using the Matcher to Predict Results

Now that our rule-based matcher has some rules, we can use it to predict whether a tuple pair is actually a match. Each rule is is a conjunction of predicates and will return True only if all the predicates return True. The matcher is then a disjunction of rules and if any one of the rules return True, then the tuple pair will be a match.

In [12]:
brm.predict(S, target_attr='pred_label', append=True)
S

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_year,rtable_title,rtable_authors,rtable_year,label,pred_label
0,0,l1223,r498,Dynamic Information Visualization,Yannis E. Ioannidis,1996,Dynamic information visualization,Yannis E. Ioannidis,1996,1,1
1,1,l1563,r1285,Dynamic Load Balancing in Hierarchical Parallel Database Systems,"Luc Bouganim, Daniela Florescu, Patrick Valduriez",1996,Dynamic Load Balancing in Hierarchical Parallel Database Systems,"Luc Bouganim, Daniela Florescu, Patrick Valduriez",1996,1,1
2,2,l1514,r1348,Query Processing and Optimization in Oracle Rdb,"Gennady Antoshenkov, Mohamed Ziauddin",1996,prospector: a content-based multimedia server for massively parallel architectures,"S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader",1996,0,0
3,3,l206,r1641,An Asymptotically Optimal Multiversion B-Tree,"Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger",1996,A complete temporal relational algebra,"Debabrata Dey, Terence M. Barron, Veda C. Storey",1996,0,0
4,4,l1589,r495,Evaluating Probabilistic Queries over Imprecise Data,"Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar",2003,Evaluating probabilistic queries over imprecise data,"Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar",2003,1,1
5,5,l43,r1415,Optimization of Run-time Management of Data Intensive Web-sites,"Khaled Yagoub, Dan Suciu, Alon Y. Levy, Daniela Florescu",1999,On random sampling over joins,"Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya",1999,0,0
6,6,l1466,r1348,Access Path Support for Referential Integrity in SQL2,"Joachim Reinert, Theo Hrder",1996,prospector: a content-based multimedia server for massively parallel architectures,"S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader",1996,0,0
7,7,l1535,r1800,Mariposa: A Wide-Area Distributed Database System,"Carl Staelin, Paul M. Aoki, Witold Litwin, Michael Stonebraker, Adam Sah, Jeff Sidell, Andrew Yu...",1996,Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases,"Sin Yeung Lee, Tok Wang Ling",1996,0,0
8,8,l1317,r1676,QuickStore: A High Performance Mapped Object Store,"David J. DeWitt, Seth J. White",1994,An Overview of Repository Technology,"Philip A. Bernstein, Umeshwar Dayal",1994,0,0
9,9,l621,r175,Communication Efficient Distributed Mining of Association Rules,"Ran Wolff, Assaf Schuster",2001,Editorial,Richard Snodgrass,2001,0,0
