{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This IPython notebook illustrates how to sample and label a table (candidate set).\n",
"First, we need to import py_entitymatching package and other libraries as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Import py_entitymatching package\n",
"import py_entitymatching as em\n",
"import os\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Get the datasets directory\n",
"datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
"\n",
"path_A = datasets_dir + os.sep + 'DBLP.csv'\n",
"path_B = datasets_dir + os.sep + 'ACM.csv'\n",
"path_C = datasets_dir + os.sep + 'tableC.csv'\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Metadata file is not present in the given path; proceeding to read the csv file.\n",
"Metadata file is not present in the given path; proceeding to read the csv file.\n",
"Metadata file is not present in the given path; proceeding to read the csv file.\n"
]
}
],
"source": [
"A = em.read_csv_metadata(path_A, key='id')\n",
"B = em.read_csv_metadata(path_B, key='id')\n",
"C = em.read_csv_metadata(path_C, key='_id', \n",
" fk_ltable='ltable_id', fk_rtable='rtable_id',\n",
" ltable=A, rtable=B)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" ltable_id | \n",
" rtable_id | \n",
" ltable_authors | \n",
" ltable_title | \n",
" rtable_authors | \n",
" rtable_title | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" conf/sigmod/AbadiC02 | \n",
" 191915 | \n",
" Daniel J. Abadi, Mitch Cherniack | \n",
" Visual COKO: a debugger for query optimizer development | \n",
" Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... | \n",
" Shoring up persistent applications | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" conf/sigmod/AbadiC02 | \n",
" 191931 | \n",
" Daniel J. Abadi, Mitch Cherniack | \n",
" Visual COKO: a debugger for query optimizer development | \n",
" Daniel J. Dietterich | \n",
" DEC data distributor: for data replication and data warehousing | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" conf/sigmod/AbadiC02 | \n",
" 233356 | \n",
" Daniel J. Abadi, Mitch Cherniack | \n",
" Visual COKO: a debugger for query optimizer development | \n",
" Mitch Cherniack, Stanley B. Zdonik | \n",
" Rule languages and internal algebras for rule-based optimizers | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" conf/sigmod/AbadiC02 | \n",
" 276311 | \n",
" Daniel J. Abadi, Mitch Cherniack | \n",
" Visual COKO: a debugger for query optimizer development | \n",
" Mitch Cherniack, Stan Zdonik | \n",
" Changing the rules: transformations for rule-based optimizers | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" conf/sigmod/AbadiC02 | \n",
" 335432 | \n",
" Daniel J. Abadi, Mitch Cherniack | \n",
" Visual COKO: a debugger for query optimizer development | \n",
" Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang | \n",
" NiagaraCQ: a scalable continuous query system for Internet databases | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id ltable_id rtable_id ltable_authors \\\n",
"0 0 conf/sigmod/AbadiC02 191915 Daniel J. Abadi, Mitch Cherniack \n",
"1 1 conf/sigmod/AbadiC02 191931 Daniel J. Abadi, Mitch Cherniack \n",
"2 2 conf/sigmod/AbadiC02 233356 Daniel J. Abadi, Mitch Cherniack \n",
"3 3 conf/sigmod/AbadiC02 276311 Daniel J. Abadi, Mitch Cherniack \n",
"4 4 conf/sigmod/AbadiC02 335432 Daniel J. Abadi, Mitch Cherniack \n",
"\n",
" ltable_title \\\n",
"0 Visual COKO: a debugger for query optimizer development \n",
"1 Visual COKO: a debugger for query optimizer development \n",
"2 Visual COKO: a debugger for query optimizer development \n",
"3 Visual COKO: a debugger for query optimizer development \n",
"4 Visual COKO: a debugger for query optimizer development \n",
"\n",
" rtable_authors \\\n",
"0 Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... \n",
"1 Daniel J. Dietterich \n",
"2 Mitch Cherniack, Stanley B. Zdonik \n",
"3 Mitch Cherniack, Stan Zdonik \n",
"4 Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang \n",
"\n",
" rtable_title \n",
"0 Shoring up persistent applications \n",
"1 DEC data distributor: for data replication and data warehousing \n",
"2 Rule languages and internal algebras for rule-based optimizers \n",
"3 Changing the rules: transformations for rule-based optimizers \n",
"4 NiagaraCQ: a scalable continuous query system for Internet databases "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"C.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"14673"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(C)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sample Candidate Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the candidate set, a sample (for labeling purposes) can be obtained like this:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"S = em.sample_table(C, 450)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Label the Sampled Set"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Column name (gold_label) is not present in dataframe\n"
]
}
],
"source": [
"# Label the sampled set\n",
"# Specify the name for the label column\n",
"G = em.label_table(S, 'gold_label')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The user must specify 0 for non-match and 1 for match. Typically, the sampling and the labeling step is done in iterations (till we get sufficient density of matches). Once labeled, the labeled data set will look like this:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Metadata file is not present in the given path; proceeding to read the csv file.\n"
]
}
],
"source": [
"# Assume that we have labeled the data and stored it in \n",
"# labeled_data_demo.csv\n",
"\n",
"path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'\n",
"G = em.read_csv_metadata(path_labeled_data, key='_id', \n",
" fk_ltable='ltable_id', fk_rtable='rtable_id',\n",
" ltable=A, rtable=B)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" ltable_id | \n",
" rtable_id | \n",
" ltable_title | \n",
" ltable_authors | \n",
" ltable_year | \n",
" rtable_title | \n",
" rtable_authors | \n",
" rtable_year | \n",
" label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" l1223 | \n",
" r498 | \n",
" Dynamic Information Visualization | \n",
" Yannis E. Ioannidis | \n",
" 1996 | \n",
" Dynamic information visualization | \n",
" Yannis E. Ioannidis | \n",
" 1996 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" l1563 | \n",
" r1285 | \n",
" Dynamic Load Balancing in Hierarchical Parallel Database Systems | \n",
" Luc Bouganim, Daniela Florescu, Patrick Valduriez | \n",
" 1996 | \n",
" Dynamic Load Balancing in Hierarchical Parallel Database Systems | \n",
" Luc Bouganim, Daniela Florescu, Patrick Valduriez | \n",
" 1996 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" l1514 | \n",
" r1348 | \n",
" Query Processing and Optimization in Oracle Rdb | \n",
" Gennady Antoshenkov, Mohamed Ziauddin | \n",
" 1996 | \n",
" prospector: a content-based multimedia server for massively parallel architectures | \n",
" S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader | \n",
" 1996 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" l206 | \n",
" r1641 | \n",
" An Asymptotically Optimal Multiversion B-Tree | \n",
" Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger | \n",
" 1996 | \n",
" A complete temporal relational algebra | \n",
" Debabrata Dey, Terence M. Barron, Veda C. Storey | \n",
" 1996 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" l1589 | \n",
" r495 | \n",
" Evaluating Probabilistic Queries over Imprecise Data | \n",
" Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | \n",
" 2003 | \n",
" Evaluating probabilistic queries over imprecise data | \n",
" Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | \n",
" 2003 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id ltable_id rtable_id \\\n",
"0 0 l1223 r498 \n",
"1 1 l1563 r1285 \n",
"2 2 l1514 r1348 \n",
"3 3 l206 r1641 \n",
"4 4 l1589 r495 \n",
"\n",
" ltable_title \\\n",
"0 Dynamic Information Visualization \n",
"1 Dynamic Load Balancing in Hierarchical Parallel Database Systems \n",
"2 Query Processing and Optimization in Oracle Rdb \n",
"3 An Asymptotically Optimal Multiversion B-Tree \n",
"4 Evaluating Probabilistic Queries over Imprecise Data \n",
"\n",
" ltable_authors \\\n",
"0 Yannis E. Ioannidis \n",
"1 Luc Bouganim, Daniela Florescu, Patrick Valduriez \n",
"2 Gennady Antoshenkov, Mohamed Ziauddin \n",
"3 Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger \n",
"4 Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar \n",
"\n",
" ltable_year \\\n",
"0 1996 \n",
"1 1996 \n",
"2 1996 \n",
"3 1996 \n",
"4 2003 \n",
"\n",
" rtable_title \\\n",
"0 Dynamic information visualization \n",
"1 Dynamic Load Balancing in Hierarchical Parallel Database Systems \n",
"2 prospector: a content-based multimedia server for massively parallel architectures \n",
"3 A complete temporal relational algebra \n",
"4 Evaluating probabilistic queries over imprecise data \n",
"\n",
" rtable_authors \\\n",
"0 Yannis E. Ioannidis \n",
"1 Luc Bouganim, Daniela Florescu, Patrick Valduriez \n",
"2 S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader \n",
"3 Debabrata Dey, Terence M. Barron, Veda C. Storey \n",
"4 Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar \n",
"\n",
" rtable_year label \n",
"0 1996 1 \n",
"1 1996 1 \n",
"2 1996 0 \n",
"3 1996 0 \n",
"4 2003 1 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"G.head()"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}