{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to sample and label a table (candidate set).\n", "First, we need to import py_entitymatching package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "path_A = datasets_dir + os.sep + 'DBLP.csv'\n", "path_B = datasets_dir + os.sep + 'ACM.csv'\n", "path_C = datasets_dir + os.sep + 'tableC.csv'\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "A = em.read_csv_metadata(path_A, key='id')\n", "B = em.read_csv_metadata(path_B, key='id')\n", "C = em.read_csv_metadata(path_C, key='_id', \n", " fk_ltable='ltable_id', fk_rtable='rtable_id',\n", " ltable=A, rtable=B)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idltable_authorsltable_titlertable_authorsrtable_title
00conf/sigmod/AbadiC02191915Daniel J. Abadi, Mitch CherniackVisual COKO: a debugger for query optimizer developmentMichael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre...Shoring up persistent applications
11conf/sigmod/AbadiC02191931Daniel J. Abadi, Mitch CherniackVisual COKO: a debugger for query optimizer developmentDaniel J. DietterichDEC data distributor: for data replication and data warehousing
22conf/sigmod/AbadiC02233356Daniel J. Abadi, Mitch CherniackVisual COKO: a debugger for query optimizer developmentMitch Cherniack, Stanley B. ZdonikRule languages and internal algebras for rule-based optimizers
33conf/sigmod/AbadiC02276311Daniel J. Abadi, Mitch CherniackVisual COKO: a debugger for query optimizer developmentMitch Cherniack, Stan ZdonikChanging the rules: transformations for rule-based optimizers
44conf/sigmod/AbadiC02335432Daniel J. Abadi, Mitch CherniackVisual COKO: a debugger for query optimizer developmentJianjun Chen, David J. DeWitt, Feng Tian, Yuan WangNiagaraCQ: a scalable continuous query system for Internet databases
\n", "
" ], "text/plain": [ " _id ltable_id rtable_id ltable_authors \\\n", "0 0 conf/sigmod/AbadiC02 191915 Daniel J. Abadi, Mitch Cherniack \n", "1 1 conf/sigmod/AbadiC02 191931 Daniel J. Abadi, Mitch Cherniack \n", "2 2 conf/sigmod/AbadiC02 233356 Daniel J. Abadi, Mitch Cherniack \n", "3 3 conf/sigmod/AbadiC02 276311 Daniel J. Abadi, Mitch Cherniack \n", "4 4 conf/sigmod/AbadiC02 335432 Daniel J. Abadi, Mitch Cherniack \n", "\n", " ltable_title \\\n", "0 Visual COKO: a debugger for query optimizer development \n", "1 Visual COKO: a debugger for query optimizer development \n", "2 Visual COKO: a debugger for query optimizer development \n", "3 Visual COKO: a debugger for query optimizer development \n", "4 Visual COKO: a debugger for query optimizer development \n", "\n", " rtable_authors \\\n", "0 Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... \n", "1 Daniel J. Dietterich \n", "2 Mitch Cherniack, Stanley B. Zdonik \n", "3 Mitch Cherniack, Stan Zdonik \n", "4 Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang \n", "\n", " rtable_title \n", "0 Shoring up persistent applications \n", "1 DEC data distributor: for data replication and data warehousing \n", "2 Rule languages and internal algebras for rule-based optimizers \n", "3 Changing the rules: transformations for rule-based optimizers \n", "4 NiagaraCQ: a scalable continuous query system for Internet databases " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "14673" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sample Candidate Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the candidate set, a sample (for labeling purposes) can be obtained like this:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "S = em.sample_table(C, 450)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Label the Sampled Set" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Column name (gold_label) is not present in dataframe\n" ] } ], "source": [ "# Label the sampled set\n", "# Specify the name for the label column\n", "G = em.label_table(S, 'gold_label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The user must specify 0 for non-match and 1 for match. Typically, the sampling and the labeling step is done in iterations (till we get sufficient density of matches). Once labeled, the labeled data set will look like this:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "# Assume that we have labeled the data and stored it in \n", "# labeled_data_demo.csv\n", "\n", "path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'\n", "G = em.read_csv_metadata(path_labeled_data, key='_id', \n", " fk_ltable='ltable_id', fk_rtable='rtable_id',\n", " ltable=A, rtable=B)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idltable_titleltable_authorsltable_yearrtable_titlertable_authorsrtable_yearlabel
00l1223r498Dynamic Information VisualizationYannis E. Ioannidis1996Dynamic information visualizationYannis E. Ioannidis19961
11l1563r1285Dynamic Load Balancing in Hierarchical Parallel Database SystemsLuc Bouganim, Daniela Florescu, Patrick Valduriez1996Dynamic Load Balancing in Hierarchical Parallel Database SystemsLuc Bouganim, Daniela Florescu, Patrick Valduriez19961
22l1514r1348Query Processing and Optimization in Oracle RdbGennady Antoshenkov, Mohamed Ziauddin1996prospector: a content-based multimedia server for massively parallel architecturesS. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader19960
33l206r1641An Asymptotically Optimal Multiversion B-TreeThomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger1996A complete temporal relational algebraDebabrata Dey, Terence M. Barron, Veda C. Storey19960
44l1589r495Evaluating Probabilistic Queries over Imprecise DataReynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar2003Evaluating probabilistic queries over imprecise dataReynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar20031
\n", "
" ], "text/plain": [ " _id ltable_id rtable_id \\\n", "0 0 l1223 r498 \n", "1 1 l1563 r1285 \n", "2 2 l1514 r1348 \n", "3 3 l206 r1641 \n", "4 4 l1589 r495 \n", "\n", " ltable_title \\\n", "0 Dynamic Information Visualization \n", "1 Dynamic Load Balancing in Hierarchical Parallel Database Systems \n", "2 Query Processing and Optimization in Oracle Rdb \n", "3 An Asymptotically Optimal Multiversion B-Tree \n", "4 Evaluating Probabilistic Queries over Imprecise Data \n", "\n", " ltable_authors \\\n", "0 Yannis E. Ioannidis \n", "1 Luc Bouganim, Daniela Florescu, Patrick Valduriez \n", "2 Gennady Antoshenkov, Mohamed Ziauddin \n", "3 Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger \n", "4 Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar \n", "\n", " ltable_year \\\n", "0 1996 \n", "1 1996 \n", "2 1996 \n", "3 1996 \n", "4 2003 \n", "\n", " rtable_title \\\n", "0 Dynamic information visualization \n", "1 Dynamic Load Balancing in Hierarchical Parallel Database Systems \n", "2 prospector: a content-based multimedia server for massively parallel architectures \n", "3 A complete temporal relational algebra \n", "4 Evaluating probabilistic queries over imprecise data \n", "\n", " rtable_authors \\\n", "0 Yannis E. Ioannidis \n", "1 Luc Bouganim, Daniela Florescu, Patrick Valduriez \n", "2 S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader \n", "3 Debabrata Dey, Terence M. Barron, Veda C. Storey \n", "4 Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar \n", "\n", " rtable_year label \n", "0 1996 1 \n", "1 1996 1 \n", "2 1996 0 \n", "3 1996 0 \n", "4 2003 1 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "G.head()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }