{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to sample and label a table (candidate set).\n", "First, we need to import py_entitymatching package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import py_entitymatching package\n", "import py_entitymatching as em\n", "import os\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "path_A = datasets_dir + os.sep + 'DBLP.csv'\n", "path_B = datasets_dir + os.sep + 'ACM.csv'\n", "path_C = datasets_dir + os.sep + 'tableC.csv'\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n", "Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "A = em.read_csv_metadata(path_A, key='id')\n", "B = em.read_csv_metadata(path_B, key='id')\n", "C = em.read_csv_metadata(path_C, key='_id', \n", " fk_ltable='ltable_id', fk_rtable='rtable_id',\n", " ltable=A, rtable=B)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | _id | \n", "ltable_id | \n", "rtable_id | \n", "ltable_authors | \n", "ltable_title | \n", "rtable_authors | \n", "rtable_title | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "conf/sigmod/AbadiC02 | \n", "191915 | \n", "Daniel J. Abadi, Mitch Cherniack | \n", "Visual COKO: a debugger for query optimizer development | \n", "Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... | \n", "Shoring up persistent applications | \n", "
1 | \n", "1 | \n", "conf/sigmod/AbadiC02 | \n", "191931 | \n", "Daniel J. Abadi, Mitch Cherniack | \n", "Visual COKO: a debugger for query optimizer development | \n", "Daniel J. Dietterich | \n", "DEC data distributor: for data replication and data warehousing | \n", "
2 | \n", "2 | \n", "conf/sigmod/AbadiC02 | \n", "233356 | \n", "Daniel J. Abadi, Mitch Cherniack | \n", "Visual COKO: a debugger for query optimizer development | \n", "Mitch Cherniack, Stanley B. Zdonik | \n", "Rule languages and internal algebras for rule-based optimizers | \n", "
3 | \n", "3 | \n", "conf/sigmod/AbadiC02 | \n", "276311 | \n", "Daniel J. Abadi, Mitch Cherniack | \n", "Visual COKO: a debugger for query optimizer development | \n", "Mitch Cherniack, Stan Zdonik | \n", "Changing the rules: transformations for rule-based optimizers | \n", "
4 | \n", "4 | \n", "conf/sigmod/AbadiC02 | \n", "335432 | \n", "Daniel J. Abadi, Mitch Cherniack | \n", "Visual COKO: a debugger for query optimizer development | \n", "Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang | \n", "NiagaraCQ: a scalable continuous query system for Internet databases | \n", "