{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to down sample two large tables that are loaded in the memory" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", " \"This module will be removed in 0.20.\", DeprecationWarning)\n" ] } ], "source": [ "import py_entitymatching as em" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Down sampling is typically done when the input tables are large (e.g. each containing more than 100K tuples). For the purposes of this notebook we will use two large datasets: Citeseer and DBLP. You can download Citeseer dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/citeseer.csv and DBLP dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/dblp.csv. Once downloaded, save these datasets as 'citeseer.csv' and 'dblp.csv' in the current directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Read the CSV files\n", "A = em.read_csv_metadata('./citeseer.csv',low_memory=False) # setting the parameter low_memory to False to speed up loading.\n", "B = em.read_csv_metadata('./dblp.csv', low_memory=False)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(1823978, 2512927)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(A), len(B)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitleauthorsjournalmonthyearpublication_type
01An Arithmetic Analogue of Bezouts TheoremDavid MckinnonNaNNaNNaNNaN
12Thompsons Group F is Not Minimally Almost ConvexJames Belk, Kai-uwe BuxNaNNaN2002.0NaN
23Cognitive Dimensions Tradeoffs in Tangible User Interface DesignDarren Edge, Alan BlackwellNaNNaNNaNNaN
34ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session...J. Bohnemeyer, Max Planck, I. IntroductionNaNNaN2002.0NaN
45PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY ANDJ. L. Foley, J. W. Little, F. L. Starr Iii, C. FrantzNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " id \\\n", "0 1 \n", "1 2 \n", "2 3 \n", "3 4 \n", "4 5 \n", "\n", " title \\\n", "0 An Arithmetic Analogue of Bezouts Theorem \n", "1 Thompsons Group F is Not Minimally Almost Convex \n", "2 Cognitive Dimensions Tradeoffs in Tangible User Interface Design \n", "3 ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session... \n", "4 PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND \n", "\n", " authors journal month \\\n", "0 David Mckinnon NaN NaN \n", "1 James Belk, Kai-uwe Bux NaN NaN \n", "2 Darren Edge, Alan Blackwell NaN NaN \n", "3 J. Bohnemeyer, Max Planck, I. Introduction NaN NaN \n", "4 J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz NaN NaN \n", "\n", " year publication_type \n", "0 NaN NaN \n", "1 2002.0 NaN \n", "2 NaN NaN \n", "3 2002.0 NaN \n", "4 NaN NaN " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitleauthorsjournalmonthyearpublication_type
01Klaus Tschira Stiftung gemeinntzige GmbH, KTSKlaus TschiraNaNNaN2012www
12The SGML/XML Web PageRobin CoverNaNNaN2006www
23The Future of Classic Data Administration: Objects + Databases + CASEArnon RosenthalNaNNaN1998www
34XML Query Data ModelMary F. Fernandez, Jonathan RobieNaNNaN2001www
45The XML Query AlgebraPeter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip WadlerNaNNaN2001www
\n", "
" ], "text/plain": [ " id title \\\n", "0 1 Klaus Tschira Stiftung gemeinntzige GmbH, KTS \n", "1 2 The SGML/XML Web Page \n", "2 3 The Future of Classic Data Administration: Objects + Databases + CASE \n", "3 4 XML Query Data Model \n", "4 5 The XML Query Algebra \n", "\n", " authors \\\n", "0 Klaus Tschira \n", "1 Robin Cover \n", "2 Arnon Rosenthal \n", "3 Mary F. Fernandez, Jonathan Robie \n", "4 Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler \n", "\n", " journal month year publication_type \n", "0 NaN NaN 2012 www \n", "1 NaN NaN 2006 www \n", "2 NaN NaN 1998 www \n", "3 NaN NaN 2001 www \n", "4 NaN NaN 2001 www " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set 'id' as the keys to the input tables\n", "em.set_key(A, 'id')\n", "em.set_key(B, 'id')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('id', 'id')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the keys\n", "em.get_key(A), em.get_key(B)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Downsample the datasets \n", "sample_A, sample_B = em.down_sample(A, B, size=1000, y_param=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the down_sample command, set the `size` to the number of tuples that should be sampled from B (this would be the size of sampled B table) and set the `y_param` to be the number of matching tuples to be picked from A.\n", "\n", "In the above, we set the number of tuples to be sampled from B to be 1000. We set the `y_param` to 1 meaning that for each tuple sampled from B pick one matching tuple from A.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display the lengths of sampled datasets\n", "len(sample_A), len(sample_B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the input tables `A` and `B` (with 1.8M and 2.5M tuples) are down sampled to smaller tables `sample_A` and `sample_B` (with )." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }