{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Down sampling is typically done when the input tables are large (e.g. each containing more than 100K tuples). This IPython notebook illustrates how to down sample two large tables that are loaded in memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to import *py_entitymatching* package as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import py_entitymatching as em"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to read in the input tables. For the purposes of this notebook we will use two large datasets: Citeseer and DBLP. You can download Citeseer dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/citeseer.csv and DBLP dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/dblp.csv. Once downloaded, save these datasets as 'citeseer.csv' and 'dblp.csv' in the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Read the CSV files\n",
    "A = em.read_csv_metadata('./citeseer.csv',low_memory=False) # setting the parameter low_memory to False  to speed up loading.\n",
    "B = em.read_csv_metadata('./dblp.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1823978, 2512927)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(A), len(B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>authors</th>\n",
       "      <th>journal</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>publication_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>An Arithmetic Analogue of Bezouts Theorem</td>\n",
       "      <td>David Mckinnon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Thompsons Group F is Not Minimally Almost Convex</td>\n",
       "      <td>James Belk, Kai-uwe Bux</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2002.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Cognitive Dimensions Tradeoffs in Tangible User Interface Design</td>\n",
       "      <td>Darren Edge, Alan Blackwell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session...</td>\n",
       "      <td>J. Bohnemeyer, Max Planck, I. Introduction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2002.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND</td>\n",
       "      <td>J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  \\\n",
       "0   1   \n",
       "1   2   \n",
       "2   3   \n",
       "3   4   \n",
       "4   5   \n",
       "\n",
       "                                                                                                 title  \\\n",
       "0                                                            An Arithmetic Analogue of Bezouts Theorem   \n",
       "1                                                     Thompsons Group F is Not Minimally Almost Convex   \n",
       "2                                     Cognitive Dimensions Tradeoffs in Tangible User Interface Design   \n",
       "3  ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session...   \n",
       "4              PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND   \n",
       "\n",
       "                                                 authors journal  month  \\\n",
       "0                                         David Mckinnon     NaN    NaN   \n",
       "1                                James Belk, Kai-uwe Bux     NaN    NaN   \n",
       "2                            Darren Edge, Alan Blackwell     NaN    NaN   \n",
       "3             J. Bohnemeyer, Max Planck, I. Introduction     NaN    NaN   \n",
       "4  J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz     NaN    NaN   \n",
       "\n",
       "     year publication_type  \n",
       "0     NaN              NaN  \n",
       "1  2002.0              NaN  \n",
       "2     NaN              NaN  \n",
       "3  2002.0              NaN  \n",
       "4     NaN              NaN  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>authors</th>\n",
       "      <th>journal</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>publication_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Klaus Tschira Stiftung gemeinntzige GmbH, KTS</td>\n",
       "      <td>Klaus Tschira</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2012</td>\n",
       "      <td>www</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>The SGML/XML Web Page</td>\n",
       "      <td>Robin Cover</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2006</td>\n",
       "      <td>www</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>The Future of Classic Data Administration: Objects + Databases + CASE</td>\n",
       "      <td>Arnon Rosenthal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1998</td>\n",
       "      <td>www</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>XML Query Data Model</td>\n",
       "      <td>Mary F. Fernandez, Jonathan Robie</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2001</td>\n",
       "      <td>www</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>The XML Query Algebra</td>\n",
       "      <td>Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2001</td>\n",
       "      <td>www</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                                                                  title  \\\n",
       "0   1                          Klaus Tschira Stiftung gemeinntzige GmbH, KTS   \n",
       "1   2                                                  The SGML/XML Web Page   \n",
       "2   3  The Future of Classic Data Administration: Objects + Databases + CASE   \n",
       "3   4                                                   XML Query Data Model   \n",
       "4   5                                                  The XML Query Algebra   \n",
       "\n",
       "                                                                                      authors  \\\n",
       "0                                                                               Klaus Tschira   \n",
       "1                                                                                 Robin Cover   \n",
       "2                                                                             Arnon Rosenthal   \n",
       "3                                                           Mary F. Fernandez, Jonathan Robie   \n",
       "4  Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler   \n",
       "\n",
       "  journal month  year publication_type  \n",
       "0     NaN   NaN  2012              www  \n",
       "1     NaN   NaN  2006              www  \n",
       "2     NaN   NaN  1998              www  \n",
       "3     NaN   NaN  2001              www  \n",
       "4     NaN   NaN  2001              www  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Set 'id' as the keys to the input tables\n",
    "em.set_key(A, 'id')\n",
    "em.set_key(B, 'id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('id', 'id')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the keys\n",
    "em.get_key(A), em.get_key(B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downsample the Input Tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the input tables can be down sampled as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Downsample the datasets \n",
    "sample_A, sample_B = em.down_sample(A, B, size=1000, y_param=1, show_progress=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the down_sample command, set the `size` to the number of tuples that should be sampled from B (this would be the size of sampled B table) and set the `y_param` to be the number of tuples to be picked from A (for each tuple in sampled B table). In the above, we set the number of tuples to be sampled from B to be 1000. We set the `y_param` to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Display the number of tuples in the sampled datasets\n",
    "len(sample_A), len(sample_B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the input tables `A` and `B` have been down sampled to smaller tables `sample_A` and `sample_B` ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Show the metadata of sample_A, sample_B\n",
    "em.show_properties(sample_A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "em.show_properties(sample_B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the sampled tables retain the same properies as the input tables."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}