{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This IPython notebook illustrates how to read a CSV file from disk as a table and set its metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to import *py_entitymatching* package and other libraries as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import py_entitymatching as em\n",
    "import pandas as pd\n",
    "import os, sys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Different Ways to Read a CSV File and Set Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to get the path of the CSV file in disk. For the convenience of the user, we have included some sample files in the package. The path of a sample CSV file can be obtained like this:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Get the datasets directory\n",
    "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n",
    "\n",
    "# Get the path of the input table\n",
    "path_A = datasets_dir + os.sep + 'person_table_A.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID,name,birth_year,hourly_wage,address,zipcode\r\n",
      "a1,Kevin Smith,1989,30,\"607 From St, San Francisco\",94107\r\n",
      "a2,Michael Franklin,1988,27.5,\"1652 Stockton St, San Francisco\",94122\r\n"
     ]
    }
   ],
   "source": [
    "# Display the contents of the file in path_A\n",
    "!cat $path_A | head -3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we get the CSV file path, we can use it read the contents and set metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Different Ways to Read a CSV File and Set Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three different ways to read a CSV file and set  metadata:\n",
    "\n",
    "1. Read a CSV file first, and then set the metadata\n",
    "2. Read a CSV file and set the metadata together\n",
    "3. Read a CSV file and set the metadata from a file in disk "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Read the CSV file First and Then Set the Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, read the CSV files as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "A = em.read_csv_metadata(path_A)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>name</th>\n",
       "      <th>birth_year</th>\n",
       "      <th>hourly_wage</th>\n",
       "      <th>address</th>\n",
       "      <th>zipcode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>Kevin Smith</td>\n",
       "      <td>1989</td>\n",
       "      <td>30.0</td>\n",
       "      <td>607 From St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a2</td>\n",
       "      <td>Michael Franklin</td>\n",
       "      <td>1988</td>\n",
       "      <td>27.5</td>\n",
       "      <td>1652 Stockton St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a3</td>\n",
       "      <td>William Bridge</td>\n",
       "      <td>1986</td>\n",
       "      <td>32.0</td>\n",
       "      <td>3131 Webster St, San Francisco</td>\n",
       "      <td>94107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a4</td>\n",
       "      <td>Binto George</td>\n",
       "      <td>1987</td>\n",
       "      <td>32.5</td>\n",
       "      <td>423 Powell St, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a5</td>\n",
       "      <td>Alphonse Kemper</td>\n",
       "      <td>1984</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1702 Post Street, San Francisco</td>\n",
       "      <td>94122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID              name  birth_year  hourly_wage  \\\n",
       "0  a1       Kevin Smith        1989         30.0   \n",
       "1  a2  Michael Franklin        1988         27.5   \n",
       "2  a3    William Bridge        1986         32.0   \n",
       "3  a4      Binto George        1987         32.5   \n",
       "4  a5   Alphonse Kemper        1984         35.0   \n",
       "\n",
       "                           address  zipcode  \n",
       "0       607 From St, San Francisco    94107  \n",
       "1  1652 Stockton St, San Francisco    94122  \n",
       "2   3131 Webster St, San Francisco    94107  \n",
       "3     423 Powell St, San Francisco    94122  \n",
       "4  1702 Post Street, San Francisco    94122  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the 'type' of A \n",
    "type(A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then set the metadata for the table. We see `ID` is the key attribute (since it contains unique values and no value is missing) for the table. We can set this metadata as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "em.set_key(A, 'ID')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ID'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the metadata that were set for table A\n",
    "em.get_key(A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the CSV file is read into the memory and the metadata (i.e. key) is set for the table. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a CSV File and Set Metadata Together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above, we saw that we first read in the CSV file and then set the metadata. These two steps can be combined into a single step like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "A = em.read_csv_metadata(path_A, key='ID')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the 'type' of A\n",
    "type(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ID'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the metadata that were set for the table A \n",
    "em.get_key(A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a CSV File and Set Metadata from a File in Disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The user can specify the metadata in a file.\n",
    "\n",
    "This file *MUST* be in the same directory as the CSV file and the file name \n",
    "should be same, except the extension is set to '.metadata'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Specify the metadata for table A (stored in person_table_A.csv).\n",
    "\n",
    "# Get the file name (with full path) where the metadata file must be stored\n",
    "metadata_fname = 'person_table_A.metadata'\n",
    "metadata_file = datasets_dir + os.sep + metadata_fname\n",
    "\n",
    "# Specify the metadata for table A . Here we specify that 'ID' is the key attribute for the table. \n",
    "\n",
    "# Note that this step  requires write permission to the datasets directory.\n",
    "with open(metadata_file, 'w') as the_file:\n",
    "    the_file.write('#key=ID')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: In the above, we used Unix shell command `echo` to write the metadata contents. If you are on Windows, you can use `echo|set /p` instead of `echo` to acheive the same effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# If you donot have write permissions to the datasets directory, first copy the file to the local directory and \n",
    "# then create a metadata file like this (you need to uncomment the following lines and then execute):\n",
    "\n",
    "# import shutil\n",
    "# shutil.copy2('path_A', './person_table_A.metadata')\n",
    "# metadata_local_file = 'person_table_A.metadata'\n",
    "# with open(metadata_local_file, 'w') as the_file:\n",
    "#    the_file.write('#key=ID'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Read the CSV file for table A\n",
    "A = em.read_csv_metadata(path_A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ID'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the key for table A\n",
    "em.get_key(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Remove the metadata file\n",
    "os.remove(metadata_file) if os.path.exists(metadata_file) else None\n",
    "os.remove('person_table_A.csv') if os.path.exists('person_table_A.csv') else None\n",
    "os.remove(metadata_fname) if os.path.exists(metadata_fname) else None"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}