{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to read the CSV files from disk as tables and set their metadata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to import *py_entitymatching* package and other libraries as follows:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", " \"This module will be removed in 0.20.\", DeprecationWarning)\n" ] } ], "source": [ "import py_entitymatching as em\n", "import pandas as pd\n", "import os, sys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the Path of the CSV File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " the paths of the CSV file in the disk. For the convenience of the user, we have included some sample files in the package. The path of a sample CSV file can be obtained like this:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the paths of the input tables\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,name,birth_year,hourly_wage,address,zipcode\r\n", "a1,Kevin Smith,1989,30,\"607 From St, San Francisco\",94107\r\n", "a2,Michael Franklin,1988,27.5,\"1652 Stockton St, San Francisco\",94122\r\n" ] } ], "source": [ "# Display the contents of the file in path_A\n", "!cat $path_A | head -3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ways to Read a CSV File and Set Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three different ways to read a CSV file and set metadata:\n", "\n", "1. Read a CSV file first, and then set the metadata\n", "2. Read a CSV file and set the metadata together\n", "3. Read a CSV file and set the metadata from a file in disk " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read the CSV file First and Then Set the Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, read the CSV files as follows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A = em.read_csv_metadata(path_A)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | ID | \n", "name | \n", "birth_year | \n", "hourly_wage | \n", "address | \n", "zipcode | \n", "
---|---|---|---|---|---|---|
0 | \n", "a1 | \n", "Kevin Smith | \n", "1989 | \n", "30.0 | \n", "607 From St, San Francisco | \n", "94107 | \n", "
1 | \n", "a2 | \n", "Michael Franklin | \n", "1988 | \n", "27.5 | \n", "1652 Stockton St, San Francisco | \n", "94122 | \n", "
2 | \n", "a3 | \n", "William Bridge | \n", "1986 | \n", "32.0 | \n", "3131 Webster St, San Francisco | \n", "94107 | \n", "
3 | \n", "a4 | \n", "Binto George | \n", "1987 | \n", "32.5 | \n", "423 Powell St, San Francisco | \n", "94122 | \n", "
4 | \n", "a5 | \n", "Alphonse Kemper | \n", "1984 | \n", "35.0 | \n", "1702 Post Street, San Francisco | \n", "94122 | \n", "