{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook illustrates how to read a CSV file from disk as a table and set its metadata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to import *py_entitymatching* package and other libraries as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import py_entitymatching as em\n", "import pandas as pd\n", "import os, sys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Different Ways to Read a CSV File and Set Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to get the path of the CSV file in disk. For the convenience of the user, we have included some sample files in the package. The path of a sample CSV file can be obtained like this:\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the datasets directory\n", "datasets_dir = em.get_install_path() + os.sep + 'datasets'\n", "\n", "# Get the path of the input table\n", "path_A = datasets_dir + os.sep + 'person_table_A.csv'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,name,birth_year,hourly_wage,address,zipcode\r\n", "a1,Kevin Smith,1989,30,\"607 From St, San Francisco\",94107\r\n", "a2,Michael Franklin,1988,27.5,\"1652 Stockton St, San Francisco\",94122\r\n" ] } ], "source": [ "# Display the contents of the file in path_A\n", "!cat $path_A | head -3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we get the CSV file path, we can use it read the contents and set metadata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Different Ways to Read a CSV File and Set Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three different ways to read a CSV file and set metadata:\n", "\n", "1. Read a CSV file first, and then set the metadata\n", "2. Read a CSV file and set the metadata together\n", "3. Read a CSV file and set the metadata from a file in disk " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read the CSV file First and Then Set the Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, read the CSV files as follows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A = em.read_csv_metadata(path_A)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | ID | \n", "name | \n", "birth_year | \n", "hourly_wage | \n", "address | \n", "zipcode | \n", "
---|---|---|---|---|---|---|
0 | \n", "a1 | \n", "Kevin Smith | \n", "1989 | \n", "30.0 | \n", "607 From St, San Francisco | \n", "94107 | \n", "
1 | \n", "a2 | \n", "Michael Franklin | \n", "1988 | \n", "27.5 | \n", "1652 Stockton St, San Francisco | \n", "94122 | \n", "
2 | \n", "a3 | \n", "William Bridge | \n", "1986 | \n", "32.0 | \n", "3131 Webster St, San Francisco | \n", "94107 | \n", "
3 | \n", "a4 | \n", "Binto George | \n", "1987 | \n", "32.5 | \n", "423 Powell St, San Francisco | \n", "94122 | \n", "
4 | \n", "a5 | \n", "Alphonse Kemper | \n", "1984 | \n", "35.0 | \n", "1702 Post Street, San Francisco | \n", "94122 | \n", "