{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Wrangling" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "
\n", "'Data Wrangling' generally refers to transforming raw data into a useable form for your analyses of interest, including loading, aggregating and formating. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will focus on loading different types of data files. Other aspects of 'wrangling' such as combining different datasets will be covered in future tutorials, and are explored in the assignments.\n", "\n", "Note: Throughout this notebook, we will be using `!` to run the shell command `cat` to print out the contents of example data files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python I/O\n", "\n", "Let's start with basic Python utilities for reading and loading data files. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Official Python \n", "documentation \n", "on input / output.\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First line of data\r\n", "Second line of data" ] } ], "source": [ "# Check out an example data file\n", "!cat files/data.txt" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First line of data\n", "Second line of data\n" ] } ], "source": [ "# First, explicitly open the file object for reading\n", "file_obj = open('files/data.txt', 'r')\n", "\n", "# You can then loop through the file object, grabbing each line of data\n", "for line in file_obj:\n", " # Here we explicitly remove the new line marker at the end of each line (the '\\n')\n", " print(line.strip('\\n'))\n", "\n", "# File objects then have to closed when you are finished with them\n", "file_obj.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since opening and closing files basically always goes together, there is a shortcut to do both of them together, which is the `with` keyword. \n", "\n", "By using `with`, file objects will be opened, and then automatically closed at the end of the code block. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First line of data\n", "Second line of data\n" ] } ], "source": [ "# Use 'with' keyword to open, read, and then close a file\n", "with open('files/data.txt', 'r') as file_obj:\n", " for line in file_obj:\n", " print(line.strip('\\n'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using input / output functionality from standard library Python is a pretty 'low level' way to read data files. This strategy often takes a lot of work to organize and define the details of how files are organized and how to read them. For example, in the above simple example, we had to deal with the new line character explicitly. \n", "\n", "As long as you have reasonably well structured data files, using standardized file types, you can use higher-level functions that will take care of a lot of these details - loading data straight into `pandas` data objects, for example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas I/O" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Pandas has a range of functions that will automatically read in whole files of standard file types in pandas objects. \n", "
\n", "\n", "
\n", "Official Pandas\n", "documentation \n", "on input / output. \n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tab complete to check out all the read functions available\n", "pd.read_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File types\n", "\n", "There are many different file types in which data may be stored. \n", "\n", "Here, we will start by examining CSV and JSON files. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "'Comma Separated Value' files store data, separated by comma's. Think of them like lists.\n", "
\n", "\n", "
\n", "More information on CSV files from\n", "wikipedia. \n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1, 2, 3, 4\r\n", "5, 6, 7, 8\r\n", "9, 10, 11, 12" ] } ], "source": [ "# Let's have a look at a csv file (printed out in plain text)\n", "!cat files/data.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CSV Files with Python" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Python has a module devoted to working with csv's\n", "import csv" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1, 2, 3, 4\n", "5, 6, 7, 8\n", "9, 10, 11, 12\n" ] } ], "source": [ "# We can read through our file with the csv module\n", "with open('files/data.csv') as csv_file:\n", " csv_reader = csv.reader(csv_file, delimiter=',')\n", " for row in csv_reader:\n", " print(', '.join(row))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CSV Files with Pandas" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Pandas also has functions to directly load csv data\n", "pd.read_csv?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01234
15678
29101112
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1 2 3 4\n", "1 5 6 7 8\n", "2 9 10 11 12" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's read in our csv file\n", "pd.read_csv(open('files/data.csv'), header=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, using `Pandas` save us from having to do more work (write more code) to use load the file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JSON Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "JavaScript Object Notation files can store hierachical key/value pairings. Think of them like dictionaries.\n", "
\n", "\n", "
\n", "More information on JSON files from\n", "wikipedia.\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\r\n", " \"firstName\": \"John\",\r\n", " \"age\": 53\r\n", "}\r\n" ] } ], "source": [ "# Let's have a look at a json file (printed out in plain text)\n", "!cat files/data.json" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'firstName': 'John', 'age': '53'}\n" ] } ], "source": [ "# Think of json's as similar to dictionaries\n", "d = {'firstName': 'John', 'age': '53'}\n", "print(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### JSON Files with Python" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Python also has a module for dealing with json\n", "import json" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Load a json file\n", "with open('files/data.json') as dat_file: \n", " dat = json.load(dat_file)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Check what data type this gets loaded as\n", "print(type(dat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### JSON Files with Pandas" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Pandas also has support for reading in json files\n", "pd.read_json?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first Alan\n", "place Manchester\n", "dtype: object" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can read in json formatted strings with pandas\n", "# Note that here I am specifying to read it in as a pd.Series, as there is a single line of data\n", "pd.read_json('{ \"first\": \"Alan\", \"place\": \"Manchester\"}', typ='series')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "firstName John\n", "age 53\n", "dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read in our json file with pandas\n", "pd.read_json(open('files/data.json'), typ='series')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "As a general guideline, for loading and wrangling data files, using standardized data files, and loading them with 'higher-level' tools such as `Pandas` makes it easier to work with data files. " ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }