{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lecture 15: Other file formats\n", "\n", "CSCI 1360: Foundations for Informatics and Analytics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview and Objectives" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In the last lecture, we looked at some ways of interacting with the filesystem through Python and how to read data off files stored on the hard drive. We looked at raw text files; however, there are numerous structured formats that these files can take, and we'll explore some of those here. By the end of this lecture, you should be able to:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Identify some of the primary data storage formats\n", " - Explain how to use other tools for some of the more exotic data types" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 1: Comma-separated value (CSV) files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We've discussed text formats: each line of text in a file can be treated as a string in a list of strings. What else might we encounter in our data science travels?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Easily the most common text file format is the CSV, or comma-separated values format. This is pretty much what it sounds like: if you have (semi-) structured data, you can delineate spaces between data using commas (or, to generalize, other characters like tabs)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As an example, we could represent a matrix very easily using the CSV format. The file storing a 3x3 matrix would look something like this:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n",
    "1,2,3\n",
    "4,5,6\n",
    "7,8,9\n",
    "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Each row is on one line by itself, and the columns are separated by commas." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "How can we read a CSV file? One way, potentially, is just do it yourself:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# File \"csv_file.txt\" contains the following:\n", "\n", "# 1,2,3,4\n", "# 5,6,7,8\n", "# 9,10,11,12" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]\n" ] } ], "source": [ "matrix = []\n", "\n", "with open(\"csv_file.txt\", \"r\") as f:\n", " full_file = f.read()\n", " \n", " # Split into lines.\n", " lines = full_file.strip().split(\"\\n\")\n", "\n", " for line in lines:\n", " # Split on commas.\n", " elements = line.strip().split(\",\")\n", " matrix.append([])\n", " \n", " # Convert to integers and store in the list.\n", " for e in elements:\n", " matrix[-1].append(int(e))\n", "\n", "print(matrix)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If, however, we'd prefer to use something a little less `strip()`-y and `split()`-y, Python also has a core `csv` module built-in:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sunny-side up,Over easy,Scrambled\n", "Spam,Spam,More spam\n", "\n" ] } ], "source": [ "import csv\n", "\n", "with open(\"eggs.csv\", \"w\") as csv_file:\n", " file_writer = csv.writer(csv_file)\n", " row1 = [\"Sunny-side up\", \"Over easy\", \"Scrambled\"]\n", " row2 = [\"Spam\", \"Spam\", \"More spam\"]\n", " file_writer.writerow(row1)\n", " file_writer.writerow(row2)\n", " \n", "with open(\"eggs.csv\", \"r\") as csv_file:\n", " print(csv_file.read())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that you first create a file reference, just like before. The one added step, though, is passing that reference to the `csv.writer()` function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Once you've created the `file_writer` object, you can call its `writerow()` function and pass in a list to the function, and it is automatically written to the file in CSV format!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The CSV readers let you do the opposite: read a line of text from a CSV file directly into a list." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Sunny-side up', 'Over easy', 'Scrambled']\n", "['Spam', 'Spam', 'More spam']\n" ] } ], "source": [ "with open(\"eggs.csv\", \"r\") as csv_file:\n", " file_reader = csv.reader(csv_file)\n", " for csv_row in file_reader:\n", " print(csv_row)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You can use a `for` loop to iterate over the rows in the CSV file. In turn, each row is a list, where each element of the list was separated by a comma." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 2: JavaScript Object Notation (JSON) files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\"JSON\", short for \"JavaScript Object Notation\", has emerged as more or less the *de facto* standard format for interacting with online services. Like CSV, it's a text-based format, but is much more flexible than CSV." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here's an example: an object in JSON format that represents a person." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "person = \"\"\"\n", "{\"name\": \"Wes\",\n", "\"places_lived\": [\"United States\", \"Spain\", \"Germany\"],\n", "\"pet\": null,\n", "\"siblings\": [{\"name\": \"Scott\", \"age\": 25, \"pet\": \"Zuko\"},\n", " {\"name\": \"Katie\", \"age\": 33, \"pet\": \"Cisco\"}]\n", "}\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It looks kind of a like a Python dictionary, doesn't it? You have key-value pairs, and they can accommodate almost any data type. In fact, when JSON objects are converted into native Python data structures, they are represented using dictionaries." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For reading and writing JSON objects, we can use the built-in `json` Python module." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import json" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(*Aside*: with CSV files, it was fairly straightforward to eschew the built-in `csv` module and do it yourself. With JSON, it is **much harder**; in fact, there really isn't a case where it's advisable to roll your own over using the built-in `json` module)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are two functions of interest: `dumps()` and `loads()`. One of them takes a JSON string and converts it to a native Python object, while the other does the opposite." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "First, we'll take our JSON string and convert it into a Python dictionary:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'places_lived': ['United States', 'Spain', 'Germany'], 'pet': None, 'name': 'Wes', 'siblings': [{'age': 25, 'pet': 'Zuko', 'name': 'Scott'}, {'age': 33, 'pet': 'Cisco', 'name': 'Katie'}]}\n" ] } ], "source": [ "python_dict = json.loads(person)\n", "print(python_dict)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And if you want to take a Python dictionary and convert it into a JSON string--perhaps you're about to save it to a file, or send it over the network to someone else--we can do that." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"places_lived\": [\"United States\", \"Spain\", \"Germany\"], \"pet\": null, \"name\": \"Wes\", \"siblings\": [{\"age\": 25, \"pet\": \"Zuko\", \"name\": \"Scott\"}, {\"age\": 33, \"pet\": \"Cisco\", \"name\": \"Katie\"}]}\n" ] } ], "source": [ "json_string = json.dumps(python_dict)\n", "print(json_string)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "At first glance, these two print-outs may look the same, but if you look closely you'll see some differences. Plus, if you tried to index `json_string[\"name\"]` you'd get some very strange errors. `python_dict[\"name\"]`, on the other hand, should nicely return `\"Wes\"`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 3: Extensible Markup Language (XML) files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**AVOID AT ALL COSTS**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...but if you have to interact with XML data (e.g., you're manually parsing a web page!), Python has a built-in `xml` library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "XML is about as general as it gets when it comes to representing data using structured text; you can represent pretty much anything. HTML is an example of XML in practice." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```\n", "\n", "\n", " Hello, world!\n", " Stop the planet, I want to get off!\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is about the simplest excerpt of XML in existence. The basic idea is you have *tags* (delineated by `<` and `>` symbols) that identify where certain fields begin and end." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Each field has an opening tag, with the name of the field in angled brackets: ``. The closing tag is exactly the same, except with a backslash in front of the tag to indicate closing: ``" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These tags can also have their own custom *attributes* that slightly tweak their behavior (e.g. the `standalone=\"yes\"` attribute in the opening `\n", "\n", " \n", " 1\n", " 2008\n", " 141100\n", " \n", " \n", " \n", " \n", " 4\n", " 2011\n", " 59900\n", " \n", " \n", " \n", " 68\n", " 2011\n", " 13600\n", " \n", " \n", " \n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data\n" ] } ], "source": [ "import xml.etree.ElementTree as ET # See, even the import statement is stupid complicated.\n", "\n", "tree = ET.parse('xml_file.txt')\n", "root = tree.getroot()\n", "print(root.tag) # The root node is \"data\", so that's what we should see here." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "With the root node, we have access to all the \"child\" data beneath it, such as the various country names:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag: \"country\" :: Name: \"Liechtenstein\"\n", "Tag: \"country\" :: Name: \"Singapore\"\n", "Tag: \"country\" :: Name: \"Panama\"\n" ] } ], "source": [ "for child in root:\n", " print(\"Tag: \\\"{}\\\" :: Name: \\\"{}\\\"\".format(child.tag, child.attrib[\"name\"]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 4: Binary files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What happens when we're not dealing with *text*? After all, images and videos are most certainly not encoded using text. Furthermore, if memory is an issue, converting text into binary formats can help save space." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are two primary options for reading and writing binary files." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "1. `pickle`, or \"pickling\", is native in Python and very flexible.\n", "2. NumPy's binary format, which works very well for NumPy arrays but not much else." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pickle has some similarities with JSON. In particular, it uses the same method names, `dumps()` and `loads()`, for converting between native Python objects and the raw data format. There are several differences, however." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Most notably, JSON is text-based whereas pickle is binary. You could open up a JSON file and read the text yourself. Not the case with pickled files." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - While JSON is used widely outside of Python, pickle is specific to Python and its objects. Consequently, JSON only works on a subset of Python data structures; pickle, on the other hand, works on just about everything." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here's an example of saving (or \"serializing\") a dictionary using pickle instead of JSON:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\x80\\x03}q\\x00(X\\x0c\\x00\\x00\\x00places_livedq\\x01]q\\x02(X\\r\\x00\\x00\\x00United Statesq\\x03X\\x05\\x00\\x00\\x00Spainq\\x04X\\x07\\x00\\x00\\x00Germanyq\\x05eX\\x03\\x00\\x00\\x00petq\\x06NX\\x04\\x00\\x00\\x00nameq\\x07X\\x03\\x00\\x00\\x00Wesq\\x08X\\x08\\x00\\x00\\x00siblingsq\\t]q\\n(}q\\x0b(X\\x03\\x00\\x00\\x00ageq\\x0cK\\x19h\\x06X\\x04\\x00\\x00\\x00Zukoq\\rh\\x07X\\x05\\x00\\x00\\x00Scottq\\x0eu}q\\x0f(h\\x0cK!h\\x06X\\x05\\x00\\x00\\x00Ciscoq\\x10h\\x07X\\x05\\x00\\x00\\x00Katieq\\x11ueu.'\n" ] } ], "source": [ "import pickle\n", "\n", "# We'll use the `python_dict` object from before.\n", "binary_object = pickle.dumps(python_dict)\n", "print(binary_object)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You can kinda see some English in there--mainly, the string constants. But everything else has been encoded in binary. It's much more space-efficient, but complete gibberish until you convert it back into a text format (e.g. JSON) or native Python object (e.g. dictionary)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If, on the other hand, you're using NumPy arrays, then you can use its own built-in binary format for saving and loading your arrays." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[6 8 1]\n", " [5 5 7]\n", " [9 2 0]]\n" ] } ], "source": [ "import numpy as np\n", "\n", "# Generate some data and save it.\n", "some_data = np.random.randint(10, size = (3, 3))\n", "print(some_data)\n", "np.save(\"my_data.npy\", some_data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now we can load it back:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[6 8 1]\n", " [5 5 7]\n", " [9 2 0]]\n" ] } ], "source": [ "my_data = np.load(\"my_data.npy\")\n", "print(my_data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is by far the easiest format to work with when you're dealing exclusively with NumPy arrays; don't bother with CSV or pickling. You don't even need to set up file descriptors with the NumPy interface." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "That said, there are limitations to NumPy serialization: namely, it can only serialize in binary format things that can be stored in NumPy arrays. This does *not* include dictionaries!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`pickle`, on the other hand, can serialize dictionaries (in fact, it specializes in serializing dictionaries), but like NumPy serialization is also not terribly cross-platform capable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So basically, some core rules of thumb on what binary format to use:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If it's a NumPy array, use NumPy serialization." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If it's *not* a NumPy array, but *could* be (e.g. a list of numbers), use NumPy serialization." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If it's a dictionary, or a structure that mixes string and numeric types, or uses wholesale objects, use `pickle`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Review Questions\n", "\n", "Some questions to discuss and consider:\n", "\n", "1: Dictionaries can be very complex; for a good example, just have a look at how big a dictionary representation of a single Tweet is [https://dev.twitter.com/overview/api/tweets](https://dev.twitter.com/overview/api/tweets): there's `\"created_at\"`, which is a string indicating the time the tweet was created; `\"contributors\"`, which is a dictionary unto itself identifying users participating in a thread; `\"entities\"`, a dictionary of lists that includes hashtags and URLs in the tweet; and `\"user\"`, which is another gargantuan dictionary containing all the information about the author of the tweet. What would be a good format to store these tweets in on the hard drive? What if we were sending these tweets somewhere, such as a smartphone app; would we use a different format? Explain.\n", "\n", "2: You can actually read raw bytes of a binary file using the standard Python `open()` function, provided you supply the special `\"b\"` flag to indicate a binary format. Can you imagine any circumstances under which you'd read a binary file this way?\n", "\n", "3: Is there any other format in which we could store the example XML data from this lecture such that we could avoid using XML entirely?\n", "\n", "4: NumPy itself has limited CSV-reading capabilities in `numpy.loadtxt`. Given its limitations in binary serialization as discussed in this lecture, do you imagine there are limitations on what kind of data it can read from CSV files?\n", "\n", "5: What kind of format (binary or text) is a .png image? Could it be stored as the other format? How?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Course Administrivia" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - A6 due today!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Review session #3 next Wednesday (Oct 12) in preparation for the midterm!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Midterm will be an **in-class, written exam** on Thursday, October 13 (one week from today)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - GII Symposium next Tuesday (Oct 11) at the GA Center. Come and check out the posters and chat with the presenters, then find me for extra credit!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Additional Resources\n", "\n", " 1. Matthes, Eric. *Python Crash Course*, Chapter 10. 2016. ISBN-13: 978-1593276034\n", " 2. McKinney, Wes. *Python for Data Analysis*, Chapter 6. 2013. ISBN-13: 978-1449319793" ] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }