{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lecture 12: Saving to and loading from the file system\n", "\n", "CSCI 1360E: Foundations for Informatics and Analytics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview and Objectives\n", "\n", "So far, all the data we've worked with have either been manually instantiated as NumPy arrays or lists of strings, or randomly generated. Here we'll finally get to go over reading to and writing from the filesystem. By the end of this lecture, you should be able to:\n", "\n", " - Identify some of the primary data storage formats\n", " - Implement a basic file reader / writer using built-in Python tools\n", " - Explain how to use other tools for some of the more exotic data types\n", " - Use exception handlers to make your interactions with the filesystem robust to failure" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 1: Interacting with text files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Text files are probably the most common and pervasive format of data. They can contain almost anything: weather data, stock market activity, literary works, and raw web data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Text files are also convenient for your own work: once some kind of analysis has finished, it's nice to dump the results into a file you can inspect later." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Reading an entire file\n", "\n", "So let's jump into it! Let's start with something simple; say...the text version of Lewis Carroll's *Alice in Wonderland*?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n" ] } ], "source": [ "file_object = open(\"alice.txt\", \"r\")\n", "contents = file_object.read()\n", "print(contents[:71])\n", "file_object.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Yep, I went there." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's walk through the code, line by line. First, we have a call to a function `open()` that accepts two arguments:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "file_object = open(\"alice.txt\", \"r\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - The first argument is the *file path*. It's like a URL, except to a file on your computer. It should be noted that, unless you specify a leading forward slash `\"/\"`, Python will interpret this path to be *relative* to wherever the Python script is that you're running with this command." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - The second argument is the *mode*. This tells Python whether you're reading from a file, writing to a file, or appending to a file. We'll come to each of these." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These two arguments are part of the function `open()`, which then returns a *file descriptor*. You can think of this kind of like the reference / pointer discussion we had in our prior functions lecture: `file_object` is a reference to the file." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The next line is where the magic happens:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "contents = file_object.read()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In this line, we're calling the method `read()` on the file reference we got in the previous step. This method goes into the file, pulls out *everything* in it, and sticks it all in the variable `contents`. One big string!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n" ] } ], "source": [ "print(contents[:71])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...of which I then print the first 71 characters, which contains the name of the book and the author. Feel free to print the entire string `contents`; it'll take a few seconds, as you're printing the whole book!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Finally, the last and possibly most important line:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "file_object.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This statement explicitly closes the file reference, effectively shutting the valve to the file. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Do not** underestimate the value of this statement. There are weird errors that can crop up when you forget to close file descriptors. It can be difficult to remember to do this, though; in other languages where you have to manually allocate and release any memory you use, it's a bit easier to remember. Since Python handles all that stuff for us, it's not a force of habit to explicitly shut off things we've turned on." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Fortunately, there's an alternative we can use!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n" ] } ], "source": [ "with open(\"alice.txt\", \"r\") as file_object:\n", " contents = file_object.read()\n", " print(contents[:71])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This code works identically to the code before it. The difference is, by using a `with` block, Python intrinsically closes the file descriptor at the end of the block. Therefore, no need to remember to do it yourself! Hooray!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's say, instead of *Alice in Wonderland*, we had some behemoth of a piece of literature: something along the lines of *War and Peace* or even an entire encyclopedia. Essentially, not something we want to read into Python *all at once.* Fortunately, we have an alternative:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n", "\n", "\n", "\n", "This eBook is for the use of anyone anywhere at no cost and with\n", "\n", "almost no restrictions whatsoever. You may copy it, give it away or\n", "\n", "re-use it under the terms of the Project Gutenberg License included\n", "\n" ] } ], "source": [ "with open(\"alice.txt\", \"r\") as file_object:\n", " num_lines = 0\n", " for line_of_text in file_object:\n", " print(line_of_text)\n", "\n", " num_lines += 1\n", " if num_lines == 5: break" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can use a `for` loop just as we're used to doing with lists. In this case, at each iteration, Python will hand you **exactly 1** line of text from the file to handle it however you'd like." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Of course, if you still want to read in the entire file at once, but really like the idea of splitting up the file line by line, there's a function for that, too:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n", "\n" ] } ], "source": [ "with open(\"alice.txt\", \"r\") as file_object:\n", " lines_of_text = file_object.readlines()\n", " print(lines_of_text[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By using `readlines()` instead of plain old `read()`, we'll get back a list of strings, where each element of the list is a single line in the text file. In the code snippet above, I've printed the first line of text from the file." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Writing to a file\n", "\n", "We've so far seen how to *read* data from a file. What if we've done some computations and want to save our results to a file?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data_to_save = \"This is important data. Definitely worth saving.\"\n", "with open(\"outfile.txt\", \"w\") as file_object:\n", " file_object.write(data_to_save)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You'll notice two important changes from before:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " 1. Switch the `\"r\"` argument in the `open()` function to `\"w\"`. You guessed it: we've gone from **R**eading to **W**riting.\n", " 2. Call `write()` on your file descriptor, and pass in the data you want to write to the file (in this case, `data_to_save`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you try this using a new notebook on JupyterHub (or on your local machine), you should see a new text file named \"`outfile.txt`\" appear in the same directory as your script. Give it a shot!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some notes about writing to a file:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If the file you're writing to does NOT currently exist, Python will try to create it for you. In most cases this should be fine (but we'll get to outstanding cases in Part 3 of this lecture)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If the file you're writing to DOES already exist, Python will overwrite everything in the file with the new content. As in, **everything that was in the file before will be erased**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "That second point seems a bit harsh, doesn't it? Luckily, there is recourse." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Appending to an existing file" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you find yourself in the situation of writing to a file multiple times, and wanting to keep what you wrote to the file previously, then you're in the market for *appending* to a file." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This works *exactly* the same as writing to a file, with one small wrinkle:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data_to_save = \"This is ALSO important data. BOTH DATA ARE IMPORTANT.\"\n", "with open(\"outfile.txt\", \"a\") as file_object:\n", " file_object.write(data_to_save)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The only change that was made was switching the `\"w\"` in the `open()` method to `\"a\"` for, you guessed it, **A**ppend. If you look in `outfile.txt`, you should see both lines of text we've written." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some notes on appending to files:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - If the file does NOT already exist, then using \"a\" in `open()` is functionally identical to using \"w\"." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - You only need to use append mode if you *closed* the file descriptor to that file previously. If you have an open file descriptor, you can call `write()` multiple times; each call will append the text to the previous text. It's only when you *close* a descriptor, but then want to open up another one to the *same file*, that you'd need to switch to append mode." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's put together what we've seen by writing to a file, appending more to it, and then reading what we wrote." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data_to_save = \"This is important data. Definitely worth saving.\\n\"\n", "with open(\"outfile.txt\", \"w\") as file_object:\n", " file_object.write(data_to_save)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data_to_save = \"This is ALSO important data. BOTH DATA ARE IMPORTANT.\"\n", "with open(\"outfile.txt\", \"a\") as file_object:\n", " file_object.write(data_to_save)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LINE 1: This is important data. Definitely worth saving.\n", "\n", "LINE 2: This is ALSO important data. BOTH DATA ARE IMPORTANT.\n" ] } ], "source": [ "with open(\"outfile.txt\", \"r\") as file_object:\n", " contents = file_object.readlines()\n", " print(\"LINE 1: {}\".format(contents[0]))\n", " print(\"LINE 2: {}\".format(contents[1]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 2: Other file formats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We've discussed text formats: each line of text in a file can be treated as a string in a list of strings. What else might we encounter in our data science travels?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### CSV files\n", "\n", "Easily the most common text file format is the CSV, or comma-separated values format. This is pretty much what it sounds like: if you have (semi-) structured data, you can delineate spaces between data using commas (or, to generalize, other characters like tabs)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As an example, we could represent a matrix very easily using the CSV format. The file storing a 3x3 matrix would look something like this:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n",
    "1,2,3\n",
    "4,5,6\n",
    "7,8,9\n",
    "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Each row is on one line by itself, and the columns are separated by commas." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "How do we read CSV files? Python has a core `csv` module built-in:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sunny-side up,Over easy,Scrambled\n", "Spam,Spam,More spam\n", "\n" ] } ], "source": [ "import csv\n", "\n", "with open(\"eggs.csv\", \"w\") as csv_file:\n", " file_writer = csv.writer(csv_file)\n", " row1 = [\"Sunny-side up\", \"Over easy\", \"Scrambled\"]\n", " row2 = [\"Spam\", \"Spam\", \"More spam\"]\n", " file_writer.writerow(row1)\n", " file_writer.writerow(row2)\n", " \n", "with open(\"eggs.csv\", \"r\") as csv_file:\n", " print(csv_file.read())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that you first create a file reference, just like before. The one added step, though, is passing that reference to the `csv.writer()` function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Once you've created the `file_writer` object, you can call its `writerow()` function and pass in a list to the function, and it is automatically written to the file in CSV format!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The CSV readers let you do the opposite: read a line of text from a CSV file directly into a list." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Sunny-side up', 'Over easy', 'Scrambled']\n", "['Spam', 'Spam', 'More spam']\n" ] } ], "source": [ "with open(\"eggs.csv\", \"r\") as csv_file:\n", " file_reader = csv.reader(csv_file)\n", " for csv_row in file_reader:\n", " print(csv_row)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You can use a `for` loop to iterate over the rows in the CSV file. In turn, each row is a list, where each element of the list was separated by a comma." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### JSON files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\"JSON\", short for \"JavaScript Object Notation\", has emerged as more or less the *de facto* standard format for interacting with online services. Like CSV, it's a text-based format, but is much more flexible than CSV." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here's an example: an object in JSON format that represents a person." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "person = \"\"\"\n", "{\"name\": \"Wes\",\n", "\"places_lived\": [\"United States\", \"Spain\", \"Germany\"],\n", "\"pet\": null,\n", "\"siblings\": [{\"name\": \"Scott\", \"age\": 25, \"pet\": \"Zuko\"},\n", " {\"name\": \"Katie\", \"age\": 33, \"pet\": \"Cisco\"}]\n", "}\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It looks kind of a like a Python dictionary, doesn't it? You have key-value pairs, and they can accommodate almost any data type. In fact, when JSON objects are converted into native Python data structures, they are usually represented using dictionaries." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For reading and writing JSON objects, we can use the built-in `json` Python module." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import json" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are two functions of interest: `dumps()` and `loads()`. One of them takes a JSON string and converts it to a native Python object, while the other does the opposite." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "First, we'll take our JSON string and convert it into a Python dictionary:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'name': 'Wes', 'places_lived': ['United States', 'Spain', 'Germany'], 'pet': None, 'siblings': [{'name': 'Scott', 'pet': 'Zuko', 'age': 25}, {'name': 'Katie', 'pet': 'Cisco', 'age': 33}]}\n" ] } ], "source": [ "python_dict = json.loads(person)\n", "print(python_dict)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And if you want to take a Python dictionary and convert it into a JSON string--perhaps you're about to save it to a file, or send it over the network to someone else--we can do that." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"name\": \"Wes\", \"places_lived\": [\"United States\", \"Spain\", \"Germany\"], \"pet\": null, \"siblings\": [{\"name\": \"Scott\", \"pet\": \"Zuko\", \"age\": 25}, {\"name\": \"Katie\", \"pet\": \"Cisco\", \"age\": 33}]}\n" ] } ], "source": [ "json_string = json.dumps(python_dict)\n", "print(json_string)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "At first glance, these two print-outs may look the same, but if you look closely you'll see some differences. Plus, if you tried to index `json_string[\"name\"]` you'd get some very strange errors. `python_dict[\"name\"]`, on the other hand, should nicely return `\"Wes\"`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### XML files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**AVOID AT ALL COSTS**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...but if you have to interact with XML data (e.g., you're manually parsing a web page!), Python has a built-in `xml` library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It's a bit beyond the scope of this course, but feel free to check it out." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Binary files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What happens when we're not dealing with *text*? After all, images and videos are most certainly not encoded using text. Furthermore, if memory is an issue, converting text into binary formats can help save space." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are two primary options for reading and writing binary files." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "1. `pickle`, or \"pickling\", is native in Python and very flexible.\n", "2. NumPy's binary format, which works very well for NumPy arrays but not much else." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pickle has some similarities with JSON. In particular, it uses the same method names, `dumps()` and `loads()`, for converting between native Python objects and the raw data format. There are several differences, however." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Most notably, JSON is text-based whereas pickle is binary. You could open up a JSON file and read the text yourself. Not the case with pickled files." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - While JSON is used widely outside of Python, pickle is specific to Python and its objects. Consequently, JSON only works on a subset of Python data structures; pickle, on the other hand, works on just about everything." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here's an example of saving (or \"serializing\") a dictionary using pickle instead of JSON:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\x80\\x03}q\\x00(X\\x04\\x00\\x00\\x00nameq\\x01X\\x03\\x00\\x00\\x00Wesq\\x02X\\x0c\\x00\\x00\\x00places_livedq\\x03]q\\x04(X\\r\\x00\\x00\\x00United Statesq\\x05X\\x05\\x00\\x00\\x00Spainq\\x06X\\x07\\x00\\x00\\x00Germanyq\\x07eX\\x03\\x00\\x00\\x00petq\\x08NX\\x08\\x00\\x00\\x00siblingsq\\t]q\\n(}q\\x0b(h\\x01X\\x05\\x00\\x00\\x00Scottq\\x0ch\\x08X\\x04\\x00\\x00\\x00Zukoq\\rX\\x03\\x00\\x00\\x00ageq\\x0eK\\x19u}q\\x0f(h\\x01X\\x05\\x00\\x00\\x00Katieq\\x10h\\x08X\\x05\\x00\\x00\\x00Ciscoq\\x11h\\x0eK!ueu.'\n" ] } ], "source": [ "import pickle\n", "\n", "# We'll use the `python_dict` object from before.\n", "binary_object = pickle.dumps(python_dict)\n", "print(binary_object)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You can kinda see some English in there--mainly, the string constants. But everything else has been encoded in binary. It's much more space-efficient, but complete gibberish until you convert it back into a text format (e.g. JSON) or native Python object (e.g. dictionary)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If, on the other hand, you're using NumPy arrays, then you can use its own built-in binary format for saving and loading your arrays." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[6 8 1]\n", " [5 5 7]\n", " [9 2 0]]\n" ] } ], "source": [ "import numpy as np\n", "\n", "# Generate some data and save it.\n", "some_data = np.random.randint(10, size = (3, 3))\n", "print(some_data)\n", "np.save(\"my_data.npy\", some_data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now we can load it back:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[6 8 1]\n", " [5 5 7]\n", " [9 2 0]]\n" ] } ], "source": [ "my_data = np.load(\"my_data.npy\")\n", "print(my_data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is by far the easiest format to work with when you're dealing exclusively with NumPy arrays; don't bother with CSV or pickling. You don't even need to set up file descriptors with the NumPy interface." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 3: Preventing errors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This aspect of programming hasn't been very heavily emphasized--that of error handling--because for the most part, data science is about building models and performing computations so you can make inferences from your data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...except, of course, from nearly every survey that says your average data scientist spends the *vast* majority of their time cleaning and organizing their data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![time spent](http://blogs-images.forbes.com/gilpress/files/2016/03/Time-1200x511.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Data is messy and computers are fickle. Just because that file was there yesterday doesn't mean it'll still be there tomorrow. When you're reading from and writing to files, you'll need to put in checks to make sure things are behaving the way you expect, and if they're not, that you're handling things gracefully." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We're going to become good friends with `try` and `except` whenever we're dealing with files. For example, let's say I want to read again from that *Alice in Wonderland* file I had:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'alicee.txt'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"alicee.txt\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"r\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mfile_object\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mcontents\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfile_object\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreadlines\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontents\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'alicee.txt'" ] } ], "source": [ "with open(\"alicee.txt\", \"r\") as file_object:\n", " contents = file_object.readlines()\n", " print(contents[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Whoops. In this example, I simply misnamed the file. In practice, maybe the file was moved; maybe it was renamed; maybe you're getting the file from the user and they incorrectly specified the name. Maybe the hard drive failed, or any number of other \"acts of God.\" Whatever the reason, **your program should be able to handle missing files.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You could probably code this up yourself:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sorry, the file 'alicee.txt' does not seem to exist.\n" ] } ], "source": [ "filename = \"alicee.txt\"\n", "try:\n", " with open(filename, \"r\") as file_object:\n", " contents = file_object.readlines()\n", " print(contents[0])\n", "except FileNotFoundError:\n", " print(\"Sorry, the file '{}' does not seem to exist.\".format(filename))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Pay attention to this: this will most likely show up on future assignments / exams, and you'll be expected to properly handle missing files or incorrect filenames." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Review Questions\n", "\n", "Some questions to discuss and consider:\n", "\n", "1: In Part 1 of the lecture when we read in and printed the first few lines of *Alice in Wonderland*, you'll notice each line of text has a blank line in between. Explain why, and how to fix it (so that printing each line shows text of the book the same way you'd see it in the file).\n", "\n", "2: Describe the circumstances under which append \"a\" mode and write \"w\" mode are identical.\n", "\n", "3: Dictionaries can be very complex; for a good example, just have a look at how big a dictionary representation of a single Tweet is [https://dev.twitter.com/overview/api/tweets](https://dev.twitter.com/overview/api/tweets): there's `\"created_at\"`, which is a string indicating the time the tweet was created; `\"contributors\"`, which is a dictionary unto itself identifying users participating in a thread; `\"entities\"`, a dictionary of lists that includes hashtags and URLs in the tweet; and `\"user\"`, which is another gargantuan dictionary containing all the information about the author of the tweet. What would be a good format to store these tweets in on the hard drive? What if we were sending these tweets somewhere, such as a smartphone app; would we use a different format? Explain." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Course Administrivia\n", "\n", "**How was the midterm?** Please feel free to discuss on the Slack chat!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We're pretty much on the regular schedule for the remainder of the semester: Mon / Wed / Fri lectures, and Tues / Thurs assignments." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...with one wrinkle: as announced in L11, I'm holding bi-weekly review sessions, held via Google Hangouts. The next one will be **Friday, July 15 at 12pm EDT.** Plan to be there if you have any questions!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Additional Resources\n", "\n", " 1. Matthes, Eric. *Python Crash Course*, Chapter 10. 2016. ISBN-13: 978-1593276034\n", " 2. McKinney, Wes. *Python for Data Analysis*, Chapter 6. 2013. ISBN-13: 978-1449319793" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }