{ "cells": [ { "cell_type": "markdown", "id": "66d720a1", "metadata": {}, "source": [ "# Python input/output\n", "\n", "The definitive source on input/output in Python is the [Python documentation](https://docs.python.org/3/tutorial/inputoutput.html), which you can always look at to find details.\n", "\n", "--------\n", "\n", "## Opening files\n", "\n", "Often we'll need to interact with data files in the course of our work. In many cases, we can use a pre-existing function to read the file and parse the data for us. For instance, we can use pandas to read a CSV file:" ] }, { "cell_type": "code", "execution_count": null, "id": "0bdd40bf", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.read_csv('2015_trip_data.csv').head()" ] }, { "cell_type": "markdown", "id": "0906b3d7", "metadata": {}, "source": [ "However, sometimes data files are not CSVs but are instead some custom format that we have to process ourselves. In these cases, we'll need to access a file from Python directly.\n", "\n", "The first step in working with a file is to \"open\" it (just like you would open a file in a text editor or preview application). We use the built-in `open` function for that:" ] }, { "cell_type": "code", "execution_count": null, "id": "7f928e56", "metadata": {}, "outputs": [], "source": [ "file = open('2015_trip_data.csv')" ] }, { "cell_type": "markdown", "id": "819a672f", "metadata": {}, "source": [ "After you're done with the file, it's important to close the file (just like you would close a text editor or preview application):" ] }, { "cell_type": "code", "execution_count": null, "id": "2311f936", "metadata": {}, "outputs": [], "source": [ "file.close()" ] }, { "cell_type": "markdown", "id": "4edf79c7", "metadata": {}, "source": [ "If you don't close the file, bad things could happen. For instance, if you're writing something to the file, the content may not get flushed (ie fully saved) if you don't close the file before you terminate your program.\n", "\n", "The open/close lifecycle of working with files is so common that Python has a special \"with\" construct to make it easier. This is usually the way I recommend working with files. Python will automatically close the file after it finishes the `with` block:" ] }, { "cell_type": "code", "execution_count": null, "id": "e7a0d54f", "metadata": {}, "outputs": [], "source": [ "with open('2015_trip_data.csv') as file:\n", " # do something with file\n", " pass\n", "\n", "# file is automatically closed after the with block" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1f21cccd", "metadata": {}, "source": [ "--------\n", "\n", "## Reading files\n", "\n", "You can do a lot of things with file objects. You can explore those with tab completion (or in the [Python documentation](https://docs.python.org/3/library/io.html#i-o-base-classes)):" ] }, { "cell_type": "code", "execution_count": null, "id": "eb33b890", "metadata": {}, "outputs": [], "source": [ "with open('2015_trip_data.csv') as file:\n", " # code here\n", " \n", " pass" ] }, { "cell_type": "markdown", "id": "707521b1", "metadata": {}, "source": [ "For reading files, there are a number of helpful reading functions, like `readline`, `readlines`, or the less-likely `read`. To read all the lines of the file into a variable:" ] }, { "cell_type": "code", "execution_count": null, "id": "9644f432", "metadata": {}, "outputs": [], "source": [ "with open('2015_trip_data.csv') as file:\n", " # code here\n", " \n", " pass" ] }, { "cell_type": "markdown", "id": "69755a0a", "metadata": {}, "source": [ "What happens if you try to read the lines again? What does this mean about how `readlines` works?\n", "\n", "How can we \"reset\" the file to read everything again?" ] }, { "cell_type": "markdown", "id": "cb7f8eb1", "metadata": {}, "source": [ "To read line by line, you can use `readline`:" ] }, { "cell_type": "code", "execution_count": null, "id": "145d1a08", "metadata": {}, "outputs": [], "source": [ "with open('2015_trip_data.csv') as file:\n", " # code here\n", " pass" ] }, { "cell_type": "markdown", "id": "21a7232a", "metadata": {}, "source": [ "Python has a slightly easier way to read text files line by line (automatically calling `readline`):" ] }, { "cell_type": "code", "execution_count": null, "id": "0ba3464e", "metadata": {}, "outputs": [], "source": [ "with open('2015_trip_data.csv') as file:\n", " # code here\n", " pass" ] }, { "cell_type": "markdown", "id": "ad908e27", "metadata": {}, "source": [ "--------\n", "\n", "## Writing files\n", "\n", "We can also edit or write to files with Python. You might start by trying what we did before, but using `writelines` instead of `readlines` to write some strings as lines of text:" ] }, { "cell_type": "code", "execution_count": null, "id": "8d8873e1", "metadata": {}, "outputs": [], "source": [ "with open('test.txt') as file:\n", " file.writelines(['first line', 'second_line'])" ] }, { "cell_type": "markdown", "id": "db3eb645", "metadata": {}, "source": [ "What went wrong?\n", "\n", "We need to tell Python that we're writing to this file instead of reading from it (reading, or `r` mode, is the default). We can set the mode to \"write\":" ] }, { "cell_type": "code", "execution_count": null, "id": "d1a1f02e", "metadata": {}, "outputs": [], "source": [ "# Open in write mode\n", "with open('test.txt', 'w') as file:\n", " file.writelines(['first line', 'second line'])" ] }, { "cell_type": "markdown", "id": "5fef1cec", "metadata": {}, "source": [ "Find the file in the Jupyter file viewer and open it up.\n", "\n", "Now wait a minute! It used `writelines` but it didn't actually write separate lines! This is a silly thing in Python - you can read in the [documentation](https://docs.python.org/3/library/io.html#io.IOBase.writelines) that `writelines` does not add line separators. We have to add those manually:" ] }, { "cell_type": "code", "execution_count": null, "id": "51f6e5ad", "metadata": {}, "outputs": [], "source": [ "with open('test.txt', 'w') as file:\n", " # Use line separators\n", " file.writelines(['first line\\n', 'second line\\n'])" ] }, { "cell_type": "markdown", "id": "e9122050", "metadata": {}, "source": [ "Notice that we *overwrote* the previous version of the file - when you open in write mode, you're going to replace the file contents with whatever you write. If instead you want to append content (ie add lines), you can open the file in \"append\" mode:" ] }, { "cell_type": "code", "execution_count": null, "id": "f453a9b2", "metadata": {}, "outputs": [], "source": [ "# Open in append mode\n", "with open('test.txt', 'a') as file:\n", " file.writelines(['third line\\n', 'fourth line\\n'])" ] }, { "cell_type": "markdown", "id": "44dd4169", "metadata": {}, "source": [ "You can also write an individual string:" ] }, { "cell_type": "code", "execution_count": null, "id": "f7045d89", "metadata": {}, "outputs": [], "source": [ "with open('test.txt', 'a') as file:\n", " # Write one string\n", " file.write('fifth line\\n')" ] }, { "cell_type": "markdown", "id": "9bb35177", "metadata": {}, "source": [ "You can also use the plain old `print` function by also providing a file as an argument to the `print` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "95bc5116", "metadata": {}, "outputs": [], "source": [ "with open('test.txt', 'a') as test_file:\n", " # Use the print built-in\n", " print('sixth line', file=test_file)\n", " print('seventh line', file=test_file)" ] }, { "cell_type": "markdown", "id": "cdaa102e", "metadata": {}, "source": [ "Note that `print` DOES automatically include a newline character at the end of the printed content.\n", "\n", "### Formatting\n", "\n", "One thing that you'll often want to do is embed variables in what you print out. You can do this by writing each part of the line individually, adding variables where you need them:" ] }, { "cell_type": "code", "execution_count": null, "id": "cc79d0e9", "metadata": {}, "outputs": [], "source": [ "secret = 'abracadabra'\n", "\n", "with open('test.txt', 'w') as file:\n", " file.write('The secret password is ')\n", " file.write(secret)\n", " file.write('!\\n')" ] }, { "cell_type": "markdown", "id": "ce43ab6e", "metadata": {}, "source": [ "However, Python supports \"format strings\" which allow you to embed a variable inside of a string. This makes things much more readable!" ] }, { "cell_type": "code", "execution_count": null, "id": "695ecfaf", "metadata": {}, "outputs": [], "source": [ "secret = 'abracadabra'\n", "\n", "with open('test.txt', 'w') as file:\n", " # Precede the string with the character \"f\", and embed variables with curly braces {}\n", " file.write(f'The secret password is {secret}!')" ] }, { "cell_type": "markdown", "id": "5cc68097", "metadata": {}, "source": [ "We call these \"f-strings\" (short for \"formatted strings\").\n", "\n", "There's a lot more formatting magic you can do with f-strings. Check out the full Python [documentation](https://docs.python.org/3/tutorial/inputoutput.html#fancier-output-formatting) if you need to figure out how to format something in a particular way." ] }, { "cell_type": "markdown", "id": "08c895b2-1164-432f-b280-b64712a21587", "metadata": {}, "source": [ "--------\n", "\n", "## A note on Windows paths\n", "\n", "Throughout the examples in the course so far, if there are file paths that are more than just the file name (ie they include one or more directories), I've written them with a forward slash `/` to separate the directories/files, which is standard in Unix systems like Linux and MacOS. For example:\n", "\n", "```\n", "foo/bar/baz.py\n", "```\n", "\n", "However, if you on a Windows machine, you may be more familiar with seeing paths separated by backslashes `\\`:\n", "\n", "```\n", "foo\\bar\\baz.py\n", "```\n", "\n", "Windows WILL however accept the former forward-slash relative path; Unix will not accept the back-slashed version.\n", "\n", "What does this mean for you if you are on a Windows computer? In order to ensure maximum interoperability with Unix systems, you should provide paths in your programs with forward slashes as much as possible. If you DO need to do path operations, for instance splitting a path name into its pieces `foo`, `bar`, and `baz`, then instead of doing this:\n", "\n", "```python\n", "# If you have a path named `path` separated by backslashes (on your system),\n", "# DO NOT DO THIS to find the file name\n", "file_name = path.split(\"\\\\\")[-1]\n", "```\n", "\n", "You should instead do this, using `os.sep` to give you the default path separator for your system:\n", "\n", "```python\n", "# This is better and usable on any system! Use the built-in path separator from the os module\n", "import os\n", "file_name = path.split(os.sep)[-1]\n", "```\n", "\n", "For even better path manipulation functionality, check out the `os.path` module: https://docs.python.org/3/library/os.path.html" ] }, { "cell_type": "markdown", "id": "07c8b859", "metadata": {}, "source": [ "--------\n", "\n", "## User input\n", "\n", "One way to build an interactive program - one that involves the user in the input and output - is by prompting them for information, which you can then use to do something with.\n", "\n", "The way to do this is with the built-in `input` function. You provide a prompt as an argument:" ] }, { "cell_type": "code", "execution_count": null, "id": "a1d47c1f", "metadata": {}, "outputs": [], "source": [ "name = input('What is your name? ')\n", "\n", "print(f'Hello, {name}!')" ] }, { "cell_type": "markdown", "id": "b0c560a8", "metadata": {}, "source": [ "You can read more about the `input` function in the Python documentation [here](https://docs.python.org/3/library/functions.html#input)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "b12696bb", "metadata": {}, "source": [ "--------\n", "\n", "## Exercise\n", "\n", "1. In a Python module, write a function called `read_csv` to read a CSV file (manually - without using pandas or the `csv` module).\n", "\n", " The function should take a file name as input and should return a list of dictionaries, with each dictionary containing the `column:value` data for a particular row. You should assume that the file contains a header row with the column names.\n", "\n", " For instance, if the CSV file contains these rows:\n", "\n", " ```\n", " id,name,age\n", " 1,sarah,30\n", " 2,steve,42\n", " ```\n", "\n", " Then the function should return the following list.:\n", "\n", " ```python\n", " [\n", " {id: \"1\", name: \"sarah\", age: \"30\"},\n", " {id: \"2\", name: \"steve\", age: \"42\"},\n", " ]\n", " ```\n", "\n", "2. In the same module, write a function called `write_csv` that takes the same list format from `read_csv`, a list of column names, and a file name, and writes the subset of columns to a file.\n", "\n", " For instance, if you call `write_csv(rows, [\"id\", \"age\"], \"output.csv\")` with the list from part 1, then the file `output.csv` should contain the following contents:\n", "\n", " ```\n", " id,age\n", " 1,30\n", " 2,42\n", " ```\n", "\n", " This is a way we might anonymize data.\n", "\n", "\n", "3. Write a main function that runs the `read_csv` function on the 2015_trip_data CSV file that we downloaded from https://s3.amazonaws.com/pronto-data/open_data_year_one.zip. Then write a new CSV without the personal columns `usertype`, `gender`, and `birthyear`.\n", "\n", "\n", "### Bonus exercises\n", "\n", "* Have the main function take the name of the csv file to read, the name of the csv file to output, and the columns to emit, as user input on the command line.\n", "* Make `read_csv` take another argument - a boolean `has_headers` that indicates whether the csv has a header row or not. If `has_headers` is true, the function should behave exactly as it did before. If `has_headers` is false, use numbers for the column names instead, ie `0`, `1`, `2`, etc." ] }, { "cell_type": "markdown", "id": "17739a56", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }