{ "cells": [ { "cell_type": "markdown", "id": "42be6f0b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Reading and writing files\n", "\n", "In most data science related scripts and analysis workflows, data will enter via files. To be more precise: via text files. \n", "Fortunately, reading from file is really simple in Python. \n", "Unfortunately, you will need to knwo something about file paths in Linux, Mac and Window operating systems, so that is were we'll start. " ] }, { "cell_type": "markdown", "id": "0aabdd44", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## File paths\n" ] }, { "cell_type": "markdown", "id": "d40b3174", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Path separators\n", "\n", "There is a big distinction in file paths on Windows versus Linus (and MacOS): on Windows, file paths start with the drive (e.g. `C:`) and the directory (folder) separator symbol is a backslash `\\`. On unix-like systems (Linux, MacOS), all paths start at the root: `/` and separator are also the forward slash.\n", "If you want to work with paths in your programs, you should therefore never use a literal character for these separators, but ask the OS to provide it for you:" ] }, { "cell_type": "code", "execution_count": 1, "id": "5bceeb7e", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/\n", "/users/Michiel/projects/python\n" ] } ], "source": [ "import os.path\n", "print(os.path.sep)\n", "\n", "folders = [os.path.sep, 'users', 'Michiel', 'projects', 'python']\n", "print(os.path.join(*folders))\n" ] }, { "cell_type": "markdown", "id": "7b91f48e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `os.path.join()` function uses the separator as defined in `os.path.sep`. \n", "\n", "Note that when you type literal backslash characters, as in Windows paths, they need to be escaped because they give special meaning to the character that comes after. This gives an error:\n", "\n", "```python\n", "windows_path = 'C:\\Users\\Michiel\\Projects\\Python\\'\n", "```\n", "```\n", " Cell In[7], line 1\n", " windows_path = 'C:\\Users\\Michiel\\Projects\\Python\\'\n", " ^\n", "SyntaxError: EOL while scanning string literal\n", "```\n", "\n", "Below is the correct way to deal with that. " ] }, { "cell_type": "code", "execution_count": 2, "id": "88e14f63", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'C:\\\\Users\\\\Michiel\\\\Projects\\\\Python\\\\'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "windows_path = 'C:\\\\Users\\\\Michiel\\\\Projects\\\\Python\\\\'\n", "windows_path" ] }, { "cell_type": "markdown", "id": "25af178c", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that it is always better to use `os.path.sep`, above, or by always using the forward slash (works most most of the time), or like this:" ] }, { "cell_type": "code", "execution_count": 3, "id": "97ff5155", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'C:/Users/Michiel/Projects/Python'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.path.sep.join(['C:', 'Users', 'Michiel', 'Projects', 'Python'])" ] }, { "cell_type": "markdown", "id": "414a44ad", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Absolute and relative paths" ] }, { "cell_type": "markdown", "id": "932ead8b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "File system locations can be defined in two ways: absolute and relative. \n", "The above examples were all absolute because the paths were all defined starting at the **root** of the file system, which is a harddisk drive designation (Windows) or a path starting with a forward slash (Linux-like). \n" ] }, { "cell_type": "markdown", "id": "058bb880", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Relative paths, on the other hand, specify locations relative to the **current working directory**. This is either the location of the running piece of code (script, notebook), or the location you have specified using\n", "\n", "```python\n", "os.chdir('/users/michiel/projects/project1')\n", "```\n", "\n", "and which you can verify with\n", "\n", "```python\n", "os.getcwd()\n", "```\n" ] }, { "cell_type": "markdown", "id": "1b0e56b6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Suppose I have a directory structure that looks like this.\n", "\n", "```\n", "users\n", " /michiel\n", " /projects\n", " /project1\n", " /script1.py\n", " /data\n", " /users.txt\n", " /project2\n", " /birds.csv\n", "\n", "```\n", "\n", "and my current working directory is `/users/michiel/python/project1` (because you are working on `script1.py`). " ] }, { "cell_type": "markdown", "id": "db892e48", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The different ways to specify the location of the data file `users.txt` are\n", "\n", "* **absolute**: `/users/michiel/projects/project1/data/users.txt`\n", "* **relative to current dir**: `data/users.txt`\n", "* **relative to current dir**: `./data/users.txt`\n", "\n", "The `./` makes it explicit that the start is the current directory." ] }, { "cell_type": "markdown", "id": "1e8fa6bc", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The different ways to specify the location of the data file `birds.csv` are\n", "\n", "* **absolute**: `/users/michiel/projects/project2/birds.csv`\n", "* **relative to current dir**: `../project2/birds.csv`\n", "\n", "The double dot `..` says \"go up one directory and work from there." ] }, { "cell_type": "markdown", "id": "6a4c6f3a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Linux users may know about the tilde `~` that specifies the current users' home folder: `~/projects/project2/birds.csv`. This does not work in python. Instead, you can use `os.path.expanduser('~')` to plug it into your file location:" ] }, { "cell_type": "code", "execution_count": 4, "id": "59da3b44", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('/Users/michielnoback/projects/project2/birds.csv',)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.path.join(os.path.expanduser('~'), 'projects', 'project2', 'birds.csv'), " ] }, { "cell_type": "markdown", "id": "5a0a889d", "metadata": {}, "source": [ "A last note on file and folder names. Although a wide range of characters is allowed in file paths, when working in datascience it is highly advisable to refrain from using \"funny\" characters in file paths. They cause errors, and are often hard to type in a string variable. So instead of this:\n", "\n", ">`/homes/michiel/projects/het 'tisser (best) koud! 😀`\n", "\n", "use this: \n", "\n", ">`/homes/michiel/projects/het_is_er_koud`" ] }, { "cell_type": "markdown", "id": "ae957cf9", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Reading from file" ] }, { "cell_type": "markdown", "id": "16514a2b", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "And if you have structured data in the form of csv, tsv, xml or Excel, the Python ecosystem prvides a wealth of dedicated data reading functions. If you are going to work with excel-style data (data organized in rows with examples and variables in columns) a lot, it is recommended to have a look at the Pandas library (we'll have a peek at that at the end of this chapter).\n", "In this chapter however we are going to check out the basics of file reading and writing, I/O in short.\n" ] }, { "cell_type": "markdown", "id": "ab67a893", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Suppose we have some data file named `lengths.csv` which contains the body lengths (in centimeters) of a sample of male and female subjects:\n", "\n", "```\n", "1,m,180\n", "2,m,188\n", "3,f,178\n", "4,f,182\n", "5,f,172\n", "6,m,189\n", "```\n", "\n", "This file in *csv format* (for Comma-Separated Values) can be found [here](./data/lengths.csv) (at `./data/lengths.csv`)" ] }, { "cell_type": "markdown", "id": "abf795b9", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "To read this data in the simplest way possible, we can read its contents in one operation:" ] }, { "cell_type": "code", "execution_count": 6, "id": "9ee56d94", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/michielnoback/git_projects/python_intro\n", "1,m,180\n", "2,m,188\n", "3,f,178\n", "4,f,182\n", "5,f,172\n", "6,m,189\n", "\n", "\n" ] } ], "source": [ "print(os.getcwd())\n", "file = open(\"data/lengths.csv\", \"r\")\n", "data = file.read()\n", "print(data)\n", "print(type(data))" ] }, { "cell_type": "markdown", "id": "77249a6b", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The statement \n", "```python\n", "file = open(\"data/lengths.csv\", \"r\")\n", "```\n", "opens the file in read mode (the second argument is the `mode` argument which defaults to `'r'`, so it could have been omitted). The functions returns a stream, or handle on the file. **Not the actual contents yet**. \n", "\n", "Reading the contents happens with the `file.read()` function call." ] }, { "cell_type": "markdown", "id": "3bbd4378", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Iterating contents\n", "\n", "Usually you want to iterate over file contents line by line without the need to store it all in memory as-is. This is done by applying the for-loop on the file stream:" ] }, { "cell_type": "code", "execution_count": null, "id": "426c0db5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "file = open(\"data/lengths.csv\", \"r\")\n", "for line in file:\n", " print(line.strip().split(',')) # of course you want to split the data to separate values" ] }, { "cell_type": "markdown", "id": "8e2bc95c", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The file stream object returned by the `open()` function supports iteration. Note that line endings are data in the file and are included when reading the lines. To remove any leading and trailing whitespaces we use the `strip()` function. \n", "To only remove whitespace characters at the end, use `rstrip()` with an optional argument specifying which characters to strip off." ] }, { "cell_type": "markdown", "id": "845e24c0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Closing files\n", "It is good custom to close streams to files that you open. In read mode this is not essential, but in write mode it is. You do this using the `close()` method. The above fragment is better like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "fdd15413", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "file = open(\"data/lengths.csv\", \"r\")\n", "for line in file:\n", " print(line.strip())\n", "file.close() # explicitly closing resources is always a good idea" ] }, { "cell_type": "markdown", "id": "72c924a1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The best way: using `with`\n", "\n", "Since programmers forgot to close their files all the time, the \"with open\" syntax was introduced. If you simply always use this form you will never go wrong." ] }, { "cell_type": "code", "execution_count": null, "id": "d7f6d2ac", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "with open(\"data/lengths.csv\", \"r\") as file:\n", " for line in file:\n", " print(line.strip())\n", "# no need to close since that is assured by using with" ] }, { "cell_type": "markdown", "id": "0c937b57", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Writing to file\n", "\n", "To open a stream for writing you need to set the mode to one of these: \n", "\n", "- \"w\" (open for writing, truncating the file first) \n", "- \"a\" (open for writing, appending to the end of the file if it exists).\n", "\n", "When writing to file, using the `with` syntax is the best way." ] }, { "cell_type": "code", "execution_count": null, "id": "d4366959", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "my_data = [\"Better safe\\n\", \"then sorry\\n\"] #note the newlines already present!\n", "\n", "with open(\"data/saying.txt\", \"w\") as sayings:\n", " for l in my_data:\n", " sayings.write(l)\n", "\n", "#or, in one operatition\n", "#with open(\"data/saying.txt\", \"w\") as sayings:\n", "# sayings.writelines(my_data)" ] }, { "cell_type": "markdown", "id": "67e0db76", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Both operations will result in a file with these contents:\n", "\n", "```\n", "Better safe\n", "then sorry\n", "```\n", "\n", "And no matter how often the code is run, the same file will be created.\n", "If the mode `\"a\"` had been used, the saying would be added to the file every time the code was run." ] } ], "metadata": { "celltoolbar": "Diavoorstelling", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }