{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loops and Files" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/master/04_loops.ipynb) or\n", "[click here to download it](https://github.com/AllenDowney/ElementsOfDataScience/raw/master/04_loops.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter presents loops, which are used to represent repeated computation, and files, which are used to store data. As an example, we will download the famous book *War and Peace* from Project Gutenberg and write a loop that reads the book and counts the words.\n", "This example presents some new computational tools; it is also an introduction to working with textual data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loops\n", "\n", "One of the most important elements of computation is repetition, and the most common way to represent repetition is a `for` loop.\n", "As a simple example, suppose we want to display the elements of a tuple. Here's a tuple of three integers:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "t = 1, 2, 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's a `for` loop that prints the elements." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "for x in t:\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line of the loop is a **header** that specifies the tuple, `t`, and a variable name, `x`. The tuple already exists, but `x` does not; the loop will create it. Note that the header ends with a colon, `:`.\n", "\n", "Inside the loop is a `print` statement, which displays the value of `x`.\n", "\n", "So here's what happens:\n", "\n", "1. When the loop starts, it gets the first element of `t`, which is `1`, and assigns it to `x`. It executes the `print` statement, which displays the value `1`.\n", "\n", "2. Then it gets the second element of `t`, which is `2`, and displays it.\n", "\n", "3. Then it gets the third element of `t`, which is `3`, and displays it.\n", "\n", "After printing the last element of the tuple, the loop ends.\n", "\n", "We can also loop through the letters in a string:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "word = 'Data'\n", "\n", "for letter in word:\n", " print(letter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the loop begins, `word` already exists, but `letter` does not. Again, the loop creates `letter` and assigns values to it.\n", "\n", "The variable created by the loop is called the **loop variable**. You can give it any name you like; in this example, I chose `letter` to remind me what kind of value it contains.\n", "\n", "After the loop ends, the loop variable contains the last value." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "letter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Create a list, called `sequence` with four elements of any type. Write a `for` loop that prints the elements. Call the loop variable `element`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might wonder why I didn't call the list `list`. I avoided it because Python has a function named `list` that makes new lists. For example, if you have a string, you can make a list of letters, like this:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "list('string')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you create a variable named `list`, you can't use the function any more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting with Loops\n", "\n", "*War and Peace* is a famously long book; let's see how long it is. To count the words we need two elements: looping through the words in a text, and counting.\n", "We'll start with counting.\n", "\n", "We've already seen that you can create a variable and give it a value, like this:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "count = 0\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you assign a different value to the same variable, the new value replaces the old one." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "count = 1\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can increase the value of a variable by reading the old value, adding `1`, and assigning the result back to the original variable." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "count = count + 1\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Increasing the value of a variable is called **incrementing**; decreasing the value is called **decrementing**. These operations are so common that there are special operators for them." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "count += 1\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, the `+=` operator reads the value of `count`, adds `1`, and assigns the result back to `count`.\n", "Python also provides `-=` and other update operators like `*=` and `/=`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** The following is a number trick from *Learn With Math Games* at :\n", "\n", "> *Finding Someone's Age*\n", ">\n", "> * Ask the person to multiply the first number of their age by 5.\n", ">\n", "> * Tell them to add 3.\n", ">\n", "> * Now tell them to double this figure.\n", ">\n", "> * Finally, have the person add the second number of their age to the figure and have them tell you the answer.\n", ">\n", "> * Deduct 6 and you will have their age.\n", "\n", "Test this algorithm using your age. Use a single variable and update it using `+=` and other update operators." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Files\n", "\n", "Now that we know how to count, let's see how we can read words from a file.\n", "We can download *War and Peace* from Project Gutenberg, which is a repository of free books at ." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "When you run the following cell, it checks to see whether you already have a file named `2600-0.txt`, which is the name of the file that contains the text of *War and Peace*.\n", "If not, it copies the file from Project Gutenberg to your computer. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", " \n", "download('https://www.gutenberg.org/files/2600/2600-0.txt')" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "If you are running this notebook on Colab, it will copy the file to a \"virtual file system\" on Colab, which means it will disappear if you leave the notebook idle for a while. In that case, you can download it again later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to read the contents of the file, you have to **open** it, which you can do with the `open` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# On Windows, it might be necessary to specify the encoding \n", "\n", "# fp = open('2600-0.txt', encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "fp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a `TextIOWrapper`, which is a type of **file pointer**.\n", "It contains the name of the file, the mode (which is `r` for \"reading\") and the encoding (which is `UTF` for \"Unicode Transformation Format\").\n", "A file pointer is like a bookmark; it keeps track of which parts of the file you have read.\n", "\n", "If you use a file pointer in a `for` loop, it loops through the lines in the file.\n", "So we can count the number of lines like this:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "count = 0\n", "for line in fp:\n", " count += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then display the result." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are about 66,000 lines in this file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## if Statements\n", "\n", "We've already see comparison operators, like `>` and `<`, which compare values and produce a Boolean result, `True` or `False`.\n", "For example, we can compare the final value of `count` to a number:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "count > 60000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use a comparison operator in an `if` statement to check for a condition and take action accordingly." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "if count > 60000:\n", " print('Long book!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line of the `if` statement specifies the condition we're checking for. Like the header of a `for` statement, the first line of an `if` statement has to end with a colon.\n", "\n", "If the condition is true, the indented statement runs; otherwise, it doesn't.\n", "In the previous example, the condition is true, so the `print` statement runs.\n", "In the following example, the condition is false, so the `print` statement doesn't run." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "if count < 1000:\n", " print('Short book!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put a `print` statement inside a `for` loop. In this example, we only print a line from the book when `count` is `1`.\n", "The other lines are read, but not displayed." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "count = 0\n", "for line in fp:\n", " if count == 1:\n", " print(line)\n", " count += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the indentation in this example:\n", "\n", "* Statements inside the `for` loop are indented.\n", "\n", "* The statement inside the `if` statement is indented.\n", "\n", "* The statement `count += 1` is **outdented** from the previous line, so it ends the `if` statement. But it is still inside the `for` loop.\n", "\n", "It is legal in Python to use spaces or tabs for indentation, but the most common convention is to use four spaces, never tabs. That's what I'll do in my code and I strongly suggest you follow the convention." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The break Statement\n", "\n", "If we display the final value of `count`, we see that the loop reads the entire file, but only prints one line:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can avoid reading the whole file by using a `break` statement, like this:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "count = 0\n", "for line in fp:\n", " if count == 1:\n", " print(line)\n", " break\n", " count += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `break` statement ends the loop immediately, skipping the rest of the file. We can confirm that by checking the last value of `count`: " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Write a loop that prints the first 5 lines of the file and then breaks out of the loop." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Whitespace\n", "\n", "If we run the loop again and display the final value of `line`, we see the special sequence `\\n` at the end." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "count = 0\n", "for line in fp:\n", " if count == 1:\n", " break\n", " count += 1\n", "\n", "line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This sequence represents a single character, called a **newline**, that puts vertical space between lines.\n", "If we use a `print` statement to display `line`, we don't see the special sequence, but we do see extra space after the line." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other strings, you might see the sequence `\\t`, which represents a \"tab\" character.\n", "When you print a tab character, it adds enough space to make the next character appear in a column that is a multiple of 8." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "print('01234567' * 6)\n", "print('a\\tbc\\tdef\\tghij\\tklmno\\tpqrstu')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Newline characters, tabs, and spaces are called **whitespace** because when they are printed they leave white space on the page (assuming that the background color is white)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting Words\n", "\n", "So far we've managed to count the lines in a file, but each line contains several words.\n", "To split a line into words, we can use a function called `split` that returns a list of words.\n", "To be more precise, `split` doesn't actually know what a word is; it just splits the line wherever there's a space or other whitespace character." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "line.split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the syntax for `split` is different from other functions we have seen. Normally when we call a function, we name the function and provide values in parentheses. So you might have expected to write `split(line)`. Sadly, that doesn't work." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Run this line of code in the next cell to see what happens.\n", "\n", "```\n", "split(line)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem is that the `split` function belongs to the string `line`; in a sense, the function is attached to the string, so we can only refer to it using the string and the **dot operator** (the period between `line` and `split`).\n", "For historical reasons, functions like this are called **methods**.\n", "\n", "Now that we can split a line into a list of words, we can use `len` to get the number of words in each list, and increment `count` accordingly." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "fp = open('2600-0.txt')\n", "count = 0\n", "for line in fp:\n", " count += len(line.split())" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By this count, there are more than half a million words in *War and Peace*.\n", "\n", "Actually, there aren't quite that many, because the file we got from Project Gutenberg has some introductory text and a table of contents before the text. And it has some license information at the end.\n", "To skip this \"front matter\", we can use one loop to read lines until we get to `CHAPTER I`, and then a second loop to count the words in the remaining lines.\n", "\n", "The file pointer, `fp`, keeps track of where it is in the file, so the second loop picks up where the first loop leaves off.\n", "In the second loop, we check for the end of the book and stop, so we ignore the \"back matter\" at the end of the file." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "first_line = \"CHAPTER I\\n\"\n", "last_line = \"End of the Project Gutenberg EBook of War and Peace, by Leo Tolstoy\\n\"\n", "\n", "fp = open('2600-0.txt')\n", "for line in fp:\n", " if line == first_line:\n", " break\n", "\n", "count = 0\n", "for line in fp:\n", " if line == last_line:\n", " print(line)\n", " break\n", " count += len(line.split())" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two things to notice about this program:\n", "\n", "* When we compare two values to see if they are equal, we use the `==` operator, not to be confused with `=`, which is the assignment operator.\n", "\n", "* The string we compare `line` to has a newline at the end. If we leave that out, it doesn't work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** \n", "\n", "1. In the previous program, replace `==` with `=` and see what happens. This is a common error, so it is good to see what the error message looks like. \n", "\n", "2. Correct the previous error, then remove the newline character after `CHAPTER I`, and see what happens.\n", "\n", "The first error is a **syntax error**, which means that the program violates the rules of Python. If your program has a syntax error, the Python interpreter prints an error message, and the program never runs.\n", "\n", "The second error is a **logic error**, which means that there is something wrong with the logic of the program. The syntax is legal, and the program runs, but it doesn't do what we wanted. Logic errors can be hard to find because we don't get any error messages. \n", "\n", "If you have a logic error, here are two strategies for debugging:\n", "\n", "1. Add print statements so the program displays additional information while it runs.\n", "\n", "2. Simplify the program until it does what you expect, and then gradually add more code, testing as you go." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This chapter presents loops, `if` statements, and the `break` statement.\n", "It also introduces tools for working with letters and words, and a simple kind of textual analysis, word counting.\n", "\n", "In the next chapter we'll continue this example, counting the number of unique words in a text and the number of times each word appears.\n", "And we'll see one more way to represent a collection of values, a Python dictionary." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "*Elements of Data Science*\n", "\n", "Copyright 2021 [Allen B. Downey](https://allendowney.com)\n", "\n", "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 2 }