{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lists and arrays\n", "\n", "Elements of Data Science\n", "\n", "by [Allen Downey](https://allendowney.com)\n", "\n", "[MIT License](https://opensource.org/licenses/MIT)\n", "\n", "### Goals\n", "\n", "In the previous notebook we used tuples to represent latitude and longitude. In this notebook, you'll see how to use tuples more generally to represent a sequence of values. And we'll see two more ways to represent sequences: lists and arrays.\n", "\n", "You might wonder why we need three ways to represent the same thing. Most of the time you don't, but each of them has different capabilities. For work with data, we will use arrays most of the time.\n", "\n", "As an example, we will use a small dataset from an article in *The Economist* about the price of sandwiches. It's a silly example, but we'll use it to discuss the idea of absolute and relative differences, and different ways to summarize a dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuples\n", "\n", "A tuple is a sequence of elements. When we use a tuple to represent latitude and longitude, the sequence only contains two elements, and they are both floating-point numbers.\n", "\n", "But in general a tuple can contain any number of elements, and the elements can be values of any type.\n", "\n", "The following is a tuple of three integers:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "1, 2, 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that when Python displays a tuple, it puts the elements in parentheses.\n", "\n", "When you type a tuple, you can put it in parentheses if you think it is easier to read that way, but you don't have to." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "(1, 2, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The elements can be any type. Here's a tuple of strings:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "'Data', 'Science'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The elements don't have to be the same type. Here's a tuple with a string, an integer, and a floating-point number." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "'one', 2, 3.14159 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a string, you can convert it to a tuple using the `tuple` function:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tuple('DataScience')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a tuple of single-character strings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you create a tuple, the parentheses are optional, but the commas are required. So how do you think you create a tuple with a single element? You might be tempted to write:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "x = (5)\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But you will find that the result is just a number, not a tuple." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "type(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a tuple with a single element, you need a comma:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "t = 5,\n", "t" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "type(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lists\n", "\n", "Python provides another way to store a sequence of elements: a list.\n", "\n", "To create a list, you put a sequence of elements in square brackets." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "[1, 2, 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists and tuples are very similar. They can contain any number of elements, the elements can be any type, and the elements don't have to be the same type.\n", "\n", "The only difference is that you can modify a list; tuples are immutable (cannot be modified). This difference will matter later, but for now we can ignore it.\n", "\n", "When you make a list, the brackets are required, but if there is a single element, you don't need a comma. So you can make a list like this:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "single = [5]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "type(single)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to make a list with no elements, like this:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "empty = []" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "type(empty)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `len` function computes the length (number of elements) in a list or tuple." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "len([1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "len(single)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "len(empty)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Create a list with 4 elements; then use `type` to confirm that it's a list, and `len` to confirm that it has 4 elements." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a lot more we could do with lists, but that's enough to get started. In the next section, we'll use lists to store data about sandwich prices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sandwiches\n", "\n", "In September 2019, *The Economist* published an article comparing sandwich prices in Boston and London: \"[Why Americans pay more for lunch than Britons do](https://www.economist.com/finance-and-economics/2019/09/07/why-americans-pay-more-for-lunch-than-britons-do)\"\n", "\n", "It includes this graph showing prices of several sandwiches in the two cities:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the sandwich names from the graph, as a list of strings." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "name_list = ['Lobster roll',\n", " 'Chicken caesar',\n", " 'Bang bang chicken',\n", " 'Ham and cheese',\n", " 'Tuna and cucumber',\n", " 'Egg'\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I contacted *The Economist* to ask for the data they used to create that graph, and they were kind enough to share it with me.\n", "\n", "Here are the corresponding sandwich prices in Boston:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "boston_price_list = [9.99, 7.99, 7.49, 7, 6.29, 4.99]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the lobster roll is \\\$9.99 in Boston.\n", "\n", "The egg sandwich is \\\$4.99.\n", "\n", "Here are the prices in London, converted to dollars at \\\$1.25 / £1." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "london_price_list = [7.5, 5, 4.4, 5, 3.75, 2.25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists provide some arithmetic operators, but they might not do what you want. For example, you can \"add\" two lists:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "boston_price_list + london_price_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it concatenates the two lists, which is not very useful in this example.\n", "\n", "To compute differences between prices, you might try subtracting lists, but you would get an error." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "boston_price_list - london_price_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can solve this problem with a NumPy array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NumPy arrays\n", "\n", "We've already seen that the NumPy library provides math functions. It also provides a type of sequence called an array.\n", "\n", "You can create a new array with the `np.array` function, starting with a list or tuple." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "boston_price_array = np.array(boston_price_list)\n", "london_price_array = np.array(london_price_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type of the result is `numpy.ndarray`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "type(boston_price_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"nd\" stands for \"n-dimensional\"; NumPy arrays can have any number of dimensions. But for now we will work with one-dimensional sequences.\n", "\n", "If you display an array, Python displays the elements:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "boston_price_array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also display the \"data type\" of the array, which is the type of the elements:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "boston_price_array.dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`float64` means that the elements are floating-point numbers that take up 64 bits each. You don't need to know about the storage format of these numbers, but if you are curious, [you can read about it here](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation).\n", "\n", "The elements of a NumPy array can be any type, but they all have to be the same type.\n", "\n", "Most often the elements are numbers, but you can also make an array of strings." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "name_array = np.array(name_list)\n", "name_array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, the `dtype` is `