{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Modeling and Simulation in Python\n", "\n", "Chapter 5\n", "\n", "Copyright 2017 Allen Downey\n", "\n", "License: [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Configure Jupyter so figures appear in the notebook\n", "%matplotlib inline\n", "\n", "# Configure Jupyter to display the assigned value after an assignment\n", "%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'\n", "\n", "# import functions from the modsim.py module\n", "from modsim import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "\n", "Pandas is a library that provides tools for reading and processing data. `read_html` reads a web page from a file or the Internet and creates one `DataFrame` for each table on the page." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from pandas import read_html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data directory contains a downloaded copy of https://en.wikipedia.org/wiki/World_population_estimates\n", "\n", "The arguments of `read_html` specify the file to read and how to interpret the tables in the file. The result, `tables`, is a sequence of `DataFrame` objects; `len(tables)` reports the length of the sequence." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "filename = 'data/World_population_estimates.html'\n", "tables = read_html(filename, header=0, index_col=0, decimal='M')\n", "len(tables)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select the `DataFrame` we want using the bracket operator. The tables are numbered from 0, so `tables[2]` is actually the third table on the page.\n", "\n", "`head` selects the header and the first five rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [], "source": [ "table2 = tables[2]\n", "table2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`tail` selects the last five rows." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "table2.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Long column names are awkard to work with, but we can replace them with abbreviated names." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "table2.columns = ['census', 'prb', 'un', 'maddison', \n", " 'hyde', 'tanton', 'biraben', 'mj', \n", " 'thomlinson', 'durand', 'clark']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the DataFrame looks like now. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "table2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first column, which is labeled `Year`, is special. It is the **index** for this `DataFrame`, which means it contains the labels for the rows.\n", "\n", "Some of the values use scientific notation; for example, `2.544000e+09` is shorthand for $2.544 \\cdot 10^9$ or 2.544 billion.\n", "\n", "`NaN` is a special value that indicates missing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series\n", "\n", "We can use dot notation to select a column from a `DataFrame`. The result is a `Series`, which is like a `DataFrame` with a single column." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "census = table2.census\n", "census.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "census.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like a `DataFrame`, a `Series` contains an index, which labels the rows.\n", "\n", "`1e9` is scientific notation for $1 \\cdot 10^9$ or 1 billion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From here on, we will work in units of billions." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "un = table2.un / 1e9\n", "un.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "census = table2.census / 1e9\n", "census.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what these estimates look like." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [], "source": [ "plot(census, ':', label='US Census')\n", "plot(un, '--', label='UN DESA')\n", " \n", "decorate(xlabel='Year',\n", " ylabel='World population (billion)')\n", "\n", "savefig('figs/chap05-fig01.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following expression computes the elementwise differences between the two series, then divides through by the UN value to produce [relative errors](https://en.wikipedia.org/wiki/Approximation_error), then finds the largest element.\n", "\n", "So the largest relative error between the estimates is about 1.3%." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "max(abs(census - un) / un) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Break down that expression into smaller steps and display the intermediate results, to make sure you understand how it works.\n", "\n", "1. Compute the elementwise differences, `census - un`\n", "2. Compute the absolute differences, `abs(census - un)`\n", "3. Compute the relative differences, `abs(census - un) / un`\n", "4. Compute the percent differences, `abs(census - un) / un * 100`\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`max` and `abs` are built-in functions provided by Python, but NumPy also provides version that are a little more general. When you import `modsim`, you get the NumPy versions of these functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Constant growth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select a value from a `Series` using bracket notation. Here's the first element:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "census[1950]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the last value." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "census[2016]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But rather than \"hard code\" those dates, we can get the first and last labels from the `Series`:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "t_0 = get_first_label(census)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "t_end = get_last_label(census)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "elapsed_time = t_end - t_0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can get the first and last values:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "p_0 = get_first_value(census)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "p_end = get_last_value(census)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can compute the average annual growth in billions of people per year." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "total_growth = p_end - p_0" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "annual_growth = total_growth / elapsed_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TimeSeries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create a `TimeSeries` to contain values generated by a linear growth model." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "results = TimeSeries()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially the `TimeSeries` is empty, but we can initialize it so the starting value, in 1950, is the 1950 population estimated by the US Census." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "results[t_0] = census[t_0]\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After that, the population in the model grows by a constant amount each year." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "for t in linrange(t_0, t_end):\n", " results[t+1] = results[t] + annual_growth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the results looks like, compared to the actual data." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "plot(census, ':', label='US Census')\n", "plot(un, '--', label='UN DESA')\n", "plot(results, color='gray', label='model')\n", "\n", "decorate(xlabel='Year', \n", " ylabel='World population (billion)',\n", " title='Constant growth')\n", "\n", "savefig('figs/chap05-fig02.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model fits the data pretty well after 1990, but not so well before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercises\n", "\n", "**Optional Exercise:** Try fitting the model using data from 1970 to the present, and see if that does a better job.\n", "\n", "Hint: \n", "\n", "1. Copy the code from above and make a few changes. Test your code after each small change.\n", "\n", "2. Make sure your `TimeSeries` starts in 1950, even though the estimated annual growth is based on later data.\n", "\n", "3. You might want to add a constant to the starting value to match the data better." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "census.loc[1960:1970]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }