{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This is an interactive jupyter notebook document.\n", "\n", "Page down through it, following the instructions…\n", "\n", "With what looks to be a permanent and long-run partial moving-online of the university, the already important topic of “data science” seems likely to become even more foundational. Hence this first problem set tries to provide you with an introduction—to “data science”, and to the framework we will be using for problem sets that we hope will make things much easier for you and for us…\n", "\n", "When you are finished, satisfied, or stuck, print your notebook to pdf, & upload the pdf to the appropriate assignment bCourses page: \n", "\n", "Also, please include with your submission all of your comments on and reactions to this assignment that you want us to know...\n", " \n", "----" ] }, { "cell_type": "markdown", "metadata": { "deletable": false }, "source": [ "# Jupyter Notebook 01. Data Manipulation & Visualization\n", "\n", "These computer programming problem set assignments are a required part of the course.\n", "\n", "Collaborating on the problem sets is more than okay—it is encouraged! Seek help from a classmate or an instructor or a roommate or a passerby when you get stuck! (Explaining things is beneficial, too—the best way to solidify your knowledge of a subject is to explain it.) \n", "\n", "But the work has to be your own: no cutting-&-pasting from others' problem sets, please! We want you to learn this stuff, and your fingers typing every keystroke is an important way of building muscle memory here.\n", "\n", "In fact, we strongly recommend that as you work through this notebook, whenever you come to a \"code\" cell—something intended not for you to read but also to direct the computer—the python interpreter—to do calculations, you (1) click on the code cell to bring it into your browser's focus; (2) click on the `+` button in the toolbar above to create a new code cell just below the one you were now in; and then (3) retype, line-by-line, the computer code in the cell (not the comment lines beginning with `#`s, but the code cells) while trying to figure out what the line of code is intended to tell the python interpreter to do. \"Muscle\"—in this case, fingertip—memory is an important but undervalued part of \"active learning\" here at Berkeley. In Germany, however, they have a term for it: _das Fingerspitzengefühl_; it's the kind of understanding-through-the-fingertips that a true expert has.\n", "\n", "In this problem set, you will learn how to:\n", "\n", "1. navigate Jupyter notebooks (like this one);\n", "2. write and evaluate some basic *expressions* in Python, the computer language of the course;\n", "3. call *functions* to use code other people have written; and\n", "4. break down python code into smaller parts to understand it.\n", "5. Do some initial data explorations with Python in a notebook.\n", "\n", "For reference, you might find it useful to read chapter 3 of the Data 8 textbook: <>. Chapters 1 <> and 2 <> are worth skimming as well..." ] }, { "cell_type": "markdown", "metadata": { "deletable": false }, "source": [ "## 1. Why Are We Doing This?\n", "\n", "First of all, we are doing this because our section leaders are overworked: teaching online takes more time and effort than teaching in person, and our section leaders were not overpaid before the 'rona arrived on these shores. Taking the bulk of the work of grading calculation assignments off of their backs is a plus—and it appears that the best way for us to do that is to distribute a number of the course assignements to you in this form: the form of a python computer language \"jupyter notebook\"\n", "\n", "Second, we are doing this because learning jupyter notebooks and python may well turn out to be the intellectual equivalent for you of \"eat your spinach\": something that may seem unpleasant and unappetizing now, but that makes you stronger and more capable. In 1999 Python programming language creator Guido van Rossem compared to the coming of literacy and writing itself the coming of the ability to read, write, and use software you had built or modified yourself to search and analyze data and information collection. Guido speculated that mass programming literacy and competence, if it could be attained, would produce increases in societal power and changes in societal organization of roughly the same magnitude as mass literacy has had over the past several centuries. Guido may be right, and he may be wrong. But what is clear is that your lives may be richer, and you may have more options, if the data science and basic programming intellectual tools become a useful part of your intellectual panoplies.\n", "\n", "An analogy: Back in the medieval European university, people would learn the trivium—the “trivial” subjects of grammar (how to write), rhetoric (how to speak in public), and logic (how to think coherently)—then they would learn the quadrivium of arithmetic, geometry, music/harmony, and astronomy/astrology; and last they would learn the advanced and professional subjects: law or medicine or theology and phyics, metaphysics, and moral philosophy. But a student would also learn two more things: how to learn by reading—how to take a book and get something useful out of it, without a requiring a direct hands-on face-to-face teacher; and (2) how to write a _fine chancery hand_ so that they could prepare their own documents, for submission to secular courts or to religious bishops or even just put them in a form where they would be easily legible to any educated audience back in those days before screens-and-attachments, before screens-and-printers, before typewriters, before printing.\n", "\n", "The data science tools may well turn out to be in the first half of the 2000s the equivalent of a _fine chancery hand_, just as a facility with the document formats and commands of the Microsoft office suite were the equivalent of a _fine chancery hand_ at the end of the 1900s: practical, general skills that make you of immense value to most if not nearly all organizations. This—along with the ability to absorb useful knowledge without requiring hands-on person-to-person face-to-face training—will greatly boost your social power and your set of opportunities in your life. If we are right about its value.\n", "\n", "Third, why Jupyter and Python, rather than R-Studio and R, or C++ and MATLAB, or even (horrors!) Microsoft Ezcel? Because Jupyter Project founder Fernando Perez has an office on the fourth floor of Evans. Because half of Berkeley undergraduates currently take Data 8 and so, taking account of other channels, more than half of Berkeley students are already going to graduate literate in python.\n", "\n", "Let us get started!\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Manipulation\n", "\n", "### 2.1. We need ways to manipulate groups of lists\n", "Here is a five element list I call `at_the_dawn` that contains my current guesses of the overall global state of the human economy 70000 years ago:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "at_the_dawn = [-68000, 1, 1200, 0.1]\n", "\n", "at_the_dawn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This list contains four elements:\n", "\n", "* The first element in this list is an integer, the date by the current calendar: 70000 years ago is the year 68000 BCE, or 68000 BC, or -68000. (Back when Werner Rolevinck's _Little Bundles of Time_ was published in Cologne Germany and became the best-selling book of the 1400s after the _Bible_, European intellectuals feared dn did not understand negative numbers: hence the \"BC\" and the counting backwards stuff. But we do understand negative numbers.) \n", "* The second element is a real number (\"real\" in the sense of \"not imaginary\", i.e., not involving the square root of minus one), a number with a fractional or right-of-the-decimal-point part, my guess of the value of the stock of useful ideas about technology and organization that was then the common property of humanity.\n", "* The third element is another real number, my guess of the average standard of living back then: about \\$1200 _per capita_ per year (but relative prices were very different, so treat that number gingerly and with suspicion).\n", "* The fourth element is yet another real number, the human population in millions: about 0.1—that is, there were perhaps only 100,000 people alive on the globe.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Python Data Analysis Library\n", "\n", "One list by itself for the year -68000 is not very useful. It becomes useful only when we combine it with other lists in a larger set of data. But then we need infotech tools to work with groups of lists.\n", "\n", "So back in 2008 computer programmer Wes McKinney began developing the Python Data Analysis Library—which he called Pandas (you can see where the \"d\" and the second \"a\" come from, but the first \"a\", the \"n\", and the \"s\"?). His bosses at AQR Capital Management up to and including Cliff Asness (with whom I have had some remarkable fights, on twitter and elsewhere) did a very good thing when they gave him the green light to open-source it. And because of its power, flexibility, and gentle learning curve `pandas` has since become the standard tool for data manipulation and data cleaning.\n", "\n", "In order to invoke any pandas commands, you first need to call it from the vasty deep into your working environment. And because it is built on top of the Numerical Python Library, `numpy`, you need to call numpy into your environment as well. It is conventional to do it with the commands:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why the \"import...\"? Because you need to tell the python interpreter that the command you are looking for comes from the pandas (or numpy) library. \"as np\" and \"as pd\"? Because people got tired of typing \"pandas\" so many times in their code, and decided they would rather just type a quick \"pd\" instead.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Pandas Dataframes\n", "\n", "Think of a pandas _dataframe_ as a table with rows and columns. The rows and columns of a pandas dataframe are best thought of as a collection of lists stacked on top/next to each other. For example, here is a collection—a list of lists—of eleven lists like our `at the dawn` list, each for a different date:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_list_of_lists = [\n", " [-68000, 1, 1200, 0.1],\n", " [-8000, 5, 1200, 2.5],\n", " [-6000, 6.3, 900, 7],\n", " [-3000, 9.2, 900, 15],\n", " [-1000, 16.8, 900, 50],\n", " [1, 30.9, 900, 170],\n", " [800, 41.1, 900, 300],\n", " [1500, 53, 900, 500],\n", " [1770, 79.4, 1100, 750],\n", " [1870, 123.5, 1300, 1300],\n", " [2020, 2720.5, 11842, 7600]\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will notice that the final two meaningful symbols in the code cell above are \"] ]\". The first \"]\" markes the end of the 'whatever comes next' list for the python interpreter. The second and final \"]\" marks the end of the list-of-lists. That we have here a list with eleven elements, and that each element is itself a list, should bend your brain a little bit if you have not seen this kind of thing before. Get used to it. You will see a lot of such tail-chasing and tail-swallowing in modern information technologies.\n", "\n", "The code cell above simply gives us a list with lists as its elements: it is not yet a pandas dataframe. Once we have imported the pandas library into our working environment, we can tell the python interpreter to turn it into a dataframe with the command to create the dataframe `long_run_growth_df` thus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df = pd.DataFrame(\n", " data=np.array(long_run_growth_list_of_lists), columns = ['year', 'human_ideas', 'income_level', 'population'] \n", " )\n", "\n", "long_run_growth_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 2.4. Good Programming Practices...\n", "\n", "As before, when the last line of the code cell is simply the name of some object, the python interpreter understands that to be a request that it evaluate that object and print it out. And we see that we have indeed created the data table we had hoped to construct.\n", "\n", "We have called it \"long_run_growth\" to remind us that it is made up of data about the long run growth of the human economy, and we have the \"\\_df\" at the end to remind us that it is a pandas _dataframe_. Note the \"pd.\" in front of \"DataFrame\" to tell the python interpreter that the command comes from pandas. Note the \"np.\" in front of \"array\" to tell the python interpreter that we want it to look into the numpy, the Numerical Python, library for the `array` command to apply to `long_run_growth_list_of_lists`. And note the \", columns = \" to tell the python interpreter that it should label each of the columns of the dataframe, and what those labels should be.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5. Cleaning the Dataset\n", "\n", "Now we need to do a little housecleaning:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first, see what the python interpreter thinks our \"year\" column is:\n", "\n", "long_run_growth_df['year']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make sure that the python interpreter understands that \n", "# the year is an integer—that is, a number without any\n", "# fractional after-the-decimal-point part:\n", "\n", "long_run_growth_df['year'] = long_run_growth_df['year'].apply(np.int64)\n", "\n", "long_run_growth_df['year']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes! Before the python interpreter had thought that the \"year\" column was a real number (i.e., not an imaginary number: not a number related to the square root of minus one) with a fractional after-the-decimal-point-part. Before it had set up the \"year\" column as a column in which each element was a 52 binary digits (\"bit\")-precise number in which where the decimal point was \"floated\" depending on the value of an 11-bit exponent, which plus an extra bit to tell whether the number was positive or negative. Now the python interpreter knows that year is an integer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make a new variable which is simply the year at which each\n", "# of the ten periods into which our dataframe divides human\n", "# history is taken to start:\n", "\n", "long_run_growth_df['initial_year'] = long_run_growth_df['year']\n", "initial_year = long_run_growth_df['initial_year'][0:10]\n", "\n", "initial_year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Parenthetically, is is almost always good to end each code cell with the name of the object the cell asks the python interpreter to calculate. Then you can look at what the python interpreter evaluates the object to be—in this case, a ten-element pandas series list called \"year\" each element of which is a 64-bit integer number—to check that the computer is doing what you expected and wanted it to do." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now we calculate era lengths—which we call \"span\"—and also\n", "# calculate the growth rates, over our different eras, g for\n", "# the proportional growth rate of real income per capita (and\n", "# also of the efficiency of labor), n for the proportional\n", "# rate of growth of the human population, & h for the\n", "# proportional growth rate of the value of useful human\n", "# ideas about technology and organization h:\n", "\n", "span = []\n", "g = []\n", "h = []\n", "n = []\n", "\n", "for t in range(10):\n", " span = span +[long_run_growth_df['year'][t+1]-long_run_growth_df['year'][t]]\n", " h = h + [np.log(long_run_growth_df['human_ideas'][t+1]/long_run_growth_df['human_ideas'][t])/span[t]]\n", " g = g + [np.log(long_run_growth_df['income_level'][t+1]/long_run_growth_df['income_level'][t])/span[t]]\n", " n = n + [np.log(long_run_growth_df['population'][t+1]/long_run_growth_df['population'][t])/span[t]]\n", " \n", "n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that $ n $—the human population proportional growth rate for the ten eras -68000 to -8000, -8000 to -6000, -6000 to -3000, -3000 to -1000, -1000 to 1, 1 to 800, 800-1500, 1500-1770, 1770-1870, and 1870-2020—is a ten element python list, which is what we hoped it would be, and that its last element, 0,0117718..., is in fact the annual average proprtional growth rate of the human population over the modern economic growth era from 1870 to 2020, during which humanity's population grew from 1.3 to 7.6 billion. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# next, we tell the python interpreter that these data \n", "# are naturally indexed by the year:\n", "\n", "long_run_growth_df.set_index('year', inplace=True)\n", "\n", "# &, last, we check to make sure that nothing has gone wrong \n", "# yet with our computations. we check to see that we can\n", "# refer to columns of the datafrme by the labels we assigned\n", "# to them & that the python interpreter understands that the year\n", "# to which each observation corresponds is its natural index:\n", "\n", "long_run_growth_df['income_level']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remember those growth rates we calculated? we can stuff them\n", "# into their own dataframe and take a look:\n", "\n", "long_run_growth_rates_df = pd.DataFrame(\n", " data=np.array([initial_year, span, h, g, n]).transpose(),\n", " columns = ['initial_year', 'span', 'h', 'g', 'n'])\n", "\n", "long_run_growth_rates_df['initial_year'] = long_run_growth_rates_df['initial_year'].apply(np.int64)\n", "long_run_growth_rates_df.set_index('initial_year', inplace=True)\n", "\n", "long_run_growth_rates_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, last, to make it easier to understand what we have done, let us label the eras into which we have divided human history:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eras = ['at the dawn', 'agriculture & herding', 'proto-agrarian age',\n", " 'writing', 'axial age', 'late-antiquity pause', 'middle age', 'commercial revolution',\n", " 'industrial revolution', 'modern economic growth', 'what the 21st century brings']\n", "\n", "long_run_growth_df['eras'] = eras\n", "\n", "eras = eras[0:10]\n", "\n", "long_run_growth_rates_df['eras'] = eras\n", "\n", "long_run_growth_rates_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the work involved in being an analyst involves this kind of \"data cleaning\" exercise. It is boring. It is remarkably difficult, and time consuming: \n", "\n", "\n", "\n", "It is essential to get right, or you will fall victim to the programming saying: garbage in, garbage out. And it is essential to do so in a way that leaves a trail that even a huge idiot can follow should one in the future attempt to understand what you have done to create the dataset you are analyzing—and there is no bigger idiot who needs a lot of help to understand where the dataset came from and what it is than yourself next year, or next month, or next week, or tomorrow.\n", "\n", "This is why we jump through all these hoops: so that somebody coming across this notebook file one or five or ten years from now will not be hopelessly lost in trying to figure out what is supposed to be going on here.\n", "\n", "To make the tasks of whoever may look at this in the future—which may well be you—I recommend an additional data cleaning step. Take the name of your dataframe, replace the \"\\_df\" at its end with \"\\_dict\", thus creating what python calls a _dictionary object_, then stuff the dataframe into the _dictionary_, and finally add information about the sources of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_dict = {}\n", "\n", "long_run_growth_dict['dataframe'] = long_run_growth_df\n", "long_run_growth_dict['source'] = 'Brad DeLong\\'s guesses about the shape of long-run human economic history'\n", "long_run_growth_dict['source_date'] = '2020-05-24'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you follow this convention, then anyone running across one of your dataframes in the future will be able to quickly get up to speed on where the data came from by simply typing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_dict['source']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 2.6. The psychology of programming\n", "\n", "Do you feel bewildered? As if I have been issuing incomprehensible and arcane commands to some deep and mysterious entities that may or may not respond in the expected way? All who program feel this way some of the time, and most of those who program feel this way most of the time. For example, here is cartoonist Randall Munroe on how he does not understand his own programming. It is all just magic:\n", "\n", "
\n", "\n", "\n", "And at times you are sure to feel worse than bewildered. You will feel like python newbee Gandalf feels at this moment:\n", "\n", "
\n", "\n", "\n", "\n", "This \"sorcerer's apprentice\" <> feeling is remarkably common among programmers. It is explicitly referenced in the introduction to the classic computer science textbook, Abelson, Sussman, & Sussman: _Structure and Interpretation of Computer Programs_ <>:\n", "\n", ">In effect, we conjure the spirits of the computer with our spells.\n", "A computational process is indeed much like a sorcerer’s idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer’s spells. They are carefully composed from symbolic expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform. A computational process, in a correctly working computer, executes programs precisely and accurately. Thus, like the sorcerer’s apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring.... Master software engineers have the ability to organize programs so that they can be reasonably sure that the resulting processes will perform the tasks intended... " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 2.7. Comment Your Code!\n", "You may recall these lines in one of the cells far above:\n", "\n", " # Change the next line \n", " # so that it computes the number of seconds in a decade \n", " # and assigns that number the name, seconds_in_a_decade.\n", " \n", "This is called a *comment*. It doesn't make anything happen in Python; Python ignores anything on a line after a `#`. Instead, it's there to communicate something about the code to you, the human reader. Comments are extremely useful. \n", "\n", "\n", "\n", "Source: http://imgs.xkcd.com/comics/future_self.png\n", "\n", "Why are comments useful? Because anyone who will read and try to understand your code in the future is guaranteed to be an idiot. You need to explain things to them very simply, as if they were a small child.\n", "\n", "They are not really an idiot, of course. It is just that they are not in-the-moment, and do not have the context in their minds that you have when you write your code.\n", "\n", "And always keep in mind that the biggest idiot of all is also the one who will be most desperate to understand what you have written: it is yourself, a month or more from now, especially near th eend of the semester." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 2.8. Maintain Your Machines!\n", "\n", "These assignments will be very difficult to do on a smartphone.\n", "\n", "Understand and keep your laptop running—or understand, keep running, and get really really good at using your tablet. Machines do not need to be expensive: around 150 dollars should do it for a Chromebook. People I know like the Samsung Exynos 5 <> or the Lenovo 3 11\" <>.\n", "\n", "And have a backup plan: what will you do if your machine breaks and has to go into the shop, or gets stolen?\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.9. Why jump through all these hoops here?\n", "\n", "You may feel that we have gone through a lot of extra and unnecessary work to create this dataframe. If you are familiar with a spreadsheet program like Microsoft Excel, you may wonder why we don't just use a spreadsheet to hold and then do calculations with the data in this small table that is `long_run_growth_df`. Indeed, Bob Frankston and Dan Bricklin who implemented and designed the original Visicalc were geniuses. Visicalc was a tremendously useful advance over earlier mainframe-based report generators, such as ITS's Business Planning Language <>. And today's Microsoft Excel is not that great an advance over Jonathan Sachs's Lotus 1-2-3, which was itself close to being merely a knockoff of Visicalc. Why not follow the line of least resistance? Why not do our data analysis and visualization in a spreadsheet?\n", "\n", "I do not recommend using spreadsheet programs. In fact, I greatly disrecommend using spreadsheet programs.\n", "\n", "Why?\n", "\n", "This is why:\n", "\n", "\n", "\n", "If you do your work in a spreadsheet, and it rapidly becomes impossible to check or understand. A spreadsheet is a uniquely easy framework to work in. A spreadsheet is a uniquely opaque and incomprehensible framework to assess for its correctness.\n", "\n", "Since we all make errors, frequently, the ability to look back and assess whether one's calculations are correct is absolutely essential. With spreadsheets, such checking is impossible. And sooner or later with very high probability you will make a large and consequential mistake that you will not catch.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.10. Selecting and sorting information\n", "\n", "From our `long_run_growth_rates_df` dataframe, suppose we wanted to select only those eras in which the average rate of human population growth was less than a quarter of a percent per year. We would do so by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df[long_run_growth_rates_df['n'] < 0.001]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We find seven such eras. If we wanted to look at only the rates of growth of income per capita in those eras, we would write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df[long_run_growth_rates_df['n'] < 0.001]['g']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our `long_run_growth_df` dataframe, suppose we wanted to sort the rows of the dataframe in decreasing order of how rich humanity was at each point. We would then write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df.sort_values(\"income_level\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that humanity has recently been richer—very recently very richer—than ever before. But we also see that there was a long stretch of history, from the year -6000 up to 1770, when humanity was poorer than it had been before in the industrial revolution era.\n", "\n", "Suppose we wanted to just look at data before the year 1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df[long_run_growth_df['initial_year'] < 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or suppose we just wanted to look at income level and population data after 1500:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df[long_run_growth_df['initial_year'] > 1500][['income_level', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or suppose we to look at the population for the year 1500:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['population'][1500] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if we want to write our dataframes to \"disk\"—a word we still use because for a generation computer data could be stored in one of three places: in the computer's volatile memory where it disappeared when the power went off or when the computer crashed; on spinning disks with little magnets on them; or archived, offsite, on reels of tape. Why would we want to do so? So that we can easily reuse the data someplace else, or find it again later.\n", "\n", "Pandas attaches methods to its dataframe to make this easy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write our two dataframes to \"disk\"\n", "\n", "long_run_growth_df.to_csv('long_run_growth.csv')\n", "long_run_growth_rates_df.to_csv('long_run_growth_rates.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where \".csv\" tells the python interpreter that the data is in the form of **c**omma-**s**eparated **v**alues, so that you can actually read it and understand it with your eyes if necessary.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.11. Reading in dataframes\n", "\n", "Luckily for you, all of this data cleaning above will be largely irrelevant for this course. Almost all datatables in this course will be premade and given to you in a form that is easily read into a pandas method, which creates the table for you. \n", "\n", "A common file type that is used for economic data is a comma-separated Values (.csv) file. If you know the url or the file name and location (the \"file path\"), you use the \"read_csv\" method from pandas, which requires one single parameter: the path to the csv file you are reading in:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run this cell to read in the datatables and set the\n", "# indexes:\n", "\n", "long_run_growth_df = pd.read_csv(\"https://delong.typepad.com/files/long_run_growth.csv\")\n", "long_run_growth_rates_df = pd.read_csv(\"https://delong.typepad.com/files/long_run_growth_rates.csv\")\n", "\n", "long_run_growth_df.set_index('year', inplace=True)\n", "long_run_growth_rates_df.set_index('initial_year', inplace=True)\n", "\n", "long_run_growth_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load in the data using `pd.read_csv()`, set your indices using `name_df.set_index()`, and you are ready to go with your data analysis with no required or needed data cleaning at all.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data Visualization " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you can read in data and manipulate it, you are now ready to learn about how to visualize data. To begin, run the cell below to import the required libraries, to make graphs appear inside the notebook rather than in a separate window, and to load and set the indexes for the dataframes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import matplotlib as mpl\n", "\n", "long_run_growth_df = pd.read_csv(\"https://delong.typepad.com/files/long_run_growth.csv\")\n", "long_run_growth_rates_df = pd.read_csv(\"https://delong.typepad.com/files/long_run_growth_rates.csv\")\n", "\n", "long_run_growth_df.set_index('year', inplace=True)\n", "long_run_growth_rates_df.set_index('initial_year', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check to make sure that the `long_run_growth_df` is what it was supposed to be:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check to make sure that the `long_run_growth_rates_df` is what it was supposed to be:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and we are ready to roll...\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1. It Is Very Easy\n", "There is very little left to do in order to visualize our data. \n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. The Most Eagle's Eye View\n", "\n", "To graph how wealth _per capita_ has evolved over human history, we simply write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['income_level'].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is it.\n", "\n", "One of the treat advantages of python and pandas is its built-in plotting methods. We can simply call `.plot()` on a dataframe to plot columns against one another. All that we have to do is specify which column to plot on which axis. Something special that pandas does is attempt to automatically parse dates into something that it can understand and order them sequentially." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We probably want to pretty-up the graph a little bit, adding labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['income_level'].plot()\n", "plt.title('Human Economic History: Wealth per Capita', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Annual Income per Capita, 2020 Dollars')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Freaky, no?\n", "\n", "This is why U.C. Davis economic historian Greg Clark says that there is really only one graph that is important in economic history.\n", "\n", "How do the other variables in our dataframe look?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['population'].plot()\n", "plt.title('Human Economic History: Population', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Millions')\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['human_ideas'].plot()\n", "plt.title('Human Economic History: Ideas', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Index of Useful Ideas Stock')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 3.3. Looking at Logarithmic Scales\n", "\n", "After the spring of coronavirus, we are used to exponential growth processes—things that explode, but only after a time in which they gather force, and which look like straight line growth on a graph plotted on a logarithmic scale. Let us plot income levels, populations, and ideas stock values on log scales and see what we see:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.log(long_run_growth_df['income_level']).plot()\n", "plt.title('Human Economic History: Wealth', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Log Annual Income per Capita, 2020 Dollars')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.log(long_run_growth_df['population']).plot()\n", "plt.title('Human Economic History: Population', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Log Millions') " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.log(long_run_growth_df['human_ideas']).plot()\n", "plt.title('Human Economic History: Ideas', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Log Index of Useful Ideas Stock')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 3.4. Slicing the Data to Look at Subsamples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we have here is not an exponential growth process—at least, not recently, for very large values of \"recently\". But perhaps we really do not care about what went on in the gatherer-hunter age before 10,000 years ago, or even in the early agrarian age before 5000 years ago. We know how to take slices out of a dataframe. And with Python those slices then act like dataframes too. So we can simply use the `plot()` method to look at subsamples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['income_level'][3:].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still definitely not an exponential...\n", "\n", "Insert some code cells below, and in them run some plot commands both to pretty-up your figures with labels and to examine how the time series behave over the past five thousand years, roughly since the invention of writing. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you want to gain a sense of how the variables behaved back in the long agrarian age, after the gatherer-hunter age but before the Industrial Revolution that started in 1770. Then we once again use the same commands we used for lists to slice the dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['income_level'][1:9].plot()\n", "plt.title('Human Economic History: Wealth', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Annual Income per Capita, 2020 Dollars')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['population'][1:9].plot()\n", "plt.title('Human Economic History: Population', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Millions') " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['human_ideas'][1:9].plot()\n", "plt.title('Human Economic History: Ideas', size=20)\n", "plt.xlabel('Year')\n", "plt.ylabel('Index of Useful Ideas Stock')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### 3.5. More attractive tables\n", "When we simply print out our dataframe for growth rates across eras, it is relatively unattractive and hard to read:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here making the printing prettier really matters—for us humans, at least, as we try to read what is going on.\n", "\n", "We can make the printing prettier by defining a format dictionary `format_dict` (or whatever other name we choose) and feeding it to the dataframe, telling the dataframe that we want it to evaluate itself using its `.style` method, and that `style()` should use its `.format()` submethod to understand what the `format_dict` is asking the python interpreter to do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "format_dict = { 'span': '{0:.0f}', 'h': '{0:,.3%}', \n", " 'g': '{0:,.2%}', 'n': '{0:,.2%}'}\n", "\n", "long_run_growth_rates_df.style.format(format_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a much more human-readable presentation of the dataframe of calculated growth rates of the ideas stock, living standards of the typical human, and of the human population over the eras that make up human history...\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. You Are Done!\n", "\n", "You're done with Problem set 01! Be sure to submit it!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Thanks to Umar Maniku, Eric van Dusen, Anaise Jean-Philippe, Marc Dordal i Carrerras, & others for helpful comments. Very substantial elements of this were borrowed from the Berkeley Data 8 <> teaching materials, specifically Lab 01: >\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", " \n", "\n", "## Data Manipulation & Visualization \n", "\n", "\n", "\n", "### Catch Our Breath—Further Notes:\n", "\n", "
\n", "\n", "----\n", "\n", "**Data Manipulation & Visualization: Links:** <> <> <>\n", "\n", "\n", " \n", "\n", "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }