{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "Most of the work involved in being an analyst involves this kind of \"data cleaning\" exercise. It is boring. It is remarkably difficult, and time consuming: \n", "\n", "\n", "\n", "It is essential to get right, or you will fall victim to the programming saying: garbage in, garbage out. And it is essential to do so in a way that leaves a trail that even a huge idiot can follow should one in the future attempt to understand what you have done to create the dataset you are analyzing—and there is no bigger idiot who needs a lot of help to understand where the dataset came from and what it is than yourself next year, or next month, or next week, or tomorrow.\n", "\n", "This is why we jump through all these hoops: so that somebody coming across this notebook file one or five or ten years from now will not be hopelessly lost in trying to figure out what is supposed to be going on here.\n", "\n", "To make the tasks of whoever may look at this in the future—which may well be you—I recommend an additional data cleaning step. Take the name of your dataframe, replace the \"\\_df\" at its end with \"\\_dict\", thus creating what python calls a _dictionary object_, then stuff the dataframe into the _dictionary_, and finally add information about the sources of the data:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, when the last line of the code cell is simply the name of some object, the python interpreter understands that to be a request that it evaluate that object and print it out. And we see that we have indeed created the data table we had hoped to construct.\n", "\n", "We have called it \"long_run_growth\" to remind us that it is made up of data about the long run growth of the human economy, and we have the \"\\_df\" at the end to remind us that it is a pandas _dataframe_. Note the \"pd.\" in front of \"DataFrame\" to tell the python interpreter that the command comes from pandas. Note the \"np.\" in front of \"array\" to tell the python interpreter that we want it to look into the numpy, the Numerical Python, library for the `array` command to apply to `long_run_growth_list_of_lists`. And note the \", columns = \" to tell the python interpreter that it should label each of the columns of the dataframe, and what those labels should be." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_dict = {}\n", "\n", "long_run_growth_dict['dataframe'] = long_run_growth_df\n", "long_run_growth_dict['source'] = 'Brad DeLong\\'s guesses about the shape of long-run human economic history'\n", "long_run_growth_dict['source_date'] = '2020-05-24'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you follow this convention, then anyone running across one of your dataframes in the future will be able to quickly get up to speed on where the data came from by simply typing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_dict['source']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.10. Selecting and sorting information\n", "\n", "From our `long_run_growth_rates_df` dataframe, suppose we wanted to select only those eras in which the average rate of human population growth was less than a quarter of a percent per year. We would do so by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df[long_run_growth_rates_df['n'] < 0.001]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We find seven such eras. If we wanted to look at only the rates of growth of income per capita in those eras, we would write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_rates_df[long_run_growth_rates_df['n'] < 0.001]['g']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our `long_run_growth_df` dataframe, suppose we wanted to sort the rows of the dataframe in decreasing order of how rich humanity was at each point. We would then write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df.sort_values(\"income_level\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that humanity has recently been richer—very recently very richer—than ever before. But we also see that there was a long stretch of history, from the year -6000 up to 1770, when humanity was poorer than it had been before in the industrial revolution era.\n", "\n", "Suppose we wanted to just look at data before the year 1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df[long_run_growth_df['initial_year'] < 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or suppose we just wanted to look at income level and population data after 1500:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df[long_run_growth_df['initial_year'] > 1500][['income_level', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or suppose we to look at the population for the year 1500:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "long_run_growth_df['population'][1500] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if we want to write our dataframes to \"disk\"—a word we still use because for a generation computer data could be stored in one of three places: in the computer's volatile memory where it disappeared when the power went off or when the computer crashed; on spinning disks with little magnets on them; or archived, offsite, on reels of tape. Why would we want to do so? So that we can easily reuse the data someplace else, or find it again later.\n", "\n", "Pandas attaches methods to its dataframe to make this easy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write our two dataframes to \"disk\"\n", "\n", "long_run_growth_df.to_csv('long_run_growth.csv')\n", "long_run_growth_rates_df.to_csv('long_run_growth_rates.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where \".csv\" tells the python interpreter that the data is in the form of **c**omma-**s**eparated **v**alues, so that you can actually read it and understand it with your eyes if necessary." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }