{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PS 00 Scratch. Data Manipulation Pieces Dropped from the Problem Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Data Manipulation\n", "### 4.1. Pandas datafames\n", "Pandas is one of the most widely used Python libraries in data science. It is commonly used for data cleaning, and with good reason: it’s very powerful and flexible. Think of it as a table. The rows and columns of a pandas dataframe thought of as a table are a collection of lists stacked on top/next to each other. For example, here is a dataframe made up of ten lists, with each list consisting of a rating and a movie title:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "top_10_movies_list = [\n", " [9.2, 'The Shawshank Redemption', '1994'],\n", " [9.2, 'The Godfather', '1972'],\n", " [9.0, 'The Godfather: Part II', '1974'],\n", " [8.9, 'Pulp Fiction', '1994'],\n", " [8.9, \"Schindler's List\", '1993'],\n", " [8.9, 'The Lord of the Rings: The Return of the King', '2003'],\n", " [8.9, '12 Angry Men', '1957'],\n", " [8.9, 'The Dark Knight', '2008'],\n", " [8.9, 'Il buono, il brutto, il cattivo', '1966'],\n", " [8.8, 'The Lord of the Rings: The Fellowship of the Ring', '2001']]\n", "\n", "top_10_movies_df = pd.DataFrame(data=np.array(top_10_movies_list), columns=[\"Rating\", \"Movie\", \"Year\"])\n", " \n", "top_10_movies_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Reading in dataframes\n", "\n", "Luckily for you, most datatables in this course will be premade and given to you in a form that is easily read into a pandas method, which creates the table for you. A common file type that is used for economic data is a Comma-Separated Values (.csv) file, which stores tabular data. It is not necessary for you to know exactly how .csv files store data, but you should know how to read a file in as a pandas dataframe. You can use the \"read_csv\" method from pandas, which takes in one parameter which is the path to the csv file you are reading in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will read in a .csv file that contains quarterly real GDI, real GDP, and nominal GDP data in the U.S. from 1947 to the present." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell to read in the table\n", "accounts = pd.read_csv(\"data/Quarterly_Accounts.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pd.read_csv` function expects a path to a .csv file as its input, and will return a data table created from the data contained in the csv.\n", "We have provided `Quarterly_Accouunts.csv` in the data directory, which is all contained in the current working directory (aka the folder this assignment is contained in). For this reason, we must specify to the `read_csv` function that it should look for the csv in the data directory, and the `/` indicates that `Quarterly_Accounts.csv` can be found there. \n", "\n", "Here is a sample of some of the rows in this datatable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Indexing dataframes\n", "\n", "Oftentimes, tables will contain a lot of extraneous data that muddles our data tables, making it more difficult to quickly and accurately obtain the data we need. To correct for this, we can select out columns or rows that we need by indexing our dataframes. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to index into a table is with square bracket notation. Suppose you wanted to obtain all of the Real GDP data from the data. Using a single pair of square brackets, you could index the table for `\"Real GDP\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell and see what it outputs\n", "accounts[\"Real GDP\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the above cell returns an array of all the real GDP values in their original order.\n", "Now, if you wanted to get the first real GDP value from this array, you could index it with another pair of square brackets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[\"Real GDP\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas columns have many of the same properties as numpy arrays. Keep in mind that pandas dataframes, as well as many other data structures, are zero-indexed, meaning indexes start at 0 and end at the number of elements minus one. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wanted to create a new datatable with select columns from the original table, you can index with double brackets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Note: .head() returns the first five rows of the table\n", "accounts[[\"Year\", \"Quarter\", \"Real GDP\", \"Real GDI\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can also get rid of columns you dont need using `.drop()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts.drop(\"Nominal GDP\", axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can use square bracket notation to index rows by their indices with a single set of brackets. You must specify a range of values for which you want to index. For example, if I wanted the 20th to 30th rows of `accounts`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[20:31]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.4. Filtering data\n", "\n", "As you can tell from the previous, indexing rows based on indices is only useful when you know the specific set of rows that you need, and you can only really get a range of entries. Working with data often involves huge datasets, making it inefficient and sometimes impossible to know exactly what indices to be looking at. On top of that, most data analysis concerns itself with looking for patterns or specific conditions in the data, which is impossible to look for with simple index based sorting. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thankfully, you can also use square bracket notation to filter out data based on a condition. Suppose we only wanted real GDP and nominal GDP data from the 21st century:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[accounts[\"Year\"] >= 2000][[\"Real GDP\", \"Nominal GDP\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `accounts` table is being indexed by the condition `accounts[\"Year\"] >= 2000`, which returns a table where only rows that have a \"Year\" greater than $2000$ is returned. We then index this table with the double bracket notation from the previous section to only get the real GDP and nominal GDP columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose now we wanted a table with data from the first quarter, and where the real GDP was less than 5000 or nominal GDP is greater than 15,000." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[(accounts[\"Quarter\"] == \"Q1\") & ((accounts[\"Real GDP\"] < 5000) | (accounts[\"Nominal GDP\"] > 15000))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many different conditions can be included to filter, and you can use `&` and `|` operators to connect them together. Make sure to include parantheses for each condition!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to reorganize data to make it more convenient is to sort the data by the values in a specific column. For example, if we wanted to find the highest real GDP since 1947, we could sort the table for real GDP:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts.sort_values(\"Real GDP\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But wait! The table looks like it's sorted in increasing order. This is because `sort_values` defaults to ordering the column in ascending order. To correct this, add in the extra optional parameter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts.sort_values(\"Real GDP\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can clearly see that the highest real GDP was attained in the first quarter of this year, and had a value of 16903.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.5. Useful methods for numeric data\n", "\n", "Here are a few useful functions when dealing with numeric data columns.\n", "To find the minimum value in a column, call `min()` on a column of the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[\"Real GDP\"].min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the maximum value, call `max()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accounts[\"Nominal GDP\"].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to find the average value of a column, use `mean()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "accounts[\"Real GDI\"].mean()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }