{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "reference-location: margin\n", "citation-location: margin\n", "execute:\n", " freeze: auto\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "[Jupyter Notebook](https://lancejnelson.github.io/PH135/jupyter/pandas.ipynb)\n", "\n", "\n", "Pandas is a Python package built to work with spreadsheet-like data, and it is very good at its job. Pandas stores data in something called a \"dataframe\". A dataframe is simply data stored in rows and columns. As an example, here is some sample data taken by an accelerometer sitting on an elevator floor:\n", "\n", "time,gFx,gFy,gFz \n", "0.007,-0.0056,-0.0046,1.012 \n", "0.008,0.0007,0.0024,1.0022 \n", "0.008,0,0.0059,1.0039 \n", "0.009,0.0054,-0.0022,1.0032 \n", "0.009,-0.0015,-0.0056,1.0042 \n", "0.009,0.0037,-0.002,0.9951 \n", "0.01,-0.002,-0.002,1.002 \n", "0.014,0.009,-0.0024,1.0159 \n", "0.015,0.0012,-0.0037,1.01 \n", "0.017,-0.0115,-0.002,1.0012 \n", "0.019,-0.0022,-0.0015,1.001 \n", "0.021,0.0024,-0.0022,1.0166 \n", "\n", "and here is the same data loaded into a dataframe.\n", "\n", "```\n", " time gFx gFy gFz \n", "0 0.007 -0.0056 -0.0046 1.0120 \n", "1 0.008 0.0007 0.0024 1.0022 \n", "2 0.008 0.0000 0.0059 1.0039 \n", "3 0.009 0.0054 -0.0022 1.0032 \n", "4 0.009 -0.0015 -0.0056 1.0042 \n", "5 0.009 0.0037 -0.0020 0.9951 \n", "6 0.010 -0.0020 -0.0020 1.0020 \n", "7 0.014 0.0090 -0.0024 1.0159 \n", "8 0.015 0.0012 -0.0037 1.0100 \n", "9 0.017 -0.0115 -0.0020 1.0012 \n", "``` \n", "Every dataframe has labels attached to its columns and rows. In this example, the row labels are just the first 10 integers and the column labels are \"time\", \"gFx\", \"gFy\", and , \"gFz\". Labeling the rows and columns is nice because you can access data using the row and column labels instead of indices. \n", "\n", "## Creating dataframes\n", "A dataframe can be initiated in many ways but the most common are: from file, from a dictionary, and from an array or list. We'll discuss each way separately.\n", "\n", "### Reading from a .csv file.\n", "The most-used function from the pandas module is `read_csv` which is used to read a csv-formatted file into a data frame. To use it, simply call the `pandas.read_csv` function and pass in the path to the .csv file. ([Here](https://lancejnelson.github.io/PH135/files/elevator.csv) is a link to the file.)" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import read_csv\n", "\n", "elevator_data = read_csv(\"elevator.csv\",index_col = 0)\n", "print(elevator_data)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keyword argument `index_col = 0` indicates that the row labels should be taken from the first column in the csv file. There are many, many possible keyword arguments that can be used to customize the way `read_csv` reads a file into a dataframe. I'll highlight just a few and refer you to the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv) for the rest:\n", "\n", "1. `delimiter`- use this to specify the character that separates the data from each other. The default is \",\" for .csv files.\n", "2. `header` - use this to specify which row contains the column names. Usually this occurs on the first row (`header = 0`) but not always.\n", "3. `usecols` - use this to specify which columns from the file should be included in the dataframe.\n", "4. `skiprows` - line numbers to skip when building the dataframe. Can be either a single integer (skip the first `n` lines) or a list of integers and it will skip all rows in the list.\n", "\n", "\n", "The `rename()` function can be used if you don't like the default index values (integers if you read from file) and want to reassign them. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import read_csv\n", "\n", "elevator_data = read_csv(\"elevator.csv\",index_col = 0)\n", "elevator_data = elevator_data.rename(index = {0:\"A\",1:\"B\",2:\"C\",3:\"D\",4:\"E\",5:\"F\",6:\"G\",7:\"H\",8:\"I\",9:\"J\"})\n", "print(elevator_data)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create from a dictionary\n", "If your data is in a dictionary you can use the `DataFrame` function (case sensitive) to initialize the dataframe. Key values in the dictionary correspond to columns in the dataframe. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "elevator = {\"time\":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017],\"gFx\":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],\"gFy\":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],\"gFz\":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]}\n", "elevator_data = DataFrame(elevator)\n", "print(elevator_data)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The row indices will default to the a set of integers starting at 0. If you want to index the rows with other labels, use the `index` keyword argument." ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "elevator = {\"time\":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017],\"gFx\":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],\"gFy\":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],\"gFz\":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]}\n", "elevator_data = DataFrame(elevator,index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "print(elevator_data)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create from a list\n", "Sometimes you have data in a list or multiple list and would like to combine all of that data and form a dataframe. This can also be done using the `DataFrame` function (case sensitive remember!). When initializing with lists, you have to also use the `columns` keyword argument to specify what you want the column labels to be. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The row labels will default to integers unless you specify otherwise with the `index` keyword.\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting data\n", "Extracting data from the dataframe could mean several things. The most common possibilities include:\n", "\n", "1. accessing a single number using row and column labels or row and column indices.\n", "2. accessing one or several columns of the dataframe.\n", "3. accessing one or several rows of the dataframe.\n", "4. slicing from the \"middle\" of the dataframe. (i.e. Not entire columns or entire rows.)\n", "5. accessing only elements in the dataframe that meet a certain criteria.\n", "\n", "We'll cover each of these tasks one at a time.\n", "\n", "### Extracting General Information\n", "Sometimes the dataframe is quite large and you'd like to inspect just a small portion of it. You can use `dataframe.head(n)` to look at the **first** `n` rows in the dataframe and `dataframe.tail(n)` to look at the **last** `n` rows.\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.head(3)\n", "elevator_data.tail(4)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a list of the column or row labels, use the variables `dataframe.columns` and `dataframe.index`." ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.columns\n", "elevator_data.index" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a single number\n", "To extract a single number from a dataframe, use the `dataframe.at[]` or `dataframe.iat[]` objects. The `at` object should be used if you want to locate the number using row and column **labels** and `iat` should be used when you want to access the number using row and column **indices**. Let's see an example:\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.at[\"A\",\"time\"]\n", "elevator_data.iat[0,0]\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `loc` and `iloc` dictionaries (see below) can also be used to extract a single entry in the dataframe.\n", "\n", "### Extracting entire columns\n", "An entire column of the dataframe can be extract by indexing it like a dictionary, using the column name as the key value\n", "\n", "```python\n", "elevator_data[\"gFz\"]\n", "elevator_data[\"time\"]\n", "elevator_data[\"time\"][0:3]\n", "\n", "```\n", "\n", "You can bundle the column names into a list and extract multiple columns at once.\n", "\n", "```python\n", "elevator_data[[\"gFx\", \"gFy\", \"gFz\"]]\n", "\n", "```\n", "\n", "### Accessing entire rows\n", "Accessing rows in a dataframe is done with the help of the `dataframe.loc[n]` or `dataframe.iloc[n]` dictionary. `loc` should be used if you want to slice using row and column labels whereas `iloc` should be used if you want to slice out using indices." ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.loc[\"A\"]\n", "elevator_data.iloc[0] " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Multiple rows can be accessed using the `A:B` syntax to slice out the desired rows.\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.loc[\"A\":\"D\"]\n", "elevator_data.iloc[0:4] " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing rows and columns\n", "\n", "The `loc` and `iloc` dictionaries can be used to slice across rows and columns. Simple separate the two slices with a comma:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.loc[\"A\":\"D\",\"time\":\"gFy\"]\n", "elevator_data.iloc[0:4,0:2] " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Boolean Slicing\n", "\n", "We have already seen boolean slicing in Numpy Arrays and we can use something similar on Pandas. Let's see an example:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data[elevator_data[\"time\"] < 0.01] # Slice only the rows where time < 0.01" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The statement `elevator_data[\"time\"] < 0.01` produces a boolean sequence which can be used as a set of indices to access only those entries where time is less than 0.01.\n", "\n", "More complex boolean slicing can be done with the use of the `query` function which allows you to be more specific about your boolean conditions. The example below will extract all rows in the dataframe where time > 0.008 and gFz is greater than 1.\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\"time\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.query(\"time > 0.008 and gFz > 1\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Special Characters in Column Name\n", "Often you will find that your column labels will contain spaces and/or other special characters in them. To correctly specify the column label in these cases, you must enclose the name in graves accents (It's located above the tab key on your keyboard.). For example, if your column label was \" time-s\" instead of just \"time\" like in the example above, you could use the query function like this:\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]\n", "gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]\n", "gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]\n", "gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]\n", "elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [\" time-s\", \"gFx\",\"gFy\", \"gFz\"],index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.query(\"` time-s` > 0.008 and gFz > 1\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Query to match strings\n", "\n", "Often your dataframe will contain strings as the data instead of numbers. If you are trying to use `query` to find dataframe entries that contain a specified string, you'll have enclose the boolean expression in single quotes ('') and enclose the string you are trying to match in double quotes (\"\"). As an example, imagine a dataframe very similar to the one used in the previous example, but with an additional column that specifies whether the object\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame\n", "from numpy import transpose\n", "\n", "dataDict = {\" time-s\":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017], \"gFx\":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],\"gFy\":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],\"gFz\":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012], \"Here\":[\"Yes\",\"Yes\",\"No\",\"No\",\"Yes\",\"No\",\"Yes\",\"No\",\"Yes\",\"No\"]}\n", "\n", "elevator_data = DataFrame(dataDict,index = [\"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\",\"H\",\"J\",\"K\"])\n", "\n", "elevator_data.query('` time-s` > 0.008 and Here == \"No\"')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">**_To Do:_**\n", ">\n", ">1. Add print statements to the cell above until you understand what each line of code does.\n", ">2. Add comments next to the line of code to help you remember.\n", "\n", "\n", "## Performing Calculations\n", "Dataframes are similar to numpy arrays in that you can do math across an entire dataset. This makes mathematical calculations very easy. Look at the example below and try to guess what calculation is being performed\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "from numpy import sqrt\n", "\n", "elevator_data[[\"gFx\", \"gFy\", \"gFz\"]] *= 9.8\n", "a_magnitude = sqrt(elevator_data[[\"gFx\", \"gFy\", \"gFz\"]]**2).sum(axis = 1)\n", "\n", "max_accel = a_magnitude.max()\n", "\n", "elevator_data" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may have noticed that `sum` method that was used to sum up each column. The keyword argument `axis = 1` indicates that you want to sum over rows, not over columns. (`axis = 0` would result in summing over columns.) There are a few other handy methods for common mathematical operations.\n", "\n", "1. `min()` - Find the minimum value.\n", "2. `max()` - Find the maximum value.\n", "3. `cumsum()` - Cumulative sum, just like in numpy.\n", "4. `std()` - Standard deviation.\n", "5. `mean()` - Mean or average.\n", "6. `quantile(q)` - Find value of a give quantile `q`. \n", "\n", "\n", "There are a multitude of other useful math functions available. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html) for a more comprehensive list. Basically, any function that numpy has, pandas will also have.\n", "\n", "\n", "## Modifying the dataframe\n", "\n", "### Adding new columns\n", "\n", "A new column can be added to a dataframe by typing `dataframe[columnname] =` followed by a list,tuple, or array containing the new entries. For example:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data[\"a_mag\"] = sqrt( (elevator_data[[\"gFx\", \"gFy\", \"gFz\"]]**2).sum(axis = 1))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding a new row\n", "\n", "To add a single (or multiple) rows to a dataframe you should use the `concat()` function (short for concatenate). This function will join multiple dataframes into one. Below is an example of the usage." ] }, { "cell_type": "code", "metadata": {}, "source": [ "from pandas import DataFrame,concat\n", "\n", "to_add = DataFrame({\" time-s\":0.02,\"gFx\":-0.028,\"gFy\":0.018,\"gFz\":1.028},index = [\"L\"]) # Build the dataframe to be added.\n", "final = concat([elevator_data,to_add]) # Append using a dictionary.\n", "\n", "elevator_data2 = DataFrame({\"time\":[0.02,0.025],\"gFx\":[-0.028,-0.022],\"gFy\":[0.018,-0.012],\"gFz\":[1.028,1.042]})\n", "\n", "final2 = concat([elevator_data,to_add,elevator_data2]) #Combine all three data frames into one." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other useful methods\n", "\n", "### Getting a summary of your dataframe.\n", "\n", "The `describe()` function will calculate several useful statistical quantities and display them in a dataframe." ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.describe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting your dataframe.\n", "\n", "A histogram of each column can be easily generate with the `dataframe.hist()` function. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.hist()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a scatter plot of two columns in your dataframe, use `dataframe.plot.scatter()`. The `x` and `y` keyword arguments should be used to specify which columns to plot. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.plot.scatter(x = \" time-s\", y = \"gFz\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also plot a single column vs. the row labels using the `plot` function" ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data[\"gFz\"].plot()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use this function to plot a single row vs the column labels." ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.iloc[2][[\" time-s\",\"gFx\",\"gFy\",\"gFz\"]].plot()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A wealth of other functions exist for pandas and I will not exemplify or explain them here because it is beyond the scope of the class. A nice summary sheet for pandas can be found [here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)\n", "\n", "\n", "### Writing your dataframe to file\n", "You have already used `savetxt` to save an entire array of numbers to a file in one step. If your data is in a pandas dataframe, saving that dataframe to file couldn't be easier; just use `dataframe.to_csv(\"filename\")`." ] }, { "cell_type": "code", "metadata": {}, "source": [ "elevator_data.to_csv(\"myelevatorData.csv\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Several helpful keyword arguments are available when writing to a file. I'll list a few of them below.\n", "\n", "1. `sep` - Delimiter or character used to separate the data as a length-one string. Default is a comma (\",\").\n", "2. `columns` - specify which columns in your dataframe to write as a list of column labels.\n", "3. `index` - `True` if you want the row labels written and `False` if you don't\n", "4. `compression` - specify the compression scheme as a string. Options are 'zip','gzip', 'bz2','zstd', and 'tar'. \n", "\n", "\n", "Depending on the type of data you are working with, there is a good argument for always using `to_csv` to write data to file, never needing `savetxt`.\n", "\n", "\n", "\n", "## Flashcards\n", "1. How does `read_csv` work? Give a short example.\n", "2. What is the keyword argument `usecols` used for when reading a dataframe from file?\n", "3. How does the `DataFrame` function work and what type of arguments can you pass it? Give a few short examples.\n", "4. How do you access/extract multiple columns in their entirety from a dataframe.\n", "5. How do you access/extract multiple rows in their entirety from a dataframe.\n", "6. How do you access/extract the middle part of a dataframe?\n", "7. How do you access/extract only the numbers in a dataframe that meet a given criteria?\n", "8. How do you add a new column to a dataframe?\n", "9. How do you add a new row to a dataframe?\n", "10. Recite Mosiah 4:27.\n", "\n", "## Exercises\n", "1. In homework problem 3 from the chapter on dictionaries and sets, you built a dictionary containing planetary information for your top three favorite planets. \n", " 1. Create a dataframe from this dictionary.\n", " 2. Make the name of the planet be the row label.\n", " 3. Write this dataframe to a csv file.\n", "\n", "2. In homework problem 2 from the chapter on numpy, you calculated the moment of inertia for 1000 different objects. \n", " 1. Create a dataframe out of the mass, radius, length, and inertia data. The dataframe should have four columns and 1000 rows.\n", " 2. Use `describe` to get a summary of the data.\n", " 3. Write the data to a .csv file and inspect the file to ensure that you did it correctly.\n", "\n", "3. In homework problem 3 from the chapter on I/O, you read planetary data into a .csv file using `genfromtxt`.\n", " 1. Read the same file into a dataframe using `read_csv`. Let the name of the planets be the row label and the physical property be the column label.\n", " 2. The sixth column contains the acceleration due to gravity for all of the planets in units of m/s$^2$. Recalculate these values to be in terms of earth's value ($g = 9.78$ m/s$^2$) and modify the sixth column.\n", " 3. The seventh column contains the orbital period for all of the planets in units of earth years. Recalculate these values to be in earth days and modify that column accordingly. \n", " 4. Add a new column labeled \"Surface Temperature\" that contains the surface temperature for each planet in Kelvins as given [here](https://solarsystem.nasa.gov/resources/681/solar-system-temperatures/).\n", " 5. Add a new row for a hypothetical planet (\"planet X\") and set its physical properties equal to the average of the other 9 planets that are already in the dataframe.\n", " 6. Determine which planets have an acceleration due to gravity that is less than earth's value ($9.8 $ m/s$^2$)\n", " 7. Determine which planets have an eccentricity that is greater than $0.04$.\n", " 8. Determine which planet(s) have an inclination greater than $1.5^\\circ$ **and** have 10 or more satellites.\n", " 9. Write the a .csv file containing the following columns in the dataframe: \"Name\", \"SurfaceGravity\", \"OrbitalPeriod\", and \"Satellites\".\n", " 10. Write the entire dataframe to a separate file. \n", "\n", " Hints: \n", " 1. For part 5 you will want to use `describe` to generate a dataframe of summary statistics. Then use `append` to add the row labeled \"mean\" to your original planetary database. Then use `rename` to rename that row from `mean` to `Planet X`. \n", " 2. The column label for the acceleration due to gravity start with a space: \" SurfaceGravity\" not \"SurfaceGravity\".\n", " 3. For part 8 use `query`. To include the space in the column names, enclose the name in backticks \"\\`\". (i.e. \" ` Inclination` > 0.04\")" ] } ], "metadata": { "kernelspec": { "name": "python3", "language": "python", "display_name": "Python 3" } }, "nbformat": 4, "nbformat_minor": 4 }