{
"cells": [
{
"cell_type": "markdown",
"id": "654d989b-ca3b-4815-94f8-fe640ebbe9d2",
"metadata": {
"slideshow": {
"slide_type": "skip"
},
"tags": []
},
"source": [
"[](https://mybinder.org/v2/gh/stefmolin/pandas-workshop/main?urlpath=lab/tree/notebooks/1-getting_started_with_pandas.ipynb) [](https://stefaniemolin.com/pandas-workshop/#/section-1)\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "4eedaf6c-296c-442c-92ec-60b8a4595781",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"# Section 1: Getting Started With Pandas\n",
"\n",
"We will begin by introducing the `Series`, `DataFrame`, and `Index` classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data."
]
},
{
"cell_type": "markdown",
"id": "002b1f0a-a96e-424b-993a-8126452d31a1",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Anatomy of a DataFrame\n",
"\n",
"A **DataFrame** is composed of one or more **Series**. The names of the **Series** form the column names, and the row labels form the **Index**."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c8c4e00d-8bbb-47ee-9c84-dba7b6c19636",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Aachen | \n",
" 1 | \n",
" Valid | \n",
" L5 | \n",
" 21 | \n",
" Fell | \n",
" 01/01/1880 12:00:00 AM | \n",
" 50.77500 | \n",
" 6.08333 | \n",
" (50.775, 6.08333) | \n",
"
\n",
" \n",
" | 1 | \n",
" Aarhus | \n",
" 2 | \n",
" Valid | \n",
" H6 | \n",
" 720 | \n",
" Fell | \n",
" 01/01/1951 12:00:00 AM | \n",
" 56.18333 | \n",
" 10.23333 | \n",
" (56.18333, 10.23333) | \n",
"
\n",
" \n",
" | 2 | \n",
" Abee | \n",
" 6 | \n",
" Valid | \n",
" EH4 | \n",
" 107000 | \n",
" Fell | \n",
" 01/01/1952 12:00:00 AM | \n",
" 54.21667 | \n",
" -113.00000 | \n",
" (54.21667, -113.0) | \n",
"
\n",
" \n",
" | 3 | \n",
" Acapulco | \n",
" 10 | \n",
" Valid | \n",
" Acapulcoite | \n",
" 1914 | \n",
" Fell | \n",
" 01/01/1976 12:00:00 AM | \n",
" 16.88333 | \n",
" -99.90000 | \n",
" (16.88333, -99.9) | \n",
"
\n",
" \n",
" | 4 | \n",
" Achiras | \n",
" 370 | \n",
" Valid | \n",
" L6 | \n",
" 780 | \n",
" Fell | \n",
" 01/01/1902 12:00:00 AM | \n",
" -33.16667 | \n",
" -64.95000 | \n",
" (-33.16667, -64.95) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"0 Aachen 1 Valid L5 21 Fell \n",
"1 Aarhus 2 Valid H6 720 Fell \n",
"2 Abee 6 Valid EH4 107000 Fell \n",
"3 Acapulco 10 Valid Acapulcoite 1914 Fell \n",
"4 Achiras 370 Valid L6 780 Fell \n",
"\n",
" year reclat reclong GeoLocation \n",
"0 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) \n",
"1 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) \n",
"2 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) \n",
"3 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) \n",
"4 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"meteorites = pd.read_csv('../data/Meteorite_Landings.csv', nrows=5)\n",
"meteorites"
]
},
{
"cell_type": "markdown",
"id": "0f71bf2b-8ff6-4352-b764-ca323a4bcb5e",
"metadata": {},
"source": [
"*Source: [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)*"
]
},
{
"cell_type": "markdown",
"id": "08f9c73e-20be-40d1-b5e9-1bfd24b5be25",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Series:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e9481da8-a3b5-4f6e-b9fc-cdb049090dbe",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 Aachen\n",
"1 Aarhus\n",
"2 Abee\n",
"3 Acapulco\n",
"4 Achiras\n",
"Name: name, dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.name"
]
},
{
"cell_type": "markdown",
"id": "5ad03f4e-5f74-4440-a825-cd8e964006d2",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Columns:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2b8723c5-3e4d-4038-82c7-5c0e9651f01c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',\n",
" 'reclat', 'reclong', 'GeoLocation'],\n",
" dtype='object')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.columns"
]
},
{
"cell_type": "markdown",
"id": "649a8b50-2ff5-403b-8752-cbcf992de79f",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"#### Index:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "43a06370-f735-4a9d-a64f-493e03b2b2c7",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=5, step=1)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.index"
]
},
{
"cell_type": "markdown",
"id": "49080335-73c4-4b21-a181-70370544f3aa",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Creating DataFrames\n",
"\n",
"We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just a couple of examples, but be sure to check out [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the documentation for a complete list."
]
},
{
"cell_type": "markdown",
"id": "a655f25f-c5ce-4286-a671-eee6c1eef663",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Using a flat file"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "96da3638-773f-4a4a-a766-b5f34da67f80",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"meteorites = pd.read_csv('../data/Meteorite_Landings.csv')"
]
},
{
"cell_type": "markdown",
"id": "0ac0a6c5-9366-481f-87cc-5d84742cebf8",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"*Tip: There are many parameters to this function to handle some initial processing while reading in the file – be sure check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).*"
]
},
{
"cell_type": "markdown",
"id": "8097800d-f20a-4a28-ab25-ec2218903cb3",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Using data from an API\n",
"\n",
"Collect the data from [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh) using the Socrata Open Data API (SODA) with the `requests` library:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "45bf1785-4f25-461b-b600-543ccb0fccff",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import requests\n",
"\n",
"response = requests.get(\n",
" 'https://data.nasa.gov/resource/gh4g-9sfh.json',\n",
" params={'$limit': 50_000}\n",
")\n",
"\n",
"if response.ok:\n",
" payload = response.json()\n",
"else:\n",
" print(f'Request was not successful and returned code: {response.status_code}.')\n",
" payload = None"
]
},
{
"cell_type": "markdown",
"id": "575a0059-5f81-46cf-9a1c-7dc06d874a72",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"Create the DataFrame with the resulting payload:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "be644eb4-193e-4820-9855-3e959c899319",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" geolocation | \n",
" :@computed_region_cbhk_fwbd | \n",
" :@computed_region_nnqa_25f4 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Aachen | \n",
" 1 | \n",
" Valid | \n",
" L5 | \n",
" 21 | \n",
" Fell | \n",
" 1880-01-01T00:00:00.000 | \n",
" 50.775000 | \n",
" 6.083330 | \n",
" {'latitude': '50.775', 'longitude': '6.08333'} | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" Aarhus | \n",
" 2 | \n",
" Valid | \n",
" H6 | \n",
" 720 | \n",
" Fell | \n",
" 1951-01-01T00:00:00.000 | \n",
" 56.183330 | \n",
" 10.233330 | \n",
" {'latitude': '56.18333', 'longitude': '10.23333'} | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" Abee | \n",
" 6 | \n",
" Valid | \n",
" EH4 | \n",
" 107000 | \n",
" Fell | \n",
" 1952-01-01T00:00:00.000 | \n",
" 54.216670 | \n",
" -113.000000 | \n",
" {'latitude': '54.21667', 'longitude': '-113.0'} | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass fall year \\\n",
"0 Aachen 1 Valid L5 21 Fell 1880-01-01T00:00:00.000 \n",
"1 Aarhus 2 Valid H6 720 Fell 1951-01-01T00:00:00.000 \n",
"2 Abee 6 Valid EH4 107000 Fell 1952-01-01T00:00:00.000 \n",
"\n",
" reclat reclong geolocation \\\n",
"0 50.775000 6.083330 {'latitude': '50.775', 'longitude': '6.08333'} \n",
"1 56.183330 10.233330 {'latitude': '56.18333', 'longitude': '10.23333'} \n",
"2 54.216670 -113.000000 {'latitude': '54.21667', 'longitude': '-113.0'} \n",
"\n",
" :@computed_region_cbhk_fwbd :@computed_region_nnqa_25f4 \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(payload)\n",
"df.head(3)"
]
},
{
"cell_type": "markdown",
"id": "09f139e3-ae7a-4f53-96ae-5f8de4e641d2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"*Tip: `df.to_csv('data.csv')` writes this data to a new file called `data.csv`.*"
]
},
{
"cell_type": "markdown",
"id": "0a263d5a-d2c5-46c2-990c-3894b3ed39bf",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Inspecting the data\n",
"Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have. "
]
},
{
"cell_type": "markdown",
"id": "f3e541dd-73d2-4f40-82e0-b8a59aa48b04",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"Let's inspect the `meteorites` data."
]
},
{
"cell_type": "markdown",
"id": "0246c277-5638-49a9-a7e1-6e82b1c15265",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### How many rows and columns are there?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d968d99d-dbf8-4402-a556-4200e13afe72",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(45716, 10)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.shape"
]
},
{
"cell_type": "markdown",
"id": "8ee994e7-25e0-4062-ae29-fadb0f698c88",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"#### What are the column names?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "24045c3d-085f-485d-b5a7-422037b24272",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',\n",
" 'reclat', 'reclong', 'GeoLocation'],\n",
" dtype='object')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.columns"
]
},
{
"cell_type": "markdown",
"id": "3b138718-9c73-40f0-b4d2-7280a688fa89",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### What type of data does each column currently hold?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "408c70cc-f900-45d2-828d-827ba6dd4123",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name object\n",
"id int64\n",
"nametype object\n",
"recclass object\n",
"mass (g) float64\n",
"fall object\n",
"year object\n",
"reclat float64\n",
"reclong float64\n",
"GeoLocation object\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.dtypes"
]
},
{
"cell_type": "markdown",
"id": "d6011183-5228-4f1c-a7c4-3f4e1cc9a4db",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### What does the data look like?"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3de2f605-ca0a-4e5e-9add-76309296adf5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Aachen | \n",
" 1 | \n",
" Valid | \n",
" L5 | \n",
" 21.0 | \n",
" Fell | \n",
" 01/01/1880 12:00:00 AM | \n",
" 50.77500 | \n",
" 6.08333 | \n",
" (50.775, 6.08333) | \n",
"
\n",
" \n",
" | 1 | \n",
" Aarhus | \n",
" 2 | \n",
" Valid | \n",
" H6 | \n",
" 720.0 | \n",
" Fell | \n",
" 01/01/1951 12:00:00 AM | \n",
" 56.18333 | \n",
" 10.23333 | \n",
" (56.18333, 10.23333) | \n",
"
\n",
" \n",
" | 2 | \n",
" Abee | \n",
" 6 | \n",
" Valid | \n",
" EH4 | \n",
" 107000.0 | \n",
" Fell | \n",
" 01/01/1952 12:00:00 AM | \n",
" 54.21667 | \n",
" -113.00000 | \n",
" (54.21667, -113.0) | \n",
"
\n",
" \n",
" | 3 | \n",
" Acapulco | \n",
" 10 | \n",
" Valid | \n",
" Acapulcoite | \n",
" 1914.0 | \n",
" Fell | \n",
" 01/01/1976 12:00:00 AM | \n",
" 16.88333 | \n",
" -99.90000 | \n",
" (16.88333, -99.9) | \n",
"
\n",
" \n",
" | 4 | \n",
" Achiras | \n",
" 370 | \n",
" Valid | \n",
" L6 | \n",
" 780.0 | \n",
" Fell | \n",
" 01/01/1902 12:00:00 AM | \n",
" -33.16667 | \n",
" -64.95000 | \n",
" (-33.16667, -64.95) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"0 Aachen 1 Valid L5 21.0 Fell \n",
"1 Aarhus 2 Valid H6 720.0 Fell \n",
"2 Abee 6 Valid EH4 107000.0 Fell \n",
"3 Acapulco 10 Valid Acapulcoite 1914.0 Fell \n",
"4 Achiras 370 Valid L6 780.0 Fell \n",
"\n",
" year reclat reclong GeoLocation \n",
"0 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) \n",
"1 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) \n",
"2 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) \n",
"3 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) \n",
"4 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.head()"
]
},
{
"cell_type": "markdown",
"id": "54e767f9-adb9-4921-86be-75426fb13846",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3b7f694d-c34c-4042-a051-07dc8759a11c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 45711 | \n",
" Zillah 002 | \n",
" 31356 | \n",
" Valid | \n",
" Eucrite | \n",
" 172.0 | \n",
" Found | \n",
" 01/01/1990 12:00:00 AM | \n",
" 29.03700 | \n",
" 17.01850 | \n",
" (29.037, 17.0185) | \n",
"
\n",
" \n",
" | 45712 | \n",
" Zinder | \n",
" 30409 | \n",
" Valid | \n",
" Pallasite, ungrouped | \n",
" 46.0 | \n",
" Found | \n",
" 01/01/1999 12:00:00 AM | \n",
" 13.78333 | \n",
" 8.96667 | \n",
" (13.78333, 8.96667) | \n",
"
\n",
" \n",
" | 45713 | \n",
" Zlin | \n",
" 30410 | \n",
" Valid | \n",
" H4 | \n",
" 3.3 | \n",
" Found | \n",
" 01/01/1939 12:00:00 AM | \n",
" 49.25000 | \n",
" 17.66667 | \n",
" (49.25, 17.66667) | \n",
"
\n",
" \n",
" | 45714 | \n",
" Zubkovsky | \n",
" 31357 | \n",
" Valid | \n",
" L6 | \n",
" 2167.0 | \n",
" Found | \n",
" 01/01/2003 12:00:00 AM | \n",
" 49.78917 | \n",
" 41.50460 | \n",
" (49.78917, 41.5046) | \n",
"
\n",
" \n",
" | 45715 | \n",
" Zulu Queen | \n",
" 30414 | \n",
" Valid | \n",
" L3.7 | \n",
" 200.0 | \n",
" Found | \n",
" 01/01/1976 12:00:00 AM | \n",
" 33.98333 | \n",
" -115.68333 | \n",
" (33.98333, -115.68333) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"45711 Zillah 002 31356 Valid Eucrite 172.0 Found \n",
"45712 Zinder 30409 Valid Pallasite, ungrouped 46.0 Found \n",
"45713 Zlin 30410 Valid H4 3.3 Found \n",
"45714 Zubkovsky 31357 Valid L6 2167.0 Found \n",
"45715 Zulu Queen 30414 Valid L3.7 200.0 Found \n",
"\n",
" year reclat reclong GeoLocation \n",
"45711 01/01/1990 12:00:00 AM 29.03700 17.01850 (29.037, 17.0185) \n",
"45712 01/01/1999 12:00:00 AM 13.78333 8.96667 (13.78333, 8.96667) \n",
"45713 01/01/1939 12:00:00 AM 49.25000 17.66667 (49.25, 17.66667) \n",
"45714 01/01/2003 12:00:00 AM 49.78917 41.50460 (49.78917, 41.5046) \n",
"45715 01/01/1976 12:00:00 AM 33.98333 -115.68333 (33.98333, -115.68333) "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.tail()"
]
},
{
"cell_type": "markdown",
"id": "3f4989da-09bd-4f22-bb5a-b69d57e07b00",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Get some information about the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "acbaa48e-0ebc-4412-b5ff-bded4f43e806",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 45716 entries, 0 to 45715\n",
"Data columns (total 10 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 name 45716 non-null object \n",
" 1 id 45716 non-null int64 \n",
" 2 nametype 45716 non-null object \n",
" 3 recclass 45716 non-null object \n",
" 4 mass (g) 45585 non-null float64\n",
" 5 fall 45716 non-null object \n",
" 6 year 45425 non-null object \n",
" 7 reclat 38401 non-null float64\n",
" 8 reclong 38401 non-null float64\n",
" 9 GeoLocation 38401 non-null object \n",
"dtypes: float64(3), int64(1), object(6)\n",
"memory usage: 3.5+ MB\n"
]
}
],
"source": [
"meteorites.info()"
]
},
{
"cell_type": "markdown",
"id": "462f9df6-c1fc-4eaa-9c43-e810badb2257",
"metadata": {},
"source": [
"### [Exercise 1.1](./workbook.ipynb#Exercise-1.1)\n",
"\n",
"##### Create a DataFrame by reading in the `2019_Yellow_Taxi_Trip_Data.csv` file. Examine the first 5 rows."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "317e4c45-7afe-425f-a7a2-cad55f20a91e",
"metadata": {},
"outputs": [],
"source": [
"# Complete this exercise in the workbook.ipynb file\n",
"# Click on `Exercise 1.1` above to open the workbook.ipynb file\n",
"\n",
"# WARNING: if you complete the exercise here, your cell numbers\n",
"# for the rest of the training might not match the slides"
]
},
{
"cell_type": "markdown",
"id": "230a3c7f-8907-43d1-b7e0-9fe678c04d17",
"metadata": {},
"source": [
"### [Exercise 1.2](./workbook.ipynb#Exercise-1.2)\n",
"\n",
"##### Find the dimensions (number of rows and number of columns) in the data."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ceab7259-c412-4f8b-b0be-e3cc801836b0",
"metadata": {},
"outputs": [],
"source": [
"# Complete this exercise in the workbook.ipynb file\n",
"# Click on `Exercise 1.2` above to open the workbook.ipynb file\n",
"\n",
"# WARNING: if you complete the exercise here, your cell numbers\n",
"# for the rest of the training might not match the slides"
]
},
{
"cell_type": "markdown",
"id": "0d63f1c9-2b23-4585-8df0-358159b26801",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Extracting subsets\n",
"\n",
"A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks."
]
},
{
"cell_type": "markdown",
"id": "9c1a3793-29f7-4c0e-90d3-230475c611f7",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Selecting columns\n",
"\n",
"We can select columns as attributes if their names would be valid Python variables:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2a084a5d-3d1c-4bef-9704-7c7ec69a73b1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Aachen\n",
"1 Aarhus\n",
"2 Abee\n",
"3 Acapulco\n",
"4 Achiras\n",
" ... \n",
"45711 Zillah 002\n",
"45712 Zinder\n",
"45713 Zlin\n",
"45714 Zubkovsky\n",
"45715 Zulu Queen\n",
"Name: name, Length: 45716, dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.name"
]
},
{
"cell_type": "markdown",
"id": "ac335459-7a87-443f-b349-7a7b826a5ffa",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"If they aren't, we have to select them as keys. However, we can select multiple columns at once this way:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "df5c98ef-0227-44e4-8038-de151a9d8717",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" mass (g) | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Aachen | \n",
" 21.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" Aarhus | \n",
" 720.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" Abee | \n",
" 107000.0 | \n",
"
\n",
" \n",
" | 3 | \n",
" Acapulco | \n",
" 1914.0 | \n",
"
\n",
" \n",
" | 4 | \n",
" Achiras | \n",
" 780.0 | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 45711 | \n",
" Zillah 002 | \n",
" 172.0 | \n",
"
\n",
" \n",
" | 45712 | \n",
" Zinder | \n",
" 46.0 | \n",
"
\n",
" \n",
" | 45713 | \n",
" Zlin | \n",
" 3.3 | \n",
"
\n",
" \n",
" | 45714 | \n",
" Zubkovsky | \n",
" 2167.0 | \n",
"
\n",
" \n",
" | 45715 | \n",
" Zulu Queen | \n",
" 200.0 | \n",
"
\n",
" \n",
"
\n",
"
45716 rows × 2 columns
\n",
"
"
],
"text/plain": [
" name mass (g)\n",
"0 Aachen 21.0\n",
"1 Aarhus 720.0\n",
"2 Abee 107000.0\n",
"3 Acapulco 1914.0\n",
"4 Achiras 780.0\n",
"... ... ...\n",
"45711 Zillah 002 172.0\n",
"45712 Zinder 46.0\n",
"45713 Zlin 3.3\n",
"45714 Zubkovsky 2167.0\n",
"45715 Zulu Queen 200.0\n",
"\n",
"[45716 rows x 2 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites[['name', 'mass (g)']]"
]
},
{
"cell_type": "markdown",
"id": "c3c390e8-1199-4cfe-a8f9-6998ddfa9f14",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Selecting rows"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "c1a63e64-8a01-4762-b1c2-79b2f1068ea9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 100 | \n",
" Benton | \n",
" 5026 | \n",
" Valid | \n",
" LL6 | \n",
" 2840.0 | \n",
" Fell | \n",
" 01/01/1949 12:00:00 AM | \n",
" 45.95000 | \n",
" -67.55000 | \n",
" (45.95, -67.55) | \n",
"
\n",
" \n",
" | 101 | \n",
" Berduc | \n",
" 48975 | \n",
" Valid | \n",
" L6 | \n",
" 270.0 | \n",
" Fell | \n",
" 01/01/2008 12:00:00 AM | \n",
" -31.91000 | \n",
" -58.32833 | \n",
" (-31.91, -58.32833) | \n",
"
\n",
" \n",
" | 102 | \n",
" Béréba | \n",
" 5028 | \n",
" Valid | \n",
" Eucrite-mmict | \n",
" 18000.0 | \n",
" Fell | \n",
" 01/01/1924 12:00:00 AM | \n",
" 11.65000 | \n",
" -3.65000 | \n",
" (11.65, -3.65) | \n",
"
\n",
" \n",
" | 103 | \n",
" Berlanguillas | \n",
" 5029 | \n",
" Valid | \n",
" L6 | \n",
" 1440.0 | \n",
" Fell | \n",
" 01/01/1811 12:00:00 AM | \n",
" 41.68333 | \n",
" -3.80000 | \n",
" (41.68333, -3.8) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"100 Benton 5026 Valid LL6 2840.0 Fell \n",
"101 Berduc 48975 Valid L6 270.0 Fell \n",
"102 Béréba 5028 Valid Eucrite-mmict 18000.0 Fell \n",
"103 Berlanguillas 5029 Valid L6 1440.0 Fell \n",
"\n",
" year reclat reclong GeoLocation \n",
"100 01/01/1949 12:00:00 AM 45.95000 -67.55000 (45.95, -67.55) \n",
"101 01/01/2008 12:00:00 AM -31.91000 -58.32833 (-31.91, -58.32833) \n",
"102 01/01/1924 12:00:00 AM 11.65000 -3.65000 (11.65, -3.65) \n",
"103 01/01/1811 12:00:00 AM 41.68333 -3.80000 (41.68333, -3.8) "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites[100:104]"
]
},
{
"cell_type": "markdown",
"id": "84a61a58-50ca-4f10-ad26-36db5022ee9f",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Indexing\n",
"\n",
"We use `iloc[]` to select rows and columns by their position:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "50087ab8-2a1d-4312-a7c7-791bc7c35e9b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" recclass | \n",
" mass (g) | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" | 100 | \n",
" Benton | \n",
" LL6 | \n",
" 2840.0 | \n",
" 01/01/1949 12:00:00 AM | \n",
"
\n",
" \n",
" | 101 | \n",
" Berduc | \n",
" L6 | \n",
" 270.0 | \n",
" 01/01/2008 12:00:00 AM | \n",
"
\n",
" \n",
" | 102 | \n",
" Béréba | \n",
" Eucrite-mmict | \n",
" 18000.0 | \n",
" 01/01/1924 12:00:00 AM | \n",
"
\n",
" \n",
" | 103 | \n",
" Berlanguillas | \n",
" L6 | \n",
" 1440.0 | \n",
" 01/01/1811 12:00:00 AM | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name recclass mass (g) year\n",
"100 Benton LL6 2840.0 01/01/1949 12:00:00 AM\n",
"101 Berduc L6 270.0 01/01/2008 12:00:00 AM\n",
"102 Béréba Eucrite-mmict 18000.0 01/01/1924 12:00:00 AM\n",
"103 Berlanguillas L6 1440.0 01/01/1811 12:00:00 AM"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.iloc[100:104, [0, 3, 4, 6]]"
]
},
{
"cell_type": "markdown",
"id": "1798f9cc-8f05-4147-a351-cb3f9f6b765c",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"We use `loc[]` to select by name:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "eb7a1cdb-e0cc-4c58-ae96-1dabe4d5fad7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" | 100 | \n",
" 2840.0 | \n",
" Fell | \n",
" 01/01/1949 12:00:00 AM | \n",
"
\n",
" \n",
" | 101 | \n",
" 270.0 | \n",
" Fell | \n",
" 01/01/2008 12:00:00 AM | \n",
"
\n",
" \n",
" | 102 | \n",
" 18000.0 | \n",
" Fell | \n",
" 01/01/1924 12:00:00 AM | \n",
"
\n",
" \n",
" | 103 | \n",
" 1440.0 | \n",
" Fell | \n",
" 01/01/1811 12:00:00 AM | \n",
"
\n",
" \n",
" | 104 | \n",
" 960.0 | \n",
" Fell | \n",
" 01/01/2004 12:00:00 AM | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mass (g) fall year\n",
"100 2840.0 Fell 01/01/1949 12:00:00 AM\n",
"101 270.0 Fell 01/01/2008 12:00:00 AM\n",
"102 18000.0 Fell 01/01/1924 12:00:00 AM\n",
"103 1440.0 Fell 01/01/1811 12:00:00 AM\n",
"104 960.0 Fell 01/01/2004 12:00:00 AM"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.loc[100:104, 'mass (g)':'year']"
]
},
{
"cell_type": "markdown",
"id": "d601db18-6091-4234-916c-f5d5e6f8fd86",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Filtering with Boolean masks\n",
"\n",
"A **Boolean mask** is a array-like structure of Boolean values – it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`)."
]
},
{
"cell_type": "markdown",
"id": "753004e2-c9ef-4dbf-86cc-fd5d48babeb2",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "1f59f4cb-9f82-42ef-8c5f-c42d0f48447d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"45711 True\n",
"45712 False\n",
"45713 False\n",
"45714 True\n",
"45715 True\n",
"Length: 45716, dtype: bool"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')"
]
},
{
"cell_type": "markdown",
"id": "26a1b2dd-08fc-40bc-a260-6445edc31382",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`)."
]
},
{
"cell_type": "markdown",
"id": "0086f863-cd77-4e0d-bea7-364b18d4ed6a",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "46727a7f-d5e7-4de9-b69a-762413de9567",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 29 | \n",
" Allende | \n",
" 2278 | \n",
" Valid | \n",
" CV3 | \n",
" 2000000.0 | \n",
" Fell | \n",
" 01/01/1969 12:00:00 AM | \n",
" 26.96667 | \n",
" -105.31667 | \n",
" (26.96667, -105.31667) | \n",
"
\n",
" \n",
" | 419 | \n",
" Jilin | \n",
" 12171 | \n",
" Valid | \n",
" H5 | \n",
" 4000000.0 | \n",
" Fell | \n",
" 01/01/1976 12:00:00 AM | \n",
" 44.05000 | \n",
" 126.16667 | \n",
" (44.05, 126.16667) | \n",
"
\n",
" \n",
" | 506 | \n",
" Kunya-Urgench | \n",
" 12379 | \n",
" Valid | \n",
" H5 | \n",
" 1100000.0 | \n",
" Fell | \n",
" 01/01/1998 12:00:00 AM | \n",
" 42.25000 | \n",
" 59.20000 | \n",
" (42.25, 59.2) | \n",
"
\n",
" \n",
" | 707 | \n",
" Norton County | \n",
" 17922 | \n",
" Valid | \n",
" Aubrite | \n",
" 1100000.0 | \n",
" Fell | \n",
" 01/01/1948 12:00:00 AM | \n",
" 39.68333 | \n",
" -99.86667 | \n",
" (39.68333, -99.86667) | \n",
"
\n",
" \n",
" | 920 | \n",
" Sikhote-Alin | \n",
" 23593 | \n",
" Valid | \n",
" Iron, IIAB | \n",
" 23000000.0 | \n",
" Fell | \n",
" 01/01/1947 12:00:00 AM | \n",
" 46.16000 | \n",
" 134.65333 | \n",
" (46.16, 134.65333) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"29 Allende 2278 Valid CV3 2000000.0 Fell \n",
"419 Jilin 12171 Valid H5 4000000.0 Fell \n",
"506 Kunya-Urgench 12379 Valid H5 1100000.0 Fell \n",
"707 Norton County 17922 Valid Aubrite 1100000.0 Fell \n",
"920 Sikhote-Alin 23593 Valid Iron, IIAB 23000000.0 Fell \n",
"\n",
" year reclat reclong GeoLocation \n",
"29 01/01/1969 12:00:00 AM 26.96667 -105.31667 (26.96667, -105.31667) \n",
"419 01/01/1976 12:00:00 AM 44.05000 126.16667 (44.05, 126.16667) \n",
"506 01/01/1998 12:00:00 AM 42.25000 59.20000 (42.25, 59.2) \n",
"707 01/01/1948 12:00:00 AM 39.68333 -99.86667 (39.68333, -99.86667) \n",
"920 01/01/1947 12:00:00 AM 46.16000 134.65333 (46.16, 134.65333) "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')]"
]
},
{
"cell_type": "markdown",
"id": "055cc350-e9e0-42b2-839b-4ad45312e172",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"*Tip: Boolean masks can be used with `loc[]` and `iloc[]`.*"
]
},
{
"cell_type": "markdown",
"id": "b4e791ea-92e5-4f30-974c-63b9195af1c2",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"An alternative to this is the `query()` method:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "bde8231a-3f4a-43e1-a90e-051a125a00af",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | 29 | \n",
" Allende | \n",
" 2278 | \n",
" Valid | \n",
" CV3 | \n",
" 2000000.0 | \n",
" Fell | \n",
" 01/01/1969 12:00:00 AM | \n",
" 26.96667 | \n",
" -105.31667 | \n",
" (26.96667, -105.31667) | \n",
"
\n",
" \n",
" | 419 | \n",
" Jilin | \n",
" 12171 | \n",
" Valid | \n",
" H5 | \n",
" 4000000.0 | \n",
" Fell | \n",
" 01/01/1976 12:00:00 AM | \n",
" 44.05000 | \n",
" 126.16667 | \n",
" (44.05, 126.16667) | \n",
"
\n",
" \n",
" | 506 | \n",
" Kunya-Urgench | \n",
" 12379 | \n",
" Valid | \n",
" H5 | \n",
" 1100000.0 | \n",
" Fell | \n",
" 01/01/1998 12:00:00 AM | \n",
" 42.25000 | \n",
" 59.20000 | \n",
" (42.25, 59.2) | \n",
"
\n",
" \n",
" | 707 | \n",
" Norton County | \n",
" 17922 | \n",
" Valid | \n",
" Aubrite | \n",
" 1100000.0 | \n",
" Fell | \n",
" 01/01/1948 12:00:00 AM | \n",
" 39.68333 | \n",
" -99.86667 | \n",
" (39.68333, -99.86667) | \n",
"
\n",
" \n",
" | 920 | \n",
" Sikhote-Alin | \n",
" 23593 | \n",
" Valid | \n",
" Iron, IIAB | \n",
" 23000000.0 | \n",
" Fell | \n",
" 01/01/1947 12:00:00 AM | \n",
" 46.16000 | \n",
" 134.65333 | \n",
" (46.16, 134.65333) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"29 Allende 2278 Valid CV3 2000000.0 Fell \n",
"419 Jilin 12171 Valid H5 4000000.0 Fell \n",
"506 Kunya-Urgench 12379 Valid H5 1100000.0 Fell \n",
"707 Norton County 17922 Valid Aubrite 1100000.0 Fell \n",
"920 Sikhote-Alin 23593 Valid Iron, IIAB 23000000.0 Fell \n",
"\n",
" year reclat reclong GeoLocation \n",
"29 01/01/1969 12:00:00 AM 26.96667 -105.31667 (26.96667, -105.31667) \n",
"419 01/01/1976 12:00:00 AM 44.05000 126.16667 (44.05, 126.16667) \n",
"506 01/01/1998 12:00:00 AM 42.25000 59.20000 (42.25, 59.2) \n",
"707 01/01/1948 12:00:00 AM 39.68333 -99.86667 (39.68333, -99.86667) \n",
"920 01/01/1947 12:00:00 AM 46.16000 134.65333 (46.16, 134.65333) "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.query(\"`mass (g)` > 1e6 and fall == 'Fell'\")"
]
},
{
"cell_type": "markdown",
"id": "869e853c-9073-49fb-beb6-05a84c882d54",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"*Tip: Here, we can use both logical operators and bitwise operators.*"
]
},
{
"cell_type": "markdown",
"id": "619cd5be-db0e-4ce6-99fb-1ebe6328e693",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Calculating summary statistics\n",
"\n",
"In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the `meteorites` data by calculating summary statistics."
]
},
{
"cell_type": "markdown",
"id": "2e45f672-cb84-4718-a9f1-6e7abf16252e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### How many of the meteorites were found versus observed falling?"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "01433b12-c3f2-4680-a654-bb20bc8c2157",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fall\n",
"Found 44609\n",
"Fell 1107\n",
"Name: count, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.fall.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "9fb5584e-92f2-4b30-a7bd-1ed9edd28db2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"*Tip: Pass in `normalize=True` to see this result as percentages. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) for additional functionality.*"
]
},
{
"cell_type": "markdown",
"id": "ef8f5c35-57e4-4347-845e-39fa4de28075",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### What was the mass of the average meterorite?"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "b998653e-02dd-4540-ba66-dbc4b458b558",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(13278.078548601512)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites['mass (g)'].mean()"
]
},
{
"cell_type": "markdown",
"id": "a398ecbe-10cc-4498-a7f1-91ea0bc736d2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"**Important**: The mean isn't always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed. Here, the mean is being pulled higher by some very heavy meteorites – the distribution is [right-skewed](https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/)."
]
},
{
"cell_type": "markdown",
"id": "7b0162c6-f48f-4687-9902-72325ebecc0d",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "b7379492-da17-4358-b357-2ae6e1a26e67",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.01 0.44\n",
"0.05 1.10\n",
"0.50 32.60\n",
"0.95 4000.00\n",
"0.99 50600.00\n",
"Name: mass (g), dtype: float64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])"
]
},
{
"cell_type": "markdown",
"id": "2ca1c739-cf2b-4000-bedb-b66a3d11f071",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"A better measure in this case is the median (50th percentile), since it is robust to outliers:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "bc2e62f3-899d-4a50-a2f4-8b2e73e1bc2f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(32.6)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites['mass (g)'].median()"
]
},
{
"cell_type": "markdown",
"id": "eebd8f14-f25b-453e-8d74-6e4b4d57aea2",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### What was the mass of the heaviest meteorite?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "585af605-e601-49b6-bd1f-4838ab993302",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(60000000.0)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites['mass (g)'].max()"
]
},
{
"cell_type": "markdown",
"id": "6cb28eaf-0850-48f6-87c9-db3d8327e759",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"Let's extract the information on this meteorite:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "29720ccc-3855-42f7-a0d0-e41a83cf1bef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name Hoba\n",
"id 11890\n",
"nametype Valid\n",
"recclass Iron, IVB\n",
"mass (g) 60000000.0\n",
"fall Found\n",
"year 01/01/1920 12:00:00 AM\n",
"reclat -19.58333\n",
"reclong 17.91667\n",
"GeoLocation (-19.58333, 17.91667)\n",
"Name: 16392, dtype: object"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.loc[meteorites['mass (g)'].idxmax()]"
]
},
{
"cell_type": "markdown",
"id": "f2deea52-a1f9-4b5f-80bb-5074862783ec",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### How many different types of meteorite classes are represented in this dataset?"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "79c2a1db-0eeb-4173-964a-a38741c059ba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"466"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.recclass.nunique()"
]
},
{
"cell_type": "markdown",
"id": "0c0929d1-d296-4e11-89fb-7f3e707be732",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"Some examples:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "3ac57de5-7734-478a-9772-feb82890d5ef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',\n",
" 'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an'],\n",
" dtype=object)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.recclass.unique()[:14]"
]
},
{
"cell_type": "markdown",
"id": "1e0eecb1-a224-44b4-bde2-79d794a7eca8",
"metadata": {},
"source": [
"*Note: All fields preceded with \"rec\" are the values recommended by The Meteoritical Society. Check out [this Wikipedia article](https://en.wikipedia.org/wiki/Meteorite_classification) for some information on meteorite classes.*"
]
},
{
"cell_type": "markdown",
"id": "a0755e36-3761-4eb1-a91d-0d6abe9e672f",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Get some summary statistics on the data itself\n",
"We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "f0297d45-1d86-411f-ad8e-74cfaa3b2389",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass (g) | \n",
" fall | \n",
" year | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 45716 | \n",
" 45716.000000 | \n",
" 45716 | \n",
" 45716 | \n",
" 4.558500e+04 | \n",
" 45716 | \n",
" 45425 | \n",
" 38401.000000 | \n",
" 38401.000000 | \n",
" 38401 | \n",
"
\n",
" \n",
" | unique | \n",
" 45716 | \n",
" NaN | \n",
" 2 | \n",
" 466 | \n",
" NaN | \n",
" 2 | \n",
" 266 | \n",
" NaN | \n",
" NaN | \n",
" 17100 | \n",
"
\n",
" \n",
" | top | \n",
" Aachen | \n",
" NaN | \n",
" Valid | \n",
" L6 | \n",
" NaN | \n",
" Found | \n",
" 01/01/2003 12:00:00 AM | \n",
" NaN | \n",
" NaN | \n",
" (0.0, 0.0) | \n",
"
\n",
" \n",
" | freq | \n",
" 1 | \n",
" NaN | \n",
" 45641 | \n",
" 8285 | \n",
" NaN | \n",
" 44609 | \n",
" 3323 | \n",
" NaN | \n",
" NaN | \n",
" 6214 | \n",
"
\n",
" \n",
" | mean | \n",
" NaN | \n",
" 26889.735104 | \n",
" NaN | \n",
" NaN | \n",
" 1.327808e+04 | \n",
" NaN | \n",
" NaN | \n",
" -39.122580 | \n",
" 61.074319 | \n",
" NaN | \n",
"
\n",
" \n",
" | std | \n",
" NaN | \n",
" 16860.683030 | \n",
" NaN | \n",
" NaN | \n",
" 5.749889e+05 | \n",
" NaN | \n",
" NaN | \n",
" 46.378511 | \n",
" 80.647298 | \n",
" NaN | \n",
"
\n",
" \n",
" | min | \n",
" NaN | \n",
" 1.000000 | \n",
" NaN | \n",
" NaN | \n",
" 0.000000e+00 | \n",
" NaN | \n",
" NaN | \n",
" -87.366670 | \n",
" -165.433330 | \n",
" NaN | \n",
"
\n",
" \n",
" | 25% | \n",
" NaN | \n",
" 12688.750000 | \n",
" NaN | \n",
" NaN | \n",
" 7.200000e+00 | \n",
" NaN | \n",
" NaN | \n",
" -76.714240 | \n",
" 0.000000 | \n",
" NaN | \n",
"
\n",
" \n",
" | 50% | \n",
" NaN | \n",
" 24261.500000 | \n",
" NaN | \n",
" NaN | \n",
" 3.260000e+01 | \n",
" NaN | \n",
" NaN | \n",
" -71.500000 | \n",
" 35.666670 | \n",
" NaN | \n",
"
\n",
" \n",
" | 75% | \n",
" NaN | \n",
" 40656.750000 | \n",
" NaN | \n",
" NaN | \n",
" 2.026000e+02 | \n",
" NaN | \n",
" NaN | \n",
" 0.000000 | \n",
" 157.166670 | \n",
" NaN | \n",
"
\n",
" \n",
" | max | \n",
" NaN | \n",
" 57458.000000 | \n",
" NaN | \n",
" NaN | \n",
" 6.000000e+07 | \n",
" NaN | \n",
" NaN | \n",
" 81.166670 | \n",
" 354.473330 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name id nametype recclass mass (g) fall \\\n",
"count 45716 45716.000000 45716 45716 4.558500e+04 45716 \n",
"unique 45716 NaN 2 466 NaN 2 \n",
"top Aachen NaN Valid L6 NaN Found \n",
"freq 1 NaN 45641 8285 NaN 44609 \n",
"mean NaN 26889.735104 NaN NaN 1.327808e+04 NaN \n",
"std NaN 16860.683030 NaN NaN 5.749889e+05 NaN \n",
"min NaN 1.000000 NaN NaN 0.000000e+00 NaN \n",
"25% NaN 12688.750000 NaN NaN 7.200000e+00 NaN \n",
"50% NaN 24261.500000 NaN NaN 3.260000e+01 NaN \n",
"75% NaN 40656.750000 NaN NaN 2.026000e+02 NaN \n",
"max NaN 57458.000000 NaN NaN 6.000000e+07 NaN \n",
"\n",
" year reclat reclong GeoLocation \n",
"count 45425 38401.000000 38401.000000 38401 \n",
"unique 266 NaN NaN 17100 \n",
"top 01/01/2003 12:00:00 AM NaN NaN (0.0, 0.0) \n",
"freq 3323 NaN NaN 6214 \n",
"mean NaN -39.122580 61.074319 NaN \n",
"std NaN 46.378511 80.647298 NaN \n",
"min NaN -87.366670 -165.433330 NaN \n",
"25% NaN -76.714240 0.000000 NaN \n",
"50% NaN -71.500000 35.666670 NaN \n",
"75% NaN 0.000000 157.166670 NaN \n",
"max NaN 81.166670 354.473330 NaN "
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"meteorites.describe(include='all')"
]
},
{
"cell_type": "markdown",
"id": "96e7cd90-e515-4068-ae66-f4d2bcab6712",
"metadata": {},
"source": [
"**Important**: `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`)."
]
},
{
"cell_type": "markdown",
"id": "7b290818-09d8-4646-94e3-b14f88d25b43",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"#### Check out the documentation for more descriptive statistics:\n",
"\n",
"- [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats)\n",
"- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)"
]
},
{
"cell_type": "markdown",
"id": "5f26b130-6bc5-47ca-81f2-b977cfd8ef85",
"metadata": {},
"source": [
"### [Exercise 1.3](./workbook.ipynb#Exercise-1.3)\n",
"\n",
"##### Using the data in the `2019_Yellow_Taxi_Trip_Data.csv` file, calculate summary statistics for the `fare_amount`, `tip_amount`, `tolls_amount`, and `total_amount` columns."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "876cafcb-00ab-4f5a-8b3c-bfead4f0b14c",
"metadata": {},
"outputs": [],
"source": [
"# Complete this exercise in the workbook.ipynb file\n",
"# Click on `Exercise 1.3` above to open the workbook.ipynb file"
]
},
{
"cell_type": "markdown",
"id": "19fa9e6a-d1e5-4202-ba94-6fa5c8bbe5f7",
"metadata": {},
"source": [
"### [Exercise 1.4](./workbook.ipynb#Exercise-1.4)\n",
"\n",
"##### Isolate the `fare_amount`, `tip_amount`, `tolls_amount`, and `total_amount` for the longest trip by distance (`trip_distance`)."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "6402bb24-3da9-48e5-bde1-0b8a9576f00d",
"metadata": {},
"outputs": [],
"source": [
"# Complete this exercise in the workbook.ipynb file\n",
"# Click on `Exercise 1.4` above to open the workbook.ipynb file"
]
},
{
"cell_type": "markdown",
"id": "68c8ccd0-3a3d-4fab-86c1-fd5a136f0540",
"metadata": {},
"source": [
"## Up Next: [Data Wrangling](./2-data_wrangling.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}