{ "cells": [ { "cell_type": "markdown", "id": "b4a46963", "metadata": {}, "source": [ "# Lesson 13 activity: probability distributions\n", "\n", "## Learning objectives\n", "\n", "This activity will help you to:\n", "\n", "1. Understand and apply binomial distributions to model discrete events\n", "2. Demonstrate the Central Limit Theorem through sampling distributions\n", "3. Visualize theoretical and empirical probability distributions\n", "4. Connect statistical theory to real-world data analysis" ] }, { "cell_type": "markdown", "id": "86e3eac2", "metadata": {}, "source": [ "## Setup\n", "\n", "Import the required libraries and load the weather dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "117792af", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": null, "id": "05634da2", "metadata": {}, "outputs": [], "source": [ "# Load the weather dataset\n", "url = 'https://media.githubusercontent.com/media/gperdrizet/fullstack-2605/refs/heads/main/data/weather.csv'\n", "df = pd.read_csv(url)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "809f038a", "metadata": {}, "source": [ "## Exercise 1: binomial distribution - modeling rainy days\n", "\n", "**Objective**: Understand and visualize binomial distributions using real weather data.\n", "\n", "The binomial distribution models the number of successes in a fixed number of independent trials. In weather forecasting, we can use it to model the probability of rainy days over a period of time.\n", "\n", "**Tasks**:\n", "\n", "1. **Calculate the probability of rain**:\n", " - Count how many days in the dataset have `rainfall_inches > 0`\n", " - Calculate the proportion of rainy days (this is your probability `p`)\n", " - Print this probability with an interpretation (e.g., \"Based on our data, there's a X% chance of rain on any given day\")\n", "\n", "2. **Create a theoretical binomial distribution**:\n", " - Assume you're looking at a 30-day period (like a month)\n", " - Using the probability from step 1, calculate the theoretical probability of getting exactly k rainy days for k = 0, 1, 2, ..., 30\n", " - Use `scipy.stats.binom.pmf()`\n", "\n", "3. **Visualize the distribution**:\n", " - Create a bar plot showing the probability of each possible number of rainy days (0 to 30)\n", " - Add a vertical line showing the expected value (mean = n × p)\n", " - Label the axes appropriately\n", " - Include a title with the probability of rain\n", "\n", "4. **Interpret** your findings:\n", " - What is the most likely number of rainy days in a 30-day period?\n", " - What is the expected (mean) number of rainy days?\n", " - What's the probability of having 15 or more rainy days in a month?\n", " - How does this distribution help weather forecasters make predictions?\n", " - **Bonus**: Calculate the standard deviation and explain what it tells you about the variability in monthly rainfall patterns" ] }, { "cell_type": "code", "execution_count": null, "id": "2c5fd71d", "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "id": "714bd825", "metadata": {}, "source": [ "## Exercise 2: central limit theorem - sampling distribution of rainfall\n", "\n", "**Objective**: Demonstrate the Central Limit Theorem by creating and analyzing a sampling distribution.\n", "\n", "The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population's original distribution. This is fundamental to statistical inference.\n", "\n", "**Tasks**:\n", "\n", "1. **Examine the population distribution**:\n", " - Create a histogram of all `rainfall_inches` values in the dataset\n", " - Calculate and print the population mean and standard deviation\n", " - Note the shape of this distribution (is it normal, skewed, etc.?)\n", "\n", "2. **Create a sampling distribution**:\n", " - Take 1000 random samples from the rainfall data, each of size n=30\n", " - For each sample, calculate the mean rainfall\n", " - Store all 1000 sample means in a list or array\n", " - Hint: Use `df['rainfall_inches'].sample(n=30, replace=True)` for each sample\n", "\n", "3. **Visualize the sampling distribution**:\n", " - Create a histogram of the 1000 sample means\n", " - Overlay a normal distribution curve using the theoretical mean (μ) and standard error (σ/√n)\n", " - Add a vertical line at the population mean\n", " - You can use `scipy.stats.norm.pdf()` to create the normal curve\n", " - Label axes and add a descriptive title\n", "\n", "4. **Compare distributions**:\n", " - Create two side-by-side histograms:\n", " - Left: Original rainfall distribution (from step 1)\n", " - Right: Sampling distribution of means (from step 3)\n", " - Make sure both use the same y-axis scale for comparison\n", " - Include the mean and standard deviation in each subplot title\n", "\n", "5. **Interpret** your findings:\n", " - How does the shape of the sampling distribution compare to the original distribution?\n", " - Is the sampling distribution approximately normal? (This demonstrates the CLT!)\n", " - Calculate the standard error: population σ divided by √30. How does this compare to the standard deviation of your sample means?\n", " - What does the CLT tell us about why we can use normal-based methods (like confidence intervals) even when our data isn't normally distributed?\n", " - **Bonus**: Repeat the experiment with different sample sizes (n=5, n=10, n=50). How does sample size affect the spread and normality of the sampling distribution?" ] }, { "cell_type": "code", "execution_count": null, "id": "a183c001", "metadata": {}, "outputs": [], "source": [ "# Your code here" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }