{ "cells": [ { "cell_type": "markdown", "id": "a92ddba5", "metadata": {}, "source": [ "# Lesson 15 activity: data wrangling\n", "\n", "## Learning objectives\n", "\n", "This activity will help you to:\n", "\n", "1. Merge and aggregate data from multiple sources\n", "2. Apply groupby operations to summarize data\n", "3. Calculate meaningful statistics from aggregated data\n", "4. Visualize aggregated results" ] }, { "cell_type": "markdown", "id": "28c8b549", "metadata": {}, "source": [ "## Setup\n", "\n", "Import the required libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "7248ac92", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "de1e8cc9", "metadata": {}, "source": [ "## Exercise 1: Merge Data and Calculate Monthly Sales\n", "\n", "**Tasks**:\n", "\n", "1. Merge `df_sales` and `df_dates` on the `transaction_id` column. Store the result in a dataframe called `df_transactions`.\n", "\n", "2. Extract the month from the `sale_date` column and add it as a new column called `month` in `df_transactions`.\n", "\n", "3. Calculate the total sales per month using groupby and sum. Store the result in a dataframe called `monthly_sales`.\n", "\n", "4. Create a bar chart showing the total sales for each month. Include appropriate labels and title.\n", "\n", "**Hints**:\n", "\n", "- To merge two dataframes, use `pd.merge()` or the `.merge()` method. You can specify the column to join on with the `on` parameter.\n", " - Example: `df_merged = pd.merge(df1, df2, on='key_column')`\n", " - Or: `df_merged = df1.merge(df2, on='key_column')`\n", "\n", "- To extract the month from a datetime column, use the `.dt.month` accessor.\n", " - Example: `df['month'] = df['date_column'].dt.month`\n", "\n", "- To group by a column and calculate the sum, use `.groupby()` followed by `.sum()`.\n", " - Example: `grouped_df = df.groupby('group_column')['value_column'].sum()`\n", " - You can also use `.reset_index()` to convert the result back to a regular dataframe." ] }, { "cell_type": "code", "execution_count": 2, "id": "d6a058a0", "metadata": {}, "outputs": [], "source": [ "df_sales = pd.read_csv('https://media.githubusercontent.com/media/gperdrizet/fullstack-2605/refs/heads/main/data/sale_amounts.csv')\n", "df_dates = pd.read_csv('https://media.githubusercontent.com/media/gperdrizet/fullstack-2605/refs/heads/main/data/sales_by_date.csv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "ed7fac23", "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "id": "8a4abc24", "metadata": {}, "source": [ "## Exercise 2: Analyze Missing Data Patterns\n", "\n", "Load the California Housing dataset with missing values and investigate whether the pattern of missing data in the `HouseAge` feature is related to the `Population` feature.\n", "\n", "**Background**: \n", "Sometimes data is not missing completely at random. In real-world scenarios, certain values might be more likely to be missing based on other features in the dataset. This is called Missing Not At Random (MNAR). Understanding these patterns is important for choosing appropriate imputation strategies.\n", "\n", "**Tasks**:\n", "\n", "1. Load the dataset from `salted_housing_data.csv` and examine the first few rows.\n", "\n", "2. Create a new boolean column called `HouseAge_Missing` that indicates whether `HouseAge` is missing (True) or not (False).\n", "\n", "3. Use `pd.qcut()` to bin the `Population` feature into 4 quartiles. Store the result in a new column called `Population_Quartile`. Use labels like `['Q1 (Lowest)', 'Q2', 'Q3', 'Q4 (Highest)']`.\n", "\n", "4. Calculate the percentage of missing `HouseAge` values for each population quartile. Use `groupby()` to group by `Population_Quartile` and calculate:\n", " - The count of missing values (sum of `HouseAge_Missing`)\n", " - The total count in each quartile\n", " - The percentage missing\n", "\n", "5. Create a bar chart showing the percentage of missing `HouseAge` values by population quartile. Add value labels on top of each bar.\n", "\n", "6. Conduct a statistical test to determine if there is a relationship between `Population_Quartile` and the amount of `HouseAge` data missing.\n", "\n", "7. Based on your analysis, answer: Is `HouseAge` more likely to be missing in certain population quartile groups? What does this suggest about the missing data mechanism?\n", "\n", "**Hints**:\n", "\n", "- To check for missing values, use `.isnull()` or `.isna()`.\n", " - Example: `df['HouseAge_Missing'] = df['HouseAge'].isnull()`\n", "\n", "- To create quartile bins, use `pd.qcut()` with `q=4`.\n", " - Example: `df['quartile'] = pd.qcut(df['column'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])`\n", "\n", "- To calculate percentage missing by group:\n", " - Group by the quartile column\n", " - Use `.agg()` with multiple functions: `{'HouseAge_Missing': ['sum', 'count']}`\n", " - Calculate percentage: `(sum / count) * 100`" ] }, { "cell_type": "code", "execution_count": null, "id": "62448d99", "metadata": {}, "outputs": [], "source": [ "housing_df = pd.read_csv('https://media.githubusercontent.com/media/gperdrizet/fullstack-2605/refs/heads/main/data/salted_housing_data.csv')" ] }, { "cell_type": "code", "execution_count": 6, "id": "1f061fe1", "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "id": "3dc46951", "metadata": {}, "source": [ "## Exercise 3: Apply Imputation to Missing Data\n", "\n", "Now that you've analyzed the missing data pattern, select and apply an appropriate imputation method to fill in the missing `HouseAge` values.\n", "\n", "**Background**:\n", "Based on your findings from Exercise 2, you discovered that `HouseAge` is more likely to be missing in areas with lower population (MNAR pattern). This information should guide your choice of imputation method. Advanced methods like KNN or Iterative Imputation can leverage relationships between features, which is particularly useful when data is not missing completely at random.\n", "\n", "**Tasks**:\n", "\n", "1. Choose an imputation method. Consider:\n", " - Simple methods: Mean or Median imputation\n", " - Advanced methods: KNN Imputation or Iterative Imputation\n", " - Justify your choice based on the missing data pattern you discovered\n", "2. Apply your chosen imputation method to fill the missing `HouseAge` values. Store the result in a new column called `HouseAge_imputed`.\n", "\n", "3. Compare the distribution of the original `HouseAge` (non-missing values only) with the imputed values:\n", " - Create a histogram showing both distributions\n", " - Calculate summary statistics (mean, median, std) for both\n", "\n", "4. Evaluate your imputation:\n", " - Does the imputed distribution look reasonable compared to the original?\n", " - Are there any obvious problems or artifacts introduced by the imputation?\n", " - Would a different method have been better? Why or why not?" ] }, { "cell_type": "code", "execution_count": 7, "id": "b1f4da2e", "metadata": {}, "outputs": [], "source": [ "# Your code here" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }