{ "cells": [ { "cell_type": "markdown", "id": "fe33bd68", "metadata": {}, "source": [ "---\n", "title: \"Intro to EDA\"\n", "date: last-modified\n", "toc: true\n", "format:\n", " html: default\n", " ipynb: default\n", "---\n" ] }, { "cell_type": "markdown", "id": "5e61f04c-922a-4270-a117-61b2edea3b4a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Exploratory data analysis, or EDA, is a standard practice prior to any data manipulation and analysis.\n", "\n", "Recall that data engineering is primarily about data preparation to *serve* smooth and effective data analysis. Exploratory data analysis generally refers to the step of understanding the data: \n", "\n", "- **summarizing characteristics of raw data**\n", "- **visualizing data (single and multiple variables)**\n", "- identifying missing data\n", "- identifying outliers\n", "\n", "This document primarily deals with the first two items. " ] }, { "cell_type": "markdown", "id": "cba716e0-7bd5-4102-b479-c8e592dd0234", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Goals\n", "In the **exploratory** phase, these are for people behind the scenes to see. " ] }, { "cell_type": "markdown", "id": "4dfb16c5-6f8d-417d-b960-c30a0e68edda", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "The main goals here are:\n", "\n", "- capture main message\n", "- (relatively) quick exploration across many summaries (including plots)\n", "- *not* intended for a client or presentation" ] }, { "cell_type": "markdown", "id": "e7635605-3b90-481f-ab2d-63f4a3c69a20", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "What does this translate to, technically?\n", "\n", "- each summary should have meaningful information\n", "- **label** your plots" ] }, { "cell_type": "markdown", "id": "1e9333af-5a68-43bf-88d9-3bbda655fb68", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Data summary\n", "As a starting point, simply looking at the data is worth the while. Some common questions to consider are the following: " ] }, { "cell_type": "markdown", "id": "4c65c64c-1290-488c-b345-3d703b0b12ed", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "\n", "1. General dataset info: size, dtypes \n", "2. Missing values? \n", "3. Duplicate data? \n", "4. Continuous variables \n", "5. Categorical variables \n", "6. Bivariate relationships \n", "7. Potential data quality issues, e.g., inconsistency, special NA characters" ] }, { "cell_type": "code", "execution_count": 1, "id": "82a89846-bdfa-404d-83d8-89d6d4974c40", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "1aa08068-0d06-4e54-885b-91736599fd93", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "|![sns](../img/sns.jpg)|\n", "|:---:|\n", "|[The origin of sns.](https://seaborn.pydata.org/faq.html#why-is-seaborn-imported-as-sns)|" ] }, { "cell_type": "markdown", "id": "fd68df4d-207e-4aad-9382-8c51df12c2e8", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Earthquake dataset\n", "\n", "[Source Link](https://open.canada.ca/data/en/dataset/2c3672b6-4c17-4ff5-9861-29e2dd6d03b3/resource/9cfea46f-561a-440f-9d17-fed3557fc7b5)" ] }, { "cell_type": "code", "execution_count": 3, "id": "10f6ca64-3a40-44ab-8de2-19f0408b1c90", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# load and save a copy of the earthquake dataset\n", "earthquake = pd.read_csv('https://raw.githubusercontent.com/mosesyhc/de300-2026wi/refs/heads/main/datasets/Canadian-Earthquakes-2010-2019.csv')" ] }, { "cell_type": "code", "execution_count": null, "id": "d51255a4-9ca2-42e1-bfe3-73bef38541fd", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# take a glimpse of the data" ] }, { "cell_type": "code", "execution_count": null, "id": "86f129d4-4d81-42db-9769-1b8d02d60291", "metadata": { "editable": true, "scrolled": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# view a summary of the full data\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d82f93a5-351a-48cf-a979-516893c1aff0", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# checks for duplicates (also ask if duplicates make sense)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "68b170c2-9fbd-4742-80e9-2e272895983b", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# duplicates\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a289329c-3672-4182-ac50-3ef60fcb5a34", "metadata": { "editable": true, "scrolled": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# a quick numerical summary \n" ] }, { "cell_type": "code", "execution_count": null, "id": "00afe82f-09eb-4803-9bfe-4d3e6fca0a48", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# checks for possible statistical assumption(s)\n", "import scipy.stats as sps" ] }, { "cell_type": "code", "execution_count": null, "id": "29725215-c4f8-406c-af9e-82592efdac6f", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# extract only numeric variables\n" ] }, { "cell_type": "code", "execution_count": null, "id": "aee3c83b-451e-4cd9-a403-0c966a33bd3f", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "83527a55-fe9f-46f1-a30e-e552fdb0fe57", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# for example, normality test\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e55e4a50-9ec3-401f-87df-95cb6715dc5d", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# for example, another normality test\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dac905eb-c0f5-4f35-b772-385e822f5465", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# pairwise correlation\n" ] }, { "cell_type": "markdown", "id": "cdbc1087-1085-4b98-9eb6-76f8fd347a23", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Data visualization" ] }, { "cell_type": "code", "execution_count": null, "id": "fc415625-04a8-41d9-8bc3-c73a19144026", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "sns.set(context='talk', style='ticks') # simply for aesthetics\n", "sns.set_palette('magma')\n", "%matplotlib inline \n", "\n", "# earthquake = earthquake.sample(n=500) # (if too slow) for illustration purposes" ] }, { "cell_type": "code", "execution_count": null, "id": "3f63be58-3bdb-416f-ad86-158ba7d0eab2", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# histogram for continuous variables using pandas built-in plots \n" ] }, { "cell_type": "code", "execution_count": null, "id": "dac34314-b559-47e2-9274-1f5df09d31ce", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# relative frequency? ...\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a4dbf50a-df4d-4b52-9f19-b038872f775a", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# histogram of masses by group\n" ] }, { "cell_type": "code", "execution_count": null, "id": "81457e61-b8d2-4e50-ac1b-cfa8c81de15e", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# other types of plots\n" ] }, { "cell_type": "code", "execution_count": null, "id": "95be9d60-1f2b-41a0-b1a6-1e3b2d3ef407", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# counts for categorical variables\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bd1504e9-ad84-40e4-b668-97c792ddaad2", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# barplots by group" ] }, { "cell_type": "code", "execution_count": null, "id": "0c006fb4-daaa-4cdd-8135-133382bc9ace", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# bivariate plots" ] }, { "cell_type": "code", "execution_count": null, "id": "fe7741b0-6132-473c-b6b8-a76325d15cf1", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# bivariate plots (log-log)" ] }, { "cell_type": "code", "execution_count": null, "id": "8e1fe0c8-3c3e-46fc-904d-fc87deb61dac", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "# pairwise plots (time-consuming)" ] }, { "cell_type": "code", "execution_count": null, "id": "6680e377-c999-4f4d-8f05-576a0f1ac7ae", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "# another pairwise plot by group" ] }, { "cell_type": "markdown", "id": "28d61bdb", "metadata": {}, "source": [ "## In-class activity\n", "Refer to the following figure, choose two subfigures to reproduce with the earthquake dataset.\n", "\n", "![](../datasets/earthquake_analysis-2010-2019.png)" ] }, { "cell_type": "markdown", "id": "6712f540-11b0-49a4-bfdf-a741a4aab9d1", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## (In case you need this) Jupyter notebook setup\n", "\n", "Visit https://docs.jupyter.org/en/latest/install/notebook-classic.html for some guidance to set up jupyter notebook." ] }, { "cell_type": "markdown", "id": "b874c045-bb8d-4cb9-8f27-47444a59e64b", "metadata": { "editable": true, "slideshow": { "slide_type": "notes" }, "tags": [] }, "source": [ "\n", "---\n", "\n", "*Note:* These notes are adapted from a blog post on [Tom's Blog](https://tomaugspurger.net/posts/modern-6-visualization/).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }