{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Visualizing Data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> To find signals in data, we must learn to reduce the noise - not just the noise that resides in the data, but also the noise that resides in us. It is nearly impossible for noisy minds to perceive anything but noise in data.\n", ">\n", "> \\- Stephen Few, Data Visualization Consultant and Author" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Joining Data" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- *Joining* is the process of combining two DataFrames to form a new DataFrame that incorporates data from both of the combined tables." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- The most common type of join is an *inner join*, but full-outer, left-outer, and right-outer joins are all useful as well." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- Joins are done in Pandas with the `pd.merge` function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Exporting Data" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- Most types of data in Python, including DataFrames and trained models, can be saved in files." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "- The most common types of files for this are CSVs, JSON, and pickle files." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Matplotlib and Seaborn" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Matplotlib" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The most tried-and-true, mature plotting library in Python is called [Matplotlib](https://matplotlib.org/).\n", "It began with a mission of replicating Matlab's plotting functionality in Python, so if you're familiar with Matlab you may notice some syntactic similiaries." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Matplotlib is traditionally imported like this:\n", "```python\n", "import matplotlib.pyplot as plt\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This means *import matplotlib's pyplot submodule under the name `plt`*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Seaborn" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "While matplotlib is powerful and stable, the rise of Python's use within data science led to the development of a more data scientist-friendly library, called [Seaborn](https://seaborn.pydata.org/).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Seaborn allows the user to describe graphics using clearer and less verbose function calls, but uses Matplotlib to generate the plots." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This approach has the added benefit of allowing the user to \"drop down\" to Matplotlib to make fine adjustments to his/her plots if needed." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Seaborn is traditionally imported like this:\n", "```python\n", "import seaborn as sns\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
Fun Fact!
\n", "Seaborn is allegedly named after West Wing character Sam Seaborn, whose full name is Samuel Norman Seaborn (S.N.S. -- the package import nickname).
\n", "Note
\n", "Note the annoying <AxesSubplot:xlabel='seats', ylabel='Count'> at the top of the previous plot. We can remove this output by adding a ; after the plotting function call.
\n", "Note
\n", "histplot, like all Seaborn plotting functions, supports a wide variety of customizations using various arguments. We won't cover those, but refer to the Seaborn docs to learn more.\n", "
\n", "Note
\n", "This plot may take a while to render when you run it -- there are a lot of flights in our data, and it takes Python a while to assign them all coordinates and colors.\n", "
\n", "\n", " | dep_time | \n", "dep_delay | \n", "
---|---|---|
0 | \n", "517.0 | \n", "2.0 | \n", "
1 | \n", "533.0 | \n", "4.0 | \n", "
2 | \n", "542.0 | \n", "2.0 | \n", "
3 | \n", "544.0 | \n", "-1.0 | \n", "
4 | \n", "554.0 | \n", "-5.0 | \n", "
\n", " | origin | \n", "n_flights | \n", "
---|---|---|
0 | \n", "EWR | \n", "120835 | \n", "
1 | \n", "JFK | \n", "111279 | \n", "
2 | \n", "LGA | \n", "104662 | \n", "
\n", " | origin | \n", "carrier | \n", "n_flights | \n", "
---|---|---|---|
0 | \n", "EWR | \n", "9E | \n", "1268 | \n", "
1 | \n", "EWR | \n", "AA | \n", "3487 | \n", "
3 | \n", "EWR | \n", "B6 | \n", "6557 | \n", "
12 | \n", "JFK | \n", "9E | \n", "14651 | \n", "
13 | \n", "JFK | \n", "AA | \n", "13783 | \n", "
14 | \n", "JFK | \n", "B6 | \n", "42076 | \n", "
22 | \n", "LGA | \n", "9E | \n", "2541 | \n", "
23 | \n", "LGA | \n", "AA | \n", "15459 | \n", "
24 | \n", "LGA | \n", "B6 | \n", "6002 | \n", "