{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2zIplrFX62ud"
},
"source": [
"# A Picture is Worth a Thousand Data Points: Introduction to Visualization\n",
"## by Andy Berres\n",
"\n",
"This notebook is intended for the hands-on portion of my workshop \"A Picture is Worth a Thousand Data Points: Introduction to Visualization\".\n",
"\n",
"### QR Code for the [Github Repository](https://github.com/q-rai/VisWorkshop/)\n",
"\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9yAXVg5ybC6z"
},
"source": [
"## Setup\n",
"\n",
"Make local installs and download some needed files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JrnSsRPl6vGO"
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"!curl -s -o setup.sh https://raw.githubusercontent.com/q-rai/VisWorkshop/main/setup.sh\n",
"!bash setup.sh"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EUf6Pi7obOOv"
},
"source": [
"## Raw Data\n",
"Let's see what the data looks like\n",
"\n",
"### Reading CSV Data in Python/Pandas\n",
"Pandas makes this very easy. The file `Data/table.csv` was downloaded by the setup script above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kVQd8myxCFlJ"
},
"outputs": [],
"source": [
"# read data and put it in a dataframe\n",
"df = pd.read_csv('Data/table.csv')\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IhZGwr--cDev"
},
"source": [
"### Simple Statistics\n",
"\n",
"I wonder if the statistical properties are of any use...\n",
"Pandas has the `DataFrame.describe()` function which provides various statistical properties."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Kg8TyOU9cIkc"
},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CBMDqty2bT7F"
},
"source": [
"That wasn't very helpful either. Looking at the numbers, I still have no idea what this could be. So let's try plotting it.\n",
"\n",
"### Simple Plots in Pandas\n",
"Pandas has some handy plotting functions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YDMgfeMOCwNF"
},
"outputs": [],
"source": [
"df.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4wvMXPiJbtu6"
},
"source": [
"Well, that wasn't very helpful. Let's try to be more specific.\n",
"\n",
"### Scatter Plots in Pandas\n",
"Pandas lets you specify plot types. You can find all the supported plot types in their [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html). Under the hood, it's all `matplotlib` so you can use the `matplotlib` settings to modify your plot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OYjp4fOBbdxx"
},
"outputs": [],
"source": [
"df.plot(x='x', y='y', kind='scatter')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qvz54rIdb7SH"
},
"source": [
"Much better, isn't it?\n",
"\n",
"### Let's look at some more interesting data\n",
"We'll be using Seaborn. It's based on Matplotlib, and it makes plotting easy (especially for pandas dataframes).\n",
"Seaborn brings a bunch of test datasets you can play with. You can get a full list using `sns.get_dataset_names()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tFXXFf3rWXlf"
},
"outputs": [],
"source": [
"cars = sns.load_dataset('mpg')\n",
"print(cars)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BDpcdocyWXlf"
},
"source": [
"#### Simple scatter plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Mx41hzAfWXlf"
},
"outputs": [],
"source": [
"sns.scatterplot(data=cars, x='horsepower', y='mpg')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-sih3S8QWXlf"
},
"source": [
"### Change some colors\n",
"\n",
"#### By origin\n",
"Origin is a discrete category, so we're using a categorical colormap. Seaborn will do this for you automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LlFfR0c9WXlf"
},
"outputs": [],
"source": [
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='origin')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OdX9uqGzWXlf"
},
"source": [
"#### By model year\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1nrwWoCkWXlf"
},
"outputs": [],
"source": [
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='model_year')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yL2gbOoWWXlf"
},
"source": [
"#### Try a different colormap\n",
"\n",
"Seaborn has different colormap (or color palette) options. You can see them here: https://seaborn.pydata.org/tutorial/color_palettes.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CsUy8ugdWXlf"
},
"outputs": [],
"source": [
"sns.set(style='white')\n",
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CtTjfpfBWXlf"
},
"source": [
"#### Change background\n",
"Options include `white` (blank background), `dark`, `whitegrid`, and `darkgrid`. Go ahead and try these out to see which one you like best."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "t8ethb3PWXlf"
},
"outputs": [],
"source": [
"sns.set(style='whitegrid')\n",
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aWkkTRsvWXlg"
},
"source": [
"#### Change marker sizes\n",
"\n",
"If you have a multidimensional dataset, marker size is another way to get more information about your data into the same plot. This variable should be numerical."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QIVXUwRcWXlg"
},
"outputs": [],
"source": [
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis', size='weight', sizes=(5,50))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ES4YA9s8WXlg"
},
"source": [
"#### The legend is blocking my view!\n",
"You can move it around fairly easily."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5ap7y2paWXlg"
},
"outputs": [],
"source": [
"sns.scatterplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis', size='weight', sizes=(5,100))\n",
"plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "00EPF16xWXlg"
},
"source": [
"#### Split by different variables\n",
"You can also make the same plot multiple times for different categorical values. Let's make one scatterplot per region of origin.\n",
"One important thing to know about `relplot` is that the x-axis and y-axis will be identical for all plots it makes, to make them comparable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MM_ba_tkWXlg"
},
"outputs": [],
"source": [
"sns.relplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis', size='weight', sizes=(5,100), col='origin')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kbnyGPyrWXlg"
},
"source": [
"### Plotting options\n",
"\n",
"Depending on what you use your plot for (`paper`, `talk`, `poster`), you may want different label sizes to keep everything legible. For more styling options: https://seaborn.pydata.org/tutorial/aesthetics.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MXw1-99rWXlg"
},
"outputs": [],
"source": [
"sns.set_context(\"talk\")\n",
"sns.relplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis', size='weight', sizes=(5,100), col='origin')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "m6gaDnPEWXlg"
},
"outputs": [],
"source": [
"sns.set_context(\"poster\")\n",
"sns.relplot(data=cars, x='horsepower', y='mpg', hue='model_year', palette='viridis', size='weight', sizes=(5,100), col='origin')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pmjudV46WXlg"
},
"source": [
"## More visualization\n",
"Seaborn alone has many more plot types. Check out their example gallery here: https://seaborn.pydata.org/examples/index.html"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}