{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(vis-common-plots-one)=\n", "# Common Plots I\n", "\n", "## Introduction\n", "\n", "In this chapter and the next, we'll look at some of the most common plots that you might want to makeāand how to create them using the most popular data visualisations libraries, including [**matplotlib**](https://matplotlib.org/), [**lets-plot**](https://lets-plot.org/), [**seaborn**](https://seaborn.pydata.org/), [**altair**](https://altair-viz.github.io/), and [**plotly**](https://plotly.com/python/). If you need an introduction to these libraries, check out the other data visualisation chapters.\n", "\n", "This chapter has benefited from the phenomenal **matplotlib** documentation, the **lets-plot** documentation, [**viztech**](https://github.com/cstorm125/viztech) (a repository that aimed to recreate the entire Financial Times Visual Vocabulary using **plotnine**), from the **seaborn** documentation, from the **altair** documentation, from the **plotly** documentation, and from examples posted around the web on forums and in blog posts. You may be wondering why **plotnine** isn't featured here: its functions have almost exactly the same names as those in **lets-plot**, and we have opted to include the latter as it is currently the more mature plotting package. However, most of the code below for **lets-plot** also works in **plotnine**, and you can read more about **plotnine** in {ref}`vis-plotnine`.\n", "\n", "Bear in mind that for many of the **matplotlib** examples, using the `df.plot.*` syntax can get the plot you want more quickly! To be more comprehensive, the solution for any kind of data is shown in the examples below.\n", "\n", "Throughout, we'll assume that the data are in a tidy format (one row per observation, one variable per column). Remember that all Altair plots can be made interactive by adding `.interactive()` at the end.\n", "\n", "First, though, let's import the libraries we'll need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "from itertools import cycle\n", "\n", "import altair as alt\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.express as px\n", "import seaborn as sns\n", "import seaborn.objects as so\n", "from lets_plot import *\n", "from lets_plot.mapping import as_discrete\n", "from vega_datasets import data\n", "\n", "# Set seed for reproducibility\n", "# Set seed for random numbers\n", "seed_for_prng = 78557\n", "prng = np.random.default_rng(\n", " seed_for_prng\n", ") # prng=probabilistic random number generator\n", "\n", "# Turn off warnings\n", "warnings.filterwarnings(\"ignore\")\n", "# Set up lets-plot charts\n", "LetsPlot.setup_html()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import matplotlib_inline.backend_inline\n", "\n", "# Plot settings\n", "plt.style.use(\n", " \"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt\"\n", ")\n", "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")\n", "# some faffing here to try and get seaborn not to change theme in object API\n", "# sns.set_theme(rc=plt.rcParams)\n", "# Set max rows displayed for readability\n", "pd.set_option(\"display.max_rows\", 6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter plot\n", "\n", "In this example, we will see a simple scatter plot with several categories using the \"cars\" data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cars = data.cars()\n", "cars.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for origin in cars[\"Origin\"].unique():\n", " cars_sub = cars[cars[\"Origin\"] == origin]\n", " ax.scatter(cars_sub[\"Horsepower\"], cars_sub[\"Miles_per_Gallon\"], label=origin)\n", "ax.set_ylabel(\"Miles per Gallon\")\n", "ax.set_xlabel(\"Horsepower\")\n", "ax.legend()\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n", "Note that this uses the seaborn objects API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(so.Plot(cars, x=\"Horsepower\", y=\"Miles_per_Gallon\", color=\"Origin\").add(so.Dot()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(cars, aes(x=\"Horsepower\", y=\"Miles_per_Gallon\", color=\"Origin\"))\n", " + geom_point()\n", " + ylab(\"Miles per Gallon\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n", "\n", "For this first example, we'll also show how to make the altair plot interactive with movable axes and a tooltip that reveals more info when you hover your mouse over points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(cars).mark_circle(size=60).encode(\n", " x=\"Horsepower\",\n", " y=\"Miles_per_Gallon\",\n", " color=\"Origin\",\n", " tooltip=[\"Name\", \"Origin\", \"Horsepower\", \"Miles_per_Gallon\"],\n", ").interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly\n", "\n", "Plotly is another declarative plotting library, at least sometimes (!), but one that is interactive by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.scatter(\n", " cars,\n", " x=\"Horsepower\",\n", " y=\"Miles_per_Gallon\",\n", " color=\"Origin\",\n", " hover_data=[\"Name\", \"Origin\", \"Horsepower\", \"Miles_per_Gallon\"],\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Facets\n", "\n", "This applies to all plots, so in some sense is common! Facets, aka panels or small multiples, are ways of showing the same chart multiple times. Let's see how to achieve them in a few of the most popular plotting libraries.\n", "\n", "We'll use the \"tips\" dataset for this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = sns.load_dataset(\"tips\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "There are many ways to create facets using Matplotlib, and you can get facets in any shape or sizes you like. \n", "\n", "The easiest way, though, is to specify the number of rows and columns. This is achieved by specifying `nrows` and `ncols` when calling `plt.subplots()`. It returns an array of shape `(nrows, ncols)` of `Axes` objects. For most purposes, you'll want to flatten these to a vector before iterating over them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=1, ncols=4, sharex=True, sharey=True)\n", "flat_axes = axes.flatten() # Not needed with 1 row or 1 col, but good to be aware of\n", "\n", "facet_grp = list(df[\"day\"].unique())\n", "# This part just to get some colours from the default color cycle\n", "colour_list = plt.rcParams[\"axes.prop_cycle\"].by_key()[\"color\"]\n", "iter_cycle = cycle(colour_list)\n", "\n", "for i, ax in enumerate(flat_axes):\n", " sub_df = df.loc[df[\"day\"] == facet_grp[i]]\n", " ax.scatter(\n", " sub_df[\"tip\"],\n", " sub_df[\"total_bill\"],\n", " s=30,\n", " edgecolor=\"k\",\n", " color=next(iter_cycle),\n", " )\n", " ax.set_title(facet_grp[i])\n", "fig.text(0.5, 0.01, \"Tip\", ha=\"center\")\n", "fig.text(0.0, 0.5, \"Total bill\", va=\"center\", rotation=\"vertical\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different facet sizes are possible in numerous ways. In practice, it's often better to have evenly sized facets laid out in a grid--especially each facet is of the same x and y axes. But, just to show it's possible, here's an example that gives more space to the weekend than to weekdays using the tips dataset: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This part just to get some colours\n", "colormap = plt.cm.Dark2\n", "\n", "fig = plt.figure(constrained_layout=True)\n", "ax_dict = fig.subplot_mosaic([[\"Thur\", \"Fri\", \"Sat\", \"Sat\", \"Sun\", \"Sun\"]])\n", "facet_grp = list(ax_dict.keys())\n", "colorst = [colormap(i) for i in np.linspace(0, 0.9, len(facet_grp))]\n", "for i, grp in enumerate(facet_grp):\n", " sub_df = df.loc[df[\"day\"] == facet_grp[i]]\n", " ax_dict[grp].scatter(\n", " sub_df[\"tip\"],\n", " sub_df[\"total_bill\"],\n", " s=30,\n", " edgecolor=\"k\",\n", " color=colorst[i],\n", " )\n", " ax_dict[grp].set_title(facet_grp[i])\n", " if grp != \"Thurs\":\n", " ax_dict[grp].set_yticklabels([])\n", "plt.tight_layout()\n", "fig.text(0.5, 0, \"Tip\", ha=\"center\")\n", "fig.text(0, 0.5, \"Total bill\", va=\"center\", rotation=\"vertical\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As well as using lists, you can also specify the layout using an array or using text, eg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "axd = plt.figure(constrained_layout=True).subplot_mosaic(\n", " \"\"\"\n", " ABD\n", " CCD\n", " CC.\n", " \"\"\"\n", ")\n", "kw = dict(ha=\"center\", va=\"center\", fontsize=60, color=\"darkgrey\")\n", "for k, ax in axd.items():\n", " ax.text(0.5, 0.5, k, transform=ax.transAxes, **kw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n", "Seaborn makes it easy to quickly create facet plots. Note the use of `col_wrap`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(df, x=\"tip\", y=\"total_bill\", color=\"day\")\n", " .facet(col=\"day\", wrap=2)\n", " .add(so.Dot())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice feature of seaborn that is much more fiddly in (base) matplotlib is the ability to specify rows and columns separately: (smoker)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(df, x=\"tip\", y=\"total_bill\", color=\"day\")\n", " .facet(col=\"day\", row=\"smoker\")\n", " .add(so.Dot())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(df, aes(x=\"tip\", y=\"total_bill\", color=\"smoker\"))\n", " + geom_point(size=3)\n", " + facet_wrap([\"smoker\", \"day\"])\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df).mark_point().encode(\n", " x=\"tip:Q\",\n", " y=\"total_bill:Q\",\n", " color=\"smoker:N\",\n", " facet=alt.Facet(\"day:N\", columns=2),\n", ").properties(\n", " width=200,\n", " height=100,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.scatter(\n", " df, x=\"tip\", y=\"total_bill\", color=\"smoker\", facet_row=\"smoker\", facet_col=\"day\"\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connected scatter plot\n", "\n", "A simple variation on the scatter plot designed to show an ordering, usually of time. We'll trace out a Beveridge curve based on US data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "import pandas_datareader.data as web\n", "\n", "start = datetime.datetime(2000, 1, 1)\n", "end = datetime.datetime(datetime.datetime.now().year, 1, 1)\n", "code_dict = {\n", " \"Vacancies\": \"LMJVTTUVUSA647N\",\n", " \"Unemployment\": \"UNRATE\",\n", " \"LabourForce\": \"CLF16OV\",\n", "}\n", "list_dfs = [\n", " web.DataReader(value, \"fred\", start, end)\n", " .rename(columns={value: key})\n", " .groupby(pd.Grouper(freq=\"AS\"))\n", " .mean()\n", " for key, value in code_dict.items()\n", "]\n", "df = pd.concat(list_dfs, axis=1)\n", "df = df.assign(Vacancies=100 * df[\"Vacancies\"] / (df[\"LabourForce\"] * 1e3)).dropna()\n", "df[\"Year\"] = df.index.year\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.close(\"all\")\n", "fig, ax = plt.subplots()\n", "quivx = -df[\"Unemployment\"].diff(-1)\n", "quivy = -df[\"Vacancies\"].diff(-1)\n", "# This connects the points\n", "ax.quiver(\n", " df[\"Unemployment\"],\n", " df[\"Vacancies\"],\n", " quivx,\n", " quivy,\n", " scale_units=\"xy\",\n", " angles=\"xy\",\n", " scale=1,\n", " width=0.006,\n", " alpha=0.3,\n", ")\n", "ax.scatter(\n", " df[\"Unemployment\"],\n", " df[\"Vacancies\"],\n", " marker=\"o\",\n", " s=35,\n", " edgecolor=\"black\",\n", " linewidth=0.2,\n", " alpha=0.9,\n", ")\n", "for j in [0, -1]:\n", " ax.annotate(\n", " df[\"Year\"].iloc[j],\n", " xy=(df[[\"Unemployment\", \"Vacancies\"]].iloc[j].tolist()),\n", " xycoords=\"data\",\n", " xytext=(-20, -40),\n", " textcoords=\"offset points\",\n", " arrowprops=dict(arrowstyle=\"->\", connectionstyle=\"angle3,angleA=0,angleB=-90\"),\n", " )\n", "ax.set_xlabel(\"Unemployment rate, %\")\n", "ax.set_ylabel(\"Vacancy rate, %\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(df, x=\"Unemployment\", y=\"Vacancies\")\n", " .add(so.Dots())\n", " .add(so.Path(marker=\"o\"))\n", " .label(\n", " x=\"Unemployment rate, %\",\n", " y=\"Vacancy rate, %\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n", "You can also use `geom_curve()` in place of `geom_segment()` below to get curved lines instead of straight lines." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is a convencience and creates a dataframe of the form\n", "# Vacancies_from\tUnemployment_from\tLabourForce_from\tYear_from\tVacancies_to\tUnemployment_to\tLabourForce_to\tYear_to\n", "# 0\t3.028239\t4.741667\t143768.916667\t2001\t2.387254\t5.783333\t144856.083333\t2002\n", "# 1\t 2.387254\t5.783333\t144856.083333\t2002\t2.212237\t5.991667\t146499.500000\t2003\n", "# so that we have both years (from and to) in each row\n", "path_df = (\n", " df.iloc[:-1]\n", " .reset_index(drop=True)\n", " .join(df.iloc[1:].reset_index(drop=True), lsuffix=\"_from\", rsuffix=\"_to\")\n", ")\n", "\n", "min_yr = df[\"Year\"].min()\n", "max_yr = df[\"Year\"].max()\n", "\n", "(\n", " ggplot(df, aes(\"Unemployment\", \"Vacancies\"))\n", " + geom_segment(\n", " aes(\n", " x=\"Unemployment_from\",\n", " y=\"Vacancies_from\",\n", " xend=\"Unemployment_to\",\n", " yend=\"Vacancies_to\",\n", " ),\n", " data=path_df,\n", " size=1,\n", " color=\"gray\",\n", " arrow=arrow(type=\"closed\", length=15, angle=15),\n", " spacer=5\n", " + 1, # Avoids arrowheads being sunk into points (+1 as circles are size 1)\n", " )\n", " + geom_point(shape=21, color=\"gray\", fill=\"#c28dc3\", size=5)\n", " + geom_text(\n", " aes(label=\"Year\"),\n", " data=df[df[\"Year\"].isin([min_yr, max_yr])],\n", " position=position_nudge(y=0.3),\n", " )\n", " + labs(x=\"Unemployment rate, %\", y=\"Vacancy rate, %\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bubble plot\n", "\n", "This is a scatter plot where the size of the point carries an extra dimension of information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "scat = ax.scatter(\n", " cars[\"Horsepower\"], cars[\"Miles_per_Gallon\"], s=cars[\"Displacement\"], alpha=0.4\n", ")\n", "ax.set_ylabel(\"Miles per Gallon\")\n", "ax.set_xlabel(\"Horsepower\")\n", "ax.legend(\n", " *scat.legend_elements(prop=\"sizes\", num=4),\n", " loc=\"upper right\",\n", " title=\"Displacement\",\n", " frameon=False,\n", ")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(cars, x=\"Horsepower\", y=\"Miles_per_Gallon\", pointsize=\"Displacement\").add(\n", " so.Dot()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(cars, aes(x=\"Horsepower\", y=\"Miles_per_Gallon\", size=\"Displacement\"))\n", " + geom_point()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(cars).mark_circle().encode(\n", " x=\"Horsepower\", y=\"Miles_per_Gallon\", size=\"Displacement\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adding a new col is easiest way to get displacement into legend with plotly:\n", "cars[\"Displacement_Size\"] = pd.cut(cars[\"Displacement\"], bins=4)\n", "fig = px.scatter(\n", " cars,\n", " x=\"Horsepower\",\n", " y=\"Miles_per_Gallon\",\n", " size=\"Displacement\",\n", " color=\"Displacement_Size\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Line plot\n", "\n", "First, let's get some data on GDP growth:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todays_date = datetime.datetime.now().strftime(\"%Y-%m-%d\")\n", "fred_df = web.DataReader([\"GDPC1\", \"NGDPRSAXDCGBQ\"], \"fred\", \"1999-01-01\", \"2021-12-31\")\n", "fred_df.columns = [\"US\", \"UK\"]\n", "fred_df.index.name = \"Date\"\n", "fred_df = 100 * fred_df.pct_change(4)\n", "df = pd.melt(\n", " fred_df.reset_index(),\n", " id_vars=[\"Date\"],\n", " value_vars=fred_df.columns,\n", " value_name=\"Real GDP growth, %\",\n", " var_name=\"Country\",\n", ")\n", "df = df.set_index(\"Date\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "Note that **Matplotlib** prefers data to be one variable per column, in which case we could have just run\n", "\n", "```python\n", "fig, ax = plt.subplots()\n", "df.plot(ax=ax)\n", "ax.set_title('Real GDP growth, %', loc='right')\n", "ax.yaxis.tick_right()\n", "```\n", "\n", "but we are working with tidy data here, so we'll do the plotting slightly differently." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for i, country in enumerate(df[\"Country\"].unique()):\n", " df_sub = df[df[\"Country\"] == country]\n", " ax.plot(df_sub.index, df_sub[\"Real GDP growth, %\"], label=country, lw=2)\n", "ax.set_title(\"Real GDP growth per capita, %\", loc=\"right\")\n", "ax.yaxis.tick_right()\n", "ax.spines[\"right\"].set_visible(True)\n", "ax.spines[\"left\"].set_visible(False)\n", "ax.legend(loc=\"lower left\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n", "Note that [only *some* **seaborn** commands currently support the use of named indexes](https://seaborn.pydata.org/tutorial/data_structure.html), so we use `df.reset_index()` to make the 'Date' index into a regular column in the snippet below (although in recent versions of **seaborn**, `lineplot()` would actually work fine with `data=df`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "y_var = \"Real GDP growth, %\"\n", "sns.lineplot(x=\"Date\", y=y_var, hue=\"Country\", data=df.reset_index(), ax=ax)\n", "ax.yaxis.tick_right()\n", "ax.spines[\"right\"].set_visible(True)\n", "ax.spines[\"left\"].set_visible(False)\n", "ax.set_ylabel(\"\")\n", "ax.set_title(y_var)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(df.reset_index(), x=\"Date\", y=\"Real GDP growth, %\", color=\"Country\").add(\n", " so.Line()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(df.reset_index(), aes(x=\"Date\", y=\"Real GDP growth, %\", color=\"Country\"))\n", " + geom_line(size=1)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df.reset_index()).mark_line().encode(\n", " x=\"Date:T\",\n", " y=\"Real GDP growth, %\",\n", " color=\"Country\",\n", " strokeDash=\"Country\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.line(\n", " df.reset_index(),\n", " x=\"Date\",\n", " y=\"Real GDP growth, %\",\n", " color=\"Country\",\n", " line_dash=\"Country\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bar chart\n", "\n", "Let's see a bar chart, using the 'barley' dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "barley = data.barley()\n", "barley = pd.DataFrame(barley.groupby([\"site\"])[\"yield\"].sum())\n", "barley.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "Just remove the 'h' in `ax.barh()` to get a vertical plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.barh(barley[\"yield\"].index, barley[\"yield\"], 0.35)\n", "ax.set_xlabel(\"Yield\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n", "Just switch x and y variables to get a vertical plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(barley.reset_index(), x=\"yield\", y=\"site\", color=\"site\").add(\n", " so.Bar(), so.Agg()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n", "Just omit `coord_flip()` to get a vertical plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(barley.reset_index(), aes(x=\"site\", y=\"yield\", fill=\"site\"))\n", " + geom_bar(stat=\"identity\")\n", " + coord_flip()\n", " + theme(legend_position=\"none\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n", "\n", "Just switch x and y to get a vertical plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(barley.reset_index()).mark_bar().encode(\n", " y=\"site\",\n", " x=\"yield\",\n", ").properties(\n", " width=alt.Step(40) # controls width of bar.\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(barley.reset_index(), y=\"site\", x=\"yield\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouped bar chart\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "barley = data.barley()\n", "barley = pd.DataFrame(barley.groupby([\"site\", \"year\"])[\"yield\"].sum()).reset_index()\n", "barley.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = barley[\"site\"].unique()\n", "y = np.arange(len(labels)) # the label locations\n", "width = 0.35 # the width of the bars\n", "\n", "fig, ax = plt.subplots()\n", "ax.barh(y - width / 2, barley.loc[barley[\"year\"] == 1931, \"yield\"], width, label=\"1931\")\n", "ax.barh(y + width / 2, barley.loc[barley[\"year\"] == 1932, \"yield\"], width, label=\"1932\")\n", "\n", "# Add some text for labels, title and custom x-axis tick labels, etc.\n", "ax.set_xlabel(\"Yield\")\n", "ax.set_yticks(y)\n", "ax.set_yticklabels(labels)\n", "ax.legend(frameon=False)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "barley[\"year\"] = barley[\"year\"].astype(\"category\") # to force category\n", "\n", "(\n", " so.Plot(barley.reset_index(), x=\"yield\", y=\"site\", color=\"year\").add(\n", " so.Bar(), so.Dodge()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(barley, aes(x=\"site\", y=\"yield\", group=\"year\", fill=as_discrete(\"year\")))\n", " + geom_bar(position=\"dodge\", stat=\"identity\")\n", " + coord_flip()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(barley.reset_index()).mark_bar().encode(\n", " y=\"year:O\", x=\"yield\", color=\"year:N\", row=\"site:N\"\n", ").properties(\n", " width=alt.Step(40) # controls width of bar.\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "px_barley = barley.reset_index()\n", "# This prevents plotly from using a continuous scale for year\n", "px_barley[\"year\"] = px_barley[\"year\"].astype(\"category\")\n", "fig = px.bar(px_barley, y=\"site\", x=\"yield\", barmode=\"group\", color=\"year\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacked bar chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = barley[\"site\"].unique()\n", "y = np.arange(len(labels)) # the label locations\n", "width = 0.35 # the width (or height) of the bars\n", "\n", "fig, ax = plt.subplots()\n", "ax.barh(y, barley.loc[barley[\"year\"] == 1931, \"yield\"], width, label=\"1931\")\n", "ax.barh(\n", " y,\n", " barley.loc[barley[\"year\"] == 1932, \"yield\"],\n", " width,\n", " label=\"1932\",\n", " left=barley.loc[barley[\"year\"] == 1931, \"yield\"],\n", ")\n", "\n", "# Add some text for labels, title and custom x-axis tick labels, etc.\n", "ax.set_xlabel(\"Yield\")\n", "ax.set_yticks(y)\n", "ax.set_yticklabels(labels)\n", "ax.legend(frameon=False)\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "barley[\"year\"] = barley[\"year\"].astype(\"category\") # to force category\n", "(\n", " so.Plot(barley.reset_index(), x=\"yield\", y=\"site\", color=\"year\").add(\n", " so.Bar(), so.Stack()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(barley, aes(x=\"site\", y=\"yield\", fill=as_discrete(\"year\")))\n", " + geom_bar(stat=\"identity\")\n", " + coord_flip()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(barley.reset_index()).mark_bar().encode(\n", " y=\"site\",\n", " x=\"yield\",\n", " color=\"year:N\",\n", ").properties(\n", " width=alt.Step(40) # controls width of bar.\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(px_barley, y=\"site\", x=\"yield\", barmode=\"relative\", color=\"year\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Diverging stacked bar chart\n", "\n", "First, let's create some data to use in our examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "category_names = [\n", " \"Strongly disagree\",\n", " \"Disagree\",\n", " \"Neither agree nor disagree\",\n", " \"Agree\",\n", " \"Strongly agree\",\n", "]\n", "results = [\n", " [10, 15, 17, 32, 26],\n", " [26, 22, 29, 10, 13],\n", " [35, 37, 7, 2, 19],\n", " [32, 11, 9, 15, 33],\n", " [21, 29, 5, 5, 40],\n", " [8, 19, 5, 30, 38],\n", "]\n", "\n", "likert_df = pd.DataFrame(\n", " results, columns=category_names, index=[f\"Question {i}\" for i in range(1, 7)]\n", ")\n", "likert_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "middle_index = likert_df.shape[1] // 2\n", "offsets = (\n", " likert_df.iloc[:, range(middle_index)].sum(axis=1)\n", " + likert_df.iloc[:, middle_index] / 2\n", ")\n", "category_colors = plt.get_cmap(\"coolwarm_r\")(\n", " np.linspace(0.15, 0.85, likert_df.shape[1])\n", ")\n", "\n", "fig, ax = plt.subplots(figsize=(10, 5))\n", "\n", "# Plot Bars\n", "for i, (colname, color) in enumerate(zip(likert_df.columns, category_colors)):\n", " widths = likert_df.iloc[:, i]\n", " starts = likert_df.cumsum(axis=1).iloc[:, i] - widths - offsets\n", " rects = ax.barh(\n", " likert_df.index, widths, left=starts, height=0.5, label=colname, color=color\n", " )\n", "\n", "# Add Zero Reference Line\n", "ax.axvline(0, linestyle=\"--\", color=\"black\", alpha=1, zorder=0, lw=0.3)\n", "\n", "# X Axis\n", "ax.set_xlim(-90, 90)\n", "ax.set_xticks(np.arange(-90, 91, 10))\n", "ax.xaxis.set_major_formatter(lambda x, pos: str(abs(int(x))))\n", "\n", "# Y Axis\n", "ax.invert_yaxis()\n", "\n", "# Remove spines\n", "ax.spines[\"right\"].set_visible(False)\n", "ax.spines[\"top\"].set_visible(False)\n", "ax.spines[\"left\"].set_visible(False)\n", "\n", "# Legend\n", "ax.legend(\n", " ncol=len(category_names),\n", " bbox_to_anchor=(0, 1),\n", " loc=\"lower left\",\n", " fontsize=\"small\",\n", " frameon=False,\n", ")\n", "\n", "# Set Background Color\n", "fig.set_facecolor(\"#FFFFFF\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kernel density estimate\n", "\n", "We'll use the diamonds dataset to demonstrate this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diamonds = sns.load_dataset(\"diamonds\").sample(1000)\n", "diamonds.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "Technically, there is a way to do this but it's pretty inelegant if you want a quick plot. That's because **matplotlib** doesn't do the density estimation itself. [Jake Vanderplas](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html) has a nice example but as it relies on a few extra libraries, we won't reproduce it here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note that there isn't a clear way to do this in the seaborn objects API yet\n", "sns.displot(diamonds, x=\"carat\", kind=\"kde\", hue=\"cut\", fill=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(ggplot(diamonds, aes(x=\"carat\", fill=\"cut\", colour=\"cut\")) + geom_density(alpha=0.5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(diamonds).transform_density(\n", " density=\"carat\", as_=[\"carat\", \"density\"], groupby=[\"cut\"]\n", ").mark_area(fillOpacity=0.5).encode(\n", " x=\"carat:Q\",\n", " y=\"density:Q\",\n", " color=\"cut:N\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.figure_factory as ff\n", "\n", "px_di = diamonds.pivot(columns=\"cut\", values=\"carat\")\n", "ff.create_distplot(\n", " [px_di[c].dropna() for c in px_di.columns],\n", " group_labels=px_di.columns,\n", " show_rug=False,\n", " show_hist=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogram or probability density function\n", "\n", "For this, let's go back to the penguins dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "penguins = sns.load_dataset(\"penguins\")\n", "penguins.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "The `density=` keyword parameter decides whether to create counts or a probability density function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.hist(penguins[\"flipper_length_mm\"], bins=30, density=True, edgecolor=\"k\")\n", "ax.set_xlabel(\"Flipper length (mm)\")\n", "ax.set_ylabel(\"Probability density\")\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(penguins, x=\"flipper_length_mm\").add(\n", " so.Bars(), so.Hist(bins=30, stat=\"density\")\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(penguins, aes(x=\"flipper_length_mm\"))\n", " + geom_histogram(bins=30) # specify the binwidth\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(penguins).mark_bar().encode(\n", " alt.X(\"flipper_length_mm:Q\", bin=True),\n", " y=\"count()\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.histogram(penguins, x=\"flipper_length_mm\", nbins=30)\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Marginal histograms\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Maplotlib\n", "\n", "[Jaker Vanderplas's excellent notes](https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html) have a great example of this, but now there's an easier way to do it with Matplotlib's new `constrained_layout` options." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(constrained_layout=True)\n", "# Create a layout with 3 panels in the given ratios\n", "axes_dict = fig.subplot_mosaic(\n", " [[\".\", \"histx\"], [\"histy\", \"scat\"]],\n", " gridspec_kw={\"width_ratios\": [1, 7], \"height_ratios\": [2, 7]},\n", ")\n", "# Glue all the relevant axes together\n", "axes_dict[\"histy\"].invert_xaxis()\n", "axes_dict[\"histx\"].sharex(axes_dict[\"scat\"])\n", "axes_dict[\"histy\"].sharey(axes_dict[\"scat\"])\n", "# Plot the data\n", "axes_dict[\"scat\"].scatter(penguins[\"bill_length_mm\"], penguins[\"bill_depth_mm\"])\n", "axes_dict[\"histx\"].hist(penguins[\"bill_length_mm\"])\n", "axes_dict[\"histy\"].hist(penguins[\"bill_depth_mm\"], orientation=\"horizontal\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.jointplot(data=penguins, x=\"bill_length_mm\", y=\"bill_depth_mm\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from lets_plot.bistro.joint import *\n", "\n", "(\n", " joint_plot(penguins, x=\"bill_length_mm\", y=\"bill_depth_mm\", reg_line=False)\n", " + labs(x=\"Bill length (mm)\", y=\"Bill depth (mm)\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair\n", "\n", "This is a bit fiddly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = alt.Chart(penguins)\n", "\n", "xscale = alt.Scale(domain=(20, 60))\n", "yscale = alt.Scale(domain=(10, 30))\n", "\n", "area_args = {\"opacity\": 0.5, \"interpolate\": \"step\"}\n", "\n", "points = base.mark_circle().encode(\n", " alt.X(\"bill_length_mm\", scale=xscale), alt.Y(\"bill_depth_mm\", scale=yscale)\n", ")\n", "\n", "top_hist = (\n", " base.mark_area(**area_args)\n", " .encode(\n", " alt.X(\n", " \"bill_length_mm:Q\",\n", " # when using bins, the axis scale is set through\n", " # the bin extent, so we do not specify the scale here\n", " # (which would be ignored anyway)\n", " bin=alt.Bin(maxbins=30, extent=xscale.domain),\n", " stack=None,\n", " title=\"\",\n", " ),\n", " alt.Y(\"count()\", stack=None, title=\"\"),\n", " )\n", " .properties(height=60)\n", ")\n", "\n", "right_hist = (\n", " base.mark_area(**area_args)\n", " .encode(\n", " alt.Y(\n", " \"bill_depth_mm:Q\",\n", " bin=alt.Bin(maxbins=30, extent=yscale.domain),\n", " stack=None,\n", " title=\"\",\n", " ),\n", " alt.X(\"count()\", stack=None, title=\"\"),\n", " )\n", " .properties(width=60)\n", ")\n", "\n", "top_hist & (points | right_hist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.scatter(\n", " penguins,\n", " x=\"bill_length_mm\",\n", " y=\"bill_depth_mm\",\n", " marginal_x=\"histogram\",\n", " marginal_y=\"histogram\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heatmap\n", "\n", "Heatmaps, or sometimes known as correlation maps, represent data in 3 dimensions by having two axes that forms a grid showing colour that corresponds to (usually) continuous values.\n", "\n", "We'll use the flights data to show the number of passengers by month-year:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "flights = sns.load_dataset(\"flights\")\n", "flights = flights.pivot(index=\"month\", columns=\"year\", values=\"passengers\").T\n", "flights.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "im = ax.imshow(flights.values, cmap=\"inferno\")\n", "cbar = ax.figure.colorbar(im, ax=ax)\n", "ax.set_xticks(np.arange(len(flights.columns)))\n", "ax.set_yticks(np.arange(len(flights.index)))\n", "# Labels\n", "ax.set_xticklabels(flights.columns, rotation=90)\n", "ax.set_yticklabels(flights.index)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(flights);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n", "Lets-Plot uses tidy data, rather than the wide data preferred by **matplotlib**, so we need to first get the original format of the flights data back:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "flights = sns.load_dataset(\"flights\")\n", "(\n", " ggplot(flights, aes(\"month\", as_discrete(\"year\"), fill=\"passengers\"))\n", " + geom_tile()\n", " + scale_y_reverse()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(flights).mark_rect().encode(\n", " x=alt.X(\"month\", type=\"nominal\", sort=None), y=\"year:O\", color=\"passengers:Q\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calendar heatmap\n", "\n", "Okay the previous heatmap was technically a calendar heatmap. But there are some nifty tools for making day-of-week by month heatmaps.\n", "\n", "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dayplot as dp\n", "\n", "df = dp.load_dataset()\n", "\n", "fig, ax = plt.subplots(figsize=(15, 6))\n", "dp.calendar(\n", " dates=df[\"dates\"],\n", " values=df[\"values\"],\n", " cmap=\"inferno\", # any matplotlib colormap\n", " start_date=\"2024-01-01\",\n", " end_date=\"2024-12-31\",\n", " ax=ax,\n", ")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boxplot\n", "\n", "Let's use the tips dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tips = sns.load_dataset(\"tips\")\n", "tips.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n", "\n", "There isn't a very direct way to create multiple box plots of different data in matplotlib in the case where the groups are unbalanced, so we create several different boxplot objects.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colormap = plt.cm.Set1\n", "colorst = [colormap(i) for i in np.linspace(0, 0.9, len(tips[\"time\"].unique()))]\n", "\n", "fig, ax = plt.subplots()\n", "for i, grp in enumerate(tips[\"time\"].unique()):\n", " bplot = ax.boxplot(\n", " tips.loc[tips[\"time\"] == grp, \"tip\"],\n", " positions=[i],\n", " vert=True, # vertical box alignment\n", " patch_artist=True, # fill with color\n", " labels=[grp],\n", " ) # X label\n", " for patch in bplot[\"boxes\"]:\n", " patch.set_facecolor(colorst[i])\n", "\n", "ax.set_ylabel(\"Tip\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(data=tips, x=\"time\", y=\"tip\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(ggplot(tips) + geom_boxplot(aes(y=\"tip\", x=\"time\", fill=\"time\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(tips).mark_boxplot(size=50).encode(\n", " x=\"time:N\", y=\"tip:Q\", color=\"time:N\"\n", ").properties(width=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.box(tips, x=\"time\", y=\"tip\", color=\"time\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Violin plot\n", "\n", "We'll use the same data as before, the tips dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colormap = plt.cm.Set1\n", "colorst = [colormap(i) for i in np.linspace(0, 0.9, len(tips[\"time\"].unique()))]\n", "\n", "fig, ax = plt.subplots()\n", "for i, grp in enumerate(tips[\"time\"].unique()):\n", " vplot = ax.violinplot(\n", " tips.loc[tips[\"time\"] == grp, \"tip\"], positions=[i], vert=True\n", " )\n", "labels = list(tips[\"time\"].unique())\n", "ax.set_xticks(np.arange(len(labels)))\n", "ax.set_xticklabels(labels)\n", "ax.set_ylabel(\"Tip\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.violinplot(data=tips, x=\"time\", y=\"tip\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(ggplot(tips, aes(x=\"time\", y=\"tip\", fill=\"time\")) + geom_violin())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Altair" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(tips).transform_density(\n", " \"tip\", as_=[\"tip\", \"density\"], groupby=[\"time\"]\n", ").mark_area(orient=\"horizontal\").encode(\n", " y=\"tip:Q\",\n", " color=\"time:N\",\n", " x=alt.X(\n", " \"density:Q\",\n", " stack=\"center\",\n", " impute=None,\n", " title=None,\n", " axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),\n", " ),\n", " column=alt.Column(\n", " \"time:N\",\n", " header=alt.Header(\n", " titleOrient=\"bottom\",\n", " labelOrient=\"bottom\",\n", " labelPadding=0,\n", " ),\n", " ),\n", ").properties(width=100).configure_facet(spacing=0).configure_view(stroke=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.violin(\n", " tips,\n", " y=\"tip\",\n", " x=\"time\",\n", " color=\"time\",\n", " box=True,\n", " points=\"all\",\n", " hover_data=tips.columns,\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lollipop" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "planets = sns.load_dataset(\"planets\").groupby(\"year\")[\"number\"].count()\n", "planets.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.stem(planets.index, planets, basefmt=\"\")\n", "ax.yaxis.tick_right()\n", "ax.spines[\"left\"].set_visible(False)\n", "ax.set_ylim(0, 200)\n", "ax.set_title(\"Number of exoplanets discovered per year\")\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " so.Plot(planets.reset_index(), x=\"year\", y=\"number\")\n", " .add(so.Dot(), so.Agg(\"sum\"))\n", " .add(so.Bar(width=0.1), so.Agg(\"sum\"))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets-Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(planets.reset_index(), aes(x=\"year\", y=\"number\"))\n", " + geom_lollipop()\n", " + ggtitle(\"Number of exoplanets discovered per year\")\n", " + scale_x_continuous(format=\"d\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.graph_objects as go\n", "\n", "px_df = planets.reset_index()\n", "\n", "fig1 = go.Figure()\n", "# Draw points\n", "fig1.add_trace(\n", " go.Scatter(\n", " x=px_df[\"year\"],\n", " y=px_df[\"number\"],\n", " mode=\"markers\",\n", " marker_color=\"darkblue\",\n", " marker_size=10,\n", " )\n", ")\n", "# Draw lines\n", "for index, row in px_df.iterrows():\n", " fig1.add_shape(type=\"line\", x0=row[\"year\"], y0=0, x1=row[\"year\"], y1=row[\"number\"])\n", "fig1.show()" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "codeforecon", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 4 }