{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Essentials of GGplotting with R \n",
    "In this notebook you'll learn about using ggplot2 to make publication quality figures.\n",
    "\n",
    "## Some useful notes\n",
    "\n",
    "With Jupyter Notebook you can get a nice popup of function definitions just like you can in RStudio. Simply navigate to a cell or start a new one, and enter in ?function like you would normally. A popup will appear.\n",
    "\n",
    "You should see an Insert dropdown menu and Run button at the top which lets you add cells as well as run code or render Markdown in the cells, but these are very useful keyboard shortcuts for the same functions: \n",
    "\n",
    "- Shift+Enter: Run code or render Markdown in the current cell you're on\n",
    "- Esc+a: Add a cell above\n",
    "- Esc+b: Add a cell below\n",
    "- Esc+dd: Delete a cell"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prerequisites"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(tidyverse)\n",
    "library(gridExtra)\n",
    "library(ggrepel)\n",
    "library(maps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing Data\n",
    "\n",
    "Core feature of exploratory data analysis is asking questions about data and searching for answers by visualizing and modeling data. Most questions around what type of variation or covariation occurs between variables.\n",
    "\n",
    "Base R comes with some functions to visualize your data -- base R plots might look something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "options(repr.plot.width=10, repr.plot.height=7)\n",
    "# regular plot functions in R\n",
    "plot(x=mpg$displ,y=mpg$hwy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use ggplot2 for your visualizations -- here's an example of default parameters in ggplot2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ggplot!\n",
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also make publication-quality visualizations using ggplot2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth(se = FALSE) +\n",
    "  labs(x=\"Engine displacement (L)\",y=\"Heighway fuel economy (mpg)\",\n",
    "    title = \"Fuel efficiency generally decreases with engine size\",\n",
    "    caption = \"Data from fueleconomy.gov\",\n",
    "    subtitle = \"Two seaters (sports cars) are an exception because of their light weight\",\n",
    "    colour = \"Car type\"\n",
    "  ) + theme_classic()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing Data\n",
    "\n",
    "All plots in ggplot follow the same syntax:\n",
    "\n",
    "```\n",
    "ggplot(data=<DATA>) +\n",
    "    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the `head()` function to look at the data we plotted in the above examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "head(mpg) # automatically loaded when you load tidyverse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's break down the components of ggplot. First, note that `ggplot(data=<DATA>)` on its own will not actually plot anything."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is because we need the `<GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)` to tell us what exactly to plot using our data. However, just `ggplot(data=<DATA>) + <GEOM_FUNCTION>()` on its own doesn't do anything either."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg) + geom_point()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So in fact we need *all* of the components described in the ggplot syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg) + geom_point(mapping=aes(x=displ,y=hwy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `<MAPPINGS>`\n",
    "\n",
    "```\n",
    "ggplot(data=<DATA>) +\n",
    "    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)\n",
    "```\n",
    "\n",
    "Mappings refer to the visual properties of objects in the plot, i.e. size, shape, color. Can display points from other variables (in this case class) in different ways by changing value of aesthetic properties. These are known as **levels**, which is done in order to distinguish aesthetic values from data values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try using `geom_point` to make some scatter plots and we can modify the mappings to change how we represent the `class` categories.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color=class))\n",
    "p2 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=class))\n",
    "p3 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,size=class))\n",
    "p4 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,alpha=class))\n",
    "grid.arrange(p1,p2,p3,p4,nrow=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we can represent the `class` data as the `color`, `shape`, `size`, or `alpha` (transparency scales). As you can see, not all mappings lend themselves to all data -- there's only 6 `shape` options available (we would need 7) and `alpha` and `size` aren't recommended for discrete data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Levels\n",
    "\n",
    "**ggplot2** automatically assigns a unique level of an aesthetic to a unique value of the variable. This process is known as scaling. It will also automatically select a scale to use with the aesthetic (i.e. continuous or discrete) as well as add a legend explaining the mapping between levels and values. That's why in the shape mapping there's no shape for suv, and why the following two pieces of code do different things:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For color property, all data points were assigned to 'blue', therefore ggplot2 assigns a single level to all of the points, which is red"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color='blue'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, color is placed outside aesthetic mapping, so ggplot2 understands that we want color of points to be blue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy),color='blue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`cty` is a continuous variable, so when mapped to color we get a gradient with bins instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color=cty))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Continuous vs discrete scales\n",
    "\n",
    "Generally continuous scales get chosen for numerical data and discrete scales are chosen for categorical data. If your data is numeric but in discrete categories you may have to use `as.factor()` in order to get proper levels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we try to map `cyl` to `shape` we get an error because `shape` is only for discrete variables even though we only have `cyl`=4,5,6 or 8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=cyl))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can transform `cyl` into categorical variable with levels using the `as.factor` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "as.factor(mpg$cyl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can try plotting again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=as.factor(cyl)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this means x and y are aesthetic mappings as well. In fact without them you will get an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_point()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `<GEOM_FUNCTION>`\n",
    "\n",
    "```\n",
    "ggplot(data=<DATA>) +\n",
    "    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)\n",
    "```\n",
    "\n",
    "**geom** is geometrical object that the plot uses to represent data. Bar charts use bar geoms, line charts use line geoms, scatterplots use point geoms, etc. Full list of geoms provided with **ggplot2** can be seen in [ggplot2 reference](https://ggplot2.tidyverse.org/reference/#section-layer-geoms). Also exist other geoms created by [other packages](http://www.ggplot2-exts.org/gallery/).\n",
    "\n",
    "Every geom function in ggplot2 takes a `mapping` argument with specific aesthetic mappings that are possible. Not every aesthetic will work with every geom. For example, can set shape of a point, but not shape of a line. However, can set linetype of a line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg) +\n",
    "  geom_smooth(mapping = aes(x = displ, y = hwy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also specify that the `linetype` should be `as.factor(cyl)` and see that the data has been separated into three lines based on their drivetrain: 4 (4wd), f (front), r (rear)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg) +\n",
    "  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = as.factor(cyl)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can display multiple geoms on same plot just by adding them -- lets add `geom_smooth` to `geom_point`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg) + \n",
    "  geom_point(mapping = aes(x = displ, y = hwy, color=drv)) +\n",
    "  geom_smooth(mapping = aes(x = displ, y = hwy, color=drv, linetype=drv))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Geoms like `geom_smooth()` use single geometric object to display multiple rows of data. If you don't necessarily want to add other distinguishing features to the geom like color, can use `group` aesthetic (for a categorical variable) to draw multiple objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    geom_smooth(mapping=aes(x=displ,y=hwy,group=drv))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use `?geom_smooth` to see a full list of which aesthetics `geom_smooth` will understand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Global mappings vs local mappings\n",
    "\n",
    "`ggplot()` function contains *global* mapping, while each geom has a local mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Global mapping of `displ` and `hwy` creates the x and y axes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg, mapping=aes(x=displ,y=hwy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mapping `color` to `class` for point geom while using global x and y mappings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + geom_point(mapping=aes(color=class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`geom_smooth` doesn't need any mapping arguments if using global mapping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +\n",
    "    geom_point(mapping=aes(color=class))+\n",
    "    geom_smooth()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second `geom_smooth` uses same x and y mapping but mapping comes from `no_2seaters` data (from the Tidyverse section of the workshop) instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "no_2seaters <- filter(mpg, class != \"2seater\")\n",
    "\n",
    "ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + \n",
    "  geom_point(mapping = aes(color = class)) + \n",
    "  geom_smooth() +\n",
    "  geom_smooth(data = no_2seaters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More syntax\n",
    "\n",
    "    We have gone over the minimum required syntax for ggplot, but there are additonal options that can be specified to further customize your plots, such as <FACET_FUNCTION>:\n",
    "\n",
    "```{r}\n",
    "ggplot(data = <DATA>) + \n",
    "  <GEOM_FUNCTION>(\n",
    "     mapping = aes(<MAPPINGS>),\n",
    "     stat = <STAT>, \n",
    "     position = <POSITION>\n",
    "  ) +\n",
    "  <COORDINATE_FUNCTION> +\n",
    "  <FACET_FUNCTION>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Facets\n",
    "\n",
    "Facets can be used to create subplots displaying one subset of data.\n",
    "\n",
    " * `facet_wrap()` for a single variable.\n",
    " * `facet_grid()` for along 2 variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    geom_point(mapping=aes(x=displ,y=hwy)) +\n",
    "    facet_wrap(~ class, nrow=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the `nrow` argument to change the arrangement of the subplots:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    geom_point(mapping=aes(x=displ,y=hwy)) +\n",
    "    facet_wrap(~ class, nrow=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    geom_point(mapping=aes(x=displ,y=hwy)) +\n",
    "    facet_wrap(~ class, ncol=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using `facet_grid`, some facets might be empty because no observations have those combinations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ggplot(data = mpg) + \n",
    "  geom_point(mapping = aes(x = displ, y = hwy)) + \n",
    "  facet_grid(drv ~ cyl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stats\n",
    "\n",
    "```{r}\n",
    "ggplot(data = <DATA>) + \n",
    "  <GEOM_FUNCTION>(\n",
    "     mapping = aes(<MAPPINGS>),\n",
    "     stat = <STAT>, \n",
    "     position = <POSITION>\n",
    "  ) +\n",
    "  <COORDINATE_FUNCTION> +\n",
    "  <FACET_FUNCTION>\n",
    "```\n",
    "\n",
    "The stat argument can be used to specify algorithm used to calculate new values for a graph. Each geom object has a default stat, and each stat has a default geom. Geoms like `geom_point()` will leave data as is, known as `stat_identity()`. Graphs like bar charts and histograms will bin your data and compute bin counts, known as `stat_count()`. Can see full list of stats at [ggplot2 reference](https://ggplot2.tidyverse.org/reference/) under both Layer: geoms and Layer: stats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    geom_bar(mapping=aes(x=class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since each stat comes with a default geom, can use stat to create geoms on plots as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) +\n",
    "    stat_count(mapping=aes(x=class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because stat_count() computes `count` and `prop`, can use those as variables for mapping as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data=mpg) + geom_bar(mapping=aes(x=class, y=..prop..,group=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Stat_summary` is associated with geom_point range, the default is to compute mean and standard error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg) + \n",
    "  stat_summary(mapping = aes(x=class,y=hwy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can change stat_summary to compute median and min/max instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg) +\n",
    "  stat_summary(\n",
    "    mapping = aes(x = class, y = hwy),\n",
    "    fun.ymin = min,\n",
    "    fun.ymax = max,\n",
    "    fun.y = median\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Position adjustments\n",
    "\n",
    "```{r}\n",
    "ggplot(data = <DATA>) + \n",
    "  <GEOM_FUNCTION>(\n",
    "     mapping = aes(<MAPPINGS>),\n",
    "     stat = <STAT>, \n",
    "     position = <POSITION>\n",
    "  ) +\n",
    "  <COORDINATE_FUNCTION> +\n",
    "  <FACET_FUNCTION>\n",
    "```\n",
    "\n",
    "Each geom also comes with a default **position adjustment** specified by `position` argument. For geoms like `geom_point()` it is \"identity\" which is position as is.\n",
    "\n",
    "Specifically for bar charts, have fill aesthetic. If fill aesthetic gets mapped to another variable, bars are automatically stacked under the \"stack\" position. Can see [list of positions](https://ggplot2.tidyverse.org/reference/#section-layer-position-adjustment) at ggplot2 reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 <- ggplot(data = mpg, mapping=aes(x=class,fill=as.factor(cyl)))\n",
    "p1 + geom_bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`position = identity` will place each object exactly where it falls in context of graph, which isn't super useful for bar charts, better for scatterplots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 + geom_bar(position=\"identity\", alpha=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`position = fill` will make bars same height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 + geom_bar(position=\"fill\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`position = dodge` places objects directly beside one another, which can make it easier to compare individual values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 + geom_bar(position=\"dodge\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For `geom_point` one possible position is \"jitter\", which will add a small amount of random noise to each point. This spreads points out so that it's unlikely for points to overlap and therefore get plotted over each other. For example it's possible that majority of points are actually one combination of `hwy` and `displ` but they all get plotted at the exact same point so you can't tell. For very large datasets can help prevent overplotting to better see where mass of plot is or trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + \n",
    "  geom_point()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This plot makes the data quite uniform -- maybe there's multiple observations with same value of cty/hwy creating overlapping points. Let's check:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + \n",
    "  geom_point(position=\"jitter\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`position=jitter` has cleared up the overlapping points for us."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coordinate systems\n",
    "\n",
    "```{r}\n",
    "ggplot(data = <DATA>) + \n",
    "  <GEOM_FUNCTION>(\n",
    "     mapping = aes(<MAPPINGS>),\n",
    "     stat = <STAT>, \n",
    "     position = <POSITION>\n",
    "  ) +\n",
    "  <COORDINATE_FUNCTION> +\n",
    "  <FACET_FUNCTION>\n",
    "```\n",
    "\n",
    "Default coordinate system is Cartesian.\n",
    "\n",
    " * `coord_flip()` switches x and y axes.\n",
    " * `coord_quickmap()` sets aspect ratio for maps.\n",
    " * `coord_polar()` sets polar coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p <- ggplot(data = mpg, mapping = aes(x = class, y = hwy))\n",
    "p + geom_boxplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can use `coord_flip()` to flip the coordinates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p + geom_boxplot() + coord_flip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also reorder x axis by lowest to highest median hwy mileage, which might allow easier comparisons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(data = mpg, mapping = aes(x = reorder(class,hwy,FUN=median), y = hwy)) + \n",
    "  geom_boxplot() +\n",
    "  coord_flip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use `geom_polygon` to make some maps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nz <- map_data(\"nz\")\n",
    "\n",
    "ggplot(nz, aes(long, lat, group = group)) +\n",
    "  geom_polygon(fill = \"white\", colour = \"black\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also tweak the aspect ratios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(nz, aes(long, lat, group = group)) +\n",
    "  geom_polygon(fill = \"white\", colour = \"black\") +\n",
    "  coord_quickmap()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also use polar coordinates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bar <- ggplot(data = mpg) + \n",
    "  geom_bar(\n",
    "    mapping = aes(x = class, fill = as.factor(cyl)), \n",
    "    show.legend = FALSE,\n",
    "    width = 1\n",
    "  ) + \n",
    "  theme(aspect.ratio = 1) +\n",
    "  labs(x = NULL, y = NULL)\n",
    "\n",
    "p1 <- bar + coord_flip()\n",
    "p2 <- bar + coord_polar()\n",
    "grid.arrange(p1,p2, nrow=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Publication Quality Graphs\n",
    "\n",
    "Last piece with some additional functions to learn...\n",
    "\n",
    "## Labels\n",
    "\n",
    "`labs()` to add most kinds of labels: title, subtitle, captions, x-axis, y-axis, legend, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth(se = FALSE) +\n",
    "  labs(\n",
    "    title = \"Fuel efficiency generally\\n decreases with engine size\",\n",
    "    subtitle = \"Two seaters (sports cars) are an exception because of their light weight\",\n",
    "    caption = \"Data from fueleconomy.gov\",\n",
    "    x = \"Engine displacement (L)\",\n",
    "    y = \"Highway fuel economy (mpg)\",\n",
    "    color = \"Car type\"\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotations\n",
    "\n",
    "Can use `geom_text()` to add text labels on the plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_in_class <- mpg %>%\n",
    "  group_by(class) %>%\n",
    "  filter(row_number(desc(hwy)) == 1)\n",
    "\n",
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(colour = class)) +\n",
    "  geom_text(aes(label = model), data = best_in_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also use `ggrepel` to add labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(colour = class)) +\n",
    "  ggrepel::geom_label_repel(aes(label = model), data = best_in_class) +\n",
    "  labs(\n",
    "    caption = \"Data from fueleconomy.gov\",\n",
    "    x = \"Engine displacement (L)\",\n",
    "    y = \"Highway fuel economy (mpg)\",\n",
    "    colour = \"Car type\"\n",
    "  ) +\n",
    "  geom_point(size = 3, shape = 1, data = best_in_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scales\n",
    "\n",
    " * `breaks`: For the position of ticks\n",
    " * `labels`: For the text label associated with each tick.\n",
    " * Default scale is x continuous, y continuous but can also do x logarithmic, y logarithmic, change color scales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify the y-scale breaks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point() +\n",
    "  scale_y_continuous(breaks = seq(15, 40, by = 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove axis tick labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point() +\n",
    "  scale_x_continuous(labels = NULL) +\n",
    "  scale_y_continuous(labels = NULL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also log-scale axes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 <- ggplot(diamonds, aes(carat, price)) +\n",
    "  geom_bin2d()\n",
    "ggplot(diamonds, aes(carat, price)) +\n",
    "  geom_bin2d() + \n",
    "  scale_x_log10() + \n",
    "  scale_y_log10()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Could get the same plot by specifying `log10(carat)` and `log10(price)` in the aesthetics mapping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(diamonds, aes(log10(carat), log10(price))) +\n",
    "  geom_bin2d()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can also use different ggplot palettes to change the colors -- let's compare the default to the `Set1` palette:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = drv))\n",
    "\n",
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = drv)) +\n",
    "  scale_colour_brewer(palette = \"Set1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `?scale_colour_brewer()` to see a list of palettes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also manually specify which colors to use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = drv)) +\n",
    "  scale_colour_manual(values=c(`4`=\"red\",f=\"blue\",r=\"blue\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Legend positioning\n",
    "\n",
    "`theme(legend.position)` to control legend position. `guides()` with `guide_legened()` or `guide_colourbar()` for legend display."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base <- ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(colour = class))\n",
    "\n",
    "p_left <- base + theme(legend.position = \"left\")\n",
    "p_top <- base + theme(legend.position = \"top\")\n",
    "p_bottom <- base + theme(legend.position = \"bottom\")\n",
    "p_right <- base + theme(legend.position = \"right\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `grid.arrange` to look at our plots:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid.arrange(p_left, p_right, nrow = 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid.arrange(p_top, p_bottom, nrow = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's pull a few of these pieces together to start making our publication quality visualization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(colour = class)) +\n",
    "  geom_smooth(se = FALSE) +\n",
    "  theme(legend.position = \"bottom\") +\n",
    "  guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zooming\n",
    "\n",
    "Three ways to control plot limits:\n",
    " * Adjusting what data are plotted\n",
    " * Setting limits in each scale\n",
    " * Setting `xlim` and `ylim` in `coord_cartesian()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can set `xlim` and `ylim` in `coord_cartesian`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(mpg, mapping = aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth() +\n",
    "  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can adjust what data are plotted, but note that `geom_smooth` will plot its regression over the subsetted data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filter(mpg, displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%\n",
    "  ggplot(aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also have different scales along `hwy` and `displ` if you subet the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "suv <- mpg %>% filter(class == \"suv\")\n",
    "compact <- mpg %>% filter(class == \"compact\")\n",
    "ggplot(suv, aes(displ, hwy, colour = drv)) +\n",
    "  geom_point()\n",
    "\n",
    "ggplot(compact, aes(displ, hwy, colour = drv)) +\n",
    "  geom_point()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the first plot is showing 4 and r for `drv`, while the second is showing 4 and f for `drv`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can set limits in each scale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_scale <- scale_x_continuous(limits = range(mpg$displ))\n",
    "y_scale <- scale_y_continuous(limits = range(mpg$hwy))\n",
    "col_scale <- scale_colour_discrete(limits = unique(mpg$drv))\n",
    "\n",
    "ggplot(suv, aes(displ, hwy, colour = drv)) +\n",
    "  geom_point() +\n",
    "  x_scale +\n",
    "  y_scale +\n",
    "  col_scale\n",
    "\n",
    "ggplot(compact, aes(displ, hwy, colour = drv)) +\n",
    "  geom_point() +\n",
    "  x_scale +\n",
    "  y_scale +\n",
    "  col_scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Themes\n",
    "\n",
    "**ggplot2** has 8 themes by default, can get more in other packages like **ggthemes**. Generally prefer `theme_classic()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base <- ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth(se = FALSE)\n",
    "\n",
    "p1 <- base + theme_bw()\n",
    "p2 <- base + theme_light()\n",
    "p3 <- base + theme_classic()\n",
    "p4 <- base + theme_linedraw()\n",
    "p5 <- base + theme_dark()\n",
    "p6 <- base + theme_minimal()\n",
    "p7 <- base + theme_void()\n",
    "\n",
    "grid.arrange(base,p1,p2,p3,p4,p5,p6,p7,nrow=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving your plots\n",
    "\n",
    " * `ggsave()` will save most recent plot to disk (can also specify which plot to save if you save the plot as an object first).\n",
    " * `tiff()` will save next plot to disk\n",
    " * Other functions like `postscript()` for eps files, etc.\n",
    " * All can take `width`, `height`, `fonts`, `pointsize`, `res` (resolution) arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 <- ggplot(mpg, aes(displ, hwy)) +\n",
    "  geom_point(aes(color = class)) +\n",
    "  geom_smooth(se = FALSE) +\n",
    "  labs(x=\"Engine displacement (L)\",y=\"Heighway fuel economy (mpg)\",\n",
    "    title = \"Fuel efficiency generally decreases with engine size\",\n",
    "    caption = \"Data from fueleconomy.gov\",\n",
    "    subtitle = \"Two seaters (sports cars) are an exception because of their light weight\",\n",
    "    colour = \"Car type\"\n",
    "  ) + theme_classic()\n",
    "p1\n",
    "ggsave(\"my_plot.pdf\")\n",
    "\n",
    "tiff(\"my_plot.tiff\",width=7,height=5,units=\"in\",pointsize=8,res=350)\n",
    "p1\n",
    "dev.off()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Some other useful visualization packages\n",
    "\n",
    "We don't have time in this workshop to get in depth, but here are some more useful visualization packages that may be helpful for your research.\n",
    "\n",
    "## ggtree for phylogenetics\n",
    "\n",
    "Resources and associated packages:\n",
    " * [Data Integration, Manipulation and Visualization of Phylogenetic Trees](https://yulab-smu.github.io/treedata-book/index.html)\n",
    " * [treeio](https://bioconductor.org/packages/release/bioc/html/treeio.html)\n",
    " * [tidytree](https://cran.r-project.org/web/packages/tidytree/index.html)\n",
    " \n",
    "## cowplot\n",
    "\n",
    "Meant to provide publication-ready theme for **gplot2** that requires minimum amount of fiddling with sizes of axis labels, plot backgrounds, etc. Auto-sets `theme_classic()` for all plots.\n",
    "\n",
    "## Gviz for plotting data along genomic coordinates\n",
    "\n",
    "Can be installed from [Bioconductor](https://bioconductor.org/packages/release/bioc/html/Gviz.html).\n",
    "\n",
    "## phyloseq for metagenomics\n",
    "\n",
    "Website is [very comprehensive](http://joey711.github.io/phyloseq/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary\n",
    "\n",
    "Now that we've gone through tidying, transforming, and visualizing data let's review all of the different functions we've used and in some cases learned the inner workings of:\n",
    "\n",
    "## Tidying\n",
    "\n",
    " * `gather()`\n",
    " * `spread()`\n",
    " * `separate()`\n",
    " * `unite()`\n",
    " * `%>%` propagates the output from a function as input to another. eg: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z).\n",
    " \n",
    "## Transforming\n",
    "\n",
    " * `filter()` to pick observations (rows) by their values\n",
    " * `arrange()` to reorder rows, default is by ascending value\n",
    " * `select()` to pick variables (columns) by their names\n",
    " * `mutate()` to create new variables with functions of existing variables\n",
    " * `summarise()` to collapes many values down to a single summary\n",
    " * `group_by()` to set up functions to operate on groups rather than the whole data set\n",
    " \n",
    "## Visualizing\n",
    "\n",
    " * `ggplot` - global data and mappings\n",
    " * `geom_point` - geom for scatterplots\n",
    " * `geom_smooth` - geom for regressions\n",
    " * `geom_pointrange` - geom for vertical intervals defined by `x`, `y`, `ymin`, and `ymax`\n",
    " * `geom_bar` - geom for barcharts\n",
    " * `geom_boxplot` - geom for boxplots\n",
    " * `geom_polygon` - geom for polygons\n",
    " * `aes(color)` - color mapping\n",
    " * `aes(shape)` - shape mapping\n",
    " * `aes(size)` - size mapping\n",
    " * `aes(alpha)` - transparency mapping\n",
    " * `as.factor()` - transforming numerical values to categorical values with levels\n",
    " * `facet_grid`\n",
    " * `facet_wrap`\n",
    " * `stat_count` - default stat for barcharts, bins by x and counts\n",
    " * `stat_identity` - default stat for scatterplots, leaves data as is\n",
    " * `stat_summary` - default stat for pointrange, by default computes mean and se of y by x\n",
    " * `position=\"identity\"`\n",
    " * `position=\"stacked\"`\n",
    " * `position=\"fill\"`\n",
    " * `position=\"dodge\"`\n",
    " * `position=\"jitter\"`\n",
    " * `coord_flip`\n",
    " * `coord_map`\n",
    " * `coord_polar`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sessionInfo()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.2.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}