# Plots with categorical variables

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/gfmt_sleep.csv)

<hr />

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"

In [2]:
import pandas as pd

import bokeh.models
import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()

<hr/>

## Types of data for plots

Let us first consider the different kinds of data we may encounter as we think about constructing a plot.

- **Quantitative** data may have continuously varying (and therefore ordered) values.
- **Categorical** data has discrete, unordered values that a variable can take.
- **Ordinal** data has discrete, ordered values. Integers are a classic example.
- **Temporal** data refers to time, which can be represented as dates.

In practice, ordinal data can be cast as quantitative or treated as categorical with an ordering enforced on the categories (e.g., categorical data `[1, 2, 3]` becomes `['1', '2', '3']`.). Temporal data can also be cast as quantitative, (e.g., seconds from the start time). We will therefore focus out attention on quantitative and categorical data.

When we made scatter plots, both types of data were quantitative. We did actually incorporate categorical information in the form of colors of the glyph (insomniacs and normal sleepers being colored differently) and in tooltips.

But what if we wanted a single type of measurement, such as percent correct in the facial identification, but were interested in delineating performance of insomniacs and normal sleepers. Here, we have the quantitative percent correct data and the categorical sleeper type. One of our axes is now categorical.

Note that this kind of plot is commonly encountered in the biological sciences. We repeat a measurement many times for given test conditions and wish to compare the results. The different conditions are the categories, and the axis along which the conditions are represented is called a categorical axis. The quantitative axis contains the result of the measurements from each condition.

<hr />

_The rest of this lesson is mostly for reference so you can see how to handle categorical axes with Bokeh. In practice, we will mostly be using [iqplot](http://iqplot.github.io/) to do this and it is done for your automatically. You may therefore skip the rest of this notebook if you like._

_That said, for some plotting applications, you may need to adjust details or do things outside of iqplot's capabilities, so the contents of this lesson can be useful._

## Making a bar graph with Bokeh

To demonstrate how to set up a categorical axis with Bokeh, I will make a bar graph of the mean percent correct for insomniacs and normal sleepers. But before I even begin this, I will give you the following piece of advice: *Don't make bar graphs.* More on that in a moment.

### Setting up a data frame for plotting

Before making a plot, we need to set up a data frame amenable for the type of plot we want. We start by reading in the data set and computing the `'insomnia'` column, which gives `True`s and `False`s, as we've done in the preceding parts of this lesson.

In [3]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16

For convenience in plotting the categorical axis, we would rather not have the values on the axis be `True` or `False`, but something more descriptive, like _insomniac_ and _normal_. So, let's make a column in the data frame, `'sleeper'` that has that for us. We use the `apply()` method of the data frame to apply a function that returns the string `'insomniac'` if the entry is in the `'insomnia'` column is `True` and `'normal'` otherwise.

In [4]:
df["sleeper"] = df["insomnia"].apply(lambda x: "insomniac" if x else "normal")

Next, we need to make a data frame that has the mean percent correct for each of the two categories of sleeper. We have decided that it is the mean of the respective measurements that will set the height of the bars.

In [5]:
df_mean = df.groupby("sleeper")["percent correct"].mean().reset_index()

# Take a look
df_mean

Unnamed: 0,sleeper,percent correct
0,insomniac,76.1
1,normal,81.461039


Now we're ready to make the bar graph. Note that we now have only two data points that we are showing on the plot. We have decided to throw out a **lot** of information from the data we collected to display only two values. Does this strike you as a terrible idea? It should. **Don't do this.** We're just doing it to show how categorical axes are set up using Bokeh.

### Setting up categorical axes

To set up a categorical axis, you need to specify the `x_range` (or `y_range` if you want the y-axis to be categorical) as a list with the categories you want on the axis when you instantiate the figure. I will make a horizontal bar graph, so I will specify `y_range`. I also want my quantitative axis (x in this case) to go from zero to 100, since it signifies a percent. Also, when I instantiate this figure, because it is not very tall and I do not want the reset tool cut off, I will also explicitly set the tools I want in the toolbar.

In [6]:
p = bokeh.plotting.figure(
    height=200,
    width=400,
    x_axis_label="percent correct",
    x_range=[0, 100],
    y_range=df_mean["sleeper"].unique(),
    tools="save",
)

Now that we have the figure, we can put the bars on. The `p.hbar()` method populates the figure with horizontal bar glyphs. The `right` kwarg says what column of the data source dictates how far to the right to show the bar, while the `height` kwarg says how think the bars are.

I will also ensure the quantitative axis starts at zero and turn off the grid lines on the categorical axis, which is commonly done.

In [7]:
p.hbar(
    source=df_mean,
    y="sleeper",
    right="percent correct",
    height=0.6,
)

# Turn off gridlines on categorical axis
p.ygrid.grid_line_color = None

bokeh.io.show(p)

We similarly make vertical bar graphs specifying `x_range` and using `p.vbar()`.

In [8]:
p = bokeh.plotting.figure(
    height=250,
    width=250,
    x_range=df_mean["sleeper"].unique()[::-1],
    y_range=[0, 100],
    y_axis_label="average percent correct",
)

p.vbar(
    source=df_mean,
    x="sleeper",
    top="percent correct",
    width=0.6,
)

p.xgrid.grid_line_color = None

bokeh.io.show(p)

## Nested categorical axes

We may wish to make a bar graph where we have four bars, normal and insomniac for males and also normal and insomniac for females. To start, we will have to re-make the `df_mean` data frame, now grouping by gender and sleeper. Furthermore, it will be nicer to label the categories as "female" and "male" instead of "f" and "m".

In [9]:
df["gender"] = df["gender"].apply(lambda x: "female" if x == "f" else "male")

df_mean = df.groupby(["gender", "sleeper"])["percent correct"].mean().reset_index()

# Take a look
df_mean

Unnamed: 0,gender,sleeper,percent correct
0,female,insomniac,73.947368
1,female,normal,82.045455
2,male,insomniac,82.916667
3,male,normal,80.0


Because of the way Bokeh handles nested categories, we need to create a new column that has a tuple corresponding to the nested category. To make the tuple, we can again apply a function, this time to each entire row of the data frame (which requires the `axis=1` kwarg of `df_mean.apply()`).

In [10]:
df_mean["cats"] = df_mean.apply(lambda x: (x["gender"], x["sleeper"]), axis=1)

# Take a look
df_mean

Unnamed: 0,gender,sleeper,percent correct,cats
0,female,insomniac,73.947368,"(female, insomniac)"
1,female,normal,82.045455,"(female, normal)"
2,male,insomniac,82.916667,"(male, insomniac)"
3,male,normal,80.0,"(male, normal)"


Next, we need to set up **factors**, which give the nested categories. We could extract them from the `'cats'` column of the data frame as

```python
factors = list(df_mean.cats)
```

Instead, we will specify them by hand to ensure they are ordered as we would like.

In [11]:
factors = [
    ("female", "normal"),
    ("female", "insomniac"),
    ("male", "normal"),
    ("male", "insomniac"),
]

Finally, to use these factors in a `y_range` (or `x_range`), we need to convert them to a **factor range** using `bokeh.models.FactorRange()`.

In [12]:
p = bokeh.plotting.figure(
    height=200,
    width=400,
    x_axis_label="average percent correct",
    x_range=[0, 100],
    y_range=bokeh.models.FactorRange(*factors),
    tools="save",
)

Now we are ready to add the bars, taking care to specify the `'cats'` column for our y-values.

In [13]:
p.hbar(
    source=df_mean,
    y="cats",
    right="percent correct",
    height=0.6,
)

p.ygrid.grid_line_color = None

bokeh.io.show(p)

## Computing environment

In [14]:
%load_ext watermark
%watermark -v -p pandas,bokeh,jupyterlab

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pandas    : 2.0.3
bokeh     : 3.2.1
jupyterlab: 4.0.6

