# Styling data frames

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/gfmt_sleep.csv)

<hr>

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"

In [2]:
import pandas as pd

<hr>

It is sometimes useful to highlight features in a data frame when viewing them. (Note that this is generally far less useful than making informative plots, which we will come to shortly.) Pandas offers some convenient ways to style the display of a data frame.

To demonstrate, we will again use a data set from [Beattie, et al.](https://doi.org/10.1098/rsos.160321) containing results from a study the effects of sleep quality on performance in the [Glasgow Facial Matching Test](https://doi.org/10.3758/BRM.42.1.286) (GMFT).

In [3]:
df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')

As our first example demonstrating styling, let's say we wanted to highlight rows corresponding to women who scored at or above 75% correct. We can write a function that will take as an argument a row of the data frame, check the value in the `'gender'`  and `'percent correct'` columns, and then specify a row color of gray or green accordingly. We then use `df.style.apply()` with the `axis=1` kwarg to apply that function to each row.

In [4]:
def highlight_high_scoring_females(s):
    if s["gender"] == "f" and s["percent correct"] >= 75:
        return ["background-color: #7fc97f"] * len(s)
    else:
        return ["background-color: lightgray"] * len(s)

df.head(10).style.apply(highlight_high_scoring_females, axis=1)

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12
5,28,f,61,80,20,50.0,71.0,63.0,31.0,72.5,64.5,70.5,15,14,2
6,30,m,32,90,75,82.5,67.0,56.5,66.0,65.0,66.0,64.0,16,9,3
7,33,m,62,45,90,67.5,54.0,37.0,65.0,81.5,62.0,61.0,14,9,9
8,34,f,33,80,100,90.0,70.5,76.5,64.5,,68.0,76.5,14,12,10
9,35,f,53,100,50,75.0,74.5,,60.5,65.0,71.0,65.0,14,8,7


We can be more fancy. Let's say we want to shade the `'percent correct'` column with a bar corresponding to the value in the column. We use the `df.style.bar()` method to do so. The `subset` kwarg specifies which columns are to have bars.

In [5]:
df.head(10).style.bar(subset=["percent correct"], vmin=0, vmax=100)

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12
5,28,f,61,80,20,50.0,71.0,63.0,31.0,72.5,64.5,70.5,15,14,2
6,30,m,32,90,75,82.5,67.0,56.5,66.0,65.0,66.0,64.0,16,9,3
7,33,m,62,45,90,67.5,54.0,37.0,65.0,81.5,62.0,61.0,14,9,9
8,34,f,33,80,100,90.0,70.5,76.5,64.5,,68.0,76.5,14,12,10
9,35,f,53,100,50,75.0,74.5,,60.5,65.0,71.0,65.0,14,8,7


Note that I have used the `vmin=0` and `vmax=100` kwargs to set the base of the bar to be at zero and the maximum to be 100.

Alternatively, I could color the percent correct according to the percent correct.

In [6]:
df.head(10).style.background_gradient(subset=["percent correct"], cmap="Reds")

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12
5,28,f,61,80,20,50.0,71.0,63.0,31.0,72.5,64.5,70.5,15,14,2
6,30,m,32,90,75,82.5,67.0,56.5,66.0,65.0,66.0,64.0,16,9,3
7,33,m,62,45,90,67.5,54.0,37.0,65.0,81.5,62.0,61.0,14,9,9
8,34,f,33,80,100,90.0,70.5,76.5,64.5,,68.0,76.5,14,12,10
9,35,f,53,100,50,75.0,74.5,,60.5,65.0,71.0,65.0,14,8,7


We could have multiple effects together as well.

In [7]:
df.head(10).style.bar(
    subset=["percent correct"], vmin=0, vmax=100
).apply(
    highlight_high_scoring_females, axis=1
)

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12
5,28,f,61,80,20,50.0,71.0,63.0,31.0,72.5,64.5,70.5,15,14,2
6,30,m,32,90,75,82.5,67.0,56.5,66.0,65.0,66.0,64.0,16,9,3
7,33,m,62,45,90,67.5,54.0,37.0,65.0,81.5,62.0,61.0,14,9,9
8,34,f,33,80,100,90.0,70.5,76.5,64.5,,68.0,76.5,14,12,10
9,35,f,53,100,50,75.0,74.5,,60.5,65.0,71.0,65.0,14,8,7


In practice, I almost never use these features because it is almost always better to display results as a plot rather than in tabular form. Still, it can be useful when exploring data sets to highlight certain aspects when exploring data sets in tabular form.

## Computing environment

In [8]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pandas    : 2.0.3
jupyterlab: 4.0.6

