# Selecting Rows

- [Download the lecture notes](https://philchodrow.github.io/PIC16A/content/pd/pd_2.ipynb). 

In the last lecture, we saw how to extract specific columns from a data frame. In many cases, we also need to extract specific rows. This operation is often called "filtering" -- we are filtering out the rows that we don't want, leaving the ones that we do. 

In [1]:
import pandas as pd
import numpy as np

In [48]:
# you'll need to run the first block in pd_2.ipynb 
# to download the data if you have not already done so

penguins = pd.read_csv("palmer_penguins.csv")
# just the first five rows and selected columns
penguins = penguins[["Species", "Region", "Island", "Culmen Length (mm)"]]
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7


The simplest way to select rows of data is by explicitly naming the value(s) of the index for the rows you want. Remember that the index is the set of bold numbers at the far left. To do this, you should use the `df.loc` attribute of the data frame, like this: 

In [49]:
penguins.loc[1:3] # rows with index values 1 through 3

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,


In [50]:
# passing an explicit list can change the order of the rows. 
s = penguins.loc[[1, 4, 0]]
s

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1


In [51]:
# note that this works, even though s does not have a 4th row, 
# because s does have an index with value 4
s.loc[4]

Species               Adelie Penguin (Pygoscelis adeliae)
Region                                             Anvers
Island                                          Torgersen
Culmen Length (mm)                                   36.7
Name: 4, dtype: object

In [52]:
# on the other hand, this doesn't work
s.loc[2]

KeyError: 2

## Boolean Indexing

While it's good to know how to refer to rows by index, this is not the most useful way to filter data frames. Boolean indexing instead allows us to filter the rows of a data set based on one or more conditions. Boolean indexing in data frames is very similar to Boolean indexing in `numpy` arrays. 

In [53]:
penguins

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
...,...,...,...,...
339,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,
340,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,46.8
341,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,50.4
342,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,45.2


In [54]:
penguins['Culmen Length (mm)'] < 40

0       True
1       True
2      False
3      False
4       True
       ...  
339    False
340    False
341    False
342    False
343    False
Name: Culmen Length (mm), Length: 344, dtype: bool

In [56]:
penguins[penguins['Culmen Length (mm)'] < 40]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.9
...,...,...,...,...
146,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,39.2
147,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.6
148,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,36.0
149,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,37.8


In [61]:
# list of penguins encountered on Torgersen island
torg = penguins['Island']== "Torgersen"
penguins[torg].head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7


In [66]:
# penguins encountered on Torgersen with culmen no longer than 40 mm
# using bitwise and operator &
culm = penguins['Culmen Length (mm)'] < 40
penguins[torg & culm].head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,38.9


In [67]:
# using bitwise or instead of and
penguins[torg | culm].head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7


An especially useful example of Boolean indexing is picking out `nan` values from the data. 

In [71]:
nas = penguins["Culmen Length (mm)"].isna()
penguins[nas]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,
339,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,


In [76]:
# invert flips the entries of a boolean array
penguins[np.invert(nas)]

Unnamed: 0,Species,Region,Island,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7
5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.3
...,...,...,...,...
338,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,47.2
340,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,46.8
341,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,50.4
342,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,45.2


Boolean indexing is by far the most useful form of filtering, and should usually be preferred in most practical contexts. It is especially powerful when combined with functions that operate on columns, as we'll see shortly. 