This notebook is part of my [Python data science curriculum](http://www.terran.us/articles/python_curriculum.html). It demonstrates some Pandas functions which I thought were not adequately explained in the Jake VanderPlas book.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from plotnine.data import diamonds

In [2]:
# This is a standard Python demo dataset. You can also load it from your Python packages
# dir with pd.read_csv if you don't want to import seaborn.
import seaborn as sns
tips = sns.load_dataset('tips')

# Loading Data

For no apparent reason, the VanderPlas book doesn't document `pd.read_csv`! This is definitely functionality that you need. If you have the Wes McKinney book available, he has a description that you can read in Chapter 6; otherwise just read the online docs.

In [3]:
pd.read_csv?

In [4]:
pd.read_excel?

For DB connections: 
http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries 
https://www.sqlalchemy.org/

# Summary Tools

Be aware that describe() ignores all non-numeric columns by default, which might not be what you wanted:

In [5]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


You can force it to include them with include='all', but they compute different statistics, so the result is ugly:

In [6]:
tips.describe(include='all')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
count,244.0,244.0,244,244,244,244,244.0
unique,,,2,2,4,2,
top,,,Male,No,Sat,Dinner,
freq,,,157,151,87,176,
mean,19.785943,2.998279,,,,,2.569672
std,8.902412,1.383638,,,,,0.9511
min,3.07,1.0,,,,,1.0
25%,13.3475,2.0,,,,,2.0
50%,17.795,2.9,,,,,2.0
75%,24.1275,3.5625,,,,,3.0


You probably want to do the numeric and string types separately. Note that values which look like strings might be `np.object` or `pd.Categorical`.

In [7]:
tips.describe(include=pd.Categorical)

Unnamed: 0,sex,smoker,day,time
count,244,244,244,244
unique,2,2,4,2
top,Male,No,Sat,Dinner
freq,157,151,87,176


`.value_counts()` and `.unique()` work on any type of column, but only one column at a time, not an entire dataframe.

In [8]:
tips['sex'].value_counts()

Male 157
Female 87
Name: sex, dtype: int64

In [9]:
tips['sex'].unique()

[Female, Male]
Categories (2, object): [Female, Male]

These specialized functions are quite a bit faster than the general approach with `groupby`:

In [10]:
%timeit -n100 diamonds.color.unique()
%timeit -n100 diamonds.color.value_counts()
%timeit -n100 diamonds.groupby('color').count()

905 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.4 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.7 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Stacking, unstacking, melting, and pivoting

## Stacking and unstacking

To understand stacking and unstacking, let's start by creating a multindex on the rows.

In [11]:
tss = tips.groupby(['sex','smoker']).aggregate('mean')
tss

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,22.2845,3.051167,2.5
Male,No,19.791237,3.113402,2.71134
Female,Yes,17.977879,2.931515,2.242424
Female,No,18.105185,2.773519,2.592593


`unstack` will them move the INNER level of the index from the ROWS to the COLUMNS:


In [12]:
tss.unstack()

Unnamed: 0_level_0,total_bill,total_bill,tip,tip,size,size
smoker,Yes,No,Yes,No,Yes,No
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Male,22.2845,19.791237,3.051167,3.113402,2.5,2.71134
Female,17.977879,18.105185,2.931515,2.773519,2.242424,2.592593


`stack` moves the innermost level from the columns to the rows:

In [13]:
tss.unstack().stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,22.2845,3.051167,2.5
Male,No,19.791237,3.113402,2.71134
Female,Yes,17.977879,2.931515,2.242424
Female,No,18.105185,2.773519,2.592593


If you remove the _last_ (only remaining) level from either the rows or the columns, you then get a one-dimensional Series instead of a DataFrame:

In [14]:
tss.unstack().unstack()

 smoker sex 
total_bill Yes Male 22.284500
 Female 17.977879
 No Male 19.791237
 Female 18.105185
tip Yes Male 3.051167
 Female 2.931515
 No Male 3.113402
 Female 2.773519
size Yes Male 2.500000
 Female 2.242424
 No Male 2.711340
 Female 2.592593
dtype: float64

Once you have a series, it can only be `unstack`ed, not `stack`ed. You can use `level=` to control which part of the index gets turned back into columns.

In [15]:
tss.unstack().unstack().unstack()

Unnamed: 0_level_0,sex,Male,Female
Unnamed: 0_level_1,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
total_bill,Yes,22.2845,17.977879
total_bill,No,19.791237,18.105185
tip,Yes,3.051167,2.931515
tip,No,3.113402,2.773519
size,Yes,2.5,2.242424
size,No,2.71134,2.592593


In [16]:
tss.unstack().unstack().unstack(level=1)

Unnamed: 0_level_0,smoker,Yes,No
Unnamed: 0_level_1,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
total_bill,Male,22.2845,19.791237
total_bill,Female,17.977879,18.105185
tip,Male,3.051167,3.113402
tip,Female,2.931515,2.773519
size,Male,2.5,2.71134
size,Female,2.242424,2.592593


In [17]:
tss.unstack().unstack().unstack(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
smoker,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Yes,Male,22.2845,3.051167,2.5
Yes,Female,17.977879,2.931515,2.242424
No,Male,19.791237,3.113402,2.71134
No,Female,18.105185,2.773519,2.592593


## Melting and pivoting

`melt` is very similar to `stack`, except that it applies to all columns and not just the innermost level, and it the converts them into a normal column instead of an index level.

In [18]:
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3


In [19]:
tips.head(2).stack()

0 total_bill 16.99
 tip 1.01
 sex Female
 smoker No
 day Sun
 time Dinner
 size 2
1 total_bill 10.34
 tip 1.66
 sex Male
 smoker No
 day Sun
 time Dinner
 size 3
dtype: object

In [20]:
# Note that I have to move the row index into a column, which is called "index", to
# preserve the association of the data in the original rows through the melt.
tips_melted=tips.reset_index().head(2).melt(id_vars='index')
tips_melted

Unnamed: 0,index,variable,value
0,0,total_bill,16.99
1,1,total_bill,10.34
2,0,tip,1.01
3,1,tip,1.66
4,0,sex,Female
5,1,sex,Male
6,0,smoker,No
7,1,smoker,No
8,0,day,Sun
9,1,day,Sun


There is no corresponding `cast` like R has. Instead, use `.pivot`

In [21]:
tips_melted.pivot(index='index',columns='variable')

Unnamed: 0_level_0,value,value,value,value,value,value,value
variable,day,sex,size,smoker,time,tip,total_bill
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,Sun,Female,2,No,Dinner,1.01,16.99
1,Sun,Male,3,No,Dinner,1.66,10.34


The distinction between `.pivot` and `.pivot_table` is that the latter does aggregation:

In [22]:
tips.pivot_table(values='tip',index='sex',columns='time',aggfunc='max')

time,Dinner,Lunch
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,6.5,5.17
Male,10.0,6.7


Note the gotcha that the arguments are not in the same order if you specify them positionally:

tips.pivot_table(__values__=None, __index__=None, __columns__=None, ... 
tips.pivot(__index__=None, __columns__=None, __values__=None)

# Aggregate

## Full Aggregate Syntax

The full syntax of arguments to `aggregate()` is fairly complex. You can have:

- A dict where the keys are columns in your source data, and the values are:
 - An array of functions to apply, where each element is:
 - A 2-tuple, where the first element is a string to call the output and the second element is the function

In [23]:
tips.groupby(['sex','smoker']).aggregate(
 {'tip':[('mean',np.mean),('50pct',np.median)],
 'time':[('pct_dinner', lambda x: 100*np.mean(x=='Dinner'))]
})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,time
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,50pct,pct_dinner
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Male,Yes,3.051167,3.0,78.333333
Male,No,3.113402,2.74,79.381443
Female,Yes,2.931515,2.88,69.69697
Female,No,2.773519,2.68,53.703704


But note that if you specify a tuple for one function, you had better specify it for all, or you get bad column names for the ones you didn't specify:

In [24]:
tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,('50pct',np.median)]})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,50pct
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,Yes,3.051167,3.0
Male,No,3.113402,2.74
Female,Yes,2.931515,2.88
Female,No,2.773519,2.68


Whereas if you don't specify _any_ names, you get sane defaults. I dunno.

In [25]:
tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median]})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,Yes,3.051167,3.0
Male,No,3.113402,2.74
Female,Yes,2.931515,2.88
Female,No,2.773519,2.68


## Multi-level Column Names

When we aggregate multiple columns with multiple functions, we get hierarchical column names:

In [26]:
tm = tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median],
 'total_bill':[np.mean,np.median]})

# See that we have a MultiIndex:
tm.columns

MultiIndex(levels=[['tip', 'total_bill'], ['mean', 'median']],
 labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

If we want to flip which is the first and which is the second level of the index, we can do it with `.swaplevel`:

In [27]:
tm

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,mean,median
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Yes,3.051167,3.0,22.2845,20.39
Male,No,3.113402,2.74,19.791237,18.24
Female,Yes,2.931515,2.88,17.977879,16.27
Female,No,2.773519,2.68,18.105185,16.69


In [28]:
tm.swaplevel(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,median,mean,median
Unnamed: 0_level_1,Unnamed: 1_level_1,tip,tip,total_bill,total_bill
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Yes,3.051167,3.0,22.2845,20.39
Male,No,3.113402,2.74,19.791237,18.24
Female,Yes,2.931515,2.88,17.977879,16.27
Female,No,2.773519,2.68,18.105185,16.69


In [29]:
# If we then want the columns sorted by the new index, we can do that explicitly:
tm.swaplevel(axis=1).sort_index(axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,mean,median,median
Unnamed: 0_level_1,Unnamed: 1_level_1,tip,total_bill,tip,total_bill
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Yes,3.051167,22.2845,3.0,20.39
Male,No,3.113402,19.791237,2.74,18.24
Female,Yes,2.931515,17.977879,2.88,16.27
Female,No,2.773519,18.105185,2.68,16.69


In [30]:
# The same thing works on the rows:
tm.swaplevel(axis=0).sort_index(axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,mean,median
smoker,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Yes,Male,3.051167,3.0,22.2845,20.39
Yes,Female,2.931515,2.88,17.977879,16.27
No,Male,3.113402,2.74,19.791237,18.24
No,Female,2.773519,2.68,18.105185,16.69


If we had more than two levels, we could specify which two we wanted to swap with additional arguments.

Some tools (including Altair) can't use data with hierarchical column names at all, so they have to be flattened. There's no built-in function for doing this, but the following idiom seems standard:

In [31]:
tm.columns = [c[0] + "." + c[1] for c in tm.columns]
tm

Unnamed: 0_level_0,Unnamed: 1_level_0,tip.mean,tip.median,total_bill.mean,total_bill.median
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,Yes,3.051167,3.0,22.2845,20.39
Male,No,3.113402,2.74,19.791237,18.24
Female,Yes,2.931515,2.88,17.977879,16.27
Female,No,2.773519,2.68,18.105185,16.69


## Set Membership

There is an `.isin` function for quickly checking set membership.

In [32]:
tips['weekend'] = tips.day.isin(['Sat','Sun'])
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,weekend
239,29.03,5.92,Male,No,Sat,Dinner,3,True
240,27.18,2.0,Female,Yes,Sat,Dinner,2,True
241,22.67,2.0,Male,Yes,Sat,Dinner,2,True
242,17.82,1.75,Male,No,Sat,Dinner,2,True
243,18.78,3.0,Female,No,Thur,Dinner,2,False


Performance of `.isin` is good compared to the alternatives:

In [33]:
%timeit -n100 diamonds.color.isin(['D','E','F'])
%timeit -n100 diamonds.eval('color in ["D","E","F"]')

1.32 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.92 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Method Chaining Helpers

In order to make it easier to create chains of manipulation functions, there is an `assign()` which creates new columns. It and the array-index filtering both take lambdas, which let you refer to an intermediate result that doesn't have a name.

In [34]:
tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
 [lambda x: x.tip_pct > 70]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,weekend,tip_pct
172,7.25,5.15,Male,Yes,Sun,Dinner,2,True,71.034483


There is also a "rename" for changing column names without needing to assign to the `.columns` or `.index` property of a named variable.

In [35]:
tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
 [lambda x: x.tip_pct > 70].rename({'day':'dayofweek'},axis='columns')

Unnamed: 0,total_bill,tip,sex,smoker,dayofweek,time,size,weekend,tip_pct
172,7.25,5.15,Male,Yes,Sun,Dinner,2,True,71.034483


# Sorting and Ranking

## sort_values and sort_index

There is a sort_values which goes with sort_index():

In [36]:
tips.set_index('total_bill').sort_index().head()

Unnamed: 0_level_0,tip,sex,smoker,day,time,size,weekend
total_bill,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3.07,1.0,Female,Yes,Sat,Dinner,1,True
5.75,1.0,Female,Yes,Fri,Dinner,2,False
7.25,1.0,Female,No,Sat,Dinner,1,True
7.25,5.15,Male,Yes,Sun,Dinner,2,True
7.51,2.0,Male,No,Thur,Lunch,2,False


In [37]:
tips.set_index('total_bill').sort_values('tip').head()

Unnamed: 0_level_0,tip,sex,smoker,day,time,size,weekend
total_bill,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3.07,1.0,Female,Yes,Sat,Dinner,1,True
12.6,1.0,Male,Yes,Sat,Dinner,2,True
5.75,1.0,Female,Yes,Fri,Dinner,2,False
7.25,1.0,Female,No,Sat,Dinner,1,True
16.99,1.01,Female,No,Sun,Dinner,2,True


## ranking

is in the `.rank()` member function. The usual options for method (min, max, dense, etc) are available as an argument to `rank()`.

Note that `pct` actually gives numbers between 0 and 1, not 0 and 100. Pandas is very sloppy generally about the meaning of "percent".

In [38]:
tips.assign(tip_rank=tips.tip.rank(), tip_pct = tips.tip.rank(pct=True)).head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,weekend,tip_rank,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,True,5.0,0.020492
1,10.34,1.66,Male,No,Sun,Dinner,3,True,33.0,0.135246
2,21.01,3.5,Male,No,Sun,Dinner,3,True,177.0,0.72541
3,23.68,3.31,Male,No,Sun,Dinner,2,True,165.0,0.67623
4,24.59,3.61,Female,No,Sun,Dinner,4,True,185.0,0.758197


# Replacing Values

In [39]:
# Note that column b gets promoted from integer to float because NaN cannot be stored in an integer type in Numpy
d = pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':np.NaN}, {'a':5, 'b': 6}])
d

Unnamed: 0,a,b
0,1,2.0
1,3,
2,5,6.0


You can replace individual values with `map`, which takes a dict or a lambda. It operates on only one column at a time.

In [40]:
d.assign(b=d.b.map({2:99}))

Unnamed: 0,a,b
0,1,99.0
1,3,
2,5,


In [41]:
d.assign(b=d.b.map(lambda x: 99 if x==2 else x))

Unnamed: 0,a,b
0,1,99.0
1,3,
2,5,6.0


You can run a map on all columns with `applymap`:

In [42]:
d.applymap(lambda x: 99 if x==2 else x)

Unnamed: 0,a,b
0,1,99.0
1,3,
2,5,6.0


You can fill NAs with `fillna`, which optionally takes column-specific defaults:

In [43]:
d.fillna({'b':-99})

Unnamed: 0,a,b
0,1,2.0
1,3,-99.0
2,5,6.0


That makes it especially convenient to do something like this:

In [44]:
d.fillna(d.mean())

Unnamed: 0,a,b
0,1,2.0
1,3,4.0
2,5,6.0


`.combine_first` is like a version of coalesce which works at a full column or dataframe level.

In [45]:
# This fills in the value of column a into column b where there is a missing value:
d.assign(b=d.b.combine_first(d.a))

Unnamed: 0,a,b
0,1,2.0
1,3,3.0
2,5,6.0


In [46]:
e=pd.DataFrame([{'a':-99, 'b':-98}]*3)
e

Unnamed: 0,a,b
0,-99,-98
1,-99,-98
2,-99,-98


In [47]:
# This does the same thing at the full dataframe level instead of a single column:
d.combine_first(e)

Unnamed: 0,a,b
0,1,2.0
1,3,-98.0
2,5,6.0


# Categories

Just like `.str` exposes special functions for strings, `.cat` exposes special functions for categorical variables. 

Let's make some categorical variables with cut (there is variant qcut, which bins by equal quantiles instead of equal width)

In [48]:
dc = diamonds.groupby(pd.cut(diamonds['carat'],np.arange(0,5,.5))).\
 aggregate({'price':'mean'})
dc

Unnamed: 0_level_0,price
carat,Unnamed: 1_level_1
"(0.0, 0.5]",839.718149
"(0.5, 1.0]",2811.342683
"(1.0, 1.5]",6513.526534
"(1.5, 2.0]",11321.774838
"(2.0, 2.5]",14918.141237
"(2.5, 3.0]",15472.904255
"(3.0, 3.5]",14822.0
"(3.5, 4.0]",15636.5
"(4.0, 4.5]",16576.5


In [49]:
# We can move the categorical index back into a column and see that it has type Categorical
dc=dc.reset_index()
dc.carat.dtype

CategoricalDtype(categories=[(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]
 ordered=True)

In [50]:
# Try tab-completing on dc.carat.cat.

# This gives us the integer values
dc.carat.cat.codes

0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
dtype: int8

In [51]:
# This gives us the labels
dc.carat.cat.categories

IntervalIndex([(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]
 closed='right',
 dtype='interval[float64]')

In [52]:
# This is a metadata flag indicating whether the category order is semantically meaningful
dc.carat.cat.ordered

True

__There is a function to reorder categories, but it appears to have some bugs. Look at these examples:__

In [53]:
# You can reorder categories:
dc.carat.cat.reorder_categories(dc.carat.cat.categories[[0,2,1,3,4,6,5,7,8]])

0 (0.0, 0.5]
1 (1.0, 1.5]
2 (0.5, 1.0]
3 (0.5, 1.0]
4 (1.5, 2.0]
5 (2.0, 2.5]
6 (3.0, 3.5]
7 (2.5, 3.0]
8 (2.5, 3.0]
Name: carat, dtype: category
Categories (9, interval[float64]): [(0.0, 0.5] < (1.0, 1.5] < (0.5, 1.0] < (1.5, 2.0] ... (3.0, 3.5] < (2.5, 3.0] < (3.5, 4.0] < (4.0, 4.5]]

In [54]:
dc.carat.cat.reorder_categories(dc.carat.cat.categories[::-1])

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
Name: carat, dtype: category
Categories (9, interval[float64]): [(4.0, 4.5] < (3.5, 4.0] < (3.0, 3.5] < (2.5, 3.0] ... (1.5, 2.0] < (1.0, 1.5] < (0.5, 1.0] < (0.0, 0.5]]

I submitted a bug:
https://github.com/pandas-dev/pandas/issues/23452

# What does "percent" mean?

Python libraries are disappointingly sloppy about using the word "percent" correctly. "Cent" is 100 and percents are supposed to be on a scale of 100, but often the word is used very shoddily on the scale of 1 instead. Here's an example:

In [55]:
pd.DataFrame({'x':[1,1,2,1]}).pct_change()

Unnamed: 0,x
0,
1,0.0
2,1.0
3,-0.5


This should be a 100% increase from 1 to 2, and then a 50% decrease from 2 back to 1, but it's actually 1.0 and -0.5. It's _not a percent_.

The same thing is true in `stats`, for example with the "percent point function" which actually goes from 0 to 1, not 0 to 100:

In [56]:
stats.norm.ppf([0.025,0.975]).round(2)

array([-1.96, 1.96])

Please do not follow these bad examples. The word "percent" does have a meaning.

# ToDo

These are some things I intend to write about but haven't gotten to yet:

.corr, .cov, .corrwith 
.duplicated, .drop_duplicates 
.sample(replace=), .take 
.get_dummies