{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4 new time-saving tricks in pandas ([video](https://www.youtube.com/watch?v=-NbY7E9hKxk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=32))\n",
"\n",
"- [My pandas video series (30 videos)](http://www.dataschool.io/easier-data-analysis-with-pandas/)\n",
"- [GitHub repository](https://github.com/justmarkham/pandas-videos)\n",
"- [pandas release notes](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.22.0'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Create a datetime column from a DataFrame\n",
"\n",
"*New in 0.18.1*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" month | \n",
" day | \n",
" year | \n",
" hour | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 12 | \n",
" 25 | \n",
" 2017 | \n",
" 10 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 15 | \n",
" 2018 | \n",
" 11 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" month day year hour\n",
"0 12 25 2017 10\n",
"1 1 15 2018 11"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create an example DataFrame\n",
"df = pd.DataFrame([[12, 25, 2017, 10], [1, 15, 2018, 11]],\n",
" columns=['month', 'day', 'year', 'hour'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2017-12-25 10:00:00\n",
"1 2018-01-15 11:00:00\n",
"dtype: datetime64[ns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new: create a datetime column from the entire DataFrame\n",
"pd.to_datetime(df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2017-12-25\n",
"1 2018-01-15\n",
"dtype: datetime64[ns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new: create a datetime column from a subset of columns\n",
"pd.to_datetime(df[['month', 'day', 'year']])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" month | \n",
" day | \n",
" year | \n",
" hour | \n",
"
\n",
" \n",
" \n",
" \n",
" 2017-12-25 | \n",
" 12 | \n",
" 25 | \n",
" 2017 | \n",
" 10 | \n",
"
\n",
" \n",
" 2018-01-15 | \n",
" 1 | \n",
" 15 | \n",
" 2018 | \n",
" 11 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" month day year hour\n",
"2017-12-25 12 25 2017 10\n",
"2018-01-15 1 15 2018 11"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# overwrite the index\n",
"df.index = pd.to_datetime(df[['month', 'day', 'year']])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [More information](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#assembling-datetimes)\n",
"- [Video: How do I work with dates and times in pandas?](https://www.youtube.com/watch?v=yCgJGsg0Xa4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Create a category column during file reading\n",
"\n",
"*New in 0.19.0*"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 89 | \n",
" 132 | \n",
" 54 | \n",
" 4.9 | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 25 | \n",
" 0 | \n",
" 14 | \n",
" 0.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 245 | \n",
" 138 | \n",
" 312 | \n",
" 12.4 | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 217 | \n",
" 57 | \n",
" 45 | \n",
" 5.9 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"0 Afghanistan 0 0 0 \n",
"1 Albania 89 132 54 \n",
"2 Algeria 25 0 14 \n",
"3 Andorra 245 138 312 \n",
"4 Angola 217 57 45 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"0 0.0 Asia \n",
"1 4.9 Europe \n",
"2 0.7 Africa \n",
"3 12.4 Europe \n",
"4 5.9 Africa "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read the drinks dataset into a DataFrame\n",
"drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n",
"drinks.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings int64\n",
"spirit_servings int64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent object\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# data types are automatically detected\n",
"drinks.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings int64\n",
"spirit_servings int64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent category\n",
"dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# old way to create a category (after file reading)\n",
"drinks['continent'] = drinks.continent.astype('category')\n",
"drinks.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings int64\n",
"spirit_servings int64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent category\n",
"dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new way to create a category (during file reading)\n",
"drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'continent':'category'})\n",
"drinks.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [More information](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#read-csv-supports-parsing-categorical-directly)\n",
"- [Video: How do I make my pandas DataFrame smaller and faster?](https://www.youtube.com/watch?v=wDYDYGyN_cw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=21)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Convert the data type of multiple columns at once\n",
"\n",
"*New in 0.19.0*"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings int64\n",
"spirit_servings int64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent object\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read the drinks dataset into a DataFrame\n",
"drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n",
"drinks.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings float64\n",
"spirit_servings float64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent object\n",
"dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# old way to convert data types (one at a time)\n",
"drinks['beer_servings'] = drinks.beer_servings.astype('float')\n",
"drinks['spirit_servings'] = drinks.spirit_servings.astype('float')\n",
"drinks.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings float64\n",
"spirit_servings float64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent object\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new way to convert data types (all at once)\n",
"drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n",
"drinks = drinks.astype({'beer_servings':'float', 'spirit_servings':'float'})\n",
"drinks.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [More information](http://pandas.pydata.org/pandas-docs/stable/basics.html#astype)\n",
"- [Video: How do I change the data type of a pandas Series?](https://www.youtube.com/watch?v=V0AWyzVMf54&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=13)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Apply multiple aggregations on a Series or DataFrame\n",
"\n",
"*New in 0.20.0*"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"continent\n",
"Africa 61.471698\n",
"Asia 37.045455\n",
"Europe 193.777778\n",
"North America 145.434783\n",
"Oceania 89.687500\n",
"South America 175.083333\n",
"Name: beer_servings, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example of a single aggregation function after a groupby\n",
"drinks.groupby('continent').beer_servings.mean()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean | \n",
" min | \n",
" max | \n",
"
\n",
" \n",
" continent | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Africa | \n",
" 61.471698 | \n",
" 0.0 | \n",
" 376.0 | \n",
"
\n",
" \n",
" Asia | \n",
" 37.045455 | \n",
" 0.0 | \n",
" 247.0 | \n",
"
\n",
" \n",
" Europe | \n",
" 193.777778 | \n",
" 0.0 | \n",
" 361.0 | \n",
"
\n",
" \n",
" North America | \n",
" 145.434783 | \n",
" 1.0 | \n",
" 285.0 | \n",
"
\n",
" \n",
" Oceania | \n",
" 89.687500 | \n",
" 0.0 | \n",
" 306.0 | \n",
"
\n",
" \n",
" South America | \n",
" 175.083333 | \n",
" 93.0 | \n",
" 333.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean min max\n",
"continent \n",
"Africa 61.471698 0.0 376.0\n",
"Asia 37.045455 0.0 247.0\n",
"Europe 193.777778 0.0 361.0\n",
"North America 145.434783 1.0 285.0\n",
"Oceania 89.687500 0.0 306.0\n",
"South America 175.083333 93.0 333.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# multiple aggregation functions can be applied simultaneously\n",
"drinks.groupby('continent').beer_servings.agg(['mean', 'min', 'max'])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"mean 106.160622\n",
"min 0.000000\n",
"max 376.000000\n",
"Name: beer_servings, dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new: apply the same aggregations to a Series\n",
"drinks.beer_servings.agg(['mean', 'min', 'max'])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" max | \n",
" Zimbabwe | \n",
" 376.000000 | \n",
" 438.000000 | \n",
" 370.000000 | \n",
" 14.400000 | \n",
" South America | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 106.160622 | \n",
" 80.994819 | \n",
" 49.450777 | \n",
" 4.717098 | \n",
" NaN | \n",
"
\n",
" \n",
" min | \n",
" Afghanistan | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"max Zimbabwe 376.000000 438.000000 370.000000 \n",
"mean NaN 106.160622 80.994819 49.450777 \n",
"min Afghanistan 0.000000 0.000000 0.000000 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"max 14.400000 South America \n",
"mean 4.717098 NaN \n",
"min 0.000000 Africa "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# new: apply the same aggregations to a DataFrame\n",
"drinks.agg(['mean', 'min', 'max'])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 193.000000 | \n",
" 193.000000 | \n",
" 193.000000 | \n",
" 193.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 106.160622 | \n",
" 80.994819 | \n",
" 49.450777 | \n",
" 4.717098 | \n",
"
\n",
" \n",
" std | \n",
" 101.143103 | \n",
" 88.284312 | \n",
" 79.697598 | \n",
" 3.773298 | \n",
"
\n",
" \n",
" min | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 20.000000 | \n",
" 4.000000 | \n",
" 1.000000 | \n",
" 1.300000 | \n",
"
\n",
" \n",
" 50% | \n",
" 76.000000 | \n",
" 56.000000 | \n",
" 8.000000 | \n",
" 4.200000 | \n",
"
\n",
" \n",
" 75% | \n",
" 188.000000 | \n",
" 128.000000 | \n",
" 59.000000 | \n",
" 7.200000 | \n",
"
\n",
" \n",
" max | \n",
" 376.000000 | \n",
" 438.000000 | \n",
" 370.000000 | \n",
" 14.400000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" beer_servings spirit_servings wine_servings \\\n",
"count 193.000000 193.000000 193.000000 \n",
"mean 106.160622 80.994819 49.450777 \n",
"std 101.143103 88.284312 79.697598 \n",
"min 0.000000 0.000000 0.000000 \n",
"25% 20.000000 4.000000 1.000000 \n",
"50% 76.000000 56.000000 8.000000 \n",
"75% 188.000000 128.000000 59.000000 \n",
"max 376.000000 438.000000 370.000000 \n",
"\n",
" total_litres_of_pure_alcohol \n",
"count 193.000000 \n",
"mean 4.717098 \n",
"std 3.773298 \n",
"min 0.000000 \n",
"25% 1.300000 \n",
"50% 4.200000 \n",
"75% 7.200000 \n",
"max 14.400000 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DataFrame describe method provides similar functionality but is less flexible\n",
"drinks.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [More information](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-aggregate)\n",
"- [Video: When should I use a \"groupby\" in pandas?](https://www.youtube.com/watch?v=qy0fDqoMJx8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=14)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bonus: Download the official pandas cheat sheet\n",
"\n",
"*New in 0.19.2*\n",
"\n",
"[Cheat sheet (PDF)](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}