{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data School: My top 25 pandas tricks ([video](https://www.youtube.com/watch?v=RlIiVeig3hc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=35))\n",
"\n",
"**See also:** \"21 more pandas tricks\" [video](https://www.youtube.com/watch?v=tWFQqaRtSQA&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=36) and [notebook](https://nbviewer.org/github/justmarkham/pandas-videos/blob/master/21_more_pandas_tricks.ipynb)\n",
"\n",
"- Watch the [complete pandas video series](https://www.dataschool.io/easier-data-analysis-with-pandas/)\n",
"- Connect on [Twitter](https://twitter.com/justmarkham), [Facebook](https://www.facebook.com/DataScienceSchool/), and [LinkedIn](https://www.linkedin.com/in/justmarkham/)\n",
"- Subscribe on [YouTube](https://www.youtube.com/dataschool?sub_confirmation=1)\n",
"- Join the [email newsletter](https://www.dataschool.io/subscribe/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"1. Show installed versions\n",
"2. Create an example DataFrame\n",
"3. Rename columns\n",
"4. Reverse row order\n",
"5. Reverse column order\n",
"6. Select columns by data type\n",
"7. Convert strings to numbers\n",
"8. Reduce DataFrame size\n",
"9. Build a DataFrame from multiple files (row-wise)\n",
"10. Build a DataFrame from multiple files (column-wise)\n",
"11. Create a DataFrame from the clipboard\n",
"12. Split a DataFrame into two random subsets\n",
"13. Filter a DataFrame by multiple categories\n",
"14. Filter a DataFrame by largest categories\n",
"15. Handle missing values\n",
"16. Split a string into multiple columns\n",
"17. Expand a Series of lists into a DataFrame\n",
"18. Aggregate by multiple functions\n",
"19. Combine the output of an aggregation with a DataFrame\n",
"20. Select a slice of rows and columns\n",
"21. Reshape a MultiIndexed Series\n",
"22. Create a pivot table\n",
"23. Convert continuous data into categorical data\n",
"24. Change display options\n",
"25. Style a DataFrame\n",
"26. Bonus trick: Profile a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load example datasets"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n",
"movies = pd.read_csv('http://bit.ly/imdbratings')\n",
"orders = pd.read_csv('http://bit.ly/chiporders', sep='\\t')\n",
"orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')\n",
"stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date'])\n",
"titanic = pd.read_csv('http://bit.ly/kaggletrain')\n",
"ufo = pd.read_csv('http://bit.ly/uforeports', parse_dates=['Time'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Show installed versions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes you need to know the pandas version you're using, especially when reading the pandas documentation. You can show the pandas version by typing:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.24.2'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But if you also need to know the versions of pandas' dependencies, you can use the `show_versions()` function:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"INSTALLED VERSIONS\n",
"------------------\n",
"commit: None\n",
"python: 3.7.3.final.0\n",
"python-bits: 64\n",
"OS: Darwin\n",
"OS-release: 18.6.0\n",
"machine: x86_64\n",
"processor: i386\n",
"byteorder: little\n",
"LC_ALL: None\n",
"LANG: en_US.UTF-8\n",
"LOCALE: en_US.UTF-8\n",
"\n",
"pandas: 0.24.2\n",
"pytest: None\n",
"pip: 19.1.1\n",
"setuptools: 41.0.1\n",
"Cython: None\n",
"numpy: 1.16.4\n",
"scipy: None\n",
"pyarrow: None\n",
"xarray: None\n",
"IPython: 7.5.0\n",
"sphinx: None\n",
"patsy: None\n",
"dateutil: 2.8.0\n",
"pytz: 2019.1\n",
"blosc: None\n",
"bottleneck: None\n",
"tables: None\n",
"numexpr: None\n",
"feather: None\n",
"matplotlib: 3.1.0\n",
"openpyxl: None\n",
"xlrd: None\n",
"xlwt: None\n",
"xlsxwriter: None\n",
"lxml.etree: None\n",
"bs4: None\n",
"html5lib: None\n",
"sqlalchemy: None\n",
"pymysql: None\n",
"psycopg2: None\n",
"jinja2: 2.10.1\n",
"s3fs: None\n",
"fastparquet: None\n",
"pandas_gbq: None\n",
"pandas_datareader: None\n",
"gcsfs: None\n"
]
}
],
"source": [
"pd.show_versions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see the versions of Python, pandas, NumPy, matplotlib, and more."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Create an example DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that you want to demonstrate some pandas code. You need an example DataFrame to work with.\n",
"\n",
"There are many ways to do this, but my favorite way is to pass a dictionary to the DataFrame constructor, in which the dictionary keys are the column names and the dictionary values are lists of column values:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col one | \n",
" col two | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100 | \n",
" 300 | \n",
"
\n",
" \n",
" 1 | \n",
" 200 | \n",
" 400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col one col two\n",
"0 100 300\n",
"1 200 400"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now if you need a much larger DataFrame, the above method will require way too much typing. In that case, you can use NumPy's `random.rand()` function, tell it the number of rows and columns, and pass that to the DataFrame constructor:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.765050 | \n",
" 0.672438 | \n",
" 0.658516 | \n",
" 0.515231 | \n",
" 0.314563 | \n",
" 0.759657 | \n",
" 0.838804 | \n",
" 0.154178 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.526786 | \n",
" 0.258871 | \n",
" 0.032577 | \n",
" 0.635255 | \n",
" 0.008315 | \n",
" 0.827765 | \n",
" 0.574318 | \n",
" 0.781200 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.114055 | \n",
" 0.795156 | \n",
" 0.144248 | \n",
" 0.161738 | \n",
" 0.624836 | \n",
" 0.223252 | \n",
" 0.492255 | \n",
" 0.274132 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.014080 | \n",
" 0.097308 | \n",
" 0.422632 | \n",
" 0.098952 | \n",
" 0.471007 | \n",
" 0.307562 | \n",
" 0.503040 | \n",
" 0.317663 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.765050 0.672438 0.658516 0.515231 0.314563 0.759657 0.838804 \n",
"1 0.526786 0.258871 0.032577 0.635255 0.008315 0.827765 0.574318 \n",
"2 0.114055 0.795156 0.144248 0.161738 0.624836 0.223252 0.492255 \n",
"3 0.014080 0.097308 0.422632 0.098952 0.471007 0.307562 0.503040 \n",
"\n",
" 7 \n",
"0 0.154178 \n",
"1 0.781200 \n",
"2 0.274132 \n",
"3 0.317663 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(np.random.rand(4, 8))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's pretty good, but if you also want non-numeric column names, you can coerce a string of letters to a list and then pass that list to the columns parameter:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
" f | \n",
" g | \n",
" h | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.929156 | \n",
" 0.665603 | \n",
" 0.934804 | \n",
" 0.498339 | \n",
" 0.598148 | \n",
" 0.717280 | \n",
" 0.304452 | \n",
" 0.311813 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.308736 | \n",
" 0.418361 | \n",
" 0.758243 | \n",
" 0.733521 | \n",
" 0.145216 | \n",
" 0.822932 | \n",
" 0.369632 | \n",
" 0.470175 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.964671 | \n",
" 0.439196 | \n",
" 0.377538 | \n",
" 0.547604 | \n",
" 0.138113 | \n",
" 0.789990 | \n",
" 0.615333 | \n",
" 0.540587 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.108064 | \n",
" 0.834134 | \n",
" 0.367098 | \n",
" 0.132073 | \n",
" 0.608710 | \n",
" 0.783628 | \n",
" 0.347594 | \n",
" 0.836521 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e f g \\\n",
"0 0.929156 0.665603 0.934804 0.498339 0.598148 0.717280 0.304452 \n",
"1 0.308736 0.418361 0.758243 0.733521 0.145216 0.822932 0.369632 \n",
"2 0.964671 0.439196 0.377538 0.547604 0.138113 0.789990 0.615333 \n",
"3 0.108064 0.834134 0.367098 0.132073 0.608710 0.783628 0.347594 \n",
"\n",
" h \n",
"0 0.311813 \n",
"1 0.470175 \n",
"2 0.540587 \n",
"3 0.836521 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you might guess, your string will need to have the same number of characters as there are columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Rename columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the example DataFrame we created in the last trick:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col one | \n",
" col two | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100 | \n",
" 300 | \n",
"
\n",
" \n",
" 1 | \n",
" 200 | \n",
" 400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col one col two\n",
"0 100 300\n",
"1 200 400"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I prefer to use dot notation to select pandas columns, but that won't work since the column names have spaces. Let's fix this.\n",
"\n",
"The most flexible method for renaming columns is the `rename()` method. You pass it a dictionary in which the keys are the old names and the values are the new names, and you also specify the axis:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df = df.rename({'col one':'col_one', 'col two':'col_two'}, axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The best thing about this method is that you can use it to rename any number of columns, whether it be just one column or all columns.\n",
"\n",
"Now if you're going to rename all of the columns at once, a simpler method is just to overwrite the columns attribute of the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df.columns = ['col_one', 'col_two']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now if the only thing you're doing is replacing spaces with underscores, an even better method is to use the `str.replace()` method, since you don't have to type out all of the column names:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"df.columns = df.columns.str.replace(' ', '_')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All three of these methods have the same result, which is to rename the columns so that they don't have any spaces:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one | \n",
" col_two | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100 | \n",
" 300 | \n",
"
\n",
" \n",
" 1 | \n",
" 200 | \n",
" 400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one col_two\n",
"0 100 300\n",
"1 200 400"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, if you just need to add a prefix or suffix to all of your column names, you can use the `add_prefix()` method..."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" X_col_one | \n",
" X_col_two | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100 | \n",
" 300 | \n",
"
\n",
" \n",
" 1 | \n",
" 200 | \n",
" 400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" X_col_one X_col_two\n",
"0 100 300\n",
"1 200 400"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.add_prefix('X_')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...or the `add_suffix()` method:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one_Y | \n",
" col_two_Y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 100 | \n",
" 300 | \n",
"
\n",
" \n",
" 1 | \n",
" 200 | \n",
" 400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one_Y col_two_Y\n",
"0 100 300\n",
"1 200 400"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.add_suffix('_Y')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Reverse row order"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the drinks DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 89 | \n",
" 132 | \n",
" 54 | \n",
" 4.9 | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 25 | \n",
" 0 | \n",
" 14 | \n",
" 0.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 245 | \n",
" 138 | \n",
" 312 | \n",
" 12.4 | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 217 | \n",
" 57 | \n",
" 45 | \n",
" 5.9 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"0 Afghanistan 0 0 0 \n",
"1 Albania 89 132 54 \n",
"2 Algeria 25 0 14 \n",
"3 Andorra 245 138 312 \n",
"4 Angola 217 57 45 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"0 0.0 Asia \n",
"1 4.9 Europe \n",
"2 0.7 Africa \n",
"3 12.4 Europe \n",
"4 5.9 Africa "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a dataset of average alcohol consumption by country. What if you wanted to reverse the order of the rows?\n",
"\n",
"The most straightforward method is to use the `loc` accessor and pass it `::-1`, which is the same slicing notation used to reverse a Python list:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 192 | \n",
" Zimbabwe | \n",
" 64 | \n",
" 18 | \n",
" 4 | \n",
" 4.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 191 | \n",
" Zambia | \n",
" 32 | \n",
" 19 | \n",
" 4 | \n",
" 2.5 | \n",
" Africa | \n",
"
\n",
" \n",
" 190 | \n",
" Yemen | \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 0.1 | \n",
" Asia | \n",
"
\n",
" \n",
" 189 | \n",
" Vietnam | \n",
" 111 | \n",
" 2 | \n",
" 1 | \n",
" 2.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 188 | \n",
" Venezuela | \n",
" 333 | \n",
" 100 | \n",
" 3 | \n",
" 7.7 | \n",
" South America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"192 Zimbabwe 64 18 4 \n",
"191 Zambia 32 19 4 \n",
"190 Yemen 6 0 0 \n",
"189 Vietnam 111 2 1 \n",
"188 Venezuela 333 100 3 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"192 4.7 Africa \n",
"191 2.5 Africa \n",
"190 0.1 Asia \n",
"189 2.0 Asia \n",
"188 7.7 South America "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.loc[::-1].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if you also wanted to reset the index so that it starts at zero?\n",
"\n",
"You would use the `reset_index()` method and tell it to drop the old index entirely:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Zimbabwe | \n",
" 64 | \n",
" 18 | \n",
" 4 | \n",
" 4.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 1 | \n",
" Zambia | \n",
" 32 | \n",
" 19 | \n",
" 4 | \n",
" 2.5 | \n",
" Africa | \n",
"
\n",
" \n",
" 2 | \n",
" Yemen | \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 0.1 | \n",
" Asia | \n",
"
\n",
" \n",
" 3 | \n",
" Vietnam | \n",
" 111 | \n",
" 2 | \n",
" 1 | \n",
" 2.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 4 | \n",
" Venezuela | \n",
" 333 | \n",
" 100 | \n",
" 3 | \n",
" 7.7 | \n",
" South America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"0 Zimbabwe 64 18 4 \n",
"1 Zambia 32 19 4 \n",
"2 Yemen 6 0 0 \n",
"3 Vietnam 111 2 1 \n",
"4 Venezuela 333 100 3 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"0 4.7 Africa \n",
"1 2.5 Africa \n",
"2 0.1 Asia \n",
"3 2.0 Asia \n",
"4 7.7 South America "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.loc[::-1].reset_index(drop=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the rows are in reverse order but the index has been reset to the default integer index."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Reverse column order"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to the previous trick, you can also use `loc` to reverse the left-to-right order of your columns:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" continent | \n",
" total_litres_of_pure_alcohol | \n",
" wine_servings | \n",
" spirit_servings | \n",
" beer_servings | \n",
" country | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Asia | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" Afghanistan | \n",
"
\n",
" \n",
" 1 | \n",
" Europe | \n",
" 4.9 | \n",
" 54 | \n",
" 132 | \n",
" 89 | \n",
" Albania | \n",
"
\n",
" \n",
" 2 | \n",
" Africa | \n",
" 0.7 | \n",
" 14 | \n",
" 0 | \n",
" 25 | \n",
" Algeria | \n",
"
\n",
" \n",
" 3 | \n",
" Europe | \n",
" 12.4 | \n",
" 312 | \n",
" 138 | \n",
" 245 | \n",
" Andorra | \n",
"
\n",
" \n",
" 4 | \n",
" Africa | \n",
" 5.9 | \n",
" 45 | \n",
" 57 | \n",
" 217 | \n",
" Angola | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" continent total_litres_of_pure_alcohol wine_servings spirit_servings \\\n",
"0 Asia 0.0 0 0 \n",
"1 Europe 4.9 54 132 \n",
"2 Africa 0.7 14 0 \n",
"3 Europe 12.4 312 138 \n",
"4 Africa 5.9 45 57 \n",
"\n",
" beer_servings country \n",
"0 0 Afghanistan \n",
"1 89 Albania \n",
"2 25 Algeria \n",
"3 245 Andorra \n",
"4 217 Angola "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.loc[:, ::-1].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The colon before the comma means \"select all rows\", and the `::-1` after the comma means \"reverse the columns\", which is why \"country\" is now on the right side."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Select columns by data type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the data types of the drinks DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country object\n",
"beer_servings int64\n",
"spirit_servings int64\n",
"wine_servings int64\n",
"total_litres_of_pure_alcohol float64\n",
"continent object\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say you need to select only the numeric columns. You can use the `select_dtypes()` method:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 89 | \n",
" 132 | \n",
" 54 | \n",
" 4.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 0 | \n",
" 14 | \n",
" 0.7 | \n",
"
\n",
" \n",
" 3 | \n",
" 245 | \n",
" 138 | \n",
" 312 | \n",
" 12.4 | \n",
"
\n",
" \n",
" 4 | \n",
" 217 | \n",
" 57 | \n",
" 45 | \n",
" 5.9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol\n",
"0 0 0 0 0.0\n",
"1 89 132 54 4.9\n",
"2 25 0 14 0.7\n",
"3 245 138 312 12.4\n",
"4 217 57 45 5.9"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.select_dtypes(include='number').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This includes both int and float columns.\n",
"\n",
"You could also use this method to select just the object columns:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country continent\n",
"0 Afghanistan Asia\n",
"1 Albania Europe\n",
"2 Algeria Africa\n",
"3 Andorra Europe\n",
"4 Angola Africa"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.select_dtypes(include='object').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can tell it to include multiple data types by passing a list:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 89 | \n",
" 132 | \n",
" 54 | \n",
" 4.9 | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 25 | \n",
" 0 | \n",
" 14 | \n",
" 0.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 245 | \n",
" 138 | \n",
" 312 | \n",
" 12.4 | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 217 | \n",
" 57 | \n",
" 45 | \n",
" 5.9 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"0 Afghanistan 0 0 0 \n",
"1 Albania 89 132 54 \n",
"2 Algeria 25 0 14 \n",
"3 Andorra 245 138 312 \n",
"4 Angola 217 57 45 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"0 0.0 Asia \n",
"1 4.9 Europe \n",
"2 0.7 Africa \n",
"3 12.4 Europe \n",
"4 5.9 Africa "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.select_dtypes(include=['number', 'object', 'category', 'datetime']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also tell it to exclude certain data types:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country continent\n",
"0 Afghanistan Asia\n",
"1 Albania Europe\n",
"2 Algeria Africa\n",
"3 Andorra Europe\n",
"4 Angola Africa"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drinks.select_dtypes(exclude='number').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Convert strings to numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create another example DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one | \n",
" col_two | \n",
" col_three | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.1 | \n",
" 4.4 | \n",
" 7.7 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.2 | \n",
" 5.5 | \n",
" 8.8 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.3 | \n",
" 6.6 | \n",
" - | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one col_two col_three\n",
"0 1.1 4.4 7.7\n",
"1 2.2 5.5 8.8\n",
"2 3.3 6.6 -"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'col_one':['1.1', '2.2', '3.3'],\n",
" 'col_two':['4.4', '5.5', '6.6'],\n",
" 'col_three':['7.7', '8.8', '-']})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These numbers are actually stored as strings, which results in object columns:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"col_one object\n",
"col_two object\n",
"col_three object\n",
"dtype: object"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to do mathematical operations on these columns, we need to convert the data types to numeric. You can use the `astype()` method on the first two columns:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"col_one float64\n",
"col_two float64\n",
"col_three object\n",
"dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.astype({'col_one':'float', 'col_two':'float'}).dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this would have resulted in an error if you tried to use it on the third column, because that column contains a dash to represent zero and pandas doesn't understand how to handle it.\n",
"\n",
"Instead, you can use the `to_numeric()` function on the third column and tell it to convert any invalid input into `NaN` values:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 7.7\n",
"1 8.8\n",
"2 NaN\n",
"Name: col_three, dtype: float64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.to_numeric(df.col_three, errors='coerce')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you know that the `NaN` values actually represent zeros, you can fill them with zeros using the `fillna()` method:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 7.7\n",
"1 8.8\n",
"2 0.0\n",
"Name: col_three, dtype: float64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.to_numeric(df.col_three, errors='coerce').fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, you can apply this function to the entire DataFrame all at once by using the `apply()` method:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one | \n",
" col_two | \n",
" col_three | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.1 | \n",
" 4.4 | \n",
" 7.7 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.2 | \n",
" 5.5 | \n",
" 8.8 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.3 | \n",
" 6.6 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one col_two col_three\n",
"0 1.1 4.4 7.7\n",
"1 2.2 5.5 8.8\n",
"2 3.3 6.6 0.0"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.apply(pd.to_numeric, errors='coerce').fillna(0)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This one line of code accomplishes our goal, because all of the data types have now been converted to float:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"col_one float64\n",
"col_two float64\n",
"col_three float64\n",
"dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Reduce DataFrame size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas DataFrames are designed to fit into memory, and so sometimes you need to reduce the DataFrame size in order to work with it on your system.\n",
"\n",
"Here's the size of the drinks DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 193 entries, 0 to 192\n",
"Data columns (total 6 columns):\n",
"country 193 non-null object\n",
"beer_servings 193 non-null int64\n",
"spirit_servings 193 non-null int64\n",
"wine_servings 193 non-null int64\n",
"total_litres_of_pure_alcohol 193 non-null float64\n",
"continent 193 non-null object\n",
"dtypes: float64(1), int64(3), object(2)\n",
"memory usage: 30.4 KB\n"
]
}
],
"source": [
"drinks.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that it currently uses 30.4 KB.\n",
"\n",
"If you're having performance problems with your DataFrame, or you can't even read it into memory, there are two easy steps you can take during the file reading process to reduce the DataFrame size.\n",
"\n",
"The first step is to only read in the columns that you actually need, which we specify with the \"usecols\" parameter:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 193 entries, 0 to 192\n",
"Data columns (total 2 columns):\n",
"beer_servings 193 non-null int64\n",
"continent 193 non-null object\n",
"dtypes: int64(1), object(1)\n",
"memory usage: 13.6 KB\n"
]
}
],
"source": [
"cols = ['beer_servings', 'continent']\n",
"small_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols)\n",
"small_drinks.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By only reading in these two columns, we've reduced the DataFrame size to 13.6 KB.\n",
"\n",
"The second step is to convert any object columns containing categorical data to the category data type, which we specify with the \"dtype\" parameter:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 193 entries, 0 to 192\n",
"Data columns (total 2 columns):\n",
"beer_servings 193 non-null int64\n",
"continent 193 non-null category\n",
"dtypes: category(1), int64(1)\n",
"memory usage: 2.3 KB\n"
]
}
],
"source": [
"dtypes = {'continent':'category'}\n",
"smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols, dtype=dtypes)\n",
"smaller_drinks.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By reading in the continent column as the category data type, we've further reduced the DataFrame size to 2.3 KB.\n",
"\n",
"Keep in mind that the category data type will only reduce memory usage if you have a small number of categories relative to the number of rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Build a DataFrame from multiple files (row-wise)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that your dataset is spread across multiple files, but you want to read the dataset into a single DataFrame.\n",
"\n",
"For example, I have a small dataset of stock data in which each CSV file only includes a single day. Here's the first day:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-03 | \n",
" 31.50 | \n",
" 14070500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-03 | \n",
" 112.52 | \n",
" 21701800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-03 | \n",
" 57.42 | \n",
" 19189500 | \n",
" MSFT | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-03 31.50 14070500 CSCO\n",
"1 2016-10-03 112.52 21701800 AAPL\n",
"2 2016-10-03 57.42 19189500 MSFT"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('data/stocks1.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the second day:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-04 | \n",
" 113.00 | \n",
" 29736800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-04 | \n",
" 57.24 | \n",
" 20085900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-04 | \n",
" 31.35 | \n",
" 18460400 | \n",
" CSCO | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-04 113.00 29736800 AAPL\n",
"1 2016-10-04 57.24 20085900 MSFT\n",
"2 2016-10-04 31.35 18460400 CSCO"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('data/stocks2.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here's the third day:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-05 | \n",
" 57.64 | \n",
" 16726400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-05 | \n",
" 31.59 | \n",
" 11808600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-05 | \n",
" 113.05 | \n",
" 21453100 | \n",
" AAPL | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-05 57.64 16726400 MSFT\n",
"1 2016-10-05 31.59 11808600 CSCO\n",
"2 2016-10-05 113.05 21453100 AAPL"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('data/stocks3.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You could read each CSV file into its own DataFrame, combine them together, and then delete the original DataFrames, but that would be memory inefficient and require a lot of code.\n",
"\n",
"A better solution is to use the built-in glob module:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from glob import glob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass a pattern to `glob()`, including wildcard characters, and it will return a list of all files that match that pattern.\n",
"\n",
"In this case, glob is looking in the \"data\" subdirectory for all CSV files that start with the word \"stocks\":"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['data/stocks1.csv', 'data/stocks2.csv', 'data/stocks3.csv']"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stock_files = sorted(glob('data/stocks*.csv'))\n",
"stock_files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"glob returns filenames in an arbitrary order, which is why we sorted the list using Python's built-in `sorted()` function.\n",
"\n",
"We can then use a generator expression to read each of the files using `read_csv()` and pass the results to the `concat()` function, which will concatenate the rows into a single DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-03 | \n",
" 31.50 | \n",
" 14070500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-03 | \n",
" 112.52 | \n",
" 21701800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-03 | \n",
" 57.42 | \n",
" 19189500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 0 | \n",
" 2016-10-04 | \n",
" 113.00 | \n",
" 29736800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-04 | \n",
" 57.24 | \n",
" 20085900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-04 | \n",
" 31.35 | \n",
" 18460400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 0 | \n",
" 2016-10-05 | \n",
" 57.64 | \n",
" 16726400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-05 | \n",
" 31.59 | \n",
" 11808600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-05 | \n",
" 113.05 | \n",
" 21453100 | \n",
" AAPL | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-03 31.50 14070500 CSCO\n",
"1 2016-10-03 112.52 21701800 AAPL\n",
"2 2016-10-03 57.42 19189500 MSFT\n",
"0 2016-10-04 113.00 29736800 AAPL\n",
"1 2016-10-04 57.24 20085900 MSFT\n",
"2 2016-10-04 31.35 18460400 CSCO\n",
"0 2016-10-05 57.64 16726400 MSFT\n",
"1 2016-10-05 31.59 11808600 CSCO\n",
"2 2016-10-05 113.05 21453100 AAPL"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat((pd.read_csv(file) for file in stock_files))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, there are now duplicate values in the index. To avoid that, we can tell the `concat()` function to ignore the index and instead use the default integer index:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-03 | \n",
" 31.50 | \n",
" 14070500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-03 | \n",
" 112.52 | \n",
" 21701800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-03 | \n",
" 57.42 | \n",
" 19189500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-10-04 | \n",
" 113.00 | \n",
" 29736800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-10-04 | \n",
" 57.24 | \n",
" 20085900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 5 | \n",
" 2016-10-04 | \n",
" 31.35 | \n",
" 18460400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 6 | \n",
" 2016-10-05 | \n",
" 57.64 | \n",
" 16726400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 7 | \n",
" 2016-10-05 | \n",
" 31.59 | \n",
" 11808600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 8 | \n",
" 2016-10-05 | \n",
" 113.05 | \n",
" 21453100 | \n",
" AAPL | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-03 31.50 14070500 CSCO\n",
"1 2016-10-03 112.52 21701800 AAPL\n",
"2 2016-10-03 57.42 19189500 MSFT\n",
"3 2016-10-04 113.00 29736800 AAPL\n",
"4 2016-10-04 57.24 20085900 MSFT\n",
"5 2016-10-04 31.35 18460400 CSCO\n",
"6 2016-10-05 57.64 16726400 MSFT\n",
"7 2016-10-05 31.59 11808600 CSCO\n",
"8 2016-10-05 113.05 21453100 AAPL"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Build a DataFrame from multiple files (column-wise)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous trick is useful when each file contains rows from your dataset. But what if each file instead contains columns from your dataset?\n",
"\n",
"Here's an example in which the drinks dataset has been split into two CSV files, and each file contains three columns:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 89 | \n",
" 132 | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 25 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 245 | \n",
" 138 | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 217 | \n",
" 57 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings\n",
"0 Afghanistan 0 0\n",
"1 Albania 89 132\n",
"2 Algeria 25 0\n",
"3 Andorra 245 138\n",
"4 Angola 217 57"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('data/drinks1.csv').head()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" 54 | \n",
" 4.9 | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" 14 | \n",
" 0.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" 312 | \n",
" 12.4 | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" 45 | \n",
" 5.9 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" wine_servings total_litres_of_pure_alcohol continent\n",
"0 0 0.0 Asia\n",
"1 54 4.9 Europe\n",
"2 14 0.7 Africa\n",
"3 312 12.4 Europe\n",
"4 45 5.9 Africa"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('data/drinks2.csv').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to the previous trick, we'll start by using `glob()`:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"drink_files = sorted(glob('data/drinks*.csv'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And this time, we'll tell the `concat()` function to concatenate along the columns axis:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" beer_servings | \n",
" spirit_servings | \n",
" wine_servings | \n",
" total_litres_of_pure_alcohol | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" Asia | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 89 | \n",
" 132 | \n",
" 54 | \n",
" 4.9 | \n",
" Europe | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 25 | \n",
" 0 | \n",
" 14 | \n",
" 0.7 | \n",
" Africa | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 245 | \n",
" 138 | \n",
" 312 | \n",
" 12.4 | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 217 | \n",
" 57 | \n",
" 45 | \n",
" 5.9 | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"0 Afghanistan 0 0 0 \n",
"1 Albania 89 132 54 \n",
"2 Algeria 25 0 14 \n",
"3 Andorra 245 138 312 \n",
"4 Angola 217 57 45 \n",
"\n",
" total_litres_of_pure_alcohol continent \n",
"0 0.0 Asia \n",
"1 4.9 Europe \n",
"2 0.7 Africa \n",
"3 12.4 Europe \n",
"4 5.9 Africa "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat((pd.read_csv(file) for file in drink_files), axis='columns').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now our DataFrame has all six columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Create a DataFrame from the clipboard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that you have some data stored in an Excel spreadsheet or a [Google Sheet](https://docs.google.com/spreadsheets/d/1ipv_HAykbky8OXUubs9eLL-LQ1rAkexXG61-B4jd0Rc/edit?usp=sharing), and you want to get it into a DataFrame as quickly as possible.\n",
"\n",
"Just select the data and copy it to the clipboard. Then, you can use the `read_clipboard()` function to read it into a DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Column A | \n",
" Column B | \n",
" Column C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 4.4 | \n",
" seven | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 5.5 | \n",
" eight | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 6.6 | \n",
" nine | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Column A Column B Column C\n",
"0 1 4.4 seven\n",
"1 2 5.5 eight\n",
"2 3 6.6 nine"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_clipboard()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like the `read_csv()` function, `read_clipboard()` automatically detects the correct data type for each column:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Column A int64\n",
"Column B float64\n",
"Column C object\n",
"dtype: object"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's copy one other dataset to the clipboard:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Left | \n",
" Right | \n",
"
\n",
" \n",
" \n",
" \n",
" Alice | \n",
" 10 | \n",
" 40 | \n",
"
\n",
" \n",
" Bob | \n",
" 20 | \n",
" 50 | \n",
"
\n",
" \n",
" Charlie | \n",
" 30 | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Left Right\n",
"Alice 10 40\n",
"Bob 20 50\n",
"Charlie 30 60"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_clipboard()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amazingly, pandas has even identified the first column as the index:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Alice', 'Bob', 'Charlie'], dtype='object')"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Keep in mind that if you want your work to be reproducible in the future, `read_clipboard()` is not the recommended approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Split a DataFrame into two random subsets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that you want to split a DataFrame into two parts, randomly assigning 75% of the rows to one DataFrame and the other 25% to a second DataFrame.\n",
"\n",
"For example, we have a DataFrame of movie ratings with 979 rows:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"979"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(movies)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `sample()` method to randomly select 75% of the rows and assign them to the \"movies_1\" DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"movies_1 = movies.sample(frac=0.75, random_state=1234)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can use the `drop()` method to drop all rows that are in \"movies_1\" and assign the remaining rows to \"movies_2\":"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"movies_2 = movies.drop(movies_1.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that the total number of rows is correct:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"979"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(movies_1) + len(movies_2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can see from the index that every movie is in either \"movies_1\":"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 0, 2, 5, 6, 7, 8, 9, 11, 13, 16,\n",
" ...\n",
" 966, 967, 969, 971, 972, 974, 975, 976, 977, 978],\n",
" dtype='int64', length=734)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_1.index.sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...or \"movies_2\":"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 1, 3, 4, 10, 12, 14, 15, 18, 26, 30,\n",
" ...\n",
" 931, 934, 937, 941, 950, 954, 960, 968, 970, 973],\n",
" dtype='int64', length=245)"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_2.index.sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Keep in mind that this approach will not work if your index values are not unique."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Filter a DataFrame by multiple categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the movies DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" star_rating | \n",
" title | \n",
" content_rating | \n",
" genre | \n",
" duration | \n",
" actors_list | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9.3 | \n",
" The Shawshank Redemption | \n",
" R | \n",
" Crime | \n",
" 142 | \n",
" [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... | \n",
"
\n",
" \n",
" 1 | \n",
" 9.2 | \n",
" The Godfather | \n",
" R | \n",
" Crime | \n",
" 175 | \n",
" [u'Marlon Brando', u'Al Pacino', u'James Caan'] | \n",
"
\n",
" \n",
" 2 | \n",
" 9.1 | \n",
" The Godfather: Part II | \n",
" R | \n",
" Crime | \n",
" 200 | \n",
" [u'Al Pacino', u'Robert De Niro', u'Robert Duv... | \n",
"
\n",
" \n",
" 3 | \n",
" 9.0 | \n",
" The Dark Knight | \n",
" PG-13 | \n",
" Action | \n",
" 152 | \n",
" [u'Christian Bale', u'Heath Ledger', u'Aaron E... | \n",
"
\n",
" \n",
" 4 | \n",
" 8.9 | \n",
" Pulp Fiction | \n",
" R | \n",
" Crime | \n",
" 154 | \n",
" [u'John Travolta', u'Uma Thurman', u'Samuel L.... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" star_rating title content_rating genre duration \\\n",
"0 9.3 The Shawshank Redemption R Crime 142 \n",
"1 9.2 The Godfather R Crime 175 \n",
"2 9.1 The Godfather: Part II R Crime 200 \n",
"3 9.0 The Dark Knight PG-13 Action 152 \n",
"4 8.9 Pulp Fiction R Crime 154 \n",
"\n",
" actors_list \n",
"0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n",
"1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n",
"2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n",
"3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n",
"4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the columns is genre:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',\n",
" 'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',\n",
" 'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.genre.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to filter the DataFrame to only show movies with the genre Action or Drama or Western, we could use multiple conditions separated by the \"or\" operator:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" star_rating | \n",
" title | \n",
" content_rating | \n",
" genre | \n",
" duration | \n",
" actors_list | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 9.0 | \n",
" The Dark Knight | \n",
" PG-13 | \n",
" Action | \n",
" 152 | \n",
" [u'Christian Bale', u'Heath Ledger', u'Aaron E... | \n",
"
\n",
" \n",
" 5 | \n",
" 8.9 | \n",
" 12 Angry Men | \n",
" NOT RATED | \n",
" Drama | \n",
" 96 | \n",
" [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... | \n",
"
\n",
" \n",
" 6 | \n",
" 8.9 | \n",
" The Good, the Bad and the Ugly | \n",
" NOT RATED | \n",
" Western | \n",
" 161 | \n",
" [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... | \n",
"
\n",
" \n",
" 9 | \n",
" 8.9 | \n",
" Fight Club | \n",
" R | \n",
" Drama | \n",
" 139 | \n",
" [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... | \n",
"
\n",
" \n",
" 11 | \n",
" 8.8 | \n",
" Inception | \n",
" PG-13 | \n",
" Action | \n",
" 148 | \n",
" [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" star_rating title content_rating genre \\\n",
"3 9.0 The Dark Knight PG-13 Action \n",
"5 8.9 12 Angry Men NOT RATED Drama \n",
"6 8.9 The Good, the Bad and the Ugly NOT RATED Western \n",
"9 8.9 Fight Club R Drama \n",
"11 8.8 Inception PG-13 Action \n",
"\n",
" duration actors_list \n",
"3 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n",
"5 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n",
"6 161 [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... \n",
"9 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... \n",
"11 148 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies[(movies.genre == 'Action') |\n",
" (movies.genre == 'Drama') |\n",
" (movies.genre == 'Western')].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, you can actually rewrite this code more clearly by using the `isin()` method and passing it a list of genres:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" star_rating | \n",
" title | \n",
" content_rating | \n",
" genre | \n",
" duration | \n",
" actors_list | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 9.0 | \n",
" The Dark Knight | \n",
" PG-13 | \n",
" Action | \n",
" 152 | \n",
" [u'Christian Bale', u'Heath Ledger', u'Aaron E... | \n",
"
\n",
" \n",
" 5 | \n",
" 8.9 | \n",
" 12 Angry Men | \n",
" NOT RATED | \n",
" Drama | \n",
" 96 | \n",
" [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... | \n",
"
\n",
" \n",
" 6 | \n",
" 8.9 | \n",
" The Good, the Bad and the Ugly | \n",
" NOT RATED | \n",
" Western | \n",
" 161 | \n",
" [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... | \n",
"
\n",
" \n",
" 9 | \n",
" 8.9 | \n",
" Fight Club | \n",
" R | \n",
" Drama | \n",
" 139 | \n",
" [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... | \n",
"
\n",
" \n",
" 11 | \n",
" 8.8 | \n",
" Inception | \n",
" PG-13 | \n",
" Action | \n",
" 148 | \n",
" [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" star_rating title content_rating genre \\\n",
"3 9.0 The Dark Knight PG-13 Action \n",
"5 8.9 12 Angry Men NOT RATED Drama \n",
"6 8.9 The Good, the Bad and the Ugly NOT RATED Western \n",
"9 8.9 Fight Club R Drama \n",
"11 8.8 Inception PG-13 Action \n",
"\n",
" duration actors_list \n",
"3 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n",
"5 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n",
"6 161 [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... \n",
"9 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... \n",
"11 148 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... "
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies[movies.genre.isin(['Action', 'Drama', 'Western'])].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if you want to reverse this filter, so that you are excluding (rather than including) those three genres, you can put a tilde in front of the condition:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" star_rating | \n",
" title | \n",
" content_rating | \n",
" genre | \n",
" duration | \n",
" actors_list | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9.3 | \n",
" The Shawshank Redemption | \n",
" R | \n",
" Crime | \n",
" 142 | \n",
" [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... | \n",
"
\n",
" \n",
" 1 | \n",
" 9.2 | \n",
" The Godfather | \n",
" R | \n",
" Crime | \n",
" 175 | \n",
" [u'Marlon Brando', u'Al Pacino', u'James Caan'] | \n",
"
\n",
" \n",
" 2 | \n",
" 9.1 | \n",
" The Godfather: Part II | \n",
" R | \n",
" Crime | \n",
" 200 | \n",
" [u'Al Pacino', u'Robert De Niro', u'Robert Duv... | \n",
"
\n",
" \n",
" 4 | \n",
" 8.9 | \n",
" Pulp Fiction | \n",
" R | \n",
" Crime | \n",
" 154 | \n",
" [u'John Travolta', u'Uma Thurman', u'Samuel L.... | \n",
"
\n",
" \n",
" 7 | \n",
" 8.9 | \n",
" The Lord of the Rings: The Return of the King | \n",
" PG-13 | \n",
" Adventure | \n",
" 201 | \n",
" [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" star_rating title content_rating \\\n",
"0 9.3 The Shawshank Redemption R \n",
"1 9.2 The Godfather R \n",
"2 9.1 The Godfather: Part II R \n",
"4 8.9 Pulp Fiction R \n",
"7 8.9 The Lord of the Rings: The Return of the King PG-13 \n",
"\n",
" genre duration actors_list \n",
"0 Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n",
"1 Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n",
"2 Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n",
"4 Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L.... \n",
"7 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... "
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies[~movies.genre.isin(['Action', 'Drama', 'Western'])].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This works because tilde is the \"not\" operator in Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Filter a DataFrame by largest categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that you needed to filter the movies DataFrame by genre, but only include the 3 largest genres.\n",
"\n",
"We'll start by taking the `value_counts()` of genre and saving it as a Series called counts:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Drama 278\n",
"Comedy 156\n",
"Action 136\n",
"Crime 124\n",
"Biography 77\n",
"Adventure 75\n",
"Animation 62\n",
"Horror 29\n",
"Mystery 16\n",
"Western 9\n",
"Sci-Fi 5\n",
"Thriller 5\n",
"Film-Noir 3\n",
"Family 2\n",
"Fantasy 1\n",
"History 1\n",
"Name: genre, dtype: int64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = movies.genre.value_counts()\n",
"counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Series method `nlargest()` makes it easy to select the 3 largest values in this Series:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Drama 278\n",
"Comedy 156\n",
"Action 136\n",
"Name: genre, dtype: int64"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts.nlargest(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And all we actually need from this Series is the index:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Drama', 'Comedy', 'Action'], dtype='object')"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts.nlargest(3).index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can pass the index object to `isin()`, and it will be treated like a list of genres:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" star_rating | \n",
" title | \n",
" content_rating | \n",
" genre | \n",
" duration | \n",
" actors_list | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 9.0 | \n",
" The Dark Knight | \n",
" PG-13 | \n",
" Action | \n",
" 152 | \n",
" [u'Christian Bale', u'Heath Ledger', u'Aaron E... | \n",
"
\n",
" \n",
" 5 | \n",
" 8.9 | \n",
" 12 Angry Men | \n",
" NOT RATED | \n",
" Drama | \n",
" 96 | \n",
" [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... | \n",
"
\n",
" \n",
" 9 | \n",
" 8.9 | \n",
" Fight Club | \n",
" R | \n",
" Drama | \n",
" 139 | \n",
" [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... | \n",
"
\n",
" \n",
" 11 | \n",
" 8.8 | \n",
" Inception | \n",
" PG-13 | \n",
" Action | \n",
" 148 | \n",
" [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... | \n",
"
\n",
" \n",
" 12 | \n",
" 8.8 | \n",
" Star Wars: Episode V - The Empire Strikes Back | \n",
" PG | \n",
" Action | \n",
" 124 | \n",
" [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" star_rating title \\\n",
"3 9.0 The Dark Knight \n",
"5 8.9 12 Angry Men \n",
"9 8.9 Fight Club \n",
"11 8.8 Inception \n",
"12 8.8 Star Wars: Episode V - The Empire Strikes Back \n",
"\n",
" content_rating genre duration \\\n",
"3 PG-13 Action 152 \n",
"5 NOT RATED Drama 96 \n",
"9 R Drama 139 \n",
"11 PG-13 Action 148 \n",
"12 PG Action 124 \n",
"\n",
" actors_list \n",
"3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n",
"5 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n",
"9 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... \n",
"11 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... \n",
"12 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi... "
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies[movies.genre.isin(counts.nlargest(3).index)].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thus, only Drama and Comedy and Action movies remain in the DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Handle missing values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at a dataset of UFO sightings:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" City | \n",
" Colors Reported | \n",
" Shape Reported | \n",
" State | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Ithaca | \n",
" NaN | \n",
" TRIANGLE | \n",
" NY | \n",
" 1930-06-01 22:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" Willingboro | \n",
" NaN | \n",
" OTHER | \n",
" NJ | \n",
" 1930-06-30 20:00:00 | \n",
"
\n",
" \n",
" 2 | \n",
" Holyoke | \n",
" NaN | \n",
" OVAL | \n",
" CO | \n",
" 1931-02-15 14:00:00 | \n",
"
\n",
" \n",
" 3 | \n",
" Abilene | \n",
" NaN | \n",
" DISK | \n",
" KS | \n",
" 1931-06-01 13:00:00 | \n",
"
\n",
" \n",
" 4 | \n",
" New York Worlds Fair | \n",
" NaN | \n",
" LIGHT | \n",
" NY | \n",
" 1933-04-18 19:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" City Colors Reported Shape Reported State \\\n",
"0 Ithaca NaN TRIANGLE NY \n",
"1 Willingboro NaN OTHER NJ \n",
"2 Holyoke NaN OVAL CO \n",
"3 Abilene NaN DISK KS \n",
"4 New York Worlds Fair NaN LIGHT NY \n",
"\n",
" Time \n",
"0 1930-06-01 22:00:00 \n",
"1 1930-06-30 20:00:00 \n",
"2 1931-02-15 14:00:00 \n",
"3 1931-06-01 13:00:00 \n",
"4 1933-04-18 19:00:00 "
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ufo.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll notice that some of the values are missing.\n",
"\n",
"To find out how many values are missing in each column, you can use the `isna()` method and then take the `sum()`:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"City 25\n",
"Colors Reported 15359\n",
"Shape Reported 2644\n",
"State 0\n",
"Time 0\n",
"dtype: int64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ufo.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`isna()` generated a DataFrame of True and False values, and `sum()` converted all of the True values to 1 and added them up.\n",
"\n",
"Similarly, you can find out the percentage of values that are missing by taking the `mean()` of `isna()`:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"City 0.001371\n",
"Colors Reported 0.842004\n",
"Shape Reported 0.144948\n",
"State 0.000000\n",
"Time 0.000000\n",
"dtype: float64"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ufo.isna().mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to drop the columns that have any missing values, you can use the `dropna()` method:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" State | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NY | \n",
" 1930-06-01 22:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" NJ | \n",
" 1930-06-30 20:00:00 | \n",
"
\n",
" \n",
" 2 | \n",
" CO | \n",
" 1931-02-15 14:00:00 | \n",
"
\n",
" \n",
" 3 | \n",
" KS | \n",
" 1931-06-01 13:00:00 | \n",
"
\n",
" \n",
" 4 | \n",
" NY | \n",
" 1933-04-18 19:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" State Time\n",
"0 NY 1930-06-01 22:00:00\n",
"1 NJ 1930-06-30 20:00:00\n",
"2 CO 1931-02-15 14:00:00\n",
"3 KS 1931-06-01 13:00:00\n",
"4 NY 1933-04-18 19:00:00"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ufo.dropna(axis='columns').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or if you want to drop columns in which more than 10% of the values are missing, you can set a threshold for `dropna()`:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" City | \n",
" State | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Ithaca | \n",
" NY | \n",
" 1930-06-01 22:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" Willingboro | \n",
" NJ | \n",
" 1930-06-30 20:00:00 | \n",
"
\n",
" \n",
" 2 | \n",
" Holyoke | \n",
" CO | \n",
" 1931-02-15 14:00:00 | \n",
"
\n",
" \n",
" 3 | \n",
" Abilene | \n",
" KS | \n",
" 1931-06-01 13:00:00 | \n",
"
\n",
" \n",
" 4 | \n",
" New York Worlds Fair | \n",
" NY | \n",
" 1933-04-18 19:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" City State Time\n",
"0 Ithaca NY 1930-06-01 22:00:00\n",
"1 Willingboro NJ 1930-06-30 20:00:00\n",
"2 Holyoke CO 1931-02-15 14:00:00\n",
"3 Abilene KS 1931-06-01 13:00:00\n",
"4 New York Worlds Fair NY 1933-04-18 19:00:00"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ufo.dropna(thresh=len(ufo)*0.9, axis='columns').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`len(ufo)` returns the total number of rows, and then we multiply that by 0.9 to tell pandas to only keep columns in which at least 90% of the values are not missing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Split a string into multiple columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create another example DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" location | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John Arthur Doe | \n",
" Los Angeles, CA | \n",
"
\n",
" \n",
" 1 | \n",
" Jane Ann Smith | \n",
" Washington, DC | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name location\n",
"0 John Arthur Doe Los Angeles, CA\n",
"1 Jane Ann Smith Washington, DC"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'name':['John Arthur Doe', 'Jane Ann Smith'],\n",
" 'location':['Los Angeles, CA', 'Washington, DC']})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we wanted to split the \"name\" column into three separate columns, for first, middle, and last name? We would use the `str.split()` method and tell it to split on a space character and expand the results into a DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Arthur | \n",
" Doe | \n",
"
\n",
" \n",
" 1 | \n",
" Jane | \n",
" Ann | \n",
" Smith | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 John Arthur Doe\n",
"1 Jane Ann Smith"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.name.str.split(' ', expand=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These three columns can actually be saved to the original DataFrame in a single assignment statement:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" location | \n",
" first | \n",
" middle | \n",
" last | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John Arthur Doe | \n",
" Los Angeles, CA | \n",
" John | \n",
" Arthur | \n",
" Doe | \n",
"
\n",
" \n",
" 1 | \n",
" Jane Ann Smith | \n",
" Washington, DC | \n",
" Jane | \n",
" Ann | \n",
" Smith | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name location first middle last\n",
"0 John Arthur Doe Los Angeles, CA John Arthur Doe\n",
"1 Jane Ann Smith Washington, DC Jane Ann Smith"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['first', 'middle', 'last']] = df.name.str.split(' ', expand=True)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we wanted to split a string, but only keep one of the resulting columns? For example, let's split the location column on \"comma space\":"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Los Angeles | \n",
" CA | \n",
"
\n",
" \n",
" 1 | \n",
" Washington | \n",
" DC | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1\n",
"0 Los Angeles CA\n",
"1 Washington DC"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.location.str.split(', ', expand=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we only cared about saving the city name in column 0, we can just select that column and save it to the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" location | \n",
" first | \n",
" middle | \n",
" last | \n",
" city | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John Arthur Doe | \n",
" Los Angeles, CA | \n",
" John | \n",
" Arthur | \n",
" Doe | \n",
" Los Angeles | \n",
"
\n",
" \n",
" 1 | \n",
" Jane Ann Smith | \n",
" Washington, DC | \n",
" Jane | \n",
" Ann | \n",
" Smith | \n",
" Washington | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name location first middle last city\n",
"0 John Arthur Doe Los Angeles, CA John Arthur Doe Los Angeles\n",
"1 Jane Ann Smith Washington, DC Jane Ann Smith Washington"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['city'] = df.location.str.split(', ', expand=True)[0]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Expand a Series of lists into a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create another example DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one | \n",
" col_two | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" [10, 40] | \n",
"
\n",
" \n",
" 1 | \n",
" b | \n",
" [20, 50] | \n",
"
\n",
" \n",
" 2 | \n",
" c | \n",
" [30, 60] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one col_two\n",
"0 a [10, 40]\n",
"1 b [20, 50]\n",
"2 c [30, 60]"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'col_one':['a', 'b', 'c'], 'col_two':[[10, 40], [20, 50], [30, 60]]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two columns, and the second column contains regular Python lists of integers.\n",
"\n",
"If we wanted to expand the second column into its own DataFrame, we can use the `apply()` method on that column and pass it the Series constructor:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" 40 | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" 50 | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1\n",
"0 10 40\n",
"1 20 50\n",
"2 30 60"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_new = df.col_two.apply(pd.Series)\n",
"df_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And by using the `concat()` function, you can combine the original DataFrame with the new DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col_one | \n",
" col_two | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" [10, 40] | \n",
" 10 | \n",
" 40 | \n",
"
\n",
" \n",
" 1 | \n",
" b | \n",
" [20, 50] | \n",
" 20 | \n",
" 50 | \n",
"
\n",
" \n",
" 2 | \n",
" c | \n",
" [30, 60] | \n",
" 30 | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col_one col_two 0 1\n",
"0 a [10, 40] 10 40\n",
"1 b [20, 50] 20 50\n",
"2 c [30, 60] 30 60"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df, df_new], axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 18. Aggregate by multiple functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at a DataFrame of orders from the Chipotle restaurant chain:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" quantity | \n",
" item_name | \n",
" choice_description | \n",
" item_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Fresh Tomato Salsa | \n",
" NaN | \n",
" 2.39 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Izze | \n",
" [Clementine] | \n",
" 3.39 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Nantucket Nectar | \n",
" [Apple] | \n",
" 3.39 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Tomatillo-Green Chili Salsa | \n",
" NaN | \n",
" 2.39 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" Chicken Bowl | \n",
" [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | \n",
" 16.98 | \n",
"
\n",
" \n",
" 5 | \n",
" 3 | \n",
" 1 | \n",
" Chicken Bowl | \n",
" [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | \n",
" 10.98 | \n",
"
\n",
" \n",
" 6 | \n",
" 3 | \n",
" 1 | \n",
" Side of Chips | \n",
" NaN | \n",
" 1.69 | \n",
"
\n",
" \n",
" 7 | \n",
" 4 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Tomatillo Red Chili Salsa, [Fajita Vegetables... | \n",
" 11.75 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 1 | \n",
" Steak Soft Tacos | \n",
" [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | \n",
" 9.25 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | \n",
" 9.25 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id quantity item_name \\\n",
"0 1 1 Chips and Fresh Tomato Salsa \n",
"1 1 1 Izze \n",
"2 1 1 Nantucket Nectar \n",
"3 1 1 Chips and Tomatillo-Green Chili Salsa \n",
"4 2 2 Chicken Bowl \n",
"5 3 1 Chicken Bowl \n",
"6 3 1 Side of Chips \n",
"7 4 1 Steak Burrito \n",
"8 4 1 Steak Soft Tacos \n",
"9 5 1 Steak Burrito \n",
"\n",
" choice_description item_price \n",
"0 NaN 2.39 \n",
"1 [Clementine] 3.39 \n",
"2 [Apple] 3.39 \n",
"3 NaN 2.39 \n",
"4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98 \n",
"5 [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... 10.98 \n",
"6 NaN 1.69 \n",
"7 [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.75 \n",
"8 [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... 9.25 \n",
"9 [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... 9.25 "
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each order has an order_id and consists of one or more rows. To figure out the total price of an order, you sum the item_price for that order_id. For example, here's the total price of order number 1:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.56"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders[orders.order_id == 1].item_price.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you wanted to calculate the total price of every order, you would `groupby()` order_id and then take the sum of item_price for each group:"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"1 11.56\n",
"2 16.98\n",
"3 12.67\n",
"4 21.00\n",
"5 13.70\n",
"Name: item_price, dtype: float64"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders.groupby('order_id').item_price.sum().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, you're not actually limited to aggregating by a single function such as `sum()`. To aggregate by multiple functions, you use the `agg()` method and pass it a list of functions such as `sum()` and `count()`:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sum | \n",
" count | \n",
"
\n",
" \n",
" order_id | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 11.56 | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" 16.98 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 12.67 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 21.00 | \n",
" 2 | \n",
"
\n",
" \n",
" 5 | \n",
" 13.70 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sum count\n",
"order_id \n",
"1 11.56 4\n",
"2 16.98 1\n",
"3 12.67 2\n",
"4 21.00 2\n",
"5 13.70 2"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders.groupby('order_id').item_price.agg(['sum', 'count']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That gives us the total price of each order as well as the number of items in each order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 19. Combine the output of an aggregation with a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take another look at the orders DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" quantity | \n",
" item_name | \n",
" choice_description | \n",
" item_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Fresh Tomato Salsa | \n",
" NaN | \n",
" 2.39 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Izze | \n",
" [Clementine] | \n",
" 3.39 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Nantucket Nectar | \n",
" [Apple] | \n",
" 3.39 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Tomatillo-Green Chili Salsa | \n",
" NaN | \n",
" 2.39 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" Chicken Bowl | \n",
" [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | \n",
" 16.98 | \n",
"
\n",
" \n",
" 5 | \n",
" 3 | \n",
" 1 | \n",
" Chicken Bowl | \n",
" [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | \n",
" 10.98 | \n",
"
\n",
" \n",
" 6 | \n",
" 3 | \n",
" 1 | \n",
" Side of Chips | \n",
" NaN | \n",
" 1.69 | \n",
"
\n",
" \n",
" 7 | \n",
" 4 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Tomatillo Red Chili Salsa, [Fajita Vegetables... | \n",
" 11.75 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 1 | \n",
" Steak Soft Tacos | \n",
" [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | \n",
" 9.25 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | \n",
" 9.25 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id quantity item_name \\\n",
"0 1 1 Chips and Fresh Tomato Salsa \n",
"1 1 1 Izze \n",
"2 1 1 Nantucket Nectar \n",
"3 1 1 Chips and Tomatillo-Green Chili Salsa \n",
"4 2 2 Chicken Bowl \n",
"5 3 1 Chicken Bowl \n",
"6 3 1 Side of Chips \n",
"7 4 1 Steak Burrito \n",
"8 4 1 Steak Soft Tacos \n",
"9 5 1 Steak Burrito \n",
"\n",
" choice_description item_price \n",
"0 NaN 2.39 \n",
"1 [Clementine] 3.39 \n",
"2 [Apple] 3.39 \n",
"3 NaN 2.39 \n",
"4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98 \n",
"5 [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... 10.98 \n",
"6 NaN 1.69 \n",
"7 [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.75 \n",
"8 [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... 9.25 \n",
"9 [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... 9.25 "
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we wanted to create a new column listing the total price of each order? Recall that we calculated the total price using the `sum()` method:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"1 11.56\n",
"2 16.98\n",
"3 12.67\n",
"4 21.00\n",
"5 13.70\n",
"Name: item_price, dtype: float64"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders.groupby('order_id').item_price.sum().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sum()` is an aggregation function, which means that it returns a reduced version of the input data.\n",
"\n",
"In other words, the output of the `sum()` function:"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1834"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(orders.groupby('order_id').item_price.sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...is smaller than the input to the function:"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"4622"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(orders.item_price)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The solution is to use the `transform()` method, which performs the same calculation but returns output data that is the same shape as the input data:"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4622"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_price = orders.groupby('order_id').item_price.transform('sum')\n",
"len(total_price)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll store the results in a new DataFrame column called total_price:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" quantity | \n",
" item_name | \n",
" choice_description | \n",
" item_price | \n",
" total_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Fresh Tomato Salsa | \n",
" NaN | \n",
" 2.39 | \n",
" 11.56 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Izze | \n",
" [Clementine] | \n",
" 3.39 | \n",
" 11.56 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Nantucket Nectar | \n",
" [Apple] | \n",
" 3.39 | \n",
" 11.56 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Tomatillo-Green Chili Salsa | \n",
" NaN | \n",
" 2.39 | \n",
" 11.56 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" Chicken Bowl | \n",
" [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | \n",
" 16.98 | \n",
" 16.98 | \n",
"
\n",
" \n",
" 5 | \n",
" 3 | \n",
" 1 | \n",
" Chicken Bowl | \n",
" [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | \n",
" 10.98 | \n",
" 12.67 | \n",
"
\n",
" \n",
" 6 | \n",
" 3 | \n",
" 1 | \n",
" Side of Chips | \n",
" NaN | \n",
" 1.69 | \n",
" 12.67 | \n",
"
\n",
" \n",
" 7 | \n",
" 4 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Tomatillo Red Chili Salsa, [Fajita Vegetables... | \n",
" 11.75 | \n",
" 21.00 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 1 | \n",
" Steak Soft Tacos | \n",
" [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | \n",
" 9.25 | \n",
" 21.00 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | \n",
" 9.25 | \n",
" 13.70 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id quantity item_name \\\n",
"0 1 1 Chips and Fresh Tomato Salsa \n",
"1 1 1 Izze \n",
"2 1 1 Nantucket Nectar \n",
"3 1 1 Chips and Tomatillo-Green Chili Salsa \n",
"4 2 2 Chicken Bowl \n",
"5 3 1 Chicken Bowl \n",
"6 3 1 Side of Chips \n",
"7 4 1 Steak Burrito \n",
"8 4 1 Steak Soft Tacos \n",
"9 5 1 Steak Burrito \n",
"\n",
" choice_description item_price total_price \n",
"0 NaN 2.39 11.56 \n",
"1 [Clementine] 3.39 11.56 \n",
"2 [Apple] 3.39 11.56 \n",
"3 NaN 2.39 11.56 \n",
"4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98 16.98 \n",
"5 [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... 10.98 12.67 \n",
"6 NaN 1.69 12.67 \n",
"7 [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.75 21.00 \n",
"8 [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... 9.25 21.00 \n",
"9 [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... 9.25 13.70 "
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders['total_price'] = total_price\n",
"orders.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the total price of each order is now listed on every single line.\n",
"\n",
"That makes it easy to calculate the percentage of the total order price that each line represents:"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" quantity | \n",
" item_name | \n",
" choice_description | \n",
" item_price | \n",
" total_price | \n",
" percent_of_total | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Fresh Tomato Salsa | \n",
" NaN | \n",
" 2.39 | \n",
" 11.56 | \n",
" 0.206747 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Izze | \n",
" [Clementine] | \n",
" 3.39 | \n",
" 11.56 | \n",
" 0.293253 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Nantucket Nectar | \n",
" [Apple] | \n",
" 3.39 | \n",
" 11.56 | \n",
" 0.293253 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" Chips and Tomatillo-Green Chili Salsa | \n",
" NaN | \n",
" 2.39 | \n",
" 11.56 | \n",
" 0.206747 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" Chicken Bowl | \n",
" [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | \n",
" 16.98 | \n",
" 16.98 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" 3 | \n",
" 1 | \n",
" Chicken Bowl | \n",
" [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | \n",
" 10.98 | \n",
" 12.67 | \n",
" 0.866614 | \n",
"
\n",
" \n",
" 6 | \n",
" 3 | \n",
" 1 | \n",
" Side of Chips | \n",
" NaN | \n",
" 1.69 | \n",
" 12.67 | \n",
" 0.133386 | \n",
"
\n",
" \n",
" 7 | \n",
" 4 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Tomatillo Red Chili Salsa, [Fajita Vegetables... | \n",
" 11.75 | \n",
" 21.00 | \n",
" 0.559524 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 1 | \n",
" Steak Soft Tacos | \n",
" [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | \n",
" 9.25 | \n",
" 21.00 | \n",
" 0.440476 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 1 | \n",
" Steak Burrito | \n",
" [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | \n",
" 9.25 | \n",
" 13.70 | \n",
" 0.675182 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id quantity item_name \\\n",
"0 1 1 Chips and Fresh Tomato Salsa \n",
"1 1 1 Izze \n",
"2 1 1 Nantucket Nectar \n",
"3 1 1 Chips and Tomatillo-Green Chili Salsa \n",
"4 2 2 Chicken Bowl \n",
"5 3 1 Chicken Bowl \n",
"6 3 1 Side of Chips \n",
"7 4 1 Steak Burrito \n",
"8 4 1 Steak Soft Tacos \n",
"9 5 1 Steak Burrito \n",
"\n",
" choice_description item_price total_price \\\n",
"0 NaN 2.39 11.56 \n",
"1 [Clementine] 3.39 11.56 \n",
"2 [Apple] 3.39 11.56 \n",
"3 NaN 2.39 11.56 \n",
"4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98 16.98 \n",
"5 [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... 10.98 12.67 \n",
"6 NaN 1.69 12.67 \n",
"7 [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.75 21.00 \n",
"8 [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... 9.25 21.00 \n",
"9 [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... 9.25 13.70 \n",
"\n",
" percent_of_total \n",
"0 0.206747 \n",
"1 0.293253 \n",
"2 0.293253 \n",
"3 0.206747 \n",
"4 1.000000 \n",
"5 0.866614 \n",
"6 0.133386 \n",
"7 0.559524 \n",
"8 0.440476 \n",
"9 0.675182 "
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"orders['percent_of_total'] = orders.item_price / orders.total_price\n",
"orders.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. Select a slice of rows and columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at another dataset:"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the famous Titanic dataset, which shows information about passengers on the Titanic and whether or not they survived.\n",
"\n",
"If you wanted a numerical summary of the dataset, you would use the `describe()` method:"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 891.000000 | \n",
" 891.000000 | \n",
" 891.000000 | \n",
" 714.000000 | \n",
" 891.000000 | \n",
" 891.000000 | \n",
" 891.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 446.000000 | \n",
" 0.383838 | \n",
" 2.308642 | \n",
" 29.699118 | \n",
" 0.523008 | \n",
" 0.381594 | \n",
" 32.204208 | \n",
"
\n",
" \n",
" std | \n",
" 257.353842 | \n",
" 0.486592 | \n",
" 0.836071 | \n",
" 14.526497 | \n",
" 1.102743 | \n",
" 0.806057 | \n",
" 49.693429 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 1.000000 | \n",
" 0.420000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 223.500000 | \n",
" 0.000000 | \n",
" 2.000000 | \n",
" 20.125000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 7.910400 | \n",
"
\n",
" \n",
" 50% | \n",
" 446.000000 | \n",
" 0.000000 | \n",
" 3.000000 | \n",
" 28.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 14.454200 | \n",
"
\n",
" \n",
" 75% | \n",
" 668.500000 | \n",
" 1.000000 | \n",
" 3.000000 | \n",
" 38.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 31.000000 | \n",
"
\n",
" \n",
" max | \n",
" 891.000000 | \n",
" 1.000000 | \n",
" 3.000000 | \n",
" 80.000000 | \n",
" 8.000000 | \n",
" 6.000000 | \n",
" 512.329200 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the resulting DataFrame might be displaying more information than you need.\n",
"\n",
"If you wanted to filter it to only show the \"five-number summary\", you can use the `loc` accessor and pass it a slice of the \"min\" through the \"max\" row labels:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" min | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.420 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 25% | \n",
" 223.5 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 20.125 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 7.9104 | \n",
"
\n",
" \n",
" 50% | \n",
" 446.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 28.000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 14.4542 | \n",
"
\n",
" \n",
" 75% | \n",
" 668.5 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 38.000 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 31.0000 | \n",
"
\n",
" \n",
" max | \n",
" 891.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 80.000 | \n",
" 8.0 | \n",
" 6.0 | \n",
" 512.3292 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp Parch Fare\n",
"min 1.0 0.0 1.0 0.420 0.0 0.0 0.0000\n",
"25% 223.5 0.0 2.0 20.125 0.0 0.0 7.9104\n",
"50% 446.0 0.0 3.0 28.000 0.0 0.0 14.4542\n",
"75% 668.5 1.0 3.0 38.000 1.0 0.0 31.0000\n",
"max 891.0 1.0 3.0 80.000 8.0 6.0 512.3292"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.describe().loc['min':'max']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if you're not interested in all of the columns, you can also pass it a slice of column labels:"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
"
\n",
" \n",
" \n",
" \n",
" min | \n",
" 1.0 | \n",
" 0.420 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.0 | \n",
" 20.125 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 50% | \n",
" 3.0 | \n",
" 28.000 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 75% | \n",
" 3.0 | \n",
" 38.000 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" max | \n",
" 3.0 | \n",
" 80.000 | \n",
" 8.0 | \n",
" 6.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Pclass Age SibSp Parch\n",
"min 1.0 0.420 0.0 0.0\n",
"25% 2.0 20.125 0.0 0.0\n",
"50% 3.0 28.000 0.0 0.0\n",
"75% 3.0 38.000 1.0 0.0\n",
"max 3.0 80.000 8.0 6.0"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.describe().loc['min':'max', 'Pclass':'Parch']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 21. Reshape a MultiIndexed Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Titanic dataset has a \"Survived\" column made up of ones and zeros, so you can calculate the overall survival rate by taking a mean of that column:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.3838383838383838"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.Survived.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you wanted to calculate the survival rate by a single category such as \"Sex\", you would use a `groupby()`:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sex\n",
"female 0.742038\n",
"male 0.188908\n",
"Name: Survived, dtype: float64"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby('Sex').Survived.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if you wanted to calculate the survival rate across two different categories at once, you would `groupby()` both of those categories:"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sex Pclass\n",
"female 1 0.968085\n",
" 2 0.921053\n",
" 3 0.500000\n",
"male 1 0.368852\n",
" 2 0.157407\n",
" 3 0.135447\n",
"Name: Survived, dtype: float64"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby(['Sex', 'Pclass']).Survived.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows the survival rate for every combination of Sex and Passenger Class. It's stored as a MultiIndexed Series, meaning that it has multiple index levels to the left of the actual data.\n",
"\n",
"It can be hard to read and interact with data in this format, so it's often more convenient to reshape a MultiIndexed Series into a DataFrame by using the `unstack()` method:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Pclass | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 0.968085 | \n",
" 0.921053 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" male | \n",
" 0.368852 | \n",
" 0.157407 | \n",
" 0.135447 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Pclass 1 2 3\n",
"Sex \n",
"female 0.968085 0.921053 0.500000\n",
"male 0.368852 0.157407 0.135447"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby(['Sex', 'Pclass']).Survived.mean().unstack()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This DataFrame contains the same exact data as the MultiIndexed Series, except that now you can interact with it using familiar DataFrame methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 22. Create a pivot table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you often create DataFrames like the one above, you might find it more convenient to use the `pivot_table()` method instead:"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Pclass | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 0.968085 | \n",
" 0.921053 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" male | \n",
" 0.368852 | \n",
" 0.157407 | \n",
" 0.135447 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Pclass 1 2 3\n",
"Sex \n",
"female 0.968085 0.921053 0.500000\n",
"male 0.368852 0.157407 0.135447"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a pivot table, you directly specify the index, the columns, the values, and the aggregation function.\n",
"\n",
"An added benefit of a pivot table is that you can easily add row and column totals by setting `margins=True`:"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Pclass | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" All | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 0.968085 | \n",
" 0.921053 | \n",
" 0.500000 | \n",
" 0.742038 | \n",
"
\n",
" \n",
" male | \n",
" 0.368852 | \n",
" 0.157407 | \n",
" 0.135447 | \n",
" 0.188908 | \n",
"
\n",
" \n",
" All | \n",
" 0.629630 | \n",
" 0.472826 | \n",
" 0.242363 | \n",
" 0.383838 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Pclass 1 2 3 All\n",
"Sex \n",
"female 0.968085 0.921053 0.500000 0.742038\n",
"male 0.368852 0.157407 0.135447 0.188908\n",
"All 0.629630 0.472826 0.242363 0.383838"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean',\n",
" margins=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows the overall survival rate as well as the survival rate by Sex and Passenger Class.\n",
"\n",
"Finally, you can create a cross-tabulation just by changing the aggregation function from \"mean\" to \"count\":"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Pclass | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" All | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 94 | \n",
" 76 | \n",
" 144 | \n",
" 314 | \n",
"
\n",
" \n",
" male | \n",
" 122 | \n",
" 108 | \n",
" 347 | \n",
" 577 | \n",
"
\n",
" \n",
" All | \n",
" 216 | \n",
" 184 | \n",
" 491 | \n",
" 891 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Pclass 1 2 3 All\n",
"Sex \n",
"female 94 76 144 314\n",
"male 122 108 347 577\n",
"All 216 184 491 891"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='count',\n",
" margins=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows the number of records that appear in each combination of categories."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 23. Convert continuous data into categorical data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the Age column from the Titanic dataset:"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 22.0\n",
"1 38.0\n",
"2 26.0\n",
"3 35.0\n",
"4 35.0\n",
"5 NaN\n",
"6 54.0\n",
"7 2.0\n",
"8 27.0\n",
"9 14.0\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.Age.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's currently continuous data, but what if you wanted to convert it into categorical data?\n",
"\n",
"One solution would be to label the age ranges, such as \"child\", \"young adult\", and \"adult\". The best way to do this is by using the `cut()` function:"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 young adult\n",
"1 adult\n",
"2 adult\n",
"3 adult\n",
"4 adult\n",
"5 NaN\n",
"6 adult\n",
"7 child\n",
"8 adult\n",
"9 child\n",
"Name: Age, dtype: category\n",
"Categories (3, object): [child < young adult < adult]"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['child', 'young adult', 'adult']).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This assigned each value to a bin with a label. Ages 0 to 18 were assigned the label \"child\", ages 18 to 25 were assigned the label \"young adult\", and ages 25 to 99 were assigned the label \"adult\".\n",
"\n",
"Notice that the data type is now \"category\", and the categories are automatically ordered."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 24. Change display options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take another look at the Titanic dataset:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the Age column has 1 decimal place and the Fare column has 4 decimal places. What if you wanted to standardize the display to use 2 decimal places?\n",
"\n",
"You can use the `set_option()` function:"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.float_format', '{:.2f}'.format)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first argument is the name of the option, and the second argument is a Python format string."
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.00 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.25 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.00 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.28 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.00 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.92 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.00 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.10 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.00 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.05 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.00 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.00 1 \n",
"2 Heikkinen, Miss. Laina female 26.00 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 \n",
"4 Allen, Mr. William Henry male 35.00 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.25 NaN S \n",
"1 0 PC 17599 71.28 C85 C \n",
"2 0 STON/O2. 3101282 7.92 NaN S \n",
"3 0 113803 53.10 C123 S \n",
"4 0 373450 8.05 NaN S "
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that Age and Fare are now using 2 decimal places. Note that this did not change the underlying data, only the display of the data.\n",
"\n",
"You can also reset any option back to its default:"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"pd.reset_option('display.float_format')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many more options you can specify is a similar way."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 25. Style a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous trick is useful if you want to change the display of your entire notebook. However, a more flexible and powerful approach is to define the style of a particular DataFrame.\n",
"\n",
"Let's return to the stocks DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Close | \n",
" Volume | \n",
" Symbol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-10-03 | \n",
" 31.50 | \n",
" 14070500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-10-03 | \n",
" 112.52 | \n",
" 21701800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-10-03 | \n",
" 57.42 | \n",
" 19189500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-10-04 | \n",
" 113.00 | \n",
" 29736800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-10-04 | \n",
" 57.24 | \n",
" 20085900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 5 | \n",
" 2016-10-04 | \n",
" 31.35 | \n",
" 18460400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 6 | \n",
" 2016-10-05 | \n",
" 57.64 | \n",
" 16726400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 7 | \n",
" 2016-10-05 | \n",
" 31.59 | \n",
" 11808600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 8 | \n",
" 2016-10-05 | \n",
" 113.05 | \n",
" 21453100 | \n",
" AAPL | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Close Volume Symbol\n",
"0 2016-10-03 31.50 14070500 CSCO\n",
"1 2016-10-03 112.52 21701800 AAPL\n",
"2 2016-10-03 57.42 19189500 MSFT\n",
"3 2016-10-04 113.00 29736800 AAPL\n",
"4 2016-10-04 57.24 20085900 MSFT\n",
"5 2016-10-04 31.35 18460400 CSCO\n",
"6 2016-10-05 57.64 16726400 MSFT\n",
"7 2016-10-05 31.59 11808600 CSCO\n",
"8 2016-10-05 113.05 21453100 AAPL"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stocks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a dictionary of format strings that specifies how each column should be formatted:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"format_dict = {'Date':'{:%m/%d/%y}', 'Close':'${:.2f}', 'Volume':'{:,}'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then we can pass it to the DataFrame's `style.format()` method:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | Date | Close | Volume | Symbol |
\n",
" \n",
" 0 | \n",
" 10/03/16 | \n",
" $31.50 | \n",
" 14,070,500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 1 | \n",
" 10/03/16 | \n",
" $112.52 | \n",
" 21,701,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 2 | \n",
" 10/03/16 | \n",
" $57.42 | \n",
" 19,189,500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 3 | \n",
" 10/04/16 | \n",
" $113.00 | \n",
" 29,736,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 4 | \n",
" 10/04/16 | \n",
" $57.24 | \n",
" 20,085,900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 5 | \n",
" 10/04/16 | \n",
" $31.35 | \n",
" 18,460,400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 6 | \n",
" 10/05/16 | \n",
" $57.64 | \n",
" 16,726,400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 7 | \n",
" 10/05/16 | \n",
" $31.59 | \n",
" 11,808,600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 8 | \n",
" 10/05/16 | \n",
" $113.05 | \n",
" 21,453,100 | \n",
" AAPL | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stocks.style.format(format_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the Date is now in month-day-year format, the closing price has a dollar sign, and the Volume has commas.\n",
"\n",
"We can apply more styling by chaining additional methods:"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" Date | Close | Volume | Symbol |
\n",
" \n",
" 10/03/16 | \n",
" $31.50 | \n",
" 14,070,500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $112.52 | \n",
" 21,701,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $57.42 | \n",
" 19,189,500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $113.00 | \n",
" 29,736,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $57.24 | \n",
" 20,085,900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $31.35 | \n",
" 18,460,400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $57.64 | \n",
" 16,726,400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $31.59 | \n",
" 11,808,600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $113.05 | \n",
" 21,453,100 | \n",
" AAPL | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(stocks.style.format(format_dict)\n",
" .hide_index()\n",
" .highlight_min('Close', color='red')\n",
" .highlight_max('Close', color='lightgreen')\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've now hidden the index, highlighted the minimum Close value in red, and highlighted the maximum Close value in green.\n",
"\n",
"Here's another example of DataFrame styling:"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" Date | Close | Volume | Symbol |
\n",
" \n",
" 10/03/16 | \n",
" $31.50 | \n",
" 14,070,500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $112.52 | \n",
" 21,701,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $57.42 | \n",
" 19,189,500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $113.00 | \n",
" 29,736,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $57.24 | \n",
" 20,085,900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $31.35 | \n",
" 18,460,400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $57.64 | \n",
" 16,726,400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $31.59 | \n",
" 11,808,600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $113.05 | \n",
" 21,453,100 | \n",
" AAPL | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(stocks.style.format(format_dict)\n",
" .hide_index()\n",
" .background_gradient(subset='Volume', cmap='Blues')\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Volume column now has a background gradient to help you easily identify high and low values.\n",
"\n",
"And here's one final example:"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Stock Prices from October 2016 Date | Close | Volume | Symbol |
\n",
" \n",
" 10/03/16 | \n",
" $31.50 | \n",
" 14,070,500 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $112.52 | \n",
" 21,701,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/03/16 | \n",
" $57.42 | \n",
" 19,189,500 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $113.00 | \n",
" 29,736,800 | \n",
" AAPL | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $57.24 | \n",
" 20,085,900 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/04/16 | \n",
" $31.35 | \n",
" 18,460,400 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $57.64 | \n",
" 16,726,400 | \n",
" MSFT | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $31.59 | \n",
" 11,808,600 | \n",
" CSCO | \n",
"
\n",
" \n",
" 10/05/16 | \n",
" $113.05 | \n",
" 21,453,100 | \n",
" AAPL | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(stocks.style.format(format_dict)\n",
" .hide_index()\n",
" .bar('Volume', color='lightblue', align='zero')\n",
" .set_caption('Stock Prices from October 2016')\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's now a bar chart within the Volume column and a caption above the DataFrame.\n",
"\n",
"Note that there are many more options for how you can style your DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bonus: Profile a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that you've got a new dataset, and you want to quickly explore it without too much work. There's a separate package called [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) that is designed for this purpose.\n",
"\n",
"First you have to install it using conda or pip. Once that's done, you import `pandas_profiling`:"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
"import pandas_profiling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, simply run the `ProfileReport()` function and pass it any DataFrame. It returns an interactive HTML report:\n",
"\n",
"- The first section is an overview of the dataset and a list of possible issues with the data.\n",
"- The next section gives a summary of each column. You can click \"toggle details\" for even more information.\n",
"- The third section shows a heatmap of the correlation between columns.\n",
"- And the fourth section shows the head of the dataset."
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"\n",
"\n",
"
\n",
"
Overview
\n",
" \n",
"
\n",
"
\n",
"
Dataset info
\n",
"
\n",
" \n",
" \n",
" Number of variables | \n",
" 12 | \n",
"
\n",
" \n",
" Number of observations | \n",
" 891 | \n",
"
\n",
" \n",
" Total Missing (%) | \n",
" 8.1% | \n",
"
\n",
" \n",
" Total size in memory | \n",
" 83.6 KiB | \n",
"
\n",
" \n",
" Average record size in memory | \n",
" 96.1 B | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
Variables types
\n",
"
\n",
" \n",
" \n",
" Numeric | \n",
" 6 | \n",
"
\n",
" \n",
" Categorical | \n",
" 4 | \n",
"
\n",
" \n",
" Boolean | \n",
" 1 | \n",
"
\n",
" \n",
" Date | \n",
" 0 | \n",
"
\n",
" \n",
" Text (Unique) | \n",
" 1 | \n",
"
\n",
" \n",
" Rejected | \n",
" 0 | \n",
"
\n",
" \n",
" Unsupported | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
"
Warnings
\n",
"
Age
has 177 / 19.9% missing values MissingCabin
has 687 / 77.1% missing values MissingCabin
has a high cardinality: 148 distinct values WarningFare
has 15 / 1.7% zeros ZerosParch
has 678 / 76.1% zeros ZerosSibSp
has 608 / 68.2% zeros ZerosTicket
has a high cardinality: 681 distinct values Warning
\n",
"
\n",
"
\n",
"
\n",
"
Variables
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 89 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 10.0% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 19.9% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 177 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 29.699 | \n",
"
\n",
" \n",
" Minimum | \n",
" 0.42 | \n",
"
\n",
" \n",
" Maximum | \n",
" 80 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 0.0% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 0.42 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 4 | \n",
"
\n",
" \n",
" Q1 | \n",
" 20.125 | \n",
"
\n",
" \n",
" Median | \n",
" 28 | \n",
"
\n",
" \n",
" Q3 | \n",
" 38 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 56 | \n",
"
\n",
" \n",
" Maximum | \n",
" 80 | \n",
"
\n",
" \n",
" Range | \n",
" 79.58 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 17.875 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 14.526 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 0.48912 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" 0.17827 | \n",
"
\n",
" \n",
" Mean | \n",
" 29.699 | \n",
"
\n",
" \n",
" MAD | \n",
" 11.323 | \n",
"
\n",
" \n",
" Skewness | \n",
" 0.38911 | \n",
"
\n",
" \n",
" Sum | \n",
" 21205 | \n",
"
\n",
" \n",
" Variance | \n",
" 211.02 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 24.0 | \n",
" 30 | \n",
" 3.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 22.0 | \n",
" 27 | \n",
" 3.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 18.0 | \n",
" 26 | \n",
" 2.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 28.0 | \n",
" 25 | \n",
" 2.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 19.0 | \n",
" 25 | \n",
" 2.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 30.0 | \n",
" 25 | \n",
" 2.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 21.0 | \n",
" 24 | \n",
" 2.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 25.0 | \n",
" 23 | \n",
" 2.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 36.0 | \n",
" 22 | \n",
" 2.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 29.0 | \n",
" 20 | \n",
" 2.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Other values (78) | \n",
" 467 | \n",
" 52.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" (Missing) | \n",
" 177 | \n",
" 19.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0.42 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 0.67 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 0.75 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 0.83 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 0.92 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 70.0 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 70.5 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 71.0 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 74.0 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 80.0 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Cabin
\n",
" Categorical\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 148 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 16.6% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 77.1% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 687 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" G6 | \n",
" \n",
" \n",
" \n",
" \n",
" 4\n",
" | \n",
"
\n",
" C23 C25 C27 | \n",
" \n",
" \n",
" \n",
" \n",
" 4\n",
" | \n",
"
\n",
" B96 B98 | \n",
" \n",
" \n",
" \n",
" \n",
" 4\n",
" | \n",
"
\n",
" Other values (144) | \n",
" \n",
" \n",
" 192\n",
" \n",
" \n",
" | \n",
"
\n",
" (Missing) | \n",
" \n",
" \n",
" 687\n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" G6 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" C23 C25 C27 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" B96 B98 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" D | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" F2 | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" F33 | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" E101 | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" C22 C26 | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" C124 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" D35 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Other values (137) | \n",
" 173 | \n",
" 19.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" (Missing) | \n",
" 687 | \n",
" 77.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Embarked
\n",
" Categorical\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 4 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.4% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.2% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 2 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" S | \n",
" \n",
" \n",
" 644\n",
" \n",
" \n",
" | \n",
"
\n",
" C | \n",
" \n",
" \n",
" 168\n",
" \n",
" \n",
" | \n",
"
\n",
" Q | \n",
" \n",
" \n",
" \n",
" \n",
" 77\n",
" | \n",
"
\n",
" (Missing) | \n",
" \n",
" \n",
" \n",
" \n",
" 2\n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" S | \n",
" 644 | \n",
" 72.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" C | \n",
" 168 | \n",
" 18.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Q | \n",
" 77 | \n",
" 8.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" (Missing) | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Fare
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 248 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 27.8% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 32.204 | \n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" Maximum | \n",
" 512.33 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 1.7% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 7.225 | \n",
"
\n",
" \n",
" Q1 | \n",
" 7.9104 | \n",
"
\n",
" \n",
" Median | \n",
" 14.454 | \n",
"
\n",
" \n",
" Q3 | \n",
" 31 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 112.08 | \n",
"
\n",
" \n",
" Maximum | \n",
" 512.33 | \n",
"
\n",
" \n",
" Range | \n",
" 512.33 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 23.09 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 49.693 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 1.5431 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" 33.398 | \n",
"
\n",
" \n",
" Mean | \n",
" 32.204 | \n",
"
\n",
" \n",
" MAD | \n",
" 28.164 | \n",
"
\n",
" \n",
" Skewness | \n",
" 4.7873 | \n",
"
\n",
" \n",
" Sum | \n",
" 28694 | \n",
"
\n",
" \n",
" Variance | \n",
" 2469.4 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 8.05 | \n",
" 43 | \n",
" 4.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 13.0 | \n",
" 42 | \n",
" 4.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 7.8958 | \n",
" 38 | \n",
" 4.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 7.75 | \n",
" 34 | \n",
" 3.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 26.0 | \n",
" 31 | \n",
" 3.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 10.5 | \n",
" 24 | \n",
" 2.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 7.925 | \n",
" 18 | \n",
" 2.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 7.775 | \n",
" 16 | \n",
" 1.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 26.55 | \n",
" 15 | \n",
" 1.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 0.0 | \n",
" 15 | \n",
" 1.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Other values (238) | \n",
" 615 | \n",
" 69.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0.0 | \n",
" 15 | \n",
" 1.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4.0125 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5.0 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 6.2375 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 6.4375 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 227.525 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 247.5208 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 262.375 | \n",
" 2 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 263.0 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 512.3292 | \n",
" 3 | \n",
" 0.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Name
\n",
" Categorical, Unique\n",
"
\n",
"
\n",
" \n",
" \n",
" First 3 values | \n",
"
\n",
" \n",
" \n",
" \n",
" Mockler, Miss. Helen Mary \"Ellie\" | \n",
"
\n",
" \n",
" Baclini, Miss. Eugenie | \n",
"
\n",
" \n",
" Mayne, Mlle. Berthe Antonine (\"Mrs de Villiers\") | \n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" Last 3 values | \n",
"
\n",
" \n",
" \n",
" \n",
" Hoyt, Mrs. Frederick Maxfield (Jane Anne Forby) | \n",
"
\n",
" \n",
" Gustafsson, Mr. Karl Gideon | \n",
"
\n",
" \n",
" Dowdell, Miss. Elizabeth | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
First 10 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" Abbing, Mr. Anthony | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Abbott, Mr. Rossmore Edward | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Abbott, Mrs. Stanton (Rosa Hunt) | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Abelson, Mr. Samuel | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Abelson, Mrs. Samuel (Hannah Wizosky) | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Last 10 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" de Mulder, Mr. Theodore | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" de Pelsmaeker, Mr. Alfons | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" del Carlo, Mr. Sebastiano | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" van Billiard, Mr. Austin Blyler | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" van Melkebeke, Mr. Philemon | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Parch
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 7 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.8% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 0.38159 | \n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" Maximum | \n",
" 6 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 76.1% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 0 | \n",
"
\n",
" \n",
" Q1 | \n",
" 0 | \n",
"
\n",
" \n",
" Median | \n",
" 0 | \n",
"
\n",
" \n",
" Q3 | \n",
" 0 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 2 | \n",
"
\n",
" \n",
" Maximum | \n",
" 6 | \n",
"
\n",
" \n",
" Range | \n",
" 6 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 0.80606 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 2.1123 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" 9.7781 | \n",
"
\n",
" \n",
" Mean | \n",
" 0.38159 | \n",
"
\n",
" \n",
" MAD | \n",
" 0.58074 | \n",
"
\n",
" \n",
" Skewness | \n",
" 2.7491 | \n",
"
\n",
" \n",
" Sum | \n",
" 340 | \n",
"
\n",
" \n",
" Variance | \n",
" 0.64973 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0 | \n",
" 678 | \n",
" 76.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 118 | \n",
" 13.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 80 | \n",
" 9.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 6 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0 | \n",
" 678 | \n",
" 76.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 118 | \n",
" 13.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 80 | \n",
" 9.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 2 | \n",
" 80 | \n",
" 9.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 6 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
PassengerId
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 891 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 100.0% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 446 | \n",
"
\n",
" \n",
" Minimum | \n",
" 1 | \n",
"
\n",
" \n",
" Maximum | \n",
" 891 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 0.0% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 1 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 45.5 | \n",
"
\n",
" \n",
" Q1 | \n",
" 223.5 | \n",
"
\n",
" \n",
" Median | \n",
" 446 | \n",
"
\n",
" \n",
" Q3 | \n",
" 668.5 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 846.5 | \n",
"
\n",
" \n",
" Maximum | \n",
" 891 | \n",
"
\n",
" \n",
" Range | \n",
" 890 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 445 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 257.35 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 0.57703 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" -1.2 | \n",
"
\n",
" \n",
" Mean | \n",
" 446 | \n",
"
\n",
" \n",
" MAD | \n",
" 222.75 | \n",
"
\n",
" \n",
" Skewness | \n",
" 0 | \n",
"
\n",
" \n",
" Sum | \n",
" 397386 | \n",
"
\n",
" \n",
" Variance | \n",
" 66231 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 891 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 293 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 304 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 303 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 302 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 301 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 300 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 299 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 298 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 297 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Other values (881) | \n",
" 881 | \n",
" 98.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 1 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 887 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 888 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 889 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 890 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 891 | \n",
" 1 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Pclass
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 3 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.3% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 2.3086 | \n",
"
\n",
" \n",
" Minimum | \n",
" 1 | \n",
"
\n",
" \n",
" Maximum | \n",
" 3 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 0.0% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 1 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 1 | \n",
"
\n",
" \n",
" Q1 | \n",
" 2 | \n",
"
\n",
" \n",
" Median | \n",
" 3 | \n",
"
\n",
" \n",
" Q3 | \n",
" 3 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 3 | \n",
"
\n",
" \n",
" Maximum | \n",
" 3 | \n",
"
\n",
" \n",
" Range | \n",
" 2 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 1 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 0.83607 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 0.36215 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" -1.28 | \n",
"
\n",
" \n",
" Mean | \n",
" 2.3086 | \n",
"
\n",
" \n",
" MAD | \n",
" 0.76197 | \n",
"
\n",
" \n",
" Skewness | \n",
" -0.63055 | \n",
"
\n",
" \n",
" Sum | \n",
" 2057 | \n",
"
\n",
" \n",
" Variance | \n",
" 0.69902 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 3 | \n",
" 491 | \n",
" 55.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 216 | \n",
" 24.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 184 | \n",
" 20.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 1 | \n",
" 216 | \n",
" 24.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 184 | \n",
" 20.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 491 | \n",
" 55.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 1 | \n",
" 216 | \n",
" 24.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 184 | \n",
" 20.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 491 | \n",
" 55.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Sex
\n",
" Categorical\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 2 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.2% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" male | \n",
" \n",
" \n",
" 577\n",
" \n",
" \n",
" | \n",
"
\n",
" female | \n",
" \n",
" \n",
" 314\n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" male | \n",
" 577 | \n",
" 64.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" female | \n",
" 314 | \n",
" 35.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
SibSp
\n",
" Numeric\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 7 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.8% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
" \n",
" Infinite (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Infinite (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" \n",
" Mean | \n",
" 0.52301 | \n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" Maximum | \n",
" 8 | \n",
"
\n",
" \n",
" Zeros (%) | \n",
" 68.2% | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" \n",
" Minimum | \n",
" 0 | \n",
"
\n",
" \n",
" 5-th percentile | \n",
" 0 | \n",
"
\n",
" \n",
" Q1 | \n",
" 0 | \n",
"
\n",
" \n",
" Median | \n",
" 0 | \n",
"
\n",
" \n",
" Q3 | \n",
" 1 | \n",
"
\n",
" \n",
" 95-th percentile | \n",
" 3 | \n",
"
\n",
" \n",
" Maximum | \n",
" 8 | \n",
"
\n",
" \n",
" Range | \n",
" 8 | \n",
"
\n",
" \n",
" Interquartile range | \n",
" 1 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" \n",
" Standard deviation | \n",
" 1.1027 | \n",
"
\n",
" \n",
" Coef of variation | \n",
" 2.1085 | \n",
"
\n",
" \n",
" Kurtosis | \n",
" 17.88 | \n",
"
\n",
" \n",
" Mean | \n",
" 0.52301 | \n",
"
\n",
" \n",
" MAD | \n",
" 0.71378 | \n",
"
\n",
" \n",
" Skewness | \n",
" 3.6954 | \n",
"
\n",
" \n",
" Sum | \n",
" 466 | \n",
"
\n",
" \n",
" Variance | \n",
" 1.216 | \n",
"
\n",
" \n",
" Memory size | \n",
" 7.0 KiB | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0 | \n",
" 608 | \n",
" 68.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 209 | \n",
" 23.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 28 | \n",
" 3.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 18 | \n",
" 2.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 16 | \n",
" 1.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 8 | \n",
" 7 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Minimum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0 | \n",
" 608 | \n",
" 68.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 209 | \n",
" 23.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 2 | \n",
" 28 | \n",
" 3.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 16 | \n",
" 1.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 18 | \n",
" 2.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
Maximum 5 values
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 2 | \n",
" 28 | \n",
" 3.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3 | \n",
" 16 | \n",
" 1.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 4 | \n",
" 18 | \n",
" 2.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 5 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 8 | \n",
" 7 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Survived
\n",
" Boolean\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 2 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 0.2% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" Mean | \n",
" 0.38384 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" 0 | \n",
" \n",
" \n",
" 549\n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" \n",
" \n",
" 342\n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 0 | \n",
" 549 | \n",
" 61.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1 | \n",
" 342 | \n",
" 38.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Ticket
\n",
" Categorical\n",
"
\n",
"
\n",
"
\n",
" \n",
" Distinct count | \n",
" 681 | \n",
"
\n",
" \n",
" Unique (%) | \n",
" 76.4% | \n",
"
\n",
" \n",
" Missing (%) | \n",
" 0.0% | \n",
"
\n",
" \n",
" Missing (n) | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" 347082 | \n",
" \n",
" \n",
" \n",
" \n",
" 7\n",
" | \n",
"
\n",
" 1601 | \n",
" \n",
" \n",
" \n",
" \n",
" 7\n",
" | \n",
"
\n",
" CA. 2343 | \n",
" \n",
" \n",
" \n",
" \n",
" 7\n",
" | \n",
"
\n",
" Other values (678) | \n",
" \n",
" \n",
" 870\n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
" \n",
" \n",
" 347082 | \n",
" 7 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 1601 | \n",
" 7 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" CA. 2343 | \n",
" 7 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
" CA 2144 | \n",
" 6 | \n",
" 0.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 347088 | \n",
" 6 | \n",
" 0.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 3101295 | \n",
" 6 | \n",
" 0.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
" S.O.C. 14879 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 382652 | \n",
" 5 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
" 349909 | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" LINE | \n",
" 4 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
" Other values (671) | \n",
" 834 | \n",
" 93.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Correlations
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Sample
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pandas_profiling.ProfileReport(titanic)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Want more tricks? Watch [21 more pandas tricks](https://www.youtube.com/watch?v=tWFQqaRtSQA&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=36) or [Read the notebook](https://nbviewer.org/github/justmarkham/pandas-videos/blob/master/21_more_pandas_tricks.ipynb)\n",
"\n",
"© 2019 [Data School](https://www.dataschool.io). All rights reserved."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}