{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Fundamentals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.\n", "\n", "For some variety, we will leave our digestive tract bacteria behind and employ some baseball data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option('display.max_rows', 10)\n", "\n", "baseball = pd.read_csv(\"../data/baseball.csv\", index_col='id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we specified the `id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining `player` and `year`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
womacto012006womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
schilcu012006schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
myersmi012006myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
helliri012006helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
johnsra052006johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
benitar012007benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
benitar012007benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
ausmubr012007ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
aloumo012007aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
alomasa022007alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n", "schilcu012006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... \n", "myersmi012006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... \n", "helliri012006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... \n", "johnsra052006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... \n", "... ... ... ... ... .. ... ... .. ... ... ... \n", "benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n", "benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n", "ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n", "aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n", "alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "schilcu012006 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "myersmi012006 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "helliri012006 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "johnsra052006 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... .. ... ... ... ... ... ... \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "player_id = baseball.player + baseball.year.astype(str)\n", "baseball_newind = baseball.copy()\n", "baseball_newind.index = player_id\n", "baseball_newind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks okay, but let's check:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, indices need not be unique. Our choice is not unique because some players change teams within years." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "wellsda012007 2\n", "gomezch022007 2\n", "wickmbo012007 2\n", "sweenma012007 2\n", "loftoke012007 2\n", " ..\n", "thomafr042007 1\n", "mesajo012007 1\n", "thomeji012007 1\n", "mabryjo012007 1\n", "johnsra052006 1\n", "Length: 88, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(baseball_newind.index).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
wickmbo012007wickmbo0120072ARINL80000...0.00.00.000.00.00.00.00.00.0
wickmbo012007wickmbo0120071ATLNL470000...0.00.00.000.00.00.00.00.00.0
\n", "

2 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "wickmbo012007 wickmbo01 2007 2 ARI NL 8 0 0 0 0 ... 0.0 \n", "wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "wickmbo012007 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "wickmbo012007 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[2 rows x 22 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.loc['wickmbo012007']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will learn more about indexing below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a truly unique index by combining `player`, `team` and `year`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
womacto01CHN2006womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
schilcu01BOS2006schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
myersmi01NYA2006myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
helliri01MIL2006helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
johnsra05NYA2006johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "womacto01CHN2006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n", "schilcu01BOS2006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... \n", "myersmi01NYA2006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... \n", "helliri01MIL2006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... \n", "johnsra05NYA2006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "womacto01CHN2006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "schilcu01BOS2006 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "myersmi01NYA2006 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "helliri01MIL2006 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "johnsra05NYA2006 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "player_unique = baseball.player + baseball.team + baseball.year.astype(str)\n", "baseball_newind = baseball.copy()\n", "baseball_newind.index = player_unique\n", "baseball_newind.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create meaningful indices more easily using a hierarchical index; for now, we will stick with the numeric `id` field as our index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Manipulating indices\n", "\n", "**Reindexing** allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.\n", "\n", "A simple use of `reindex` is to alter the order of the rows:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89526benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
89525benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.reindex(baseball.index[::-1]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `id` index is not sequential. Say we wanted to populate the table with every `id` value. We could specify and index that is a sequence from the first to the last `id` numbers in the database, and Pandas would fill in the missing data with `NaN` values:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto012006.02.0CHNNL19.050.06.014.01.0...2.01.01.04.04.00.00.03.00.00.0
88642NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
88643schilcu012006.01.0BOSAL31.02.00.01.00.0...0.00.00.00.01.00.00.00.00.00.0
88644NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
88645myersmi012006.01.0NYAAL62.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "id ... \n", "88641 womacto01 2006.0 2.0 CHN NL 19.0 50.0 6.0 14.0 1.0 ... \n", "88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n", "88643 schilcu01 2006.0 1.0 BOS AL 31.0 2.0 0.0 1.0 0.0 ... \n", "88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n", "88645 myersmi01 2006.0 1.0 NYA AL 62.0 0.0 0.0 0.0 0.0 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 2.0 1.0 1.0 4.0 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88643 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88645 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id_range = range(baseball.index.values.min(), baseball.index.values.max())\n", "baseball.reindex(id_range).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values can be filled as desired, either with selected values, or by rule:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88642womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88644schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi sb \\\n", "id ... \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 \n", "88642 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 \n", "88644 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 \n", "\n", " cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88642 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88643 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88644 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.reindex(id_range, method='ffill').head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
player
id
88641womacto01
88642charliebrown
88643schilcu01
88644charliebrown
88645myersmi01
\n", "
" ], "text/plain": [ " player\n", "id \n", "88641 womacto01\n", "88642 charliebrown\n", "88643 schilcu01\n", "88644 charliebrown\n", "88645 myersmi01" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.reindex(id_range, fill_value='charliebrown', columns=['player']).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that `reindex` does not work if we pass a non-unique index series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can remove rows or columns via the `drop` method:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(100, 22)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
88649helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
88650johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
89521bondsba0120071SFNNL126340759414...66.05.00.013254.043.03.00.02.013.0
89523biggicr0120071HOUNL1415176813031...50.04.03.023112.00.03.07.05.05.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

98 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... ... \n", "89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 ... 66.0 \n", "89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... ... ... \n", "89521 5.0 0.0 132 54.0 43.0 3.0 0.0 2.0 13.0 \n", "89523 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[98 rows x 22 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.drop([89525, 89526])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2bX3bhrrbisbcsbbsoshsfgidp
id
88641womacto0120062CHNNL19506141012.01.01.044.03.00.00.0
88643schilcu0120061BOSAL312010000.00.00.001.00.00.00.0
88645myersmi0120061NYAAL620000000.00.00.000.00.00.00.0
88649helliri0120061MILNL203000000.00.00.002.00.00.00.0
88650johnsra0520061NYAAL336010000.00.00.004.00.00.00.0
...............................................................
89525benitar0120072FLONL340000000.00.00.000.00.00.00.0
89526benitar0120071SFNNL190000000.00.00.000.00.00.00.0
89530ausmubr0120071HOUNL1173493882163325.06.01.03774.04.01.011.0
89533aloumo0120071NYNNL87328511121911349.03.00.02730.00.03.013.0
89534alomasa0220071NYNNL822131000.00.00.003.00.00.00.0
\n", "

100 rows × 20 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b X3b hr rbi \\\n", "id \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 0 0 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 0 0 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 0 0 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... .. ... \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0 0 0.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0 0 0.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3 3 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 \n", "\n", " sb cs bb so sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 \n", "... ... ... .. ... ... ... ... \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "89530 6.0 1.0 37 74.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 20 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.drop(['ibb','hbp'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing and Selection\n", "\n", "Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the `Index` object to extract values in addition to arrays of integers." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "womacto01CHN2006 14\n", "schilcu01BOS2006 1\n", "myersmi01NYA2006 0\n", "helliri01MIL2006 0\n", "johnsra05NYA2006 1\n", " ... \n", "benitar01FLO2007 0\n", "benitar01SFN2007 0\n", "ausmubr01HOU2007 82\n", "aloumo01NYN2007 112\n", "alomasa02NYN2007 3\n", "Name: h, Length: 100, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sample Series object\n", "hits = baseball_newind.h\n", "hits" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "womacto01CHN2006 14\n", "schilcu01BOS2006 1\n", "myersmi01NYA2006 0\n", "Name: h, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Numpy-style indexing\n", "hits[:3]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "womacto01CHN2006 14\n", "schilcu01BOS2006 1\n", "Name: h, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Indexing by label\n", "hits[['womacto01CHN2006','schilcu01BOS2006']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also slice with data labels, since they have an intrinsic order within the Index:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "womacto01CHN2006 14\n", "schilcu01BOS2006 1\n", "myersmi01NYA2006 0\n", "helliri01MIL2006 0\n", "johnsra05NYA2006 1\n", "finlest01SFN2006 105\n", "gonzalu01ARI2006 159\n", "Name: h, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hits['womacto01CHN2006':'gonzalu01ARI2006']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "data": { "text/plain": [ "womacto01CHN2006 5\n", "schilcu01BOS2006 5\n", "myersmi01NYA2006 5\n", "helliri01MIL2006 5\n", "johnsra05NYA2006 5\n", " ... \n", "benitar01FLO2007 0\n", "benitar01SFN2007 0\n", "ausmubr01HOU2007 82\n", "aloumo01NYN2007 112\n", "alomasa02NYN2007 3\n", "Name: h, Length: 100, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hits['womacto01CHN2006':'gonzalu01ARI2006'] = 5\n", "hits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a `DataFrame` we can slice along either or both axes:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hab
womacto01CHN2006550
schilcu01BOS200652
myersmi01NYA200650
helliri01MIL200653
johnsra05NYA200656
.........
benitar01FLO200700
benitar01SFN200700
ausmubr01HOU200782349
aloumo01NYN2007112328
alomasa02NYN2007322
\n", "

100 rows × 2 columns

\n", "
" ], "text/plain": [ " h ab\n", "womacto01CHN2006 5 50\n", "schilcu01BOS2006 5 2\n", "myersmi01NYA2006 5 0\n", "helliri01MIL2006 5 3\n", "johnsra05NYA2006 5 6\n", "... ... ...\n", "benitar01FLO2007 0 0\n", "benitar01SFN2007 0 0\n", "ausmubr01HOU2007 82 349\n", "aloumo01NYN2007 112 328\n", "alomasa02NYN2007 3 22\n", "\n", "[100 rows x 2 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind[['h','ab']]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
gonzalu01ARI2006gonzalu0120061ARINL15358693552...73.00.01.06958.010.07.00.06.014.0
vizquom01SFN2007vizquom0120071SFNNL1455135412618...51.014.06.04448.06.01.014.03.014.0
thomafr04TOR2007thomafr0420071TORAL1555316314730...95.00.00.08194.03.07.00.05.014.0
rodriiv01DET2007rodriiv0120071DETAL1295025014131...63.02.02.0996.01.01.01.02.016.0
griffke02CIN2007griffke0220071CINNL1445287814624...93.06.01.08599.014.01.00.09.014.0
delgaca01NYN2007delgaca0120071NYNNL1395387113930...87.04.00.052118.08.011.00.06.012.0
biggicr01HOU2007biggicr0120071HOUNL1415176813031...50.04.03.023112.00.03.07.05.05.0
\n", "

7 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 \n", "vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 \n", "thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 \n", "rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 \n", "griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 \n", "delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 \n", "biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf \\\n", "gonzalu01ARI2006 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 \n", "vizquom01SFN2007 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 \n", "thomafr04TOR2007 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 \n", "rodriiv01DET2007 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 \n", "griffke02CIN2007 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 \n", "delgaca01NYN2007 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 \n", "biggicr01HOU2007 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 \n", "\n", " gidp \n", "gonzalu01ARI2006 14.0 \n", "vizquom01SFN2007 14.0 \n", "thomafr04TOR2007 14.0 \n", "rodriiv01DET2007 16.0 \n", "griffke02CIN2007 14.0 \n", "delgaca01NYN2007 12.0 \n", "biggicr01HOU2007 5.0 \n", "\n", "[7 rows x 22 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind[baseball_newind.ab>500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a more concise (and readable) syntax, we can use the new `query` method to perform selection on a `DataFrame`. Instead of having to type the fully-specified column, we can simply pass a string that describes what to select. The query above is then simply:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
gonzalu01ARI2006gonzalu0120061ARINL15358693552...73.00.01.06958.010.07.00.06.014.0
vizquom01SFN2007vizquom0120071SFNNL1455135412618...51.014.06.04448.06.01.014.03.014.0
thomafr04TOR2007thomafr0420071TORAL1555316314730...95.00.00.08194.03.07.00.05.014.0
rodriiv01DET2007rodriiv0120071DETAL1295025014131...63.02.02.0996.01.01.01.02.016.0
griffke02CIN2007griffke0220071CINNL1445287814624...93.06.01.08599.014.01.00.09.014.0
delgaca01NYN2007delgaca0120071NYNNL1395387113930...87.04.00.052118.08.011.00.06.012.0
biggicr01HOU2007biggicr0120071HOUNL1415176813031...50.04.03.023112.00.03.07.05.05.0
\n", "

7 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 \n", "vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 \n", "thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 \n", "rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 \n", "griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 \n", "delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 \n", "biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf \\\n", "gonzalu01ARI2006 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 \n", "vizquom01SFN2007 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 \n", "thomafr04TOR2007 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 \n", "rodriiv01DET2007 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 \n", "griffke02CIN2007 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 \n", "delgaca01NYN2007 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 \n", "biggicr01HOU2007 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 \n", "\n", " gidp \n", "gonzalu01ARI2006 14.0 \n", "vizquom01SFN2007 14.0 \n", "thomafr04TOR2007 14.0 \n", "rodriiv01DET2007 16.0 \n", "griffke02CIN2007 14.0 \n", "delgaca01NYN2007 12.0 \n", "biggicr01HOU2007 5.0 \n", "\n", "[7 rows x 22 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.query('ab > 500')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `DataFrame.index` and `DataFrame.columns` are placed in the query namespace by default. If you want to refer to a variable in the current namespace, you can prefix the variable with `@`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "min_ab = 450" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
gonzalu01ARI2006gonzalu0120061ARINL15358693552...73.00.01.06958.010.07.00.06.014.0
vizquom01SFN2007vizquom0120071SFNNL1455135412618...51.014.06.04448.06.01.014.03.014.0
thomafr04TOR2007thomafr0420071TORAL1555316314730...95.00.00.08194.03.07.00.05.014.0
sheffga01DET2007sheffga0120071DETAL13349410713120...75.022.05.08471.02.09.00.06.010.0
rodriiv01DET2007rodriiv0120071DETAL1295025014131...63.02.02.0996.01.01.01.02.016.0
..................................................................
kentje01LAN2007kentje0120071LANNL1364947814936...79.01.03.05761.04.05.00.06.017.0
griffke02CIN2007griffke0220071CINNL1445287814624...93.06.01.08599.014.01.00.09.014.0
gonzalu01LAN2007gonzalu0120071LANNL1394647012923...68.06.02.05656.04.04.00.02.011.0
delgaca01NYN2007delgaca0120071NYNNL1395387113930...87.04.00.052118.08.011.00.06.012.0
biggicr01HOU2007biggicr0120071HOUNL1415176813031...50.04.03.023112.00.03.07.05.05.0
\n", "

11 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 \n", "vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 \n", "thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 \n", "sheffga01DET2007 sheffga01 2007 1 DET AL 133 494 107 131 20 \n", "rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 \n", "... ... ... ... ... .. ... ... ... ... ... \n", "kentje01LAN2007 kentje01 2007 1 LAN NL 136 494 78 149 36 \n", "griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 \n", "gonzalu01LAN2007 gonzalu01 2007 1 LAN NL 139 464 70 129 23 \n", "delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 \n", "biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf \\\n", "gonzalu01ARI2006 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 \n", "vizquom01SFN2007 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 \n", "thomafr04TOR2007 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 \n", "sheffga01DET2007 ... 75.0 22.0 5.0 84 71.0 2.0 9.0 0.0 6.0 \n", "rodriiv01DET2007 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 \n", "... ... ... ... ... .. ... ... ... ... ... \n", "kentje01LAN2007 ... 79.0 1.0 3.0 57 61.0 4.0 5.0 0.0 6.0 \n", "griffke02CIN2007 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 \n", "gonzalu01LAN2007 ... 68.0 6.0 2.0 56 56.0 4.0 4.0 0.0 2.0 \n", "delgaca01NYN2007 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 \n", "biggicr01HOU2007 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 \n", "\n", " gidp \n", "gonzalu01ARI2006 14.0 \n", "vizquom01SFN2007 14.0 \n", "thomafr04TOR2007 14.0 \n", "sheffga01DET2007 10.0 \n", "rodriiv01DET2007 16.0 \n", "... ... \n", "kentje01LAN2007 17.0 \n", "griffke02CIN2007 14.0 \n", "gonzalu01LAN2007 11.0 \n", "delgaca01NYN2007 12.0 \n", "biggicr01HOU2007 5.0 \n", "\n", "[11 rows x 22 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.query('ab > @min_ab')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The indexing field `loc` allows us to select subsets of rows and columns in an intuitive way:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "h 5\n", "X2b 52\n", "X3b 2\n", "hr 15\n", "Name: gonzalu01ARI2006, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.loc['gonzalu01ARI2006', ['h','X2b', 'X3b', 'hr']]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "womacto01CHN2006 1\n", "schilcu01BOS2006 0\n", "myersmi01NYA2006 0\n", "Name: hr, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.loc[:'myersmi01NYA2006', 'hr']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to using `loc` to select rows and columns by **label**, pandas also allows indexing by **position** using the `iloc` attribute.\n", "\n", "So, we can query rows and columns by absolute position, rather than by name:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gabr
womacto01CHN200619506
schilcu01BOS20063120
myersmi01NYA20066200
helliri01MIL20062030
johnsra05NYA20063360
\n", "
" ], "text/plain": [ " g ab r\n", "womacto01CHN2006 19 50 6\n", "schilcu01BOS2006 31 2 0\n", "myersmi01NYA2006 62 0 0\n", "helliri01MIL2006 20 3 0\n", "johnsra05NYA2006 33 6 0" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.iloc[:5, 5:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "You can use the `isin` method query a DataFrame based upon a list of values as follows: \n", "\n", " data['phylum'].isin(['Firmacutes', 'Bacteroidetes'])\n", "\n", "Use `isin` to find all players that played for the Los Angeles Dodgers (LAN) or the San Francisco Giants (SFN). How many records contain these values?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Write your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operations\n", "\n", "`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.\n", "\n", "For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years. First, let's (artificially) construct two Series, consisting of home runs hit in years 2006 and 2007, respectively:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "hr2006 = baseball.loc[baseball.year==2006, 'hr']\n", "hr2006.index = baseball.player[baseball.year==2006]\n", "\n", "hr2007 = baseball.loc[baseball.year==2007, 'hr']\n", "hr2007.index = baseball.player[baseball.year==2007]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "player\n", "francju01 0\n", "francju01 1\n", "zaungr01 10\n", "witasja01 0\n", "williwo02 1\n", " ..\n", "benitar01 0\n", "benitar01 0\n", "ausmubr01 3\n", "aloumo01 13\n", "alomasa02 0\n", "Name: hr, Length: 92, dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr2007" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's add them together, in hopes of getting 2-year home run totals:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "player\n", "alomasa02 NaN\n", "aloumo01 NaN\n", "ausmubr01 NaN\n", "benitar01 NaN\n", "benitar01 NaN\n", " ..\n", "wickmbo01 NaN\n", "williwo02 NaN\n", "witasja01 NaN\n", "womacto01 NaN\n", "zaungr01 NaN\n", "Name: hr, Length: 94, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr_total = hr2006 + hr2007\n", "hr_total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas' data alignment places `NaN` values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "player\n", "finlest01 7.0\n", "gonzalu01 30.0\n", "johnsra05 0.0\n", "myersmi01 0.0\n", "schilcu01 0.0\n", "seleaa01 0.0\n", "Name: hr, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr_total[hr_total.notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with `NaN`. We can use the `add` method to calculate player home run totals by using the `fill_value` argument to insert a zero for home runs where labels do not overlap:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "player\n", "alomasa02 0.0\n", "aloumo01 13.0\n", "ausmubr01 3.0\n", "benitar01 0.0\n", "benitar01 0.0\n", " ... \n", "wickmbo01 0.0\n", "williwo02 1.0\n", "witasja01 0.0\n", "womacto01 1.0\n", "zaungr01 10.0\n", "Name: hr, Length: 94, dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr2007.add(hr2006, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations can also be **broadcast** between rows or columns.\n", "\n", "For example, if we subtract the maximum number of home runs hit from the `hr` column, we get how many fewer than the maximum were hit by each player:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 -34\n", "88643 -35\n", "88645 -35\n", "88649 -35\n", "88650 -35\n", " ..\n", "89525 -35\n", "89526 -35\n", "89530 -32\n", "89533 -22\n", "89534 -35\n", "Name: hr, Length: 100, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr - baseball.hr.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, looking at things row-wise, we can see how a particular player compares with the rest of the group with respect to important statistics" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'bondsba01'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.loc[89521, \"player\"]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hX2bX3bhr
id
88641-80-130-27
88643-93-140-28
88645-94-140-28
88649-94-140-28
88650-93-140-28
8865211712-22
8865365382-13
88662-89-130-28
89177-84-110-28
89178-84-140-27
\n", "
" ], "text/plain": [ " h X2b X3b hr\n", "id \n", "88641 -80 -13 0 -27\n", "88643 -93 -14 0 -28\n", "88645 -94 -14 0 -28\n", "88649 -94 -14 0 -28\n", "88650 -93 -14 0 -28\n", "88652 11 7 12 -22\n", "88653 65 38 2 -13\n", "88662 -89 -13 0 -28\n", "89177 -84 -11 0 -28\n", "89178 -84 -14 0 -27" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats = baseball[['h','X2b', 'X3b', 'hr']]\n", "diff = stats - stats.loc[89521]\n", "diff[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also apply functions to each column or row of a `DataFrame`" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "h 8.0\n", "X2b 1.0\n", "X3b 0.0\n", "hr 0.0\n", "dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "stats.apply(np.median)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def range_calc(x):\n", " return x.max() - x.min()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "h 159\n", "X2b 52\n", "X3b 12\n", "hr 35\n", "dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stat_range = lambda x: x.max() - x.min()\n", "stats.apply(stat_range)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets use apply to calculate a meaningful baseball statistics, slugging percentage:\n", "\n", "$$SLG = \\frac{1B + (2 \\times 2B) + (3 \\times 3B) + (4 \\times HR)}{AB}$$\n", "\n", "And just for fun, we will format the resulting estimate." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 0.360\n", "88643 0.500\n", "88645 0.000\n", "88649 0.000\n", "88650 0.167\n", " ... \n", "89525 0.000\n", "89526 0.000\n", "89530 0.324\n", "89533 0.524\n", "89534 0.182\n", "Length: 100, dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def slugging(x): \n", " bases = x['h']-x['X2b']-x['X3b']-x['hr'] + 2*x['X2b'] + 3*x['X3b'] + 4*x['hr']\n", " ab = x['ab']+1e-6\n", " \n", " return bases/ab\n", "\n", "baseball.apply(slugging, axis=1).round(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting and Ranking\n", "\n", "Pandas objects include methods for re-ordering data." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
alomasa02NYN2007alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
aloumo01NYN2007aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
ausmubr01HOU2007ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
benitar01FLO2007benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
benitar01SFN2007benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "alomasa02NYN2007 alomasa02 2007 1 NYN NL 8 22 1 3 1 \n", "aloumo01NYN2007 aloumo01 2007 1 NYN NL 87 328 51 112 19 \n", "ausmubr01HOU2007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 \n", "benitar01FLO2007 benitar01 2007 2 FLO NL 34 0 0 0 0 \n", "benitar01SFN2007 benitar01 2007 1 SFN NL 19 0 0 0 0 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf gidp \n", "alomasa02NYN2007 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "aloumo01NYN2007 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "ausmubr01HOU2007 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "benitar01FLO2007 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "benitar01SFN2007 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index().head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
zaungr01TOR2007zaungr0120071TORAL110331438024...52.00.00.05155.08.02.01.06.09.0
womacto01CHN2006womacto0120062CHNNL1950651...2.01.01.044.00.00.03.00.00.0
witasja01TBA2007witasja0120071TBAAL30000...0.00.00.000.00.00.00.00.00.0
williwo02HOU2007williwo0220071HOUNL3359360...2.00.00.0025.00.00.05.00.01.0
wickmbo01ATL2007wickmbo0120071ATLNL470000...0.00.00.000.00.00.00.00.00.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "zaungr01TOR2007 zaungr01 2007 1 TOR AL 110 331 43 80 24 \n", "womacto01CHN2006 womacto01 2006 2 CHN NL 19 50 6 5 1 \n", "witasja01TBA2007 witasja01 2007 1 TBA AL 3 0 0 0 0 \n", "williwo02HOU2007 williwo02 2007 1 HOU NL 33 59 3 6 0 \n", "wickmbo01ATL2007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf gidp \n", "zaungr01TOR2007 ... 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0 \n", "womacto01CHN2006 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "witasja01TBA2007 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "williwo02HOU2007 ... 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0 \n", "wickmbo01ATL2007 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try sorting the **columns** instead of the rows, in ascending order:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X2bX3babbbcsggidphhbphr...playerrrbisbsfshsostintteamyear
womacto01CHN2006105041.0190.050.01...womacto0162.01.00.03.04.02CHN2006
schilcu01BOS200600200.0310.050.00...schilcu0100.00.00.00.01.01BOS2006
myersmi01NYA200600000.0620.050.00...myersmi0100.00.00.00.00.01NYA2006
helliri01MIL200600300.0200.050.00...helliri0100.00.00.00.02.01MIL2006
johnsra05NYA200600600.0330.050.00...johnsra0500.00.00.00.04.01NYA2006
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " X2b X3b ab bb cs g gidp h hbp hr ... \\\n", "womacto01CHN2006 1 0 50 4 1.0 19 0.0 5 0.0 1 ... \n", "schilcu01BOS2006 0 0 2 0 0.0 31 0.0 5 0.0 0 ... \n", "myersmi01NYA2006 0 0 0 0 0.0 62 0.0 5 0.0 0 ... \n", "helliri01MIL2006 0 0 3 0 0.0 20 0.0 5 0.0 0 ... \n", "johnsra05NYA2006 0 0 6 0 0.0 33 0.0 5 0.0 0 ... \n", "\n", " player r rbi sb sf sh so stint team year \n", "womacto01CHN2006 womacto01 6 2.0 1.0 0.0 3.0 4.0 2 CHN 2006 \n", "schilcu01BOS2006 schilcu01 0 0.0 0.0 0.0 0.0 1.0 1 BOS 2006 \n", "myersmi01NYA2006 myersmi01 0 0.0 0.0 0.0 0.0 0.0 1 NYA 2006 \n", "helliri01MIL2006 helliri01 0 0.0 0.0 0.0 0.0 2.0 1 MIL 2006 \n", "johnsra05NYA2006 johnsra05 0 0.0 0.0 0.0 0.0 4.0 1 NYA 2006 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index(axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use `sort_values` to sort a `Series` by value, rather than by label." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "89534 0\n", "89365 0\n", "89469 0\n", "89367 0\n", "89370 0\n", " ..\n", "89378 25\n", "89361 26\n", "89521 28\n", "89462 30\n", "89360 35\n", "Name: hr, Length: 100, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playersbcs
id
89378sheffga0122.05.0
89430loftoke0121.04.0
89347vizquom0114.06.0
89463greensh0111.01.0
88652finlest017.00.0
89462griffke026.01.0
89530ausmubr016.01.0
89466gonzalu016.02.0
89521bondsba015.00.0
89438kleskry015.01.0
\n", "
" ], "text/plain": [ " player sb cs\n", "id \n", "89378 sheffga01 22.0 5.0\n", "89430 loftoke01 21.0 4.0\n", "89347 vizquom01 14.0 6.0\n", "89463 greensh01 11.0 1.0\n", "88652 finlest01 7.0 0.0\n", "89462 griffke02 6.0 1.0\n", "89530 ausmubr01 6.0 1.0\n", "89466 gonzalu01 6.0 2.0\n", "89521 bondsba01 5.0 0.0\n", "89438 kleskry01 5.0 1.0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball[['player','sb','cs']].sort_values(ascending=[False,True], \n", " by=['sb', 'cs']).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 62.5\n", "88643 29.0\n", "88645 29.0\n", "88649 29.0\n", "88650 29.0\n", " ... \n", "89525 29.0\n", "89526 29.0\n", "89530 71.5\n", "89533 88.0\n", "89534 29.0\n", "Name: hr, Length: 100, dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ties are assigned the mean value of the tied ranks, which may result in decimal values." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.5\n", "1 1.5\n", "dtype: float64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([100,100]).rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 58.0\n", "88643 1.0\n", "88645 2.0\n", "88649 3.0\n", "88650 4.0\n", " ... \n", "89525 55.0\n", "89526 56.0\n", "89530 72.0\n", "89533 88.0\n", "89534 57.0\n", "Name: hr, Length: 100, dtype: float64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.rank(method='first')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the `DataFrame`'s `rank` method results in the ranks of all columns:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
886412.096.57.082.031.570.047.540.539.050.5...51.024.517.544.559.066.065.516.070.076.5
8864337.596.557.088.081.555.573.081.063.578.0...78.563.562.579.073.066.065.567.570.076.5
8864547.596.557.040.581.536.091.081.084.578.0...78.563.562.579.089.066.065.567.570.076.5
8864966.096.557.047.031.567.569.081.084.578.0...78.563.562.579.067.066.065.567.570.076.5
8865061.596.557.040.581.551.064.581.063.578.0...78.563.562.579.059.066.065.567.570.076.5
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "id ... \n", "88641 2.0 96.5 7.0 82.0 31.5 70.0 47.5 40.5 39.0 50.5 ... \n", "88643 37.5 96.5 57.0 88.0 81.5 55.5 73.0 81.0 63.5 78.0 ... \n", "88645 47.5 96.5 57.0 40.5 81.5 36.0 91.0 81.0 84.5 78.0 ... \n", "88649 66.0 96.5 57.0 47.0 31.5 67.5 69.0 81.0 84.5 78.0 ... \n", "88650 61.5 96.5 57.0 40.5 81.5 51.0 64.5 81.0 63.5 78.0 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 51.0 24.5 17.5 44.5 59.0 66.0 65.5 16.0 70.0 76.5 \n", "88643 78.5 63.5 62.5 79.0 73.0 66.0 65.5 67.5 70.0 76.5 \n", "88645 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n", "88649 78.5 63.5 62.5 79.0 67.0 66.0 65.5 67.5 70.0 76.5 \n", "88650 78.5 63.5 62.5 79.0 59.0 66.0 65.5 67.5 70.0 76.5 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.rank(ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rhhr
id
8864140.539.038.5
8864381.063.572.0
8864581.084.572.0
8864981.084.572.0
8865081.063.572.0
\n", "
" ], "text/plain": [ " r h hr\n", "id \n", "88641 40.5 39.0 38.5\n", "88643 81.0 63.5 72.0\n", "88645 81.0 84.5 72.0\n", "88649 81.0 84.5 72.0\n", "88650 81.0 63.5 72.0" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball[['r','h','hr']].rank(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Calculate **on base percentage** for each player, and return the ordered series of estimates.\n", "\n", "$$OBP = \\frac{H + BB + HBP}{AB + BB + HBP + SF}$$" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "# Write your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hierarchical indexing\n", "\n", "In the baseball example, I was forced to combine 3 fields to obtain a unique index that was not simply an integer value. A more elegant way to have done this would be to create a hierarchical index from the three fields." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stintlggabrhX2bX3bhrrbisbcsbbsoibbhbpshsfgidp
yearteamplayer
2006CHNwomacto012NL19506141012.01.01.044.00.00.03.00.00.0
BOSschilcu011AL312010000.00.00.001.00.00.00.00.00.0
NYAmyersmi011AL620000000.00.00.000.00.00.00.00.00.0
MILhelliri011NL203000000.00.00.002.00.00.00.00.00.0
NYAjohnsra051AL336010000.00.00.004.00.00.00.00.00.0
SFNfinlest011NL139426661052112640.07.00.04655.02.02.03.04.06.0
ARIgonzalu011NL153586931595221573.00.01.06958.010.07.00.06.014.0
LANseleaa011NL2826251000.00.00.017.00.00.06.00.01.0
2007ATLfrancju012NL15401103008.00.00.0410.01.00.00.01.01.0
NYNfrancju011NL40507100018.02.01.01013.00.00.00.01.01.0
\n", "
" ], "text/plain": [ " stint lg g ab r h X2b X3b hr rbi sb \\\n", "year team player \n", "2006 CHN womacto01 2 NL 19 50 6 14 1 0 1 2.0 1.0 \n", " BOS schilcu01 1 AL 31 2 0 1 0 0 0 0.0 0.0 \n", " NYA myersmi01 1 AL 62 0 0 0 0 0 0 0.0 0.0 \n", " MIL helliri01 1 NL 20 3 0 0 0 0 0 0.0 0.0 \n", " NYA johnsra05 1 AL 33 6 0 1 0 0 0 0.0 0.0 \n", " SFN finlest01 1 NL 139 426 66 105 21 12 6 40.0 7.0 \n", " ARI gonzalu01 1 NL 153 586 93 159 52 2 15 73.0 0.0 \n", " LAN seleaa01 1 NL 28 26 2 5 1 0 0 0.0 0.0 \n", "2007 ATL francju01 2 NL 15 40 1 10 3 0 0 8.0 0.0 \n", " NYN francju01 1 NL 40 50 7 10 0 0 1 8.0 2.0 \n", "\n", " cs bb so ibb hbp sh sf gidp \n", "year team player \n", "2006 CHN womacto01 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", " BOS schilcu01 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " NYA myersmi01 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " MIL helliri01 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", " NYA johnsra05 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", " SFN finlest01 0.0 46 55.0 2.0 2.0 3.0 4.0 6.0 \n", " ARI gonzalu01 1.0 69 58.0 10.0 7.0 0.0 6.0 14.0 \n", " LAN seleaa01 0.0 1 7.0 0.0 0.0 6.0 0.0 1.0 \n", "2007 ATL francju01 0.0 4 10.0 1.0 0.0 0.0 1.0 1.0 \n", " NYN francju01 1.0 10 13.0 0.0 0.0 0.0 1.0 1.0 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_h = baseball.set_index(['year', 'team', 'player'])\n", "baseball_h.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This index is a `MultiIndex` object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, Pandas does not print the repeats, making it easy to identify groups of values." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[[2006, 2007], ['ARI', 'ATL', 'BAL', 'BOS', 'CHA', 'CHN', 'CIN', 'CLE', 'COL', 'DET', 'FLO', 'HOU', 'KCA', 'LAA', 'LAN', 'MIL', 'MIN', 'NYA', 'NYN', 'OAK', 'PHI', 'SDN', 'SFN', 'SLN', 'TBA', 'TEX', 'TOR'], ['alomasa02', 'aloumo01', 'ausmubr01', 'benitar01', 'biggicr01', 'bondsba01', 'cirilje01', 'claytro01', 'clemero02', 'coninje01', 'cormirh01', 'delgaca01', 'easleda01', 'edmonji01', 'embreal01', 'finlest01', 'floydcl01', 'francju01', 'glavito02', 'gomezch02', 'gonzalu01', 'gordoto01', 'graffto01', 'greensh01', 'griffke02', 'guarded01', 'helliri01', 'hernaro01', 'hoffmtr01', 'johnsra05', 'jonesto02', 'kentje01', 'kleskry01', 'loaizes01', 'loftoke01', 'mabryjo01', 'maddugr01', 'martipe02', 'mesajo01', 'moyerja01', 'mussimi01', 'myersmi01', 'oliveda02', 'parkch01', 'perezne01', 'piazzmi01', 'ramirma02', 'rodriiv01', 'rogerke01', 'sandere02', 'schilcu01', 'schmija01', 'seaneru01', 'seleaa01', 'sheffga01', 'smoltjo01', 'sosasa01', 'sprinru01', 'stairma01', 'stantmi02', 'stinnke01', 'suppaje01', 'sweenma01', 'tavarju01', 'thomafr04', 'thomeji01', 'timlimi01', 'trachst01', 'valenjo03', 'villoro01', 'vizquom01', 'wakefti01', 'walketo04', 'weathda01', 'wellsda01', 'whiteri01', 'whitero02', 'wickmbo01', 'williwo02', 'witasja01', 'womacto01', 'zaungr01']],\n", " labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [5, 3, 17, 15, 17, 22, 0, 14, 1, 18], [80, 50, 41, 26, 29, 15, 20, 53, 17, 17]],\n", " names=['year', 'team', 'player'])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_h.index[:10]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_h.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try using this hierarchical index to retrieve Julio Franco (`francju01`), who played for the Atlanta Braves (`ATL`) in 2007:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stint 2\n", "lg NL\n", "g 15\n", "ab 40\n", "r 1\n", " ..\n", "ibb 1\n", "hbp 0\n", "sh 0\n", "sf 1\n", "gidp 1\n", "Name: (2007, ATL, francju01), Length: 19, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_h.loc[(2007, 'ATL', 'francju01')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall earlier we imported some microbiome data using two index columns. This created a 2-level hierarchical index:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "mb = pd.read_csv(\"../data/microbiome.csv\", index_col=['Taxon','Patient'])" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
TaxonPatient
Firmicutes101364182
211174703
304083946
418318605
5069350
61718717
7017333
8122880
901623196
10137232
\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Taxon Patient \n", "Firmicutes 1 0 136 4182\n", " 2 1 1174 703\n", " 3 0 408 3946\n", " 4 1 831 8605\n", " 5 0 693 50\n", " 6 1 718 717\n", " 7 0 173 33\n", " 8 1 228 80\n", " 9 0 162 3196\n", " 10 1 372 32" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a hierachical index, we can select subsets of the data based on a *partial* index:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
Patient
1024691821
21839661
30441418
411204483
50231012
............
1016857795
110483666
12129503994
1301541816
141130753
\n", "

14 rows × 3 columns

\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Patient \n", "1 0 2469 1821\n", "2 1 839 661\n", "3 0 4414 18\n", "4 1 12044 83\n", "5 0 2310 12\n", "... ... ... ...\n", "10 1 6857 795\n", "11 0 483 666\n", "12 1 2950 3994\n", "13 0 1541 816\n", "14 1 1307 53\n", "\n", "[14 rows x 3 columns]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.loc['Proteobacteria']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hierarchical indices can be created on either or both axes. Here is a trivial example:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioColorado
GreenRedGreen
a1012
2345
b1678
291011
\n", "
" ], "text/plain": [ " Ohio Colorado\n", " Green Red Green\n", "a 1 0 1 2\n", " 2 3 4 5\n", "b 1 6 7 8\n", " 2 9 10 11" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)), \n", " index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], \n", " columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])\n", "\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get fancy, both the row and column indices themselves can be given names:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011
\n", "
" ], "text/plain": [ "state Ohio Colorado\n", "color Green Red Green\n", "key1 key2 \n", "a 1 0 1 2\n", " 2 3 4 5\n", "b 1 6 7 8\n", " 2 9 10 11" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.index.names = ['key1', 'key2']\n", "frame.columns.names = ['state', 'color']\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this, we can do all sorts of custom indexing:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorGreenRed
key2
101
234
\n", "
" ], "text/plain": [ "color Green Red\n", "key2 \n", "1 0 1\n", "2 3 4" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.loc['a', 'Ohio']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try retrieving the value corresponding to `b2` in `Colorado`:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "# Write your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, the order of the set of indices in a hierarchical `MultiIndex` can be changed by swapping them pairwise:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
PatientTaxon
1Firmicutes01364182
2Firmicutes11174703
3Firmicutes04083946
4Firmicutes18318605
5Firmicutes069350
\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Patient Taxon \n", "1 Firmicutes 0 136 4182\n", "2 Firmicutes 1 1174 703\n", "3 Firmicutes 0 408 3946\n", "4 Firmicutes 1 831 8605\n", "5 Firmicutes 0 693 50" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.swaplevel('Patient', 'Taxon').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data can also be sorted by any index level, using `sortlevel`:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
TaxonPatient
Proteobacteria141130753
Other14130532
Firmicutes1412812377
Bacteroidetes14110233
Actinobacteria141310204
\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Taxon Patient \n", "Proteobacteria 14 1 1307 53\n", "Other 14 1 305 32\n", "Firmicutes 14 1 281 2377\n", "Bacteroidetes 14 1 102 33\n", "Actinobacteria 14 1 310 204" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.sortlevel('Patient', ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing data\n", "\n", "The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.\n", "\n", "Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy)." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 -3\n", "2 None\n", "3 foobar\n", "dtype: object" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo = pd.Series([np.nan, -3, None, 'foobar'])\n", "foo" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values may be dropped or indexed out:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0FalseNaN551.0NaNFalseFalse
13110.0NaNFalse0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.0NaNFalse0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0NaNFalse1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0FalseNaN581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False NaN 55 \n", "1 31 1 0.0 NaN False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 NaN False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 NaN False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False NaN 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 NaN False False \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores = pd.read_csv('../data/test_scores.csv', index_col=0, nrows=50)\n", "test_scores" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
73902.03.0True1.0501.01.0FalseTrue
1210911.00.0False0.0481.01.0TrueFalse
149102.00.0False0.0581.01.0TrueTrue
196510.02.0False0.0501.01.0FalseTrue
217111.02.0False1.0580.00.0TrueTrue
....................................
506111.01.0True0.0570.01.0FalseTrue
518402.00.0False0.0520.01.0FalseFalse
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
\n", "

21 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "7 39 0 2.0 3.0 True 1.0 50 \n", "12 109 1 1.0 0.0 False 0.0 48 \n", "14 91 0 2.0 0.0 False 0.0 58 \n", "19 65 1 0.0 2.0 False 0.0 50 \n", "21 71 1 1.0 2.0 False 1.0 58 \n", ".. ... ... ... ... ... ... ... \n", "50 61 1 1.0 1.0 True 0.0 57 \n", "51 84 0 2.0 0.0 False 0.0 52 \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "7 1.0 1.0 False True \n", "12 1.0 1.0 True False \n", "14 1.0 1.0 True True \n", "19 1.0 1.0 False True \n", "21 0.0 0.0 True True \n", ".. ... ... ... ... \n", "50 0.0 1.0 False True \n", "51 0.0 1.0 False False \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "\n", "[21 rows x 11 columns]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.dropna()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
0FalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalse
1FalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalse
3FalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalse
....................................
52FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
53FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
54FalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalse
55FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
56FalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 False False False False False True False \n", "1 False False False True False False False \n", "2 False False False False False False False \n", "3 False False False True False False False \n", "5 False False False False False False False \n", ".. ... ... ... ... ... ... ... \n", "52 False False False False False False False \n", "53 False False False False False False False \n", "54 False False False True False False False \n", "55 False False False False False False False \n", "56 False False False False False True False \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 False True False False \n", "1 False False False False \n", "2 False True False False \n", "3 False False False False \n", "5 False True False False \n", ".. ... ... ... ... \n", "52 False False False False \n", "53 False False False False \n", "54 False False False False \n", "55 False False False False \n", "56 False False False False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.isnull()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0FalseNaN551.0NaNFalseFalse
13110.0NaNFalse0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.0NaNFalse0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0NaNFalse1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0FalseNaN581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False NaN 55 \n", "1 31 1 0.0 NaN False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 NaN False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 NaN False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False NaN 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 NaN False False \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores[test_scores.notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `dropna` drops entire rows in which one or more values are missing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0FalseNaN551.0NaNFalseFalse
13110.0NaNFalse0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.0NaNFalse0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0NaNFalse1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0FalseNaN581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False NaN 55 \n", "1 31 1 0.0 NaN False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 NaN False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 NaN False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False NaN 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 NaN False False \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.dropna(how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
13110.0NaNFalse0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.0NaNFalse0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
68511.01.0False0.0530.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0NaNFalse1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0FalseNaN581.01.0TrueFalse
\n", "

42 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "1 31 1 0.0 NaN False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 NaN False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", "6 85 1 1.0 1.0 False 0.0 53 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 NaN False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False NaN 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", "6 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[42 rows x 11 columns]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.dropna(thresh=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Try using the `axis` argument to drop columns with missing values:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "# Write your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0False-999.0551.0-999.0FalseFalse
13110.0-999.0False0.0530.00.0FalseFalse
28311.01.0True0.0521.0-999.0FalseTrue
37503.0-999.0False0.0550.01.0FalseFalse
56200.04.0False1.0500.0-999.0FalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0-999.0False1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0False-999.0581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False -999.0 55 \n", "1 31 1 0.0 -999.0 False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 -999.0 False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 -999.0 False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False -999.0 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 -999.0 False False \n", "1 0.0 0.0 False False \n", "2 1.0 -999.0 False True \n", "3 0.0 1.0 False False \n", "5 0.0 -999.0 False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.fillna(-999)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0False1.0551.0NaNFalseFalse
13110.00.0False0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.00.0False0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.00.0False1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0False1.0581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False 1.0 55 \n", "1 31 1 0.0 0.0 False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 0.0 False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 0.0 False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False 1.0 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 NaN False False \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.fillna({'family_inv': 0, 'prev_disab': 1})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or `DataFrame` in place (**in general, we like to do this, by the way!**)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can alter values in-place using `inplace=True`." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0False0.0551.0NaNFalseFalse
13110.0NaNFalse0.0530.00.0FalseFalse
28311.01.0True0.0521.0NaNFalseTrue
37503.0NaNFalse0.0550.01.0FalseFalse
56200.04.0False1.0500.0NaNFalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.0NaNFalse1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0False0.0581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False 0.0 55 \n", "1 31 1 0.0 NaN False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 NaN False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 NaN False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False 0.0 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 NaN False False \n", "1 0.0 0.0 False False \n", "2 1.0 NaN False True \n", "3 0.0 1.0 False False \n", "5 0.0 NaN False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.prev_disab.fillna(0, inplace=True)\n", "test_scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values can also be interpolated, using any one of a variety of methods:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoremalesiblingsfamily_invnon_englishprev_disabage_testnon_severe_hlmother_hsearly_identnon_white
04002.02.0False0.0551.00.0FalseFalse
13110.01.0False0.0530.00.0FalseFalse
28311.01.0True0.0521.01.0FalseTrue
37503.04.0False0.0550.01.0FalseFalse
56200.04.0False1.0500.01.0FalseFalse
....................................
526702.02.0False0.0490.01.0TrueTrue
538911.00.0False0.0481.01.0FalseTrue
547611.01.0False1.0561.01.0FalseTrue
559811.01.0True0.0511.01.0TrueFalse
569310.00.0False0.0581.01.0TrueFalse
\n", "

50 rows × 11 columns

\n", "
" ], "text/plain": [ " score male siblings family_inv non_english prev_disab age_test \\\n", "0 40 0 2.0 2.0 False 0.0 55 \n", "1 31 1 0.0 1.0 False 0.0 53 \n", "2 83 1 1.0 1.0 True 0.0 52 \n", "3 75 0 3.0 4.0 False 0.0 55 \n", "5 62 0 0.0 4.0 False 1.0 50 \n", ".. ... ... ... ... ... ... ... \n", "52 67 0 2.0 2.0 False 0.0 49 \n", "53 89 1 1.0 0.0 False 0.0 48 \n", "54 76 1 1.0 1.0 False 1.0 56 \n", "55 98 1 1.0 1.0 True 0.0 51 \n", "56 93 1 0.0 0.0 False 0.0 58 \n", "\n", " non_severe_hl mother_hs early_ident non_white \n", "0 1.0 0.0 False False \n", "1 0.0 0.0 False False \n", "2 1.0 1.0 False True \n", "3 0.0 1.0 False False \n", "5 0.0 1.0 False False \n", ".. ... ... ... ... \n", "52 0.0 1.0 True True \n", "53 1.0 1.0 False True \n", "54 1.0 1.0 False True \n", "55 1.0 1.0 True False \n", "56 1.0 1.0 True False \n", "\n", "[50 rows x 11 columns]" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data summarization\n", "\n", "We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "player womacto01schilcu01myersmi01helliri01johnsra05f...\n", "year 200692\n", "stint 113\n", "team CHNBOSNYAMILNYASFNARILANATLNYNTORTBAHOUARIATLM...\n", "lg NLALALNLALNLNLNLNLNLALALNLNLNLALNLNLNLNLALALNL...\n", " ... \n", "ibb 177\n", "hbp 112\n", "sh 138\n", "sf 120\n", "gidp 354\n", "Length: 22, dtype: object" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly, `sum` is more meaningful for some columns than others. For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "year 2006.92\n", "stint 1.13\n", "g 52.38\n", "ab 136.54\n", "r 18.69\n", " ... \n", "ibb 1.77\n", "hbp 1.12\n", "sh 1.38\n", "sf 1.20\n", "gidp 3.54\n", "Length: 19, dtype: float64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The important difference between NumPy's functions and Pandas' methods is that the latter have built-in support for handling missing data." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "score 78.500000\n", "male 0.480000\n", "siblings 1.020000\n", "family_inv 0.972973\n", "non_english 0.320000\n", " ... \n", "age_test 53.120000\n", "non_severe_hl 0.420000\n", "mother_hs 0.787879\n", "early_ident 0.280000\n", "non_white 0.460000\n", "Length: 11, dtype: float64" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we may not want to ignore missing values, and allow the `nan` to propagate." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "score 78.50\n", "male 0.48\n", "siblings 1.02\n", "family_inv NaN\n", "non_english 0.32\n", " ... \n", "age_test 53.12\n", "non_severe_hl 0.42\n", "mother_hs NaN\n", "early_ident 0.28\n", "non_white 0.46\n", "Length: 11, dtype: float64" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_scores.mean(skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing `axis=1` will summarize over rows instead of columns, which only makes sense in certain situations." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "88653 69\n", "89439 57\n", "89361 56\n", "89462 55\n", "89396 54\n", " ..\n", "89355 0\n", "89354 0\n", "89480 0\n", "89348 0\n", "89420 0\n", "Length: 100, dtype: int64" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extra_bases = baseball[['X2b','X3b','hr']].sum(axis=1)\n", "extra_bases.sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstintgabrhX2bX3bhrrbisbcsbbsoibbhbpshsfgidp
count100.00000100.000000100.000000100.000000100.00000100.000000100.000000100.000000100.000000100.00000100.000000100.000000100.000000100.000000100.000000100.00000100.000000100.000000100.000000
mean2006.920001.13000052.380000136.54000018.6900035.8200007.3900000.5500004.37000018.470001.3800000.46000015.49000024.0800001.7700001.120001.3800001.2000003.540000
std0.272660.33799848.031299181.93685327.7749650.22180711.1172771.4451247.97553728.347933.6948781.06761325.81264932.8044965.0429572.230552.9190422.0350465.201826
min2006.000001.0000001.0000000.0000000.000000.0000000.0000000.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.000000.0000000.0000000.000000
25%2007.000001.0000009.5000002.0000000.000000.0000000.0000000.0000000.0000000.000000.0000000.0000000.0000001.0000000.0000000.000000.0000000.0000000.000000
50%2007.000001.00000033.00000040.5000002.000008.0000001.0000000.0000000.0000002.000000.0000000.0000001.0000007.0000000.0000000.000000.0000000.0000001.000000
75%2007.000001.00000083.250000243.75000033.2500062.75000011.7500001.0000006.00000027.000001.0000000.00000019.25000037.2500001.2500001.000001.0000002.0000006.000000
max2007.000002.000000155.000000586.000000107.00000159.00000052.00000012.00000035.00000096.0000022.0000006.000000132.000000134.00000043.00000011.0000014.0000009.00000021.000000
\n", "
" ], "text/plain": [ " year stint g ab r h \\\n", "count 100.00000 100.000000 100.000000 100.000000 100.00000 100.000000 \n", "mean 2006.92000 1.130000 52.380000 136.540000 18.69000 35.820000 \n", "std 0.27266 0.337998 48.031299 181.936853 27.77496 50.221807 \n", "min 2006.00000 1.000000 1.000000 0.000000 0.00000 0.000000 \n", "25% 2007.00000 1.000000 9.500000 2.000000 0.00000 0.000000 \n", "50% 2007.00000 1.000000 33.000000 40.500000 2.00000 8.000000 \n", "75% 2007.00000 1.000000 83.250000 243.750000 33.25000 62.750000 \n", "max 2007.00000 2.000000 155.000000 586.000000 107.00000 159.000000 \n", "\n", " X2b X3b hr rbi sb cs \\\n", "count 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 \n", "mean 7.390000 0.550000 4.370000 18.47000 1.380000 0.460000 \n", "std 11.117277 1.445124 7.975537 28.34793 3.694878 1.067613 \n", "min 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 \n", "50% 1.000000 0.000000 0.000000 2.00000 0.000000 0.000000 \n", "75% 11.750000 1.000000 6.000000 27.00000 1.000000 0.000000 \n", "max 52.000000 12.000000 35.000000 96.00000 22.000000 6.000000 \n", "\n", " bb so ibb hbp sh sf \\\n", "count 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 \n", "mean 15.490000 24.080000 1.770000 1.12000 1.380000 1.200000 \n", "std 25.812649 32.804496 5.042957 2.23055 2.919042 2.035046 \n", "min 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 \n", "25% 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 \n", "50% 1.000000 7.000000 0.000000 0.00000 0.000000 0.000000 \n", "75% 19.250000 37.250000 1.250000 1.00000 1.000000 2.000000 \n", "max 132.000000 134.000000 43.000000 11.00000 14.000000 9.000000 \n", "\n", " gidp \n", "count 100.000000 \n", "mean 3.540000 \n", "std 5.201826 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 1.000000 \n", "75% 6.000000 \n", "max 21.000000 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`describe` can detect non-numeric data and sometimes yield useful information about it." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 100\n", "unique 82\n", "top gomezch02\n", "freq 2\n", "Name: player, dtype: object" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.player.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also calculate summary statistics *across* multiple columns, for example, correlation and covariance.\n", "\n", "$$cov(x,y) = \\sum_i (x_i - \\bar{x})(y_i - \\bar{y})$$" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "69.07646464646454" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.cov(baseball.X2b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$corr(x,y) = \\frac{cov(x,y)}{(n-1)s_x s_y} = \\frac{\\sum_i (x_i - \\bar{x})(y_i - \\bar{y})}{\\sqrt{\\sum_i (x_i - \\bar{x})^2 \\sum_i (y_i - \\bar{y})^2}}$$" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7790615182539742" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.corr(baseball.X2b)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9942174036272377" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.ab.corr(baseball.h)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try running `corr` on the entire `baseball` DataFrame to see what is returned:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "# Write answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we have a `DataFrame` with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
TaxonPatient
Firmicutes101364182
211174703
304083946
418318605
5069350
\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Taxon Patient \n", "Firmicutes 1 0 136 4182\n", " 2 1 1174 703\n", " 3 0 408 3946\n", " 4 1 831 8605\n", " 5 0 693 50" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.head()" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GroupTissueStool
Taxon
Firmicutes7963430172
Proteobacteria74250812483
Actinobacteria761671615
Bacteroidetes788804276
Other72868242
\n", "
" ], "text/plain": [ " Group Tissue Stool\n", "Taxon \n", "Firmicutes 7 9634 30172\n", "Proteobacteria 7 42508 12483\n", "Actinobacteria 7 6167 1615\n", "Bacteroidetes 7 8880 4276\n", "Other 7 2868 242" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb.sum(level='Taxon')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing Data to Files\n", "\n", "As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "mb.to_csv(\"mb.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "baseball.to_pickle(\"baseball_pickle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The complement to `to_pickle` is the `read_pickle` function, which restores the pickle to a `DataFrame` or `Series`:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
88649helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
88650johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
89525benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
89526benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... ... \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... .. ... ... ... ... ... ... \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_pickle(\"baseball_pickle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## References\n", "\n", "[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) Wes McKinney" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": false, "sideBar": false, "threshold": "3", "toc_cell": true, "toc_section_display": "none", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }