{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python pandas Q&A video series by [Data School](http://www.dataschool.io/)\n", "\n", "### [YouTube playlist](https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y) and [GitHub repository](https://github.com/justmarkham/pandas-videos)\n", "\n", "## Table of contents\n", "\n", "1. What is pandas?\n", "2. How do I read a tabular data file into pandas?\n", "3. How do I select a pandas Series from a DataFrame?\n", "4. Why do some pandas commands end with parentheses (and others don't)?\n", "5. How do I rename columns in a pandas DataFrame?\n", "6. How do I remove columns from a pandas DataFrame?\n", "7. How do I sort a pandas DataFrame or a Series?\n", "8. How do I filter rows of a pandas DataFrame by column value?\n", "9. How do I apply multiple filter criteria to a pandas DataFrame?\n", "10. Your pandas questions answered!\n", "11. How do I use the \"axis\" parameter in pandas?\n", "12. How do I use string methods in pandas?\n", "13. How do I change the data type of a pandas Series?\n", "14. When should I use a \"groupby\" in pandas?\n", "15. How do I explore a pandas Series?\n", "16. How do I handle missing values in pandas?\n", "17. What do I need to know about the pandas index? (Part 1)\n", "18. What do I need to know about the pandas index? (Part 2)\n", "19. How do I select multiple rows and columns from a pandas DataFrame?\n", "20. When should I use the \"inplace\" parameter in pandas?\n", "21. How do I make my pandas DataFrame smaller and faster?\n", "22. How do I use pandas with scikit-learn to create Kaggle submissions?\n", "23. More of your pandas questions answered!\n", "24. How do I create dummy variables in pandas?\n", "25. How do I work with dates and times in pandas?\n", "26. How do I find and remove duplicate rows in pandas?\n", "27. How do I avoid a SettingWithCopyWarning in pandas?\n", "28. How do I change display options in pandas?\n", "29. How do I create a pandas DataFrame from another object?\n", "30. How do I apply a function to a pandas Series or DataFrame?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# conventional way to import pandas\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. What is pandas? ([video](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=1))\n", "\n", "- [pandas main page](http://pandas.pydata.org/)\n", "- [pandas installation instructions](http://pandas.pydata.org/pandas-docs/stable/install.html)\n", "- [Anaconda distribution of Python](https://www.continuum.io/downloads) (includes pandas)\n", "- [How to use the IPython/Jupyter notebook](https://youtu.be/IsXXlYVBt1M?t=5m17s) (video)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. How do I read a tabular data file into pandas? ([video](https://www.youtube.com/watch?v=5_QXMwezPJE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=2))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read a dataset of Chipotle orders directly from a URL and store the results in a DataFrame\n", "orders = pd.read_table('http://bit.ly/chiporders')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
\n", "
" ], "text/plain": [ " order_id quantity item_name \\\n", "0 1 1 Chips and Fresh Tomato Salsa \n", "1 1 1 Izze \n", "2 1 1 Nantucket Nectar \n", "3 1 1 Chips and Tomatillo-Green Chili Salsa \n", "4 2 2 Chicken Bowl \n", "\n", " choice_description item_price \n", "0 NaN $2.39 \n", "1 [Clementine] $3.39 \n", "2 [Apple] $3.39 \n", "3 NaN $2.39 \n", "4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the first 5 rows\n", "orders.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`read_table`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read a dataset of movie reviewers (modifying the default parameter values for read_table)\n", "user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']\n", "users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idagegenderoccupationzip_code
0124Mtechnician85711
1253Fother94043
2323Mwriter32067
3424Mtechnician43537
4533Fother15213
\n", "
" ], "text/plain": [ " user_id age gender occupation zip_code\n", "0 1 24 M technician 85711\n", "1 2 53 F other 94043\n", "2 3 23 M writer 32067\n", "3 4 24 M technician 43537\n", "4 5 33 F other 15213" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the first 5 rows\n", "users.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. How do I select a pandas Series from a DataFrame? ([video](https://www.youtube.com/watch?v=zxqjeyKP2Tk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=3))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_table('http://bit.ly/uforeports', sep=',')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read_csv is equivalent to read_table, except it assumes a comma separator\n", "ufo = pd.read_csv('http://bit.ly/uforeports')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the first 5 rows\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Ithaca\n", "1 Willingboro\n", "2 Holyoke\n", "3 Abilene\n", "4 New York Worlds Fair\n", "5 Valley City\n", "6 Crater Lake\n", "7 Alma\n", "8 Eklutna\n", "9 Hubbard\n", "10 Fontana\n", "11 Waterloo\n", "12 Belton\n", "13 Keokuk\n", "14 Ludington\n", "15 Forest Home\n", "16 Los Angeles\n", "17 Hapeville\n", "18 Oneida\n", "19 Bering Sea\n", "20 Nebraska\n", "21 NaN\n", "22 NaN\n", "23 Owensboro\n", "24 Wilderness\n", "25 San Diego\n", "26 Wilderness\n", "27 Clovis\n", "28 Los Alamos\n", "29 Ft. Duschene\n", " ... \n", "18211 Holyoke\n", "18212 Carson\n", "18213 Pasadena\n", "18214 Austin\n", "18215 El Campo\n", "18216 Garden Grove\n", "18217 Berthoud Pass\n", "18218 Sisterdale\n", "18219 Garden Grove\n", "18220 Shasta Lake\n", "18221 Franklin\n", "18222 Albrightsville\n", "18223 Greenville\n", "18224 Eufaula\n", "18225 Simi Valley\n", "18226 San Francisco\n", "18227 San Francisco\n", "18228 Kingsville\n", "18229 Chicago\n", "18230 Pismo Beach\n", "18231 Pismo Beach\n", "18232 Lodi\n", "18233 Anchorage\n", "18234 Capitola\n", "18235 Fountain Hills\n", "18236 Grant Park\n", "18237 Spirit Lake\n", "18238 Eagle River\n", "18239 Eagle River\n", "18240 Ybor\n", "Name: City, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select the 'City' Series using bracket notation\n", "ufo['City']\n", "\n", "# or equivalently, use dot notation\n", "ufo.City" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bracket notation** will always work, whereas **dot notation** has limitations:\n", "\n", "- Dot notation doesn't work if there are **spaces** in the Series name\n", "- Dot notation doesn't work if the Series has the same name as a **DataFrame method or attribute** (like 'head' or 'shape')\n", "- Dot notation can't be used to define the name of a **new Series** (see below)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTimeLocation
0IthacaNaNTRIANGLENY6/1/1930 22:00Ithaca, NY
1WillingboroNaNOTHERNJ6/30/1930 20:00Willingboro, NJ
2HolyokeNaNOVALCO2/15/1931 14:00Holyoke, CO
3AbileneNaNDISKKS6/1/1931 13:00Abilene, KS
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00New York Worlds Fair, NY
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time \\\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 \n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00 \n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00 \n", "3 Abilene NaN DISK KS 6/1/1931 13:00 \n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 \n", "\n", " Location \n", "0 Ithaca, NY \n", "1 Willingboro, NJ \n", "2 Holyoke, CO \n", "3 Abilene, KS \n", "4 New York Worlds Fair, NY " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a new 'Location' Series (must use bracket notation to define the Series name)\n", "ufo['Location'] = ufo.City + ', ' + ufo.State\n", "ufo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Why do some pandas commands end with parentheses (and others don't)? ([video](https://www.youtube.com/watch?v=hSrDViyKWVk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=4))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Methods** end with parentheses, while **attributes** don't:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example method: show the first 5 rows\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingduration
count979.000000979.000000
mean7.889785120.979571
std0.33606926.218010
min7.40000064.000000
25%7.600000102.000000
50%7.800000117.000000
75%8.100000134.000000
max9.300000242.000000
\n", "
" ], "text/plain": [ " star_rating duration\n", "count 979.000000 979.000000\n", "mean 7.889785 120.979571\n", "std 0.336069 26.218010\n", "min 7.400000 64.000000\n", "25% 7.600000 102.000000\n", "50% 7.800000 117.000000\n", "75% 8.100000 134.000000\n", "max 9.300000 242.000000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example method: calculate summary statistics\n", "movies.describe()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(979, 6)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example attribute: number of rows and columns\n", "movies.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "star_rating float64\n", "title object\n", "content_rating object\n", "genre object\n", "duration int64\n", "actors_list object\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example attribute: data type of each column\n", "movies.dtypes" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecontent_ratinggenreactors_list
count979976979979
unique9751216969
topThe Girl with the Dragon TattooRDrama[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq24602786
\n", "
" ], "text/plain": [ " title content_rating genre \\\n", "count 979 976 979 \n", "unique 975 12 16 \n", "top The Girl with the Dragon Tattoo R Drama \n", "freq 2 460 278 \n", "\n", " actors_list \n", "count 979 \n", "unique 969 \n", "top [u'Daniel Radcliffe', u'Emma Watson', u'Rupert... \n", "freq 6 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use an optional parameter to the describe method to summarize only 'object' columns\n", "movies.describe(include=['object'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. How do I rename columns in a pandas DataFrame? ([video](https://www.youtube.com/watch?v=0uBirYFhizE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=5))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the column names\n", "ufo.columns" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State', u'Time'], dtype='object')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rename two of the columns by using the 'rename' method\n", "ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)\n", "ufo.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`rename`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'city', u'colors reported', u'shape reported', u'state', u'time'], dtype='object')" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace all of the column names by overwriting the 'columns' attribute\n", "ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']\n", "ufo.columns = ufo_cols\n", "ufo.columns" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'city', u'colors reported', u'shape reported', u'state', u'time'], dtype='object')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace the column names during the file reading process by using the 'names' parameter\n", "ufo = pd.read_csv('http://bit.ly/uforeports', header=0, names=ufo_cols)\n", "ufo.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'city', u'colors_reported', u'shape_reported', u'state', u'time'], dtype='object')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace all spaces with underscores in the column names by using the 'str.replace' method\n", "ufo.columns = ufo.columns.str.replace(' ', '_')\n", "ufo.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`str.replace`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. How do I remove columns from a pandas DataFrame? ([video](https://www.youtube.com/watch?v=gnUKkS964WQ&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=6))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityShape ReportedStateTime
0IthacaTRIANGLENY6/1/1930 22:00
1WillingboroOTHERNJ6/30/1930 20:00
2HolyokeOVALCO2/15/1931 14:00
3AbileneDISKKS6/1/1931 13:00
4New York Worlds FairLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Shape Reported State Time\n", "0 Ithaca TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro OTHER NJ 6/30/1930 20:00\n", "2 Holyoke OVAL CO 2/15/1931 14:00\n", "3 Abilene DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove a single column (axis=1 refers to columns)\n", "ufo.drop('Colors Reported', axis=1, inplace=True)\n", "ufo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`drop`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Shape ReportedTime
0TRIANGLE6/1/1930 22:00
1OTHER6/30/1930 20:00
2OVAL2/15/1931 14:00
3DISK6/1/1931 13:00
4LIGHT4/18/1933 19:00
\n", "
" ], "text/plain": [ " Shape Reported Time\n", "0 TRIANGLE 6/1/1930 22:00\n", "1 OTHER 6/30/1930 20:00\n", "2 OVAL 2/15/1931 14:00\n", "3 DISK 6/1/1931 13:00\n", "4 LIGHT 4/18/1933 19:00" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove multiple columns at once\n", "ufo.drop(['City', 'State'], axis=1, inplace=True)\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Shape ReportedTime
2OVAL2/15/1931 14:00
3DISK6/1/1931 13:00
4LIGHT4/18/1933 19:00
5DISK9/15/1934 15:30
6CIRCLE6/15/1935 0:00
\n", "
" ], "text/plain": [ " Shape Reported Time\n", "2 OVAL 2/15/1931 14:00\n", "3 DISK 6/1/1931 13:00\n", "4 LIGHT 4/18/1933 19:00\n", "5 DISK 9/15/1934 15:30\n", "6 CIRCLE 6/15/1935 0:00" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove multiple rows at once (axis=0 refers to rows)\n", "ufo.drop([0, 1], axis=0, inplace=True)\n", "ufo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. How do I sort a pandas DataFrame or a Series? ([video](https://www.youtube.com/watch?v=zY4doF6xSxY&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=7))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')\n", "movies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** None of the sorting methods below affect the underlying data. (In other words, the sorting is temporary)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "542 (500) Days of Summer\n", "5 12 Angry Men\n", "201 12 Years a Slave\n", "698 127 Hours\n", "110 2001: A Space Odyssey\n", "Name: title, dtype: object" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the 'title' Series in ascending order (returns a Series)\n", "movies.title.sort_values().head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "864 [Rec]\n", "526 Zulu\n", "615 Zombieland\n", "677 Zodiac\n", "955 Zero Dark Thirty\n", "Name: title, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort in descending order instead\n", "movies.title.sort_values(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html) for a **Series**. (Prior to version 0.17, use [**`order`**](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.order.html) instead.)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
5427.8(500) Days of SummerPG-13Comedy95[u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
58.912 Angry MenNOT RATEDDrama96[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
2018.112 Years a SlaveRBiography134[u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
6987.6127 HoursRAdventure94[u'James Franco', u'Amber Tamblyn', u'Kate Mara']
1108.32001: A Space OdysseyGMystery160[u'Keir Dullea', u'Gary Lockwood', u'William S...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "542 7.8 (500) Days of Summer PG-13 Comedy 95 \n", "5 8.9 12 Angry Men NOT RATED Drama 96 \n", "201 8.1 12 Years a Slave R Biography 134 \n", "698 7.6 127 Hours R Adventure 94 \n", "110 8.3 2001: A Space Odyssey G Mystery 160 \n", "\n", " actors_list \n", "542 [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ... \n", "5 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n", "201 [u'Chiwetel Ejiofor', u'Michael Kenneth Willia... \n", "698 [u'James Franco', u'Amber Tamblyn', u'Kate Mara'] \n", "110 [u'Keir Dullea', u'Gary Lockwood', u'William S... " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the entire DataFrame by the 'title' Series (returns a DataFrame)\n", "movies.sort_values('title').head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
8647.5[Rec]RHorror78[u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
5267.8ZuluUNRATEDDrama138[u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
6157.7ZombielandRComedy88[u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
6777.7ZodiacRCrime157[u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
9557.4Zero Dark ThirtyRDrama157[u'Jessica Chastain', u'Joel Edgerton', u'Chri...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "864 7.5 [Rec] R Horror 78 \n", "526 7.8 Zulu UNRATED Drama 138 \n", "615 7.7 Zombieland R Comedy 88 \n", "677 7.7 Zodiac R Crime 157 \n", "955 7.4 Zero Dark Thirty R Drama 157 \n", "\n", " actors_list \n", "864 [u'Manuela Velasco', u'Ferran Terraza', u'Jorg... \n", "526 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac... \n", "615 [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha... \n", "677 [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M... \n", "955 [u'Jessica Chastain', u'Joel Edgerton', u'Chri... " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort in descending order instead\n", "movies.sort_values('title', ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) for a **DataFrame**. (Prior to version 0.17, use [**`sort`**](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sort.html) instead.)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
7137.6The Jungle BookAPPROVEDAnimation78[u'Phil Harris', u'Sebastian Cabot', u'Louis P...
5137.8Invasion of the Body SnatchersAPPROVEDHorror80[u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
2728.1The KillingAPPROVEDCrime85[u'Sterling Hayden', u'Coleen Gray', u'Vince E...
7037.6DraculaAPPROVEDHorror85[u'Bela Lugosi', u'Helen Chandler', u'David Ma...
6127.7A Hard Day's NightAPPROVEDComedy87[u'John Lennon', u'Paul McCartney', u'George H...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre \\\n", "713 7.6 The Jungle Book APPROVED Animation \n", "513 7.8 Invasion of the Body Snatchers APPROVED Horror \n", "272 8.1 The Killing APPROVED Crime \n", "703 7.6 Dracula APPROVED Horror \n", "612 7.7 A Hard Day's Night APPROVED Comedy \n", "\n", " duration actors_list \n", "713 78 [u'Phil Harris', u'Sebastian Cabot', u'Louis P... \n", "513 80 [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga... \n", "272 85 [u'Sterling Hayden', u'Coleen Gray', u'Vince E... \n", "703 85 [u'Bela Lugosi', u'Helen Chandler', u'David Ma... \n", "612 87 [u'John Lennon', u'Paul McCartney', u'George H... " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the DataFrame first by 'content_rating', then by 'duration'\n", "movies.sort_values(['content_rating', 'duration']).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Summary of changes to the sorting API](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#changes-to-sorting-api) in pandas 0.17\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. How do I filter rows of a pandas DataFrame by column value? ([video](https://www.youtube.com/watch?v=2AFGPdNn4FM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=8))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(979, 6)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the number of rows and columns\n", "movies.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwise\n", "booleans = []\n", "for length in movies.duration:\n", " if length >= 200:\n", " booleans.append(True)\n", " else:\n", " booleans.append(False)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "979" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm that the list has the same length as the DataFrame\n", "len(booleans)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[False, False, True, False, False]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the first five list elements\n", "booleans[0:5]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 False\n", "dtype: bool" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert the list to a Series\n", "is_long = pd.Series(booleans)\n", "is_long.head()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
78.9The Lord of the Rings: The Return of the KingPG-13Adventure201[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
178.7Seven SamuraiUNRATEDDrama207[u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K...
788.4Once Upon a Time in AmericaRCrime229[u'Robert De Niro', u'James Woods', u'Elizabet...
858.4Lawrence of ArabiaPGAdventure216[u\"Peter O'Toole\", u'Alec Guinness', u'Anthony...
1428.3Lagaan: Once Upon a Time in IndiaPGAdventure224[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
1578.2Gone with the WindGDrama238[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
2048.1Ben-HurGAdventure212[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
4457.9The Ten CommandmentsAPPROVEDAdventure220[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
4767.8HamletPG-13Drama242[u'Kenneth Branagh', u'Julie Christie', u'Dere...
6307.7Malcolm XPG-13Biography202[u'Denzel Washington', u'Angela Bassett', u'De...
7677.6It's a Mad, Mad, Mad, Mad WorldAPPROVEDAction205[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
\n", "
" ], "text/plain": [ " star_rating title \\\n", "2 9.1 The Godfather: Part II \n", "7 8.9 The Lord of the Rings: The Return of the King \n", "17 8.7 Seven Samurai \n", "78 8.4 Once Upon a Time in America \n", "85 8.4 Lawrence of Arabia \n", "142 8.3 Lagaan: Once Upon a Time in India \n", "157 8.2 Gone with the Wind \n", "204 8.1 Ben-Hur \n", "445 7.9 The Ten Commandments \n", "476 7.8 Hamlet \n", "630 7.7 Malcolm X \n", "767 7.6 It's a Mad, Mad, Mad, Mad World \n", "\n", " content_rating genre duration \\\n", "2 R Crime 200 \n", "7 PG-13 Adventure 201 \n", "17 UNRATED Drama 207 \n", "78 R Crime 229 \n", "85 PG Adventure 216 \n", "142 PG Adventure 224 \n", "157 G Drama 238 \n", "204 G Adventure 212 \n", "445 APPROVED Adventure 220 \n", "476 PG-13 Drama 242 \n", "630 PG-13 Biography 202 \n", "767 APPROVED Action 205 \n", "\n", " actors_list \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "7 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... \n", "17 [u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K... \n", "78 [u'Robert De Niro', u'James Woods', u'Elizabet... \n", "85 [u\"Peter O'Toole\", u'Alec Guinness', u'Anthony... \n", "142 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... \n", "157 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... \n", "204 [u'Charlton Heston', u'Jack Hawkins', u'Stephe... \n", "445 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... \n", "476 [u'Kenneth Branagh', u'Julie Christie', u'Dere... \n", "630 [u'Denzel Washington', u'Angela Bassett', u'De... \n", "767 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use bracket notation with the boolean Series to tell the DataFrame which rows to display\n", "movies[is_long]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
78.9The Lord of the Rings: The Return of the KingPG-13Adventure201[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
178.7Seven SamuraiUNRATEDDrama207[u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K...
788.4Once Upon a Time in AmericaRCrime229[u'Robert De Niro', u'James Woods', u'Elizabet...
858.4Lawrence of ArabiaPGAdventure216[u\"Peter O'Toole\", u'Alec Guinness', u'Anthony...
1428.3Lagaan: Once Upon a Time in IndiaPGAdventure224[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
1578.2Gone with the WindGDrama238[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
2048.1Ben-HurGAdventure212[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
4457.9The Ten CommandmentsAPPROVEDAdventure220[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
4767.8HamletPG-13Drama242[u'Kenneth Branagh', u'Julie Christie', u'Dere...
6307.7Malcolm XPG-13Biography202[u'Denzel Washington', u'Angela Bassett', u'De...
7677.6It's a Mad, Mad, Mad, Mad WorldAPPROVEDAction205[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
\n", "
" ], "text/plain": [ " star_rating title \\\n", "2 9.1 The Godfather: Part II \n", "7 8.9 The Lord of the Rings: The Return of the King \n", "17 8.7 Seven Samurai \n", "78 8.4 Once Upon a Time in America \n", "85 8.4 Lawrence of Arabia \n", "142 8.3 Lagaan: Once Upon a Time in India \n", "157 8.2 Gone with the Wind \n", "204 8.1 Ben-Hur \n", "445 7.9 The Ten Commandments \n", "476 7.8 Hamlet \n", "630 7.7 Malcolm X \n", "767 7.6 It's a Mad, Mad, Mad, Mad World \n", "\n", " content_rating genre duration \\\n", "2 R Crime 200 \n", "7 PG-13 Adventure 201 \n", "17 UNRATED Drama 207 \n", "78 R Crime 229 \n", "85 PG Adventure 216 \n", "142 PG Adventure 224 \n", "157 G Drama 238 \n", "204 G Adventure 212 \n", "445 APPROVED Adventure 220 \n", "476 PG-13 Drama 242 \n", "630 PG-13 Biography 202 \n", "767 APPROVED Action 205 \n", "\n", " actors_list \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "7 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... \n", "17 [u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K... \n", "78 [u'Robert De Niro', u'James Woods', u'Elizabet... \n", "85 [u\"Peter O'Toole\", u'Alec Guinness', u'Anthony... \n", "142 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... \n", "157 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... \n", "204 [u'Charlton Heston', u'Jack Hawkins', u'Stephe... \n", "445 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... \n", "476 [u'Kenneth Branagh', u'Julie Christie', u'Dere... \n", "630 [u'Denzel Washington', u'Angela Bassett', u'De... \n", "767 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparison\n", "is_long = movies.duration >= 200\n", "movies[is_long]\n", "\n", "# or equivalently, write it in one line (no need to create the 'is_long' object)\n", "movies[movies.duration >= 200]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2 Crime\n", "7 Adventure\n", "17 Drama\n", "78 Crime\n", "85 Adventure\n", "142 Adventure\n", "157 Drama\n", "204 Adventure\n", "445 Adventure\n", "476 Drama\n", "630 Biography\n", "767 Action\n", "Name: genre, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select the 'genre' Series from the filtered DataFrame\n", "movies[movies.duration >= 200].genre\n", "\n", "# or equivalently, use the 'loc' method\n", "movies.loc[movies.duration >= 200, 'genre']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. How do I apply multiple filter criteria to a pandas DataFrame? ([video](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
78.9The Lord of the Rings: The Return of the KingPG-13Adventure201[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
178.7Seven SamuraiUNRATEDDrama207[u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K...
788.4Once Upon a Time in AmericaRCrime229[u'Robert De Niro', u'James Woods', u'Elizabet...
858.4Lawrence of ArabiaPGAdventure216[u\"Peter O'Toole\", u'Alec Guinness', u'Anthony...
1428.3Lagaan: Once Upon a Time in IndiaPGAdventure224[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
1578.2Gone with the WindGDrama238[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
2048.1Ben-HurGAdventure212[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
4457.9The Ten CommandmentsAPPROVEDAdventure220[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
4767.8HamletPG-13Drama242[u'Kenneth Branagh', u'Julie Christie', u'Dere...
6307.7Malcolm XPG-13Biography202[u'Denzel Washington', u'Angela Bassett', u'De...
7677.6It's a Mad, Mad, Mad, Mad WorldAPPROVEDAction205[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
\n", "
" ], "text/plain": [ " star_rating title \\\n", "2 9.1 The Godfather: Part II \n", "7 8.9 The Lord of the Rings: The Return of the King \n", "17 8.7 Seven Samurai \n", "78 8.4 Once Upon a Time in America \n", "85 8.4 Lawrence of Arabia \n", "142 8.3 Lagaan: Once Upon a Time in India \n", "157 8.2 Gone with the Wind \n", "204 8.1 Ben-Hur \n", "445 7.9 The Ten Commandments \n", "476 7.8 Hamlet \n", "630 7.7 Malcolm X \n", "767 7.6 It's a Mad, Mad, Mad, Mad World \n", "\n", " content_rating genre duration \\\n", "2 R Crime 200 \n", "7 PG-13 Adventure 201 \n", "17 UNRATED Drama 207 \n", "78 R Crime 229 \n", "85 PG Adventure 216 \n", "142 PG Adventure 224 \n", "157 G Drama 238 \n", "204 G Adventure 212 \n", "445 APPROVED Adventure 220 \n", "476 PG-13 Drama 242 \n", "630 PG-13 Biography 202 \n", "767 APPROVED Action 205 \n", "\n", " actors_list \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "7 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... \n", "17 [u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K... \n", "78 [u'Robert De Niro', u'James Woods', u'Elizabet... \n", "85 [u\"Peter O'Toole\", u'Alec Guinness', u'Anthony... \n", "142 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... \n", "157 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... \n", "204 [u'Charlton Heston', u'Jack Hawkins', u'Stephe... \n", "445 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... \n", "476 [u'Kenneth Branagh', u'Julie Christie', u'Dere... \n", "630 [u'Denzel Washington', u'Angela Bassett', u'De... \n", "767 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... " ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter the DataFrame to only show movies with a 'duration' of at least 200 minutes\n", "movies[movies.duration >= 200]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Understanding **logical operators:**\n", "\n", "- **`and`**: True only if **both sides** of the operator are True\n", "- **`or`**: True if **either side** of the operator is True" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "False\n", "False\n" ] } ], "source": [ "# demonstration of the 'and' operator\n", "print(True and True)\n", "print(True and False)\n", "print(False and False)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "False\n" ] } ], "source": [ "# demonstration of the 'or' operator\n", "print(True or True)\n", "print(True or False)\n", "print(False or False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rules for specifying **multiple filter criteria** in pandas:\n", "\n", "- use **`&`** instead of **`and`**\n", "- use **`|`** instead of **`or`**\n", "- add **parentheses** around each condition to specify evaluation order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
178.7Seven SamuraiUNRATEDDrama207[u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K...
1578.2Gone with the WindGDrama238[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
4767.8HamletPG-13Drama242[u'Kenneth Branagh', u'Julie Christie', u'Dere...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "17 8.7 Seven Samurai UNRATED Drama 207 \n", "157 8.2 Gone with the Wind G Drama 238 \n", "476 7.8 Hamlet PG-13 Drama 242 \n", "\n", " actors_list \n", "17 [u'Toshir\\xf4 Mifune', u'Takashi Shimura', u'K... \n", "157 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... \n", "476 [u'Kenneth Branagh', u'Julie Christie', u'Dere... " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CORRECT: use the '&' operator to specify that both conditions are required\n", "movies[(movies.duration >=200) & (movies.genre == 'Drama')]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
58.912 Angry MenNOT RATEDDrama96[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
78.9The Lord of the Rings: The Return of the KingPG-13Adventure201[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
98.9Fight ClubRDrama139[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
138.8Forrest GumpPG-13Drama142[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
\n", "
" ], "text/plain": [ " star_rating title content_rating \\\n", "2 9.1 The Godfather: Part II R \n", "5 8.9 12 Angry Men NOT RATED \n", "7 8.9 The Lord of the Rings: The Return of the King PG-13 \n", "9 8.9 Fight Club R \n", "13 8.8 Forrest Gump PG-13 \n", "\n", " genre duration actors_list \n", "2 Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "5 Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n", "7 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... \n", "9 Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... \n", "13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise'] " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both)\n", "movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action'" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
58.912 Angry MenNOT RATEDDrama96[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
98.9Fight ClubRDrama139[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
118.8InceptionPG-13Action148[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
128.8Star Wars: Episode V - The Empire Strikes BackPGAction124[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
138.8Forrest GumpPG-13Drama142[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
\n", "
" ], "text/plain": [ " star_rating title \\\n", "0 9.3 The Shawshank Redemption \n", "1 9.2 The Godfather \n", "2 9.1 The Godfather: Part II \n", "3 9.0 The Dark Knight \n", "4 8.9 Pulp Fiction \n", "5 8.9 12 Angry Men \n", "9 8.9 Fight Club \n", "11 8.8 Inception \n", "12 8.8 Star Wars: Episode V - The Empire Strikes Back \n", "13 8.8 Forrest Gump \n", "\n", " content_rating genre duration \\\n", "0 R Crime 142 \n", "1 R Crime 175 \n", "2 R Crime 200 \n", "3 PG-13 Action 152 \n", "4 R Crime 154 \n", "5 NOT RATED Drama 96 \n", "9 R Drama 139 \n", "11 PG-13 Action 148 \n", "12 PG Action 124 \n", "13 PG-13 Drama 142 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... \n", "5 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n", "9 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... \n", "11 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... \n", "12 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi... \n", "13 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise'] " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the '|' operator to specify that a row can match any of the three criteria\n", "movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10)\n", "\n", "# or equivalently, use the 'isin' method\n", "movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`isin`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Your pandas questions answered! ([video](https://www.youtube.com/watch?v=B-r9VuK80dk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** When reading from a file, how do I read in only a subset of the columns?" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame, and check the columns\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.columns" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'City', u'Time'], dtype='object')" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# specify which columns to include by name\n", "ufo = pd.read_csv('http://bit.ly/uforeports', usecols=['City', 'State'])\n", "\n", "# or equivalently, specify columns by position\n", "ufo = pd.read_csv('http://bit.ly/uforeports', usecols=[0, 4])\n", "ufo.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** When reading from a file, how do I read in only a subset of the rows?" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# specify how many rows to read\n", "ufo = pd.read_csv('http://bit.ly/uforeports', nrows=3)\n", "ufo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How do I iterate through a Series?" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ithaca\n", "Willingboro\n", "Holyoke\n" ] } ], "source": [ "# Series are directly iterable (like a list)\n", "for c in ufo.City:\n", " print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How do I iterate through a DataFrame?" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, 'Ithaca', 'NY')\n", "(1, 'Willingboro', 'NJ')\n", "(2, 'Holyoke', 'CO')\n" ] } ], "source": [ "# various methods are available to iterate through a DataFrame\n", "for index, row in ufo.iterrows():\n", " print(index, row.City, row.State)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`iterrows`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How do I drop all non-numeric columns from a DataFrame?" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings int64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent object\n", "dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame, and check the data types\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.dtypes" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "beer_servings int64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# only include numeric columns in the DataFrame\n", "import numpy as np\n", "drinks.select_dtypes(include=[np.number]).dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`select_dtypes`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How do I know whether I should pass an argument as a string or a list?" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
count193.000000193.000000193.000000193.000000
mean106.16062280.99481949.4507774.717098
std101.14310388.28431279.6975983.773298
min0.0000000.0000000.0000000.000000
25%20.0000004.0000001.0000001.300000
50%76.00000056.0000008.0000004.200000
75%188.000000128.00000059.0000007.200000
max376.000000438.000000370.00000014.400000
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "count 193.000000 193.000000 193.000000 \n", "mean 106.160622 80.994819 49.450777 \n", "std 101.143103 88.284312 79.697598 \n", "min 0.000000 0.000000 0.000000 \n", "25% 20.000000 4.000000 1.000000 \n", "50% 76.000000 56.000000 8.000000 \n", "75% 188.000000 128.000000 59.000000 \n", "max 376.000000 438.000000 370.000000 \n", "\n", " total_litres_of_pure_alcohol \n", "count 193.000000 \n", "mean 4.717098 \n", "std 3.773298 \n", "min 0.000000 \n", "25% 1.300000 \n", "50% 4.200000 \n", "75% 7.200000 \n", "max 14.400000 " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# describe all of the numeric columns\n", "drinks.describe()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
count193193.000000193.000000193.000000193.000000193
unique193NaNNaNNaNNaN6
topLesothoNaNNaNNaNNaNAfrica
freq1NaNNaNNaNNaN53
meanNaN106.16062280.99481949.4507774.717098NaN
stdNaN101.14310388.28431279.6975983.773298NaN
minNaN0.0000000.0000000.0000000.000000NaN
25%NaN20.0000004.0000001.0000001.300000NaN
50%NaN76.00000056.0000008.0000004.200000NaN
75%NaN188.000000128.00000059.0000007.200000NaN
maxNaN376.000000438.000000370.00000014.400000NaN
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "count 193 193.000000 193.000000 193.000000 \n", "unique 193 NaN NaN NaN \n", "top Lesotho NaN NaN NaN \n", "freq 1 NaN NaN NaN \n", "mean NaN 106.160622 80.994819 49.450777 \n", "std NaN 101.143103 88.284312 79.697598 \n", "min NaN 0.000000 0.000000 0.000000 \n", "25% NaN 20.000000 4.000000 1.000000 \n", "50% NaN 76.000000 56.000000 8.000000 \n", "75% NaN 188.000000 128.000000 59.000000 \n", "max NaN 376.000000 438.000000 370.000000 \n", "\n", " total_litres_of_pure_alcohol continent \n", "count 193.000000 193 \n", "unique NaN 6 \n", "top NaN Africa \n", "freq NaN 53 \n", "mean 4.717098 NaN \n", "std 3.773298 NaN \n", "min 0.000000 NaN \n", "25% 1.300000 NaN \n", "50% 4.200000 NaN \n", "75% 7.200000 NaN \n", "max 14.400000 NaN " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass the string 'all' to describe all columns\n", "drinks.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrytotal_litres_of_pure_alcoholcontinent
count193193.000000193
unique193NaN6
topLesothoNaNAfrica
freq1NaN53
meanNaN4.717098NaN
stdNaN3.773298NaN
minNaN0.000000NaN
25%NaN1.300000NaN
50%NaN4.200000NaN
75%NaN7.200000NaN
maxNaN14.400000NaN
\n", "
" ], "text/plain": [ " country total_litres_of_pure_alcohol continent\n", "count 193 193.000000 193\n", "unique 193 NaN 6\n", "top Lesotho NaN Africa\n", "freq 1 NaN 53\n", "mean NaN 4.717098 NaN\n", "std NaN 3.773298 NaN\n", "min NaN 0.000000 NaN\n", "25% NaN 1.300000 NaN\n", "50% NaN 4.200000 NaN\n", "75% NaN 7.200000 NaN\n", "max NaN 14.400000 NaN" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass a list of data types to only describe certain types\n", "drinks.describe(include=['object', 'float64'])" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinent
count193193
unique1936
topLesothoAfrica
freq153
\n", "
" ], "text/plain": [ " country continent\n", "count 193 193\n", "unique 193 6\n", "top Lesotho Africa\n", "freq 1 53" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass a list even if you only want to describe a single data type\n", "drinks.describe(include=['object'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11. How do I use the \"axis\" parameter in pandas? ([video](https://www.youtube.com/watch?v=PtO3t6ynH-8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=11))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
0Afghanistan0000.0
1Albania89132544.9
2Algeria250140.7
3Andorra24513831212.4
4Angola21757455.9
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol \n", "0 0.0 \n", "1 4.9 \n", "2 0.7 \n", "3 12.4 \n", "4 5.9 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop a column (temporarily)\n", "drinks.drop('continent', axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`drop`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
3Andorra24513831212.4Europe
4Angola21757455.9Africa
5Antigua & Barbuda102128454.9North America
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "5 Antigua & Barbuda 102 128 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "3 12.4 Europe \n", "4 5.9 Africa \n", "5 4.9 North America " ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop a row (temporarily)\n", "drinks.drop(2, axis=0).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When **referring to rows or columns** with the axis parameter:\n", "\n", "- **axis 0** refers to rows\n", "- **axis 1** refers to columns" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "beer_servings 106.160622\n", "spirit_servings 80.994819\n", "wine_servings 49.450777\n", "total_litres_of_pure_alcohol 4.717098\n", "dtype: float64" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean of each numeric column\n", "drinks.mean()\n", "\n", "# or equivalently, specify the axis explicitly\n", "drinks.mean(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`mean`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.000\n", "1 69.975\n", "2 9.925\n", "3 176.850\n", "4 81.225\n", "dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean of each row\n", "drinks.mean(axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When performing a **mathematical operation** with the axis parameter:\n", "\n", "- **axis 0** means the operation should \"move down\" the row axis\n", "- **axis 1** means the operation should \"move across\" the column axis" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "beer_servings 106.160622\n", "spirit_servings 80.994819\n", "wine_servings 49.450777\n", "total_litres_of_pure_alcohol 4.717098\n", "dtype: float64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'index' is an alias for axis 0\n", "drinks.mean(axis='index')" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.000\n", "1 69.975\n", "2 9.925\n", "3 176.850\n", "4 81.225\n", "dtype: float64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'columns' is an alias for axis 1\n", "drinks.mean(axis='columns').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 12. How do I use string methods in pandas? ([video](https://www.youtube.com/watch?v=bofaC0IckHo&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=12))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
\n", "
" ], "text/plain": [ " order_id quantity item_name \\\n", "0 1 1 Chips and Fresh Tomato Salsa \n", "1 1 1 Izze \n", "2 1 1 Nantucket Nectar \n", "3 1 1 Chips and Tomatillo-Green Chili Salsa \n", "4 2 2 Chicken Bowl \n", "\n", " choice_description item_price \n", "0 NaN $2.39 \n", "1 [Clementine] $3.39 \n", "2 [Apple] $3.39 \n", "3 NaN $2.39 \n", "4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of Chipotle orders into a DataFrame\n", "orders = pd.read_table('http://bit.ly/chiporders')\n", "orders.head()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'HELLO'" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# normal way to access string methods in Python\n", "'hello'.upper()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 CHIPS AND FRESH TOMATO SALSA\n", "1 IZZE\n", "2 NANTUCKET NECTAR\n", "3 CHIPS AND TOMATILLO-GREEN CHILI SALSA\n", "4 CHICKEN BOWL\n", "Name: item_name, dtype: object" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# string methods for pandas Series are accessed via 'str'\n", "orders.item_name.str.upper().head()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 True\n", "Name: item_name, dtype: bool" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# string method 'contains' checks for a substring and returns a boolean Series\n", "orders.item_name.str.contains('Chicken').head()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idquantityitem_namechoice_descriptionitem_price
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
531Chicken Bowl[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...$10.98
1161Chicken Crispy Tacos[Roasted Chili Corn Salsa, [Fajita Vegetables,...$8.75
1261Chicken Soft Tacos[Roasted Chili Corn Salsa, [Rice, Black Beans,...$8.75
1371Chicken Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...$11.25
\n", "
" ], "text/plain": [ " order_id quantity item_name \\\n", "4 2 2 Chicken Bowl \n", "5 3 1 Chicken Bowl \n", "11 6 1 Chicken Crispy Tacos \n", "12 6 1 Chicken Soft Tacos \n", "13 7 1 Chicken Bowl \n", "\n", " choice_description item_price \n", "4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 \n", "5 [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98 \n", "11 [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75 \n", "12 [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75 \n", "13 [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25 " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the boolean Series to filter the DataFrame\n", "orders[orders.item_name.str.contains('Chicken')].head()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 Clementine\n", "2 Apple\n", "3 NaN\n", "4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ...\n", "Name: choice_description, dtype: object" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# string methods can be chained together\n", "orders.choice_description.str.replace('[', '').str.replace(']', '').head()" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 Clementine\n", "2 Apple\n", "3 NaN\n", "4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ...\n", "Name: choice_description, dtype: object" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# many pandas string methods support regular expressions (regex)\n", "orders.choice_description.str.replace('[\\[\\]]', '').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[String handling section](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) of the pandas API reference\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 13. How do I change the data type of a pandas Series? ([video](https://www.youtube.com/watch?v=V0AWyzVMf54&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=13))" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings int64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent object\n", "dtype: object" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the data type of each Series\n", "drinks.dtypes" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings float64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent object\n", "dtype: object" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# change the data type of an existing Series\n", "drinks['beer_servings'] = drinks.beer_servings.astype(float)\n", "drinks.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`astype`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings float64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent object\n", "dtype: object" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# alternatively, change the data type of a Series while reading in a file\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'beer_servings':float})\n", "drinks.dtypes" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
\n", "
" ], "text/plain": [ " order_id quantity item_name \\\n", "0 1 1 Chips and Fresh Tomato Salsa \n", "1 1 1 Izze \n", "2 1 1 Nantucket Nectar \n", "3 1 1 Chips and Tomatillo-Green Chili Salsa \n", "4 2 2 Chicken Bowl \n", "\n", " choice_description item_price \n", "0 NaN $2.39 \n", "1 [Clementine] $3.39 \n", "2 [Apple] $3.39 \n", "3 NaN $2.39 \n", "4 [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of Chipotle orders into a DataFrame\n", "orders = pd.read_table('http://bit.ly/chiporders')\n", "orders.head()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "order_id int64\n", "quantity int64\n", "item_name object\n", "choice_description object\n", "item_price object\n", "dtype: object" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the data type of each Series\n", "orders.dtypes" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "7.464335785374397" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert a string to a number in order to do math\n", "orders.item_price.str.replace('$', '').astype(float).mean()" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 True\n", "Name: item_name, dtype: bool" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# string method 'contains' checks for a substring and returns a boolean Series\n", "orders.item_name.str.contains('Chicken').head()" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 1\n", "Name: item_name, dtype: int32" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert a boolean Series to an integer (False = 0, True = 1)\n", "orders.item_name.str.contains('Chicken').astype(int).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 14. When should I use a \"groupby\" in pandas? ([video](https://www.youtube.com/watch?v=qy0fDqoMJx8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=14))" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "106.16062176165804" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean beer servings across the entire dataset\n", "drinks.beer_servings.mean()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "61.471698113207545" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean beer servings just for countries in Africa\n", "drinks[drinks.continent=='Africa'].beer_servings.mean()" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "continent\n", "Africa 61.471698\n", "Asia 37.045455\n", "Europe 193.777778\n", "North America 145.434783\n", "Oceania 89.687500\n", "South America 175.083333\n", "Name: beer_servings, dtype: float64" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean beer servings for each continent\n", "drinks.groupby('continent').beer_servings.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`groupby`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "continent\n", "Africa 376\n", "Asia 247\n", "Europe 361\n", "North America 285\n", "Oceania 306\n", "South America 333\n", "Name: beer_servings, dtype: int64" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# other aggregation functions (such as 'max') can also be used with groupby\n", "drinks.groupby('continent').beer_servings.max()" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanminmax
continent
Africa5361.4716980376
Asia4437.0454550247
Europe45193.7777780361
North America23145.4347831285
Oceania1689.6875000306
South America12175.08333393333
\n", "
" ], "text/plain": [ " count mean min max\n", "continent \n", "Africa 53 61.471698 0 376\n", "Asia 44 37.045455 0 247\n", "Europe 45 193.777778 0 361\n", "North America 23 145.434783 1 285\n", "Oceania 16 89.687500 0 306\n", "South America 12 175.083333 93 333" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# multiple aggregation functions can be applied simultaneously\n", "drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`agg`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
Africa61.47169816.33962316.2641513.007547
Asia37.04545560.8409099.0681822.170455
Europe193.777778132.555556142.2222228.617778
North America145.434783165.73913024.5217395.995652
Oceania89.68750058.43750035.6250003.381250
South America175.083333114.75000062.4166676.308333
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "continent \n", "Africa 61.471698 16.339623 16.264151 \n", "Asia 37.045455 60.840909 9.068182 \n", "Europe 193.777778 132.555556 142.222222 \n", "North America 145.434783 165.739130 24.521739 \n", "Oceania 89.687500 58.437500 35.625000 \n", "South America 175.083333 114.750000 62.416667 \n", "\n", " total_litres_of_pure_alcohol \n", "continent \n", "Africa 3.007547 \n", "Asia 2.170455 \n", "Europe 8.617778 \n", "North America 5.995652 \n", "Oceania 3.381250 \n", "South America 6.308333 " ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# specifying a column to which the aggregation function should be applied is not required\n", "drinks.groupby('continent').mean()" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# allow plots to appear in the notebook\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAFOCAYAAACWguaYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8VNX5x/HPE2RRIZAASQiEkOKO/iiCuKFGrVoRRa3s\ni6C1dcECVqsWkdC6r63+Wi1KFVQEbdWC8kNaFbWoRVS0iqCCBAgSZU2gIkue3x8zmSaYkIRMcieX\n7/v1mhczd33uvcyTM+eee465OyIi0vAlBR2AiIjEhxK6iEhIKKGLiISEErqISEgooYuIhIQSuohI\nSFSZ0M2sg5m9amafmNm/zewX0ekpZjbXzJaa2ctm1rLMOjea2edm9qmZnVmXByAiIhFWVTt0M8sA\nMtx9kZk1B94D+gIjgfXufpeZXQ+kuPsNZnYE8BRwDNAB+AdwsKvBu4hInaqyhO7ua919UfT9FuBT\nIom6LzAlutgU4Pzo+/OA6e6+091XAJ8DPeMct4iI7KZGdehm1gn4IfAOkO7uhRBJ+kBadLH2wKoy\nqxVEp4mISB3ar7oLRqtb/gKMdvctZrZ7FUqNqlQqWF9ERKrB3a2i6dUqoZvZfkSS+RPu/rfo5EIz\nS4/OzwC+jk4vALLKrN4hOq2ioOrtNWHChHrdX32/dHwN+xXm4wvzsQVxfHtS3SqXPwOL3f33ZabN\nBEZE318M/K3M9IFm1sTMcoCDgAXV3I+IiOylKqtczOxEYAjwbzP7gEjVyq+BO4FnzOwSIB/oD+Du\ni83sGWAxsAO40qv6syIiIrVWZUJ39/lAo0pm/6iSdW4Hbq9FXHGXm5sbdAh1SsfXsIX5+MJ8bJBY\nx1dlO/Q627GZCu4iIjVkZnglN0Wr3cqlvnTq1In8/PygwxBpcLKzs1mxYkXQYUiAEq6EHv3rE0BE\nIg2bvjv7hj2V0NU5l4hISCihi4iEhBK6iEhIKKGLiISEEnoN5eTk8OqrrwYdRp274ooruPXWW4MO\nQ0RqIOGaLVYkI6MThYV115QxPT2btWtX1Nn2G6KHHnoo6BBEpIYaREKPJPO6a45VWFhhC6DA7Nq1\ni0aNKns4Nz7cHbPEOm4RqR1VueyFBQsW0KVLF1q3bs2ll17K9u3bAXjxxRfp1q0bKSkp9OrVi3//\n+9+xdb766isuuugi0tLS6Ny5Mw8++GBs3sSJE+nXrx/Dhg2jVatWTJky5Xv7LPXuu+9yzDHH0LJl\nS9q1a8e1114bm/fOO+9w4oknkpKSQrdu3Xj99ddj80499VRuuukmevXqxYEHHsjdd9/NMcccU27b\n999/P+efHxmnZOTIkdx8880AvP7662RlZXHfffeRnp5O+/btefzxx2PrbdiwgXPPPZeWLVty7LHH\nMn78eE466aTY/LFjx5Kenk7Lli3p2rUrixcvrsnpFpHqCqrLyciuv6+i6YCD1+Gr4lgq0qlTJz/q\nqKO8oKDAN27c6CeeeKKPHz/eP/jgA09LS/N3333XS0pKfOrUqd6pUyffvn27l5SUePfu3f2WW27x\nnTt3+pdffumdO3f2uXPnurt7Xl6eN2nSxGfOnOnu7tu2bat0/8cff7w/+eST7u6+detW/9e//uXu\n7gUFBd66dWufM2eOu7v/4x//8NatW/u6devc3T03N9ezs7P9008/9V27dvnmzZs9OTnZv/jii9i2\njznmGH/mmWfc3X3EiBE+fvx4d3efN2+e77fffp6Xl+c7d+702bNn+wEHHOCbNm1yd/cBAwb4oEGD\nfNu2bb548WLPysryk046yd3dX375Ze/Ro4cXFRW5u/uSJUt87dq11T7fUn01+X8sDVf0OleYV1VC\n3wtXX301mZmZtGrVinHjxjFt2jQmTZrE5ZdfTo8ePTAzhg0bRtOmTXnnnXd49913WbduHePGjaNR\no0Z06tSJn/70p0yfPj22zeOPP55zzz0XgKZNm1a67yZNmvDFF1+wfv16DjjgAHr2jIzu9+STT3LO\nOedw1llnAXD66afTo0cPZs+eHVt3xIgRHHbYYSQlJZGcnEzfvn15+umnAfj8889ZunRpLIaK9jt+\n/HgaNWrE2WefTfPmzVm6dCklJSU899xz/OY3v6Fp06YcfvjhXHzxxbH1GjduTHFxMYsXL8bdOfTQ\nQ0lPT9/LMy8ie6KEvhc6dOgQe5+dnc2aNWtYuXIl99xzD6mpqaSmppKSksLq1atZs2YN+fn5FBQU\nlJt3++238/XXX8e2k5WVVdGuvmfy5MksXbqUww47jGOPPZaXXnoJgPz8fJ555ply+5g/fz5r166t\ndB+DBg2KJfRp06Zx/vnn06xZswr327p1a5KS/vvf5YADDmDLli1888037Nq1q9w5KbufU089lVGj\nRnHVVVeRnp7O5ZdfzpYtW6p1rCJSM0roe2HVqv8Ombpy5Urat29PVlYWN910Exs2bGDDhg1s3LiR\nLVu2MGDAALKysvjBD35Qbt7mzZuZNWtWbDvVvUHZuXNnpk2bxjfffMOvfvUrLrroIr799luysrIY\nPnx4uX0UFxdz3XXXVbqPM844g2+++YYPP/yQ6dOnM3jw4Bqfi7Zt27LffvuxevXqCs8PwKhRo1i4\ncCGLFy9m6dKl3H333TXej4hUTQl9L/zhD3+goKCADRs2cOuttzJw4EB++tOf8tBDD7FgQWRwpq1b\ntzJ79my2bt1Kz549adGiBXfddRfbtm1j165dfPLJJyxcuLDG+37qqadYt24dAC1btsTMSEpKYujQ\nocyaNYu5c+dSUlLCtm3beP3111mzZk2l29pvv/3o168f1113HRs3buSMM86ocTxJSUlceOGF5OXl\n8e2337JkyRKmTp0am79w4UIWLFjAzp072X///WnWrFm5kr6IxE+D+Galp2cDVmevyParx8wYPHgw\nZ555JgcddBAHH3ww48aNo3v37jz66KOMGjWK1NRUDjnkkFhrlaSkJF588UUWLVpETk4OaWlpXHbZ\nZRQVFdX4XMyZM4cuXbqQnJzM2LFjmTFjBk2bNqVDhw787W9/47bbbqNt27ZkZ2dzzz33UFJSEou7\nIoMGDeKVV16hf//+NUq0Zbf34IMPsmnTJtq1a8fFF1/M4MGDY/cBioqKuOyyy0hNTSUnJ4c2bdqU\n+9UgIvGj7nMl7m644QYKCwt57LHHgg5ln6Lvzr5B3edKnVq6dGmszf2CBQuYPHkyF154YcBRiex7\nlNATUO/evWnRogXJyckkJyfH3t9xxx1Bh1ah4uJiLrzwQpo3b86gQYO47rrrKm3+KCJ1R1UuIiGh\n786+QVUuIiL7ACV0EZGQUEIXESHSTbeZVfjKyOgUdHjVojp0kZDQd6d2Is9WVHb+Eufcqg5dRGQf\noIReD4488kjeeOONSufffvvt/OxnP6vHiGom0eMTkYgGUeWS0SGDwoLCOoslvX06a1evrXrBepCf\nn09OTg47d+5UnydSI6pyqZ0wVLk0jCHoCgohrw63n1d3fyyqsvtwcx4dGq4+//OUlJToj4dICOhb\nXEN33nknHTp0IDk5mcMPP5zXXnstNoTcwIEDSU5OpkePHnz00UexdXJycnj11VeBioebmzhxIsOH\nDwfglFNOAaBVq1YkJyfzr3/9q9JYli1bRm5uLq1atSItLY1BgwbF5i1ZsoQzzzyT1q1bc/jhh/Ps\ns8/G5o0cOZIrr7ySc845hxYtWnDPPffQrl27cn9Enn/+eX74wx/GYh42bBgQ+QWRlJTE1KlTyc7O\nJi0tjdtuuy223rZt27j44otJTU2lS5cu3H333eX6R6/o/IlIfCih18Bnn33GH/7wB9577z2Kiop4\n+eWX6dSpEwAzZ85kwIABbNy4kUGDBnH++eeza9euCrczc+ZM+vfvz6ZNm77XB3lpXXtRURFFRUUc\ne+yxlcYzfvx4zjrrLDZt2sTq1au5+uqrAfjPf/7DmWeeydChQ1m3bh3Tp0/nyiuvZMmSJbF1n376\nacaPH09xcTGjR4+mefPmsT86pfOHDBkS+7x7b43z58/n888/5x//+Ae/+c1vWLp0KQB5eXmsXLmS\nFStW8Pe//50nn3wytu6ezp+I1J4Seg00atSI7du38/HHH7Nz5046duxITk4OAN27d+eCCy6gUaNG\nXHPNNWzbto133nmnwu2UHW6ushGCqlPl0rhx49hoSE2aNOGEE04AIoNV5+TkMHz4cMyMrl278pOf\n/KRcKb1v374cd9xxQGTIu4EDBzJt2jQg0jfL7Nmzy5X4yzIz8vLyaNKkCf/zP/9D165d+fDDDwF4\n9tlnGTduHMnJyWRmZvKLX/yiWudPRGpPCb0GOnfuzO9+9zvy8vJIS0tj8ODBfPXVV0D5YdfMjA4d\nOlQ6uER1h5uryt13301JSQk9e/bkqKOOinVXm5+fzzvvvFNuOLpp06ZRWPjfewW7xzB48GCef/55\nduzYwXPPPUf37t3LDSu3u7LjgpYORwewZs2aSoejK3v+0tPTy50/Eak9JfQaGjhwIG+++SYrV64E\n4PrrrwfKD7vm7qxevZr27dtXuI09DTdX3aHoANLS0pg0aRIFBQU8/PDDXHnllSxfvpysrCxyc3PL\nDUdXVFTE//7v/1a6n8MPP5zs7Gxmz57N008/vVfD0QG0a9eu3HB0peepVOn5y8/PByJ9p4tIfCih\n18Bnn33Ga6+9xvbt22nSpAn7779/rIXKe++9xwsvvMCuXbu4//77adas2R7rvyvTtm1bkpKSWLZs\nWZXL/uUvf6GgoACI3ERNSkoiKSmJPn368Nlnn/Hkk0+yc+dOduzYwcKFC2P13JUZPHgwv//973nz\nzTfp169fpcvtqTqof//+3H777WzatImCggL+8Ic/xOZVdP7UukYkfhpEs8X09ul12rQwvX161QsB\n3333HTfccANLliyhcePGnHDCCUyaNIk//elP9O3blxkzZjB8+HAOPvhgnnvuuViyr0mpe//992fc\nuHGceOKJ7Ny5kzlz5tCzZ88Kl3333XcZM2YMRUVFpKen88ADD8RuMs6dO5exY8dyzTXX4O507dqV\n++67b4/7HjhwIDfeeCO9e/cmNTW10uV2P56yn2+++WYuv/xycnJyyMzMZMiQIbGqoMrOn4jER4N4\nsCjRTZw4kWXLlpUbHFkiHn74YWbMmKHmifWgIX53EkkYHizS712Jq7Vr1/LWW2/h7ixdupR7771X\nw9GJ1BMl9AR3xRVXVDgc3ZVXXhl0aBXavn07P//5z0lOTuZHP/oRF1xwAVdccUXQYYnsE1TlIhIS\n+u7UjqpcREQkYSihi4iEhBK6iEhIKKGLiISEErqISEgoodeBFi1asGLFiqDDqLXevXvzxBNPBB2G\niFRTg2i22Ckjg/zCunv0Pzs9nRVrE2MIOpG9pWaLtbNPNFs0s8lmVmhmH5WZNsHMVpvZ+9HXj8vM\nu9HMPjezT83szHgcQH5hIQ519qrLPxaJqrLBN0Sk4apOlctjwFkVTL/P3Y+OvuYAmNnhQH/gcOBs\n4I9Wk56pEtzjjz/OeeedF/t88MEHM2DAgNjnjh078uGHH5KUlMTy5cuByHBvo0aNok+fPiQnJ3P8\n8cfz5ZdfxtbZ01BxlZk9ezZdunQhOTmZrKyscp1uvfjii3Tr1o2UlBR69erFv//979i8nJwc7rrr\nLrp27Urz5s256667vter4ujRoxkzZgwAp556Kn/+858BmDJlCieddBLXXXcdqampdO7cmTlz5sTW\nW7FiBaeccgotW7bkzDPPZNSoUbFh67777juGDRtGmzZtSElJ4dhjj+Wbb76p+oSLSM24e5UvIBv4\nqMznCcAvK1juBuD6Mp//Dzi2km16RSqaDrjX4auyWHa3fPlyT0lJcXf3NWvWeHZ2tmdlZbm7+7Jl\nyzw1NdXd3c3Mly1b5u7uI0aM8DZt2vjChQt9165dPmTIEB80aJC7u2/dutWzsrJ8ypQpXlJS4osW\nLfK2bdv6p59+usc42rVr5/Pnz3d3902bNvkHH3zg7u7vv/++p6Wl+bvvvuslJSU+depU79Spk2/f\nvt3d3Tt16uTdunXzgoIC37Ztm+fn5/uBBx7oW7ZscXf3Xbt2ebt27XzBggXu7p6bm+uTJ092d/fH\nH3/cmzRp4pMnT/aSkhJ/6KGHPDMzMxbT8ccf77/61a98x44d/s9//tOTk5N92LBh7u7+pz/9yc87\n7zzftm2bl5SU+Pvvv+/FxcXVOudSfdX9fywVA/aQJhLn3EZjqTBX1+am6CgzW2Rmj5pZy+i09sCq\nMssURKeFQk5ODi1atGDRokW88cYbnHXWWWRmZvLZZ5/xxhtvcNJJJ1W43gUXXED37t1JSkpiyJAh\nLFq0CKh4qLgLL7ywylJ6kyZN+OSTTyguLqZly5axwZwfeeQRLr/8cnr06IGZMWzYMJo2bVpuKLzR\no0eTmZlJ06ZN6dixI0cffTTPP/88AK+88goHHnggxxxzTIX7zc7O5pJLLsHMuPjii/nqq6/4+uuv\nWbVqFQsXLmTixInst99+nHjiieV+yTRu3Jj169fz2WefYWZ069aN5s2bV//Ei0i17G1/6H8EfuPu\nbma3APcCP63pRvLy8mLvc3Nzyc3N3ctw6s8pp5zCa6+9xhdffEFubi4pKSnMmzePt99+m1NOOaXC\ndTIyMmLvyw7XVnaoOIj8Wtq1a1esqqIyf/3rX/ntb3/L9ddfT9euXbn99ts57rjjyM/PZ+rUqTz4\n4IOx7e3YsaPcUHi7Dys3aNAgnn76aYYOHVrlSEVlj2P//fcHYMuWLXzzzTekpqaWGx81KysrNnLR\nsGHDWL16NQMHDmTz5s0MHTqUW2+9NdZfvIhUbt68ecybN69ay+5VQnf3shWgjwCzou8LgLKDVXaI\nTqtQ2YTeUJx88snMmjWLFStWMG7cOFq2bMlTTz3FO++8U25A5OooHSru5ZdfrtF63bt3j42O9OCD\nD9K/f39WrlxJVlYW48aN48Ybb6x03d1vafTr149rr72WgoICnn/++UoHtt6Tdu3asWHDBrZt2xZL\n6qtWrYrta7/99mP8+PGMHz+elStXcvbZZ3PooYcycuTIGu9LZF+ze2F34sSJlS5b3SoXi74iH8wy\nysy7EPg4+n4mMNDMmphZDnAQsKCa+2gQSkvo3377LZmZmZx00knMmTOH9evXx6o+qquyoeKWLFlS\n6To7duxg2rRpFBUV0ahRI1q0aBEr6V522WU8/PDDLFgQOeVbt25l9uzZbN26tdLttWnThlNOOYWR\nI0fygx/8gEMPPbRGxwCRm8E9evQgLy+PHTt28PbbbzNr1qzY/Hnz5vHxxx9TUlJC8+bNady4sYae\nE6kD1Wm2OA14CzjEzFaa2UjgLjP7yMwWAacAYwHcfTHwDLAYmA1cGa3Er5Xs9PTYX5S6eGWnV28I\nOoi0bGnRogUnn3wyEHmIqHPnzvTq1StWIq1uw57mzZszd+5cpk+fTmZmJpmZmdxwww1s3759j+s9\n8cQT5OTk0KpVKyZNmsS0adOASMn9kUceYdSoUaSmpnLIIYcwZcqU2HqVxTV48GBeeeUVhgwZUm56\nVcdRdv5TTz3FW2+9RZs2bbj55psZOHAgTZs2BSKDXlx00UW0bNmSLl26cOqpp1ZZrSQiNdcgHiyS\nhmfgwIEcfvjhTJgwIehQ9hn67tTOPvFgkUh1LFy4kOXLl+PuzJkzh5kzZ3L++ecHHZbIPkUJPUEd\neeSRsWHnyg499/TTTwcdWoXWrl1Lbm4uLVq0YMyYMTz88MN07do16LBE9imqchEJCX13akdVLiIi\nkjCU0EVEQkIJXUQkJJTQRURCQgldRCQklNATyMiRI7n55pv3uMzrr79OVtZ/u8s58sgjeeONN+o6\ntL329ddfc/LJJ9OyZUuuu+66oMOpldI+4Wtj9+tX3/uXcGsQCT2jY0fMrM5eGR07VjuWnJwcXn31\n1bgvWxNlH7n/+OOPY90QTJw4keHDh8d9f7UxadIk0tLS2Lx5M3fffXfQ4dRaPMZrqc02QjRejNSB\nve0+t14VrloFr71Wd9s/9dQ623aicfd6TQr5+fkcccQR9bY/iAyvp655ZV/UIEroiWL48OGsXLmS\nc889l+TkZO655x5mzZrFkUceSWpqKqeddhpLly6tdFmA/v37065dO1JSUsjNzWXx4sW1iqn0V8DL\nL7/MbbfdxowZM2jRogXdunUDIsPI3XTTTfTq1YsDDzyQL7/8kqKiIi699FIyMzPJyspi/PjxsYcm\nli1bRm5uLq1atSItLY1BgwZVGcNbb71Fz549Y8PLvf3220CkCmnKlCnceeedJCcn7/HXysSJE+nX\nrx8DBw4kOTmZHj168NFHsWFsyw3rV7rt0uqp0mqMu+66i3bt2nHJJZcAex6OrzJ33nknBx10EMnJ\nyRx55JG88MILlS77ySefxIYPbNeuHXfccQcA27dvZ8yYMbRv354OHTowduxYduzYEVvP3bnvvvtI\nT0+nffv2PP7447F5RUVFDB8+nLS0NHJycrj11lurjFmklBJ6DUydOpWOHTvy4osvUlRURN++fRk0\naBAPPPAA33zzDWeffTZ9+vRh586d31v22muvBaB3794sW7aMr7/+mqOPPvp7PRzurbPOOotf//rX\nDBgwgOLiYj744IPYvCeffJJHH32U4uJiOnbsyMUXX0zTpk1Zvnw5H3zwAX//+9959NFHARg/fjxn\nnXUWmzZtYvXq1Vx99dV73O/GjRvp06cPY8aMYf369YwdO5ZzzjmHjRs38thjjzFkyBCuv/56ioqK\nOO200/a4rZkzZzJgwAA2btzIoEGDOP/882ODWVf1q2Lt2rVs2rSJlStXMmnSJD744AMuvfRSHnnk\nETZs2MDPf/5zzjvvvHKJtSIHHXQQ8+fPp6ioiAkTJjB06FAKKxhEfMuWLZxxxhn07t2br776ii++\n+ILTTz8dgFtuuYUFCxbw0Ucf8eGHH7JgwQJuueWWcrEWFxezZs0aHn30Ua666io2b94MwKhRoygu\nLmbFihXMmzePqVOn8thjj+0xZpFSSuh7obQ0O2PGDPr06cNpp51Go0aNuPbaa/n222956623vrds\nqREjRnDAAQfQuHFjbr75Zj788EOKi4vrNN4RI0Zw2GGHkZSUxIYNG/i///s/7r//fpo1a0abNm0Y\nM2YM06dPByLDxeXn51NQUECTJk044YQT9rjtl156iUMOOYTBgweTlJTEwIEDOeyww8r1h15d3bt3\n54ILLqBRo0Zcc801bNu2LTbgRlWPXTdq1IiJEyfSuHFjmjZtWq3h+Cryk5/8hPRod8r9+vXj4IMP\njvUvX9aLL75Iu3btGDNmDE2aNCk3dN+0adOYMGECrVu3pnXr1kyYMIEnnngitm6TJk0YP348jRo1\n4uyzz6Z58+YsXbqUkpISZsyYwR133MEBBxxAdnY2v/zlL8utK7InSui1sGbNGrKzs2OfzYysrCwK\nCioepKmkpIQbbriBgw46iFatWpGTk4OZsW7dujqNs2yrivz8fHbs2EG7du1ITU0lJSWFyy+/nG++\niQxCdffdd1NSUkLPnj056qijqiwd7n4OIDL2aGXnoLpxmhkdOnQoN3zenrRt25bGjRvHPufn53Pv\nvfeSmpoaO87Vq1dXub2pU6fGqmlSUlL45JNPKrw+q1atonPnzhVuY82aNXQsc6M9Ozu73H5bt25d\nboCP0mEJ161bx86dO7+37t6cS9k3KaHXUNmf/pmZmeTn55ebv2rVqti4nbtXE0ybNo1Zs2bx6quv\nsmnTJlasWBEbrTvesVU2PSsri2bNmrF+/Xo2bNjAxo0b2bRpU6y+Oi0tjUmTJlFQUMDDDz/MlVde\nWa7ueneZmZmsWLGi3LSVK1fSvn3NxwZfteq/44u7O6tXr45t54ADDuA///lPbP7atWsrPcbS4xw3\nbhwbNmyIHeeWLVsYMGBApftfuXIlP/vZz/jjH//Ixo0b2bhxI126dKnw+mRlZbFs2bIKt9O+ffty\n/y/y8/PJzMzcw5FHtGnTJvYLqey6e3MuZd+khF5D6enpsQTXv39/XnrpJV577TV27tzJPffcQ7Nm\nzTj++OOByKDKZZNhcXExTZs2JSUlha1bt3LjjTfGtcVJenp67I9EZTIyMjjzzDMZO3YsxcXFuDvL\nly+PtWX/y1/+EisRtmrViqSkpD0OF9e7d28+//xzpk+fzq5du5gxYwaffvopffr0qXH87733Xmys\n1NIqoWOPPRaAbt26MW3aNEpKSpgzZw6vv/76Hre1N8Pxbd26laSkJNq0aUNJSQmPPfYYH3/8cYXL\n9unTh7Vr1/LAAw+wfft2tmzZEtvXwIEDueWWW1i3bh3r1q3jt7/9bbVGaEpKSqJfv36MGzeOLVu2\nkJ+fz/3336/RnaT6SkuI9f2K7Pr7KpqenpXlRPq1rJNXelZWhbFU5G9/+5t37NjRU1JS/N577/UX\nXnjBjzjiCG/VqpXn5ub64sWLK11269at3rdvX2/RooV36tTJn3jiCU9KSvJly5a5u/uIESN8/Pjx\ne9z/vHnzPKtMvDk5Of7KK6+4u/v69eu9V69enpKS4t27d3d391NPPdUnT55cbhtFRUV+xRVXeIcO\nHbxVq1Z+9NFH+4wZM9zd/Ve/+pW3b9/eW7Ro4QcddJA/+uijVZ6T+fPne/fu3b1Vq1beo0cPf+ut\nt2LzRo4cWeUxubvn5eV5v379fODAgd6iRQs/+uijfdGiRbH5Cxcu9C5dunhycrIPHz7cBw8eHNvu\n7uek1Msvv+zHHHOMp6SkeGZmpvfv39+3bNmyxzhuuukmT01N9bZt2/ovf/lLz83NjZ2/xx9/3E86\n6aTYsp988omffvrpnpKS4u3atfM777zT3d23bdvmo0eP9nbt2nlmZqaPGTPGv/vuu0pjLXsNN27c\n6EOHDvW2bdt6x44d/ZZbboktt/v+d1fZd0qqJ5IPvJJX4pzbaCwV5lX1hy4JYeLEiSxbtoypU6cG\nHUqDpe9O7ag/dBERSRhK6Ano9ttvjw05V/Z1zjnnBBLPP//5z+/FU/q5Jnr37l1uO6Xv77jjjnp7\nenXVqlV/p8bXAAAWIUlEQVSVHsvq1avrJQaRuqIqF5GQ0HendlTlIiIiCUMJXUQkJJTQRURCIuG6\nz83OzlafzyJ7YfcuGGTfk3A3RSXcqrrxRF4ls/L2uFZC3bDa6+NLkGPYV+mmqIiIJAwldBGRkFBC\nFxEJCSV0EZGQUEIXEQkJJXQRkZBQQhcRCQkldBGRqjSKtP+u6JXRISPo6GIS7klREZGEs4tKHwor\nzCusz0j2SCV0EZGQUEIXEQkJJXQRkZBQQhcRCQkldBGRkFBCFxEJCSV0EZGQUEIXEQkJJXQRkZBQ\nQhcRCQkldBGRkKgyoZvZZDMrNLOPykxLMbO5ZrbUzF42s5Zl5t1oZp+b2admdmZdBS4iIuVVp4T+\nGHDWbtNuAP7h7ocCrwI3ApjZEUB/4HDgbOCPFhlKW0RE6liVCd3d/wls3G1yX2BK9P0U4Pzo+/OA\n6e6+091XAJ8DPeMTqoiI7Mne1qGnuXshgLuvBdKi09sDq8osVxCdJiIidSxe/aH73qyUl5cXe5+b\nm0tubm6cwhERCYd58+Yxb968ai27twm90MzS3b3QzDKAr6PTC4CsMst1iE6rUNmELiIi37d7YXfi\nxImVLlvdKheLvkrNBEZE318M/K3M9IFm1sTMcoCDgAXV3IeIiNRClSV0M5sG5AKtzWwlMAG4A3jW\nzC4B8om0bMHdF5vZM8BiYAdwpbvvVXWMiIjUTJUJ3d0HVzLrR5Usfztwe22CEhGRmtOToiIiIaGE\nLiISEkroIiIhoYQuIhISSugiIiGhhC4iEhJK6CJSLRkZnTCzCl8ZGZ2CDk+IX18uIhJyhYX5VNZt\nU2GheslOBCqhi4iEhBK6iEhIKKGLiISEErqISEgooYuIhIQSuohISCihi4iEhBK6iEhIKKGLiISE\nErqISEgooYuIhIQSuohISCihi4iEhBK6iEhIKKGLiISEErqISEgooYuIhIQSuohISCihi4iEhBK6\niEhIKKGLiISEErqISEgooYuIhIQSuohISCihi4iEhBK6iEhIKKGLiISEErqISEgooYuIhIQSuohI\nLTQFzKzCV6eMjHqNpcEl9IyMTpWevIyMTkGHJyL7mO8Ar+SVX1hYr7E0uIReWJhPZacvMi+8Mjpk\nVP7HrEP9lgREJPHsF3QAUn2FBYWQV8m8vPotCYhI4mlwJXQRSUCNKq9H1q/H+qMSuojU3i706zEB\nqIQuIhISSugiIiGhhC4iEhJK6CIiIVGrm6JmtgLYDJQAO9y9p5mlADOAbGAF0N/dN9cyThERqUJt\nS+glQK67d3P3ntFpNwD/cPdDgVeBG2u5DxERqYbaJnSrYBt9gSnR91OA82u5DxERqYbaJnQH/m5m\n75rZT6PT0t29EMDd1wJptdyHiIhUQ20fLDrR3b8ys7bAXDNbSiTJl7X755i8vLzY+9zcXHJzc2sZ\njohIuMybN4958+ZVa9laJXR3/yr67zdm9gLQEyg0s3R3LzSzDODrytYvm9BFROT7di/sTpw4sdJl\n97rKxcwOMLPm0fcHAmcC/wZmAiOii10M/G1v9yEiItVXmxJ6OvC8mXl0O0+5+1wzWwg8Y2aXAPlA\n/zjEKSIiVdjrhO7uXwI/rGD6BuBHtQlKRERqTk+KioiEhBK6iEhIKKGLiISEErqISEgooYuIhIQS\nuohISCihi4iEhBK6iEhIKKGLiISEEnqCycjohJlV+BIR2ZPadp8rcVZYmE/lPQ4rqYtI5VRCF5E6\n1RQq/dXZKSMj6PBCRSV0EalT37GH35yFhfUZSuiphC4iEhJK6CIiIaGELiISEkroIiIhoYQuIhIS\nSugiIiGhhC4iEhJK6CIiIaGELiISEkroIiIhoYQukgDU34nEg/pyEUkA6u9E4kEldBGRkFBCFxEJ\nCSV0EZGQUEIXEQkJJXQRkZBQQhcRCQkldBGRkAhXQm9U+cMZBzZqpAc3RCTUwvVg0S4gr+JZ/8kr\n0YMbIhJq4Sqhi4jsw5TQRURCQgldRCQklNBFREJCCV1EJCSU0EVEQkIJPSQ0QIKIKKGHROkACRW9\n8sPezr5x40r/mGV07Bh0dCL1JlwPFsm+accOeO21CmcVnnpqPQcjEhyV0EVEQkIJXSTRqUqp4arn\na6cqF5FEpyqlhquer12dldDN7MdmtsTMPjOz6+tqPyIiElEnCd3MkoD/Bc4CugCDzOywuthXXOgn\nrUgw9N2Lq7qqcukJfO7u+QBmNh3oCyypo/3VTth/0ka/NBVJz8pi7cqV9RyQSFTYv3v1rK4Sentg\nVZnPq4kkeQmCvjQi+wS1chERCQlzr2wcn1ps1Ow4IM/dfxz9fAPg7n5nmWXiv2MRkX2Au1dYh1pX\nCb0RsBQ4HfgKWAAMcvdP474zEREB6qgO3d13mdkoYC6Rap3JSuYiInWrTkroIiJS/3RTVEQkJJTQ\nRURCQn25iATEzI4EjgCalU5z96nBRSTVlajXLtR16GaWAhxM+ZP+RnARxdc+cHwGDAF+4O6/MbOO\nQIa7Lwg4tFozswlALpGkMBs4G/inu18UZFzxEm26/CBwONAEaARsdffkQAOLg0S+dqFN6Gb2U2A0\n0AFYBBwHvO3upwUaWJyE/fgAzOwhoAQ4zd0Pj/4Bm+vuxwQcWq2Z2b+BrsAH7t7VzNKBJ939jIBD\niwszWwgMBJ4FegDDgUPc/cZAA4uDRL52Ya5DHw0cA+S7+6lAN2BTsCHFVdiPD+BYd78K2Abg7huJ\nlPbC4Ft3LwF2mlky8DWQFXBMceXuXwCN3H2Xuz8G/DjomOIkYa9dmOvQt7n7tmjPbU3dfYmZHRp0\nUHEU9uMD2BF9SM0BzKwtkRJ7GCw0s1bAI8B7wBbg7WBDiqv/mFkTYJGZ3UXkAcOwFCAT9tqFucrl\neWAkMAY4DdgINHb33oEGFidhPz4AMxsCDAC6A48DFwE3ufuzQcYVb2bWCUh2948CDiVuzCybSMm1\nMTAWaAn8MVpqD41Eu3ahTehlmdkpRP5DzXH37UHHE29hPr5oP/qnRz++GpYnjs3sAiLHszn6uRWQ\n6+4vBBuZVCWRr11oE3r0Lvsn7l4c/ZwMHO7u/wo2stoxs2R3LzKz1Irmu/uG+o6pLpnZ0UAvItUu\n8939/YBDigszW+TuP9xt2gfu3i2omOLBzJ5x9/7RG4ffSy7u/j8BhBVXiXztwlyH/hBwdJnPWyqY\n1hBNA/oQqbtzoGyvaw78IIig6oKZ3Qz0A/5K5DgfM7Nn3f2WYCOLi4rqk8PwfRwd/bdPoFHUrYS9\ndmEuoVf0V/SjMJQQ9hVmthTo6u7bop/3Bxa5e4O/+WtmfybSKukP0UlXAanuPiKwoKRaEvnaheWu\nc0WWm9kvzKxx9DUaWB50UPFiZiea2YHR90PN7L7ogzdhsoYyD00BTYGCgGKJt6uB7cCM6Os7Iokh\nFMzsQjP73Mw2m1mRmRWbWVHQccVJwl67MJfQ04AHiLQAceAVYIy7fx1oYHFiZh8Rebjhf4i0AHkU\n6O/upwQZVzyZ2QtE2tr/ncg1PINI3/qrAdz9F8FFJ3tiZl8A54blJnZDEdqEHnZm9r67Hx2tZy5w\n98ml04KOLV7M7OI9zXf3KfUVS7yY2e/cfYyZzaLim4bnBRBW3JnZfHc/Meg44qkhXLuEqMiPJzP7\nlbvfZWYPUvFJD0uprtjMbgSGASeZWRIhu57uPiX6cMoh0UlL3X1HkDHFwRPRf+8JNIq6t9DMZgAv\nEKmSAMDdnwsupFpL+GsXqgQQVfoTb2GgUdS9AcBgYKS7rzWzk4EDA44prswsF5gCrCDSyiXLzC5u\nyB2Quft70adff+buQ4KOpw4lA/8BziwzzYEGm9AbwrULXUJ391nRk36Uu18bdDx1JZrEXwMGm9mT\nwJfA7wIOK97uBc5096UAZnYI8DSRJ0cbrOgQjdlm1iRsD4KVcveRQcdQFxL92oUuoUPspIeq/q5U\nNKkNir7WEbnLbtEOusKmcWkyB3D3z8yscZABxdFyYL6ZzQS2lk509/uCCyl+zKwZcCnQhfLdO18S\nWFDxk7DXLpQJPWpR9IQ/S/mT3mB/8kUtAd4E+pT2i2FmY4MNqc4sNLNHgSejn4cQnqq0ZdFXEtAi\n4FjqwhNE/q+eBfyGyLULS4uXhL12oW3lYmaPVTDZG3oJwczOJ9LP9InAHGA68Ki75wQaWB0ws6ZE\n2vf2ik56k0gHT99VvlbDYmYHuPt/go4j3kofhS99mC/6y+pNdz8u6NjiJRGvXehK6GZ2p7tfD8wO\nW698ANEOgF6IPlTUl0hvi2nRwSCed/e5gQYYJ9H7IH+O3nwK/KdsvJnZ8cBkoDnQ0cy6Aj939yuD\njSxuSlsjbbLIcG1rgbQA44mbRL52YXxStLeZGdDgR0bZE3ff6u7T3P1cIqMWfQBcH3BYcePuu4Ds\naLPFMPodkeqI9QDu/iFwcqARxdcki4wwNR6YCSwG7go2pLhJ2GsXuhI6kWqIjUDz6KPGZTuvKnH3\nlsGEVXeiI/lMir7CJGFvPsWDu6+KlD1idgUVS7y5+6PRt68Tog7jSiXqtQtdCd3dr3P3VsBL7p7s\n7i3cvQXQG3gq4PCkZpYBL/Lfm0+lrzBYZWYnAB7ta+hawnPTEDNLN7PJZvZ/0c9HmNmlQccVJwl7\n7UJ7UxTAzLoRad7Xn0g77b+6+/8GG5UImFkb4PfAj4j8ipwLjHb39YEGFifRRP4YMM4jAynvR2RQ\n5aMCDq3WEvnahS6hV9JO+1p3zw40MKmx6INTFXXfcFoA4UgNmNm77n5M2YEfKurSWuIrjHXo+1I7\n7bAr+6RvM+AnwM6AYokrM8sh0g1rJ8p8DxOhg6c42WpmrfnvAN/HAZuDDSk+EvnahTGhX0iknfZr\nZlbaTtv2vIokInd/b7dJ881sQSDBxN8LRJq+zQJKAo6lLlxDpHVLZzObD7QlMsh3GCTstQtdlUup\nMu20BxHpE30qIWqnvS/YbdzUJCJ9uDwQkhGL/uXuxwYdR12K1psfSqRAFYaeMoHEvnahTehlRdvD\n9gMGuPvpVS0vicHMvuS/46buJHJj+zfu/s9AA4sDMxsMHEzkhlrZ7mXDMgj2VcBT7r4p+jkFGOTu\nfww2stpL5Gu3TyR0kURjZrcT6ct+Gf/92e5hueFb0Q3QsjdIG7JEvnZhrEOXBq50kJLo+35lu3Aw\ns9vc/dfBRRc3/YAfJGIXrHHSyMzMoyXGaFcOYXnqN2GvXegeLJJQGFjm/e5dOPy4PgOpQx8DrYIO\nog69DMwws9PN7HQijRPmBBxTvCTstVMJXRKRVfK+os8NVStgiZm9S/l62MCbvsXJeOAyoLTDqpeJ\ntAwJg4S9dkrokoi8kvcVfW6oJgQdQF2Itmy5DRgJrIpO7kikX54kEqTPk1pK2Gunm6KScMxsF5HO\nuAzYn8jYlEQ/N3P3sIxaFGNmvYi0Arkq6Fhqw8zuJ9Lfzlh3L45Oa0FkOMFv3X10kPHVhUS6dkro\nIgGJ9jU0mMhNtlD0NWRmnwOH+G6JJXpTdIm7HxxMZPGVqNdOVS4i9WgfGBPWd0/m0Ym7zKxBlx4b\nwrVTKxeR+rWEyJPLfdy9l7s/SDjqlUstNrPhu080s6FEjr0hS/hrpxK6SP0Ke19DVwHPmdklQGlf\nPD2I3Au5ILCo4iPhr53q0EUCEPa+hszsNKBL9ONid38lyHjiKZGvnRK6SMDU11DDlWjXTgldRCQk\ndFNURCQklNBFREJCCV1EJCSU0EUCYGYXmtnnZrbZzIrMrNjMioKOS6qWyNdON0VFAmBmXwDnuvun\nQcciNZPI104ldJFgFCZiQpBqSdhrpxK6SD0yswujb08BMoiMIF+2T+3ngohLqtYQrp0Sukg9MrPH\n9jDb3f2SegtGaqQhXDsldJEAmNmJ7j6/qmmSeBL52imhiwTAzN5396OrmiaJJ5GvnXpbFKlHZnY8\ncALQ1syuKTMrGWgUTFRSHQ3h2imhi9SvJkBzIt+9FmWmFwEXBRKRVFfCXztVuYjUs+hwbM+4+0+C\njkVqzsyy3T0/6DgqohK6SD2LDseWGXQcstcer2g4PXc/LYhgylJCFwnGIjObCTwLbC2dmAhtmaVK\n15Z53wz4CbAzoFjKUZWLSAAqadOcEG2ZpebMbIG79ww6DpXQRQLg7iODjkH2jpmllvmYBHQHWgYU\nTjlK6CIBMLMOwIPAidFJbwKj3X11cFFJNb0HOJEBoncCXwKXBhpRlKpcRAJgZn8HpgFPRCcNBYa4\n+xnBRSUNnRK6SADMbJG7/7CqaZJ4zKwxcAVwcnTSPOBP7r4jsKCi1H2uSDDWm9lQM2sUfQ0F1gcd\nlFTLQ0Tqzf8YfXWPTgucSugiATCzbCJ16McTqY99C/iFu68MNDCpkpl96O5dq5oWBN0UFQlA9EnD\n84KOQ/bKLjPr7O7LAMzsB8CugGMClNBF6pWZ3byH2e7uv623YGRvXQe8ZmbLibR0yQYSohmqqlxE\n6pGZ/bKCyQcSafbW2t2b13NIshfMrClwaPTjUnf/bk/L1xcldJGAmFkLYDSRZP4McK+7fx1sVFIZ\nMzsGWOXua6OfhxN57D8fyHP3DUHGB2rlIlLvzCzVzG4BPiJS7Xm0u1+vZJ7w/gRsBzCzk4E7gKnA\nZmBSgHHFqA5dpB6Z2d3AhUQSwFHuviXgkKT6GpUphQ8AJrn7X4G/mtmiAOOKUZWLSD0ysxIiI8Xv\nJNJcMTaLyE3R5EACkyqZ2cfAD919p5ktAX7m7m+UznP3I4ONUCV0kXrl7qrmbLieBl43s3XAt0T6\n38HMDiJS7RI4ldBFRKrJzI4D2gFz3X1rdNohQHN3fz/Q4FBCFxEJDf38ExEJCSV0EZGQUEIXEQkJ\nJXTZp5lZtpkNKvO5u5n9rg7209fMDov3dkXKUkKXfV0OMLj0g7u/5+5j6mA/5wNd6mC7IjFK6NKg\nmdlwM/vQzD4wsynREvcrZrbIzP4eHbsTM3vMzH5vZvPN7AszuzC6iduBXmb2vpmNNrNTzGxWdJ0J\nZjbZzF6LrnN1mf0OMbN/Rdd7yMwsOr3YzG6J7v8tM2trZscT6Sr3rujyOfV7lmRfoYQuDZaZHQH8\nGsh1927AGCKDRjwWHcptWvRzqQx3PxE4F7gzOu0G4E13P9rdfx+dVrYt76HAGcCxwITo6EKHEXn0\n+wR3PxooAYZElz8QeCu6/zeBy9z9bWAmcF10P1/G8TSIxOhJUWnITgOedfeNAO6+MVoaviA6/wn+\nm7gBXogu96mZpVVzHy+5+04iQ8YVAunA6cDRwLvRknkzYG10+e3uPjv6/j3gR3t3aCI1p4QuYbOn\nJ+XK9llt1dxe2XV2EfnOGDDF3cdVsPz2CpYXqReqcpGG7FWgn5mlQqRbWiJjc5a2WhlKtL+NCpQm\n9GKgRTX3V7rOK8BFZtY2ut8UM8vabZndFQPqeEvqlBK6NFjuvhi4lUiHSR8A9wBXAyOj3ZkOITKA\nBHy/5F76+SOgJHpTdTR75tH9fgrcBMw1sw+BuUT696hoP6WmA9eZ2Xu6KSp1RX25iIiEhEroIiIh\noYQuIhISSugiIiGhhC4iEhJK6CIiIaGELiISEkroIiIh8f/G5eTz8NI/TAAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# side-by-side bar plot of the DataFrame directly above\n", "drinks.groupby('continent').mean().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`plot`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 15. How do I explore a pandas Series? ([video](https://www.youtube.com/watch?v=QTVTq8SPzxM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=15))" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "star_rating float64\n", "title object\n", "content_rating object\n", "genre object\n", "duration int64\n", "actors_list object\n", "dtype: object" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the data type of each Series\n", "movies.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exploring a non-numeric Series:**" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 979\n", "unique 16\n", "top Drama\n", "freq 278\n", "Name: genre, dtype: object" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the non-null values, unique values, and frequency of the most common value\n", "movies.genre.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Drama 278\n", "Comedy 156\n", "Action 136\n", "Crime 124\n", "Biography 77\n", "Adventure 75\n", "Animation 62\n", "Horror 29\n", "Mystery 16\n", "Western 9\n", "Thriller 5\n", "Sci-Fi 5\n", "Film-Noir 3\n", "Family 2\n", "Fantasy 1\n", "History 1\n", "Name: genre, dtype: int64" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count how many times each value in the Series occurs\n", "movies.genre.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Drama 0.283963\n", "Comedy 0.159346\n", "Action 0.138917\n", "Crime 0.126660\n", "Biography 0.078652\n", "Adventure 0.076609\n", "Animation 0.063330\n", "Horror 0.029622\n", "Mystery 0.016343\n", "Western 0.009193\n", "Thriller 0.005107\n", "Sci-Fi 0.005107\n", "Film-Noir 0.003064\n", "Family 0.002043\n", "Fantasy 0.001021\n", "History 0.001021\n", "Name: genre, dtype: float64" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# display percentages instead of raw counts\n", "movies.genre.value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'value_counts' (like many pandas methods) outputs a Series\n", "type(movies.genre.value_counts())" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Drama 278\n", "Comedy 156\n", "Action 136\n", "Crime 124\n", "Biography 77\n", "Name: genre, dtype: int64" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# thus, you can add another Series method on the end\n", "movies.genre.value_counts().head()" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',\n", " 'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',\n", " 'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# display the unique values in the Series\n", "movies.genre.unique()" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the number of unique values in the Series\n", "movies.genre.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`unique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) and [**`nunique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
content_ratingAPPROVEDGGPNC-17NOT RATEDPASSEDPGPG-13RTV-MAUNRATEDX
genre
Action311041114467030
Adventure320051212317020
Animation32000302555010
Biography12101062936000
Comedy9211163232373041
Crime60017164870111
Drama123042412555143191
Family010000100000
Fantasy000000001000
Film-Noir100010000010
History000000000010
Horror2001101216051
Mystery410010126010
Sci-Fi100000013000
Thriller100000103000
Western100020213000
\n", "
" ], "text/plain": [ "content_rating APPROVED G GP NC-17 NOT RATED PASSED PG PG-13 R \\\n", "genre \n", "Action 3 1 1 0 4 1 11 44 67 \n", "Adventure 3 2 0 0 5 1 21 23 17 \n", "Animation 3 20 0 0 3 0 25 5 5 \n", "Biography 1 2 1 0 1 0 6 29 36 \n", "Comedy 9 2 1 1 16 3 23 23 73 \n", "Crime 6 0 0 1 7 1 6 4 87 \n", "Drama 12 3 0 4 24 1 25 55 143 \n", "Family 0 1 0 0 0 0 1 0 0 \n", "Fantasy 0 0 0 0 0 0 0 0 1 \n", "Film-Noir 1 0 0 0 1 0 0 0 0 \n", "History 0 0 0 0 0 0 0 0 0 \n", "Horror 2 0 0 1 1 0 1 2 16 \n", "Mystery 4 1 0 0 1 0 1 2 6 \n", "Sci-Fi 1 0 0 0 0 0 0 1 3 \n", "Thriller 1 0 0 0 0 0 1 0 3 \n", "Western 1 0 0 0 2 0 2 1 3 \n", "\n", "content_rating TV-MA UNRATED X \n", "genre \n", "Action 0 3 0 \n", "Adventure 0 2 0 \n", "Animation 0 1 0 \n", "Biography 0 0 0 \n", "Comedy 0 4 1 \n", "Crime 0 11 1 \n", "Drama 1 9 1 \n", "Family 0 0 0 \n", "Fantasy 0 0 0 \n", "Film-Noir 0 1 0 \n", "History 0 1 0 \n", "Horror 0 5 1 \n", "Mystery 0 1 0 \n", "Sci-Fi 0 0 0 \n", "Thriller 0 0 0 \n", "Western 0 0 0 " ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute a cross-tabulation of two Series\n", "pd.crosstab(movies.genre, movies.content_rating)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`crosstab`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exploring a numeric Series:**" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 979.000000\n", "mean 120.979571\n", "std 26.218010\n", "min 64.000000\n", "25% 102.000000\n", "50% 117.000000\n", "75% 134.000000\n", "max 242.000000\n", "Name: duration, dtype: float64" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate various summary statistics\n", "movies.duration.describe()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "120.97957099080695" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# many statistics are implemented as Series methods\n", "movies.duration.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`mean`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "112 23\n", "113 22\n", "102 20\n", "101 20\n", "129 19\n", "Name: duration, dtype: int64" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'value_counts' is primarily useful for categorical data, not numerical data\n", "movies.duration.value_counts().head()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# allow plots to appear in the notebook\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEACAYAAACgS0HpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFQFJREFUeJzt3X+sX/V93/HnCzv8Ci2iybArOzNkoNR0i5xkWNtYp2/W\nxSGriinTGG20QRhTJErIVE3DZtpsVd0SqiVdtgppKknrZEHEoUkw/cGvkW+7RApOA15M7DJPmwmw\n+K6qsiSELTHhvT++x3C53Gt/7vX31733+ZC+4nw/33PO53MP5/p1P5/POeebqkKSpFM5Y9INkCQt\nDwaGJKmJgSFJamJgSJKaGBiSpCYGhiSpyUgDI8lZSR5L8kSSg0l2deW7kjyb5PHudeWsbXYmOZLk\ncJJto2yfJKldRn0fRpJzq+qFJGuALwO3Au8BvldVH52z7mbgbuByYCPwCHBpebOIJE3cyIekquqF\nbvEsYC1w4h//zLP6duCeqnqxqo4CR4Cto26jJOnURh4YSc5I8gRwDHi4qr7afXRLkgNJ7kpyfle2\nAXhm1ubPdWWSpAkbRw/jpap6G4Mhpq1JLgPuBN5cVVsYBMlHRt0OSdLpWTuuiqrqu0n6wJVz5i5+\nC7i/W34OeNOszzZ2Za+SxDkNSVqCqppvOqDJqK+SeuOJ4aYk5wDvAv40yfpZq10DPNkt7wOuS3Jm\nkouBS4D98+27qnwN6bVr166Jt2ElvTyeHstpfZ2uUfcwfhLYk+QMBuH0mar6gySfTLIFeAk4Crwf\noKoOJdkLHAKOAzfXMH5KSdJpG2lgVNVB4O3zlP+jk2zzIeBDo2yXJGnxvNNb9Hq9STdhRfF4Do/H\ncrqM/Ma9UUjiSJUkLVISalonvSVJK4eBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYG\nhiSpiYEhSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYG\nhiSpyUgDI8lZSR5L8kSSg0l2deUXJHkoyVNJHkxy/qxtdiY5kuRwkm2jbJ8kqV2qarQVJOdW1QtJ\n1gBfBm4F/h7w51X160luAy6oqh1JLgM+DVwObAQeAS6tOY1MMrdIknQKSaiqLHX7kQ9JVdUL3eJZ\nwFqggO3Anq58D3B1t3wVcE9VvVhVR4EjwNZRt1GSdGojD4wkZyR5AjgGPFxVXwXWVdUMQFUdAy7s\nVt8APDNr8+e6MknShK0ddQVV9RLwtiQ/Dnw+yU8z6GW8arXF7nf37t0vL/d6PXq93mm0cvVav/4i\nZmaeHnu969Zt4tixo2OvV1pN+v0+/X5/aPsb+RzGqypL/iXwAnAT0KuqmSTrgS9W1eYkO4Cqqju6\n9R8AdlXVY3P24xzGkCRhCXk9jJrx/6E0XlM9h5HkjSeugEpyDvAu4DCwD7ihW+164L5ueR9wXZIz\nk1wMXALsH2UbJUltRj0k9ZPAniRnMAinz1TVHyT5CrA3yY3A08C1AFV1KMle4BBwHLjZroQkTYex\nDkkNi0NSw+OQlLR6TPWQlCRp5TAwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1\nMTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1\nMTAkSU3WTroBWq3OIsnYa123bhPHjh0de73SSjDSHkaSjUkeTfKNJAeTfKAr35Xk2SSPd68rZ22z\nM8mRJIeTbBtl+zRJPwBq7K+ZmafH8tNJK1GqanQ7T9YD66vqQJLzgK8B24F/AHyvqj46Z/3NwN3A\n5cBG4BHg0prTyCRzi7REg7/yJ3EsJ1ev545WqyRU1ZK79iPtYVTVsao60C0/DxwGNnQfz9fo7cA9\nVfViVR0FjgBbR9lGSVKbsU16J7kI2AI81hXdkuRAkruSnN+VbQCembXZc7wSMJKkCRrLpHc3HHUv\n8MGqej7JncCvVlUl+TXgI8BNi9nn7t27X17u9Xr0er3hNViSVoB+v0+/3x/a/kY6hwGQZC3we8Af\nVtXH5vl8E3B/Vb01yQ6gquqO7rMHgF1V9dicbZzDGBLnMKTVY6rnMDqfAA7NDotuMvyEa4Anu+V9\nwHVJzkxyMXAJsH8MbZQkncJIh6SSXAG8FziY5AkGf1LeDvxSki3AS8BR4P0AVXUoyV7gEHAcuNmu\nhCRNh5EPSY2CQ1LD45CUtHoshyEpSdIKYGBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSp\niYEhSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCZNgZHk\nr4y6IZKk6dbaw7gzyf4kNyc5f6QtkiRNpabAqKqfAd4LvAn4WpK7k7xrpC2TJE2VVFX7yska4Grg\n3wPfBQLcXlWfG03zFmxHLabdWlgSYBLHcnL1eu5otUpCVWWp27fOYbw1yW8Ah4G/Dfx8VW3uln/j\nJNttTPJokm8kOZjk1q78giQPJXkqyYOzh7mS7ExyJMnhJNuW+oNJkoarqYeR5I+Au4B7q+r/zvns\nH1bVpxbYbj2wvqoOJDkP+BqwHXgf8OdV9etJbgMuqKodSS4DPg1cDmwEHgEundudsIcxPPYwpNVj\nLD0M4OeAu0+ERZIzkpwLsFBYdJ8dq6oD3fLzDHooGxmExp5utT0MhrkArgLuqaoXq+oocATYuqif\nSJI0Eq2B8Qhwzqz353ZlzZJcBGwBvgKsq6oZGIQKcGG32gbgmVmbPdeVSZImbG3jemd3PQRg0Fs4\n0cNo0Q1H3Qt8sNt27pjAoscIdu/e/fJyr9ej1+stdheStKL1+336/f7Q9tc6h/Fl4ANV9Xj3/h3A\nb1bVX2/Ydi3we8AfVtXHurLDQK+qZrp5ji9W1eYkO4Cqqju69R4AdlXVY3P26RzGkDiHIa0e45rD\n+KfAZ5P8lyRfAj4D3NK47SeAQyfCorMPuKFbvh64b1b5dUnOTHIxcAmwv7EeSdIINd+HkeR1wFu6\nt09V1fGGba4A/hg4yODPyQJuZxACexncCPg0cG1V/Z9um53APwaOMxjCemie/drDGBJ7GNLqcbo9\njMUExt8ALmLWvEdVfXKpFZ8OA2N4DAxp9TjdwGia9E7yKeAvAQeAH3XFBUwkMCRJ49d6ldRfBS7z\nz3pJWr1aJ72fBNaPsiGSpOnW2sN4I3AoyX7gBycKq+qqkbRKkjR1WgNj9ygbIUmafou5SmoTgwcB\nPtLd5b2mqr430tYt3BanU4bEq6Sk1WNcjzf/Jwwe7fEfu6INwBeWWqkkaflpnfT+ZeAKBl+aRFUd\n4ZUHBkqSVoHWwPhBVf3wxJvu+VD26yVpFWkNjD9KcjtwTvdd3p8F7h9dsyRJ06b1abVnMHi+0zYG\ns5UPAndNaubZSe/hcdJbWj3G9iypaWJgDI+BIa0e43qW1P9knt/uqnrzUiuWJC0vi3mW1AlnA38f\n+InhN0eSNK2WPCSV5GtV9Y4ht6e1boekhsQhKWn1GNeQ1NtnvT2DQY+jtXciSVoBWv/R/8is5ReB\no8C1Q2/NKrd+/UXMzDw96WZI0ry8SmqKTGZ4yCEpabUY15DUr5zs86r66FIbIElaHhZzldTlwL7u\n/c8D+4Ejo2iUJGn6tN7p/cfAz514nHmSHwN+v6r+1ojbt1B7HJIaXq0TqHOy9a7Ec0dqMZbHmwPr\ngB/Oev/DrkyStEq0Dkl9Etif5PPd+6uBPaNpkiRpGjX1MKrqXwPvA77dvd5XVf/mVNsl+XiSmSRf\nn1W2K8mzSR7vXlfO+mxnkiNJDifZtvgfR5I0Kq1DUgDnAt+tqo8Bzya5uGGb3wbePU/5R6vq7d3r\nAYAkmxnc27EZeA9wZwaD+pKkKdD6Fa27gNuAnV3R64D/dKrtqupLDHokr9nlPGXbgXuq6sWqOsrg\nCqytLe2TJI1eaw/jF4CrgO8DVNX/An7sNOq9JcmBJHclOb8r2wA8M2ud57oySdIUaJ30/mFVVZIC\nSPL606jzTuBXu/39GoPHjty02J3s3r375eVer0ev1zuNJknSytPv9+n3+0PbX+t9GP8MuBR4F/Ah\n4Ebg7qr6Dw3bbgLur6q3nuyzJDuAqqo7us8eAHZV1WPzbOd9GMOrdQJ1TrbelXjuSC3Gch9GVf1b\n4F7gd4G3AP+qJSw6YdacRZL1sz67BniyW94HXJfkzG5C/RIGd5NLkqbAKYekkqwBHqmqdwIPL2bn\nSe4GesAbknwT2AW8M8kW4CUGT719P0BVHUqyFzgEHAduXpHdCElaplqHpP4zcE1VfWf0TTo1h6SG\nWusE6pxsvSvx3JFajOVptcDzwMEkD9NdKQVQVbcutWJJ0vLSGhif616SpFXqpENSSf5iVX1zjO1p\n4pDUUGudQJ2TrXclnjtSi1FfJfWFWRX97lIrkSQtf6cKjNlJ9OZRNkSSNN1OFRi1wLIkaZU51RzG\njxhcFRXgHOCFEx8xuCv7x0fewvnb5RzG8GqdQJ2TrXclnjtSi5FeVltVa5a6Y0nSyrKY78OQJK1i\nBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpiYEhSWpiYEiSmrR+Rau0\nQpzVPRV4vNat28SxY0fHXq80TCd9vPm08vHmQ611AnWuznpX4jmr5WXUX9EqSRIw4sBI8vEkM0m+\nPqvsgiQPJXkqyYNJzp/12c4kR5IcTrJtlG2TJC3OqHsYvw28e07ZDuCRqnoL8CiwEyDJZcC1wGbg\nPcCdmcRgsyRpXiMNjKr6EvDtOcXbgT3d8h7g6m75KuCeqnqxqo4CR4Cto2yfJKndJOYwLqyqGYCq\nOgZc2JVvAJ6Ztd5zXZkkaQpMw2W1S7p0ZPfu3S8v93o9er3ekJojSStDv9+n3+8PbX8jv6w2ySbg\n/qp6a/f+MNCrqpkk64EvVtXmJDuAqqo7uvUeAHZV1WPz7NPLaodX6wTqXJ31rsRzVsvLcrisNt3r\nhH3ADd3y9cB9s8qvS3JmkouBS4D9Y2ifJKnBSIekktwN9IA3JPkmsAv4MPDZJDcCTzO4MoqqOpRk\nL3AIOA7cvCK7EZK0THmn9xRxSGpl17sSz1ktL8thSEqStAIYGJKkJgaGJKmJgSFJamJgSJKaGBiS\npCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKajPQr\nWiWdcFb3jYrjtW7dJo4dOzr2erUy+RWtU8SvaLXeUdS7En9XtDR+RaskaSwMDElSEwNDktTEwJAk\nNZnYVVJJjgLfAV4CjlfV1iQXAJ8BNgFHgWur6juTaqMk6RWT7GG8BPSq6m1VtbUr2wE8UlVvAR4F\ndk6sdZKkV5lkYGSe+rcDe7rlPcDVY22RJGlBkwyMAh5O8tUkN3Vl66pqBqCqjgEXTqx1kqRXmeSd\n3ldU1beS/AXgoSRP8do7mxa842j37t0vL/d6PXq93ijaKEnLVr/fp9/vD21/U3Gnd5JdwPPATQzm\nNWaSrAe+WFWb51nfO72HV+sE6rTecda7En9XtDTL8k7vJOcmOa9bfj2wDTgI7ANu6Fa7HrhvEu2T\nJL3WpIak1gGfT1JdGz5dVQ8l+RNgb5IbgaeBayfUPknSHFMxJLVYDkkNtdYJ1Gm946x3Jf6uaGmW\n5ZCUJGn5MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTSb5ePOp9Du/\nczef+9zvj73ejRvXj71OrQZndY+cGZ916zZx7NjRsdap8fBZUnO84x0/y+OPbwX+8kj2v5A1a27i\nRz/6f/gsKetd/vX6/KppdbrPkrKHMa+/A/zsWGtcs+bmLjAkaTo5hyFJamJgSJKaGBiSpCYGhiSp\niYEhSWriVVKShmz8936A93+Mg4Ehach+wCTuOZmZGX9IrTYOSUmSmhgYkqQmUxkYSa5M8qdJ/luS\n2ybdHknSFAZGkjOA3wTeDfw08ItJfmqyrVrp+pNuwArTn3QDVpD+pBugWaYuMICtwJGqerqqjgP3\nANsn3KYVrj/pBqww/Uk3YAXpL2LdwdVZ436tX3/RiH726TONV0ltAJ6Z9f5ZBiEiSScxqauzzl41\nlxFPY2BM1Nlnv45zz/0XrF3778Za7/e/75NqpeVp9VxGPHXfh5HkrwG7q+rK7v0OoKrqjlnrTFej\nJWmZOJ3vw5jGwFgDPMXgCym+BewHfrGqDk+0YZK0yk3dkFRV/SjJLcBDDCblP25YSNLkTV0PQ5I0\nnabxstrXSHI0yX9N8kSS/V3ZBUkeSvJUkgeTnD/pdk6rJB9PMpPk67PKFjx+SXYmOZLkcJJtk2n1\ndFrgWO5K8mySx7vXlbM+81ieRJKNSR5N8o0kB5Pc2pV7fi7SPMfyA1358M7Pqpr6F/A/gAvmlN0B\n/PNu+Tbgw5Nu57S+gL8JbAG+fqrjB1wGPMFguPIi4L/T9UR9LXgsdwG/Ms+6mz2Wpzye64Et3fJ5\nDOYvf8rzc6jHcmjn57LoYQDhtb2h7cCebnkPcPVYW7SMVNWXgG/PKV7o+F0F3FNVL1bVUeAI3gfz\nsgWOJQzO0bm247E8qao6VlUHuuXngcPARjw/F22BY7mh+3go5+dyCYwCHk7y1SQ3dWXrqmoGBgcK\nuHBirVueLlzg+M29cfI5XjnptLBbkhxIctes4ROP5SIkuYhB7+0rLPz77TFtMOtYPtYVDeX8XC6B\ncUVVvR34u8AvJ/kZXnunjLP3p8fjt3R3Am+uqi3AMeAjE27PspPkPOBe4IPdX8f+fi/RPMdyaOfn\nsgiMqvpW998/A77AoNs0k2QdQJL1wP+eXAuXpYWO33PAm2att7Er0wKq6s+qGxQGfotXuvUeywZJ\n1jL4B+5TVXVfV+z5uQTzHcthnp9THxhJzu0SkySvB7YBB4F9wA3datcD9827A50QXj2OudDx2wdc\nl+TMJBcDlzC4eVKveNWx7P5BO+Ea4Mlu2WPZ5hPAoar62Kwyz8+lec2xHOb5OXU37s1jHfD57nEg\na4FPV9VDSf4E2JvkRuBp4NpJNnKaJbkb6AFvSPJNBldNfBj47NzjV1WHkuwFDgHHgZtn/XWy6i1w\nLN+ZZAvwEnAUeD94LFskuQJ4L3AwyRMMhp5uZ3CV1Gt+vz2mCzvJsfylYZ2f3rgnSWoy9UNSkqTp\nYGBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpyf8HHxttW4V88GMAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# histogram of the 'duration' Series (shows the distribution of a numerical variable)\n", "movies.duration.plot(kind='hist')" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEqCAYAAAAF56vUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmYHGW5/vHvnQRkk4hsIwQICgJRFBQiCuqgggsquLHp\nEcTtHA6bHBdwS1DPUURUlJ8LshgUxKAiARVZR0Rk3yIBjEsQUYKgsrgG8vz+eN9mKp1OpruqmnRq\n7s919ZVOTc8zNTM9T731vJsiAjMzW/lNWNEnYGZm9XBCNzNrCCd0M7OGcEI3M2sIJ3Qzs4ZwQjcz\na4gxE7qkJ0i6WtKNkuZKmpGPryPpQkl3SPqxpMmFzzla0nxJt0navZ/fgJmZJepmHLqkNSLi75Im\nAj8DDgPeANwfEZ+W9AFgnYg4StI04AxgR2AKcDGwZXjAu5lZX3VVcomIv+enTwAmAQHsCczKx2cB\ne+XnrwXOiohHImIBMB+YXtcJm5lZZ10ldEkTJN0I3ANcFBHXAhtGxEKAiLgH2CC/fGPgrsKn352P\nmZlZH03q5kURsRjYXtLawDmSnkFqpS/xsl6+sCSXYMzMSogIdTre0yiXiHgQGAFeASyUtCGApCHg\n3vyyu4FNCp82JR/rFK+rx4wZM7p+rWOu/OfomI7pmMt+LE83o1zWa41gkbQ6sBtwGzAHODC/7ADg\n3Px8DrCvpFUlbQ5sAVwz1tcxM7Nquim5PAWYJWkC6QLw7Yj4oaSrgNmSDgLuBPYGiIh5kmYD84BF\nwMEx1mXFzMwqGzOhR8Rc4Dkdjv8ZeNkyPueTwCcrn102PDxcV6hxH3NlOEfHdEzHLKercej9IMkN\ndzOzHkki6ugUNTOzweWEbmbWEE7oZmYN4YRuZtYQTuhmZg3hhG5m1hBO6GZmDeGEbmbWEE7oZmYN\n4YRuZtYQTuhmZg0xcAl9aGgqksZ8DA1NXdGnamY2UAZucS5JdLf5kcZc7N3MrGm8OJeZ2TjghG5m\n1hBO6GZmDeGEbmbWEE7oZmYN4YRuZtYQTuhmZg3hhG5m1hBO6GZmDeGEbmbWEE7oZmYN4YRuZtYQ\nTuhmZg3hhG5m1hBjJnRJUyRdKulWSXMlHZqPz5D0e0k35McrCp9ztKT5km6TtHs/vwEzM0vGXA9d\n0hAwFBE3SVoLuB7YE9gHeCgiPtv2+m2AM4EdgSnAxcCW7Yufez10M7PeVVoPPSLuiYib8vOHgduA\njVuxO3zKnsBZEfFIRCwA5gPTy5y4mZl1r6cauqSpwHbA1fnQIZJuknSypMn52MbAXYVPu5vRC4CZ\nmfVJ1wk9l1u+AxyeW+pfAp4aEdsB9wDH9+cUzcysG5O6eZGkSaRk/o2IOBcgIv5UeMnXgPPy87uB\nTQofm5KPLWXmzJmPPR8eHmZ4eLjL0zYzGx9GRkYYGRnp6rVdbRIt6XTgvog4snBsKCLuyc/fA+wY\nEftLmgacATyPVGq5CHeKmpnVYnmdomO20CXtDLwZmCvpRlK2/SCwv6TtgMXAAuDdABExT9JsYB6w\nCDi4Y+Y2M7NaddVC78sXdgvdzKxnlYYtmpnZysEJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCGc\n0M3MGsIJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCGc0M3MGsIJ3cysIZzQzcwawgndzKwhnNDN\nzBrCCd3MrCGc0M3MGsIJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCGc0M3MGsIJ3cysIZzQzcwa\nwgndzKwhnNDNzBpizIQuaYqkSyXdKmmupMPy8XUkXSjpDkk/ljS58DlHS5ov6TZJu/fzGzAzs0QR\nsfwXSEPAUETcJGkt4HpgT+BtwP0R8WlJHwDWiYijJE0DzgB2BKYAFwNbRtsXktR+qHUcWP455Vcy\n1rmbmTWNJCJCnT42Zgs9Iu6JiJvy84eB20iJek9gVn7ZLGCv/Py1wFkR8UhELADmA9MrfQdmZjam\nnmrokqYC2wFXARtGxEJISR/YIL9sY+CuwqfdnY+ZmVkfTer2hbnc8h3g8Ih4WFJ7vaPn+sfMmTMf\nez48PMzw8HCvIczMGm1kZISRkZGuXjtmDR1A0iTgfOBHEXFCPnYbMBwRC3Od/bKI2EbSUUBExLH5\ndRcAMyLi6raYrqGbmfWoUg09OxWY10rm2RzgwPz8AODcwvF9Ja0qaXNgC+Cans/azMx60s0ol52B\ny4G5pKZzAB8kJenZwCbAncDeEfHX/DlHA28HFpFKNBd2iOsWuplZj5bXQu+q5NIPTuhmZr2ro+Ri\nZmYDzgndzKwhnNDNzBrCCd3MrCGc0M3MGsIJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCGc0M3M\nGsIJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCGc0M3MGsIJ3cysIZzQzcwawgndzKwhnNDNzBrC\nCd3MrCGc0M3MGsIJ3cysIZzQzcwawgndzKwhnNDNzBrCCd3MrCHGTOiSTpG0UNIthWMzJP1e0g35\n8YrCx46WNF/SbZJ279eJd2toaCqSunoMDU1d0adrZlaaImL5L5B2AR4GTo+IZ+VjM4CHIuKzba/d\nBjgT2BGYAlwMbBkdvoikToeRBCz/nPIrGevce4vXfUwzsxVFEhGhTh8bs4UeEVcAf+kUt8OxPYGz\nIuKRiFgAzAem93CuZmZWUpUa+iGSbpJ0sqTJ+djGwF2F19ydj5mZWZ9NKvl5XwI+FhEh6RPA8cA7\neg0yc+bMx54PDw8zPDxc8nTMzJppZGSEkZGRrl47Zg0dQNJmwHmtGvqyPibpKCAi4tj8sQuAGRFx\ndYfPcw3dzKxHlWrorRgUauaShgofez3wi/x8DrCvpFUlbQ5sAVzT+ymbmVmvxiy5SDoTGAbWlfQ7\nYAawq6TtgMXAAuDdABExT9JsYB6wCDi4YzPczMxq11XJpS9f2CUXM7Oe1VFyMTOzAeeEbmbWEE7o\nZmYN4YRuZtYQTuhmZg3hhF6CV3A0s0HkYYsDEtPMrBsetmhmNg44oZuZNYQTuplZQzihm5k1hBO6\nmVlDOKGbmTWEE7qZWUM4oQ8IT1Yys6o8sajBMc2seTyxyMxsHHBCNzNrCCd0M7OGcEI3M2sIJ3Qz\ns4ZwQjczawgndDOzhnBCNzNrCCd0M7OGcEI3M2sIJ3Qzs4ZwQjcza4gxE7qkUyQtlHRL4dg6ki6U\ndIekH0uaXPjY0ZLmS7pN0u79OnEzM1tSNy3004CXtx07Crg4IrYCLgWOBpA0Ddgb2AZ4JfAlpWUE\nzcysz8ZM6BFxBfCXtsN7ArPy81nAXvn5a4GzIuKRiFgAzAem13OqZma2PGVr6BtExEKAiLgH2CAf\n3xi4q/C6u/MxMzPrs0k1xSm128LMmTMfez48PMzw8HBNp2Nm1gwjIyOMjIx09dqudiyStBlwXkQ8\nK///NmA4IhZKGgIui4htJB0FREQcm193ATAjIq7uENM7FvU5ppk1Tx07Fik/WuYAB+bnBwDnFo7v\nK2lVSZsDWwDX9HzGZmbWszFLLpLOBIaBdSX9DpgBfAo4W9JBwJ2kkS1ExDxJs4F5wCLg4I7NcDMz\nq503iW5wTDNrHm8SbWY2Djihm5k1hBO6mVlDOKGbmTWEE7qZWUM4oZuZNYQTuplZQzihm5k1hBO6\nmVlDOKGbmTWEE7qZWUM4oZuZNYQTeoMNDU1FUlePoaGpK/p0zawir7bomD3FNLMVy6stWm3c6jcb\nXG6hO+YKj2lm3XML3cxsHHBCNzNrCCd0M7OGcEI3M2sIJ3Qzs4ZwQjczawgndDOzhnBCNzNrCCd0\nW+E8+9SsHp4p6piNjGnWVJ4pamY2Djihm5k1xKQqnyxpAfAAsBhYFBHTJa0DfBvYDFgA7B0RD1Q8\nTzMzG0PVFvpiYDgito+I6fnYUcDFEbEVcClwdMWvYWZmXaia0NUhxp7ArPx8FrBXxa9hZmZdqJrQ\nA7hI0rWS3pGPbRgRCwEi4h5gg4pfw8zMulCphg7sHBF/lLQ+cKGkO1h6/Nkyx5jNnDnzsefDw8MM\nDw9XPB0zs2YZGRlhZGSkq9fWNg5d0gzgYeAdpLr6QklDwGURsU2H13scumP2LaZZU/VlHLqkNSSt\nlZ+vCewOzAXmAAfmlx0AnFv2a5iZWfeqlFw2BM6RFDnOGRFxoaTrgNmSDgLuBPau4TzNzGwMnvrv\nmI2MadZUnvpvZjYOOKGbmTWEE7qZWUM4oZuZNYQTuplZQzihm5k1hBO6mVlDOKGbmTWEE7qZWUM4\noZuZNYQTuplZQzihWyMNDU1F0piPoaGpK/pUzWrjxbkcc5zH9GJftnLx4lxmZuOAE7qZWUM4oZuZ\nNYQTuplZQzihm5k1hBO6mVlDOKGbmTWEE7qZWUM4oZt1ybNPbdB5pqhjjvOY3c8U9exTGwSeKWpm\nNg44oZuZNYQTuplZQzihm61A7mi1OrlT1DHHecwV2ynqjlbr1QrpFJX0Ckm3S/qlpA/06+uYmVnS\nl4QuaQJwIvBy4BnAfpK2Lh9xpJbzcsx+xHPMQYvZjzLO41UaGhkZqfT54z1mv1ro04H5EXFnRCwC\nzgL2LB9upJ6zcsw+xHPMQYu5cOGdpDJO8TFjqWPpdSsuZqeLxK677uqLRAX9SugbA3cV/v/7fMzM\nDHj8LhLHHHPMuOlk9igXM2uMleEisazyVR0Xnr6McpG0EzAzIl6R/38UEBFxbOE17rI3MythWaNc\n+pXQJwJ3AC8F/ghcA+wXEbfV/sXMzAyASf0IGhGPSjoEuJBU1jnFydzMrL9W2MQiMzOrlztFzcwa\nwgndrGEkTZC094o+D3v8jZuELulQSeus6PPolqQ1aoy1UnzvkiZK2kjSpq1HDfEuq+v8CnFfk2dD\nD6SIWAy8f0WfRzfyxecFK/o8upEHe9Qdc9064w3sm1LSlpK+I2mepN+0HhVCbghcK2l2Xmem47Cf\nHs9xfUkflHSSpFNbj4oxXyBpHnB7/v+zJX2p4qnW+r0reYukj+b/byppesWYhwILgYuAH+TH+VVi\nRsSjwGJJk6vE6WAfYL6kT1db0iKRtLOki/K6R7+R9NuK73WAiyW9V9Imkp7celQ4xyvyvw9JerDw\neEjSg2Xj5ovP/yv7+csi6XuS9qj5wjtf0nGSptUY8ypJZ0t6VS05aVA7RfMbaAbwOeA1wNuACRHx\n0QoxBeyeY+0AzCaNwPl1yXhXAj8FrgcebR2PiO9WOMergTcCcyJi+3zsFxHxzLIxc4zavndJXwYW\nAy+JiG1y6//CiNixwvn9CnheRNxfNsYy4p4LbE+6UPytdTwiDqsYd21gP9LPM4DTgG9FxEMlYt0O\nvIel30elfxaSftvhcETEU8vG7BdJnwF+Dnyv4xKs5WK+jPS72Qk4GzgtIu6oGPOJwL457gTgVOCs\niCh9Qct/ly8DDgJ2JP1dfj0iflkqYEQM5AO4Pv87t/1YxbjPBj5PagF/GbgR+HTJWDf14fu+Ov97\nY+HYzTXFruV7B26o+xyBy4BJffh5HtDpUVPsdYEjgAXAj4D5wKFlf+eD/ABeX3i+Ts2xHyI1EP4N\nPJj//2BNsScD/0laiuRKUjJepYa4LwbuJjUSZgFb1BBz1xzzr8BPgOf3GqMv49Br8q98uzQ/j2m/\nG1irbDBJhwNvBe4DTgbeFxGLWl+DcjXH8yW9KiJ+WPa8Orgr1xRD0irA4UClMfx9+N4X5Xpi5Pjr\nk/4gq/gNMCLpB8C/Wgcj4rNVgkbELEmrAk/Ph+6ItGBcaZL2BA4EtgBOB6ZHxL2532Me8MUeQ14m\n6Tjgeyz5vd9Q4RzXAI4ENo2Id0naEtgqIsqWsT6czw/gEuA5Zc+tXUQ8sa5YRbk+/RbgP0iNlzOA\nXUgX9eES8SYCe5AuClOB43PMFwI/ZPQ9VvYcFwKHAnOA7Uh3Fpv3Em+QE/rhwBrAYcDHgZeQfhFl\nPZnUylhiEYeIWCzp1RXO8YOS/g20kkRExNoVzvM/gRNIi5ndTZqc9d8V4kH93/sXgHOADST9L6lE\n9OGK5/i7/Fg1P2ohaZjUgloACNhE0gERcXmFsK8DPtceIyL+LuntJeI9L/+7QzEc6T1f1mmkEk6r\nw/FuUoIom9C1jOelSdo6Im6X1PHiUPGCdg6wFfAN4DUR8cf8oW9Luq5k2PmkO8njIuLKwvHvSHpR\nyZg/z+e4V0T8vnD8Oklf6TXYwNbQ6zJWR1BE/PnxOpcVKbcuNqRwEY+I31WItzVpaQcBl0RNM4El\nrZXP7eGa4l0P7B+5firp6aRa93NLxpsIXBwRu9Z0fhOAN0bE7DriFeJeFxE7SLoxRvtibo6IZ5eM\ndzupz2AC8E1gfwqJvUzylXRSvnvoNBIpIqL0BU3SrhFR6wgnSWvV9b7M8SaSSp7/U1fMgW2hS9oB\n+BCwGUsmoWf1GOp6UmtHwKbAX/LzJ5FahD3d0nQ4z9cCravzSIVb2la8zUm3XVNZ8vt+bYWYhwAz\nSbd0rdJIAL3+LFtvwlsjYmvySJw6SHomqaXy5Pz/+4C3RsStFUOvEoXOsIj4ZS5llRJpWYvFkiZH\nxAMVz611l/R+UmdYnf4taXVGy2JPo1DOKeGPQKv8dU/hOZS8m4iId+V/a7k4Akh6fafnha/5vfZj\nPfiopE8A/wAuIP39vCcivlkmWH4v1Tpkc2ATOqk29T5gLhXqsxGxOYCkrwHntOrdkl4J7FXlBCV9\nitQzfUY+dLiknSPi6Aphvw+cApxH9bp0yxGk+mnlEST5TXiHpE2rtPA7OAk4stWqyqWSrzFaMijr\nOkknk1qVAG8Gyt5ytzwMzJVU18iZiyW9F/h2W7wqd48zSElnE0lnADuT6v6l1Jl02+UL7H9RaBgB\nXy3Z1/Ga5XwsGO0HKGP3iHi/pNeRSnivBy5n9L1Vxk2S5pDKYcXffanzHNiSi6QrImKXGuPNjYht\nxzrWY8xbgO0ijaVttV5vLHEXUYx5dUQ8b+xX9hTzMmC3iHikpniXk4YCXsOSb8IqdxFLlQOqlAgK\nMZ5A6oNovZd+CnwpIkq3ViV17MuJiFkl4/VliGHucNuJdEd6VUTcVyVeh/gntVrZFeOcDKxC6uuA\n1EH4aES8o2rsOkm6NSKekc/3OxFxQdX3qKTTOhyOiDioVLwBTugvJdXsLmHJnv9yVy7px6Q/5mJL\n7UUR8fIK53gLMNxqSeV6/UjFhL4/sCWpM7TSiAdJR+anzyB1ENUygkTSizsdj4iflImXY54D3EAq\nu0Dq+X9uRLyuQsyJwOkR8eayMZYTe3XSCJJKY5vrtqwOxpYqHY0dvtYNEVF5tEudF3NJb4mIbxbe\n+0uoMmoq35HvRSq5TCeVbc+vuwFWxSCXXN4GbE26chfrvmVvmfYj3Yaek+Ncno9V8UngxtwCFumW\n8aiKMbcltVBewpLfd5kOotZwsFpHkFRJ3MtxEHAMo7/fn+ZjpeXy0GaSVo2If1c9wRZJrwE+Q/pZ\nbi5pO+BjZe9Qah5iePxyPlZ15Ey7e2uK86ikp0We5CbpqRQmWPVozfxv7UMhI+IoSZ8GHsjvrb9R\naa9kkDSFNMx153zop8DhbSNeuo83wC30OyJiqz7EXTMi/jb2K7uO9xRSHR3gmoi4p2K8XwHT6kxA\nhdhrk27nep7N2BbnIXJnGymprQL8reJwzb6QdDqwDWlsb7E8VKWldj0pMY5EDbN5JX2b1Hn/1oh4\nZk7wV0bEdmXPcWWS78ZPI81FEGkgxNvqHqVSh9x5Pw1YrXUsIk6vEO8i4EyWvDN9c0TsVibeILfQ\nr5Q0LSLm1REs9yafTJqctKmkZwPvjoiDS8RqHz/buppuJGmjire1vyDdytXV+mmNGDqN3GqR9ABw\nUERcXyZeFCaCSBKplbJTyXP7fEQcIek8Ri8Sxa9Vui6f/To/JlBfq21RRDygJZfeqNKB/bSI2EfS\nfvDYePZSY707jewoqlCy7NvvKSIuad2V5EN3VOnjgL6NFptBmpA0jTSR6JXAFaTJZWWtHxHFOvrX\nJR1RNtggJ/SdSD3AvyXVfUVqXZatT38OeDmppUZE3KzykwGOBN5F59vbqre1TwJul3QtS9a7qyS2\nU4GDI+KnAJJ2ISX40rX+wnkF8P38Zi9Tbmq1TD5T9Vza5Rr6EyPivTWHvjX3dUzMiegw0rTysuoc\nYtivUR61/56W8/f3PElEtclf/Rgt9kbS8hk3RsTbJG1ItREuAPdLegvwrfz//YDSo9EGOaG/ou6A\nEXFXW8OnVJ2u0LP/yoj4Z/Fjklbr8Cm9mFHx8zt5tJXMASLiCkmlR7y0tQInkGY4/nMZL1+uwl3C\ndhFxQtvXOZy0pkUpuc6589iv7NmhpDkS/yLdLv+YNJu5rJksPcTwbWUCRUSpz+sibuv3dB3wj7aR\nXU8oGfZ9nb4UqaGxCVBludp/RsQXKnx+J//I8wYeyeXLe0nnWcVBpBr650jf+5VUGF46sAk98jR1\nSRtQqFdVUPsaKaQffnsvf6djXetTh+NPJH2V1AoI0vKvI62SUYkSUbEV+AhpTG6lziHSsg4ntB07\nsMOxXtU6zjfbIyI+RErqAEh6U/4aPYuIC3NdvjXE8PCyQwz7Ocoju4S0OmBrxuTqpBFZPc8XiIgl\n7ibyxffDpIlLh1Y7TU7Id42VR4sVXCfpSaT5EdeTfgY/r3SWMKX97jv/HO4qE2xgE7rSDMzjgY1I\nV8LNSAn4GSVD1rZGiqShHGd1SdszOgV6bdL6M2ViXhERu7R1OMJoqalKh2Nr+Fd76397ypWITo6I\nnxUP5Ddhz3X/XDfenzRaZE7hQ08E6liWYTXSLWzxe6w6weRolk7enY51RdIlEfFS0rDS9mO96tso\nj2y1KEx/j4iHVXEzltwp+hHS7+X/IuKiiucI9Y4WS5882t/2FUkXAGtHxC2VzjK1ztsbgJ2OdWVg\nEzrpFnYn0roZ20valdQDXEpu8dQ1HvnlpNbjFNJFp5XQHwQ+WCZg5ElU0YeV56L+WX51vgmvJE0r\nX48l+yQeAqr+sdRaglCaXfwqYGNJxdv5tUl3Kr3GW43UAFhPaU35YsNg4zLnGBFfzWWQByPic2Vi\njOFvkp7TaunmDvd/lAkkaQ/SXc4DwIcj4or6TpM3AU+tebjqYxfZiFjQfqzHWM8n3dWs33Y3tTYV\nSk2DnNAXRcT9SltUTYiIyyR9vmywOnu9I80InCXpDVFhM4sO51hcJ6VW+Y/nGSw53OpjPcao/U2Y\nS2t3As8v8/ljqXmc7x9INeTXkm65Wx4ibVDRq3eTlmXYKMcrNgxOLBEPeKzvYD9SXbZuRwBnS/pD\n/v9TSCW8Ms4jjRC7H3i/0po2j6k4EKC20WL9uPCShvuuRcpFxUbcg6TO11IGOaH/VWnlvcuBMyTd\nS6EGWkI/er2fm6/QfwXIv+z/iYhSS8lGn9ZJUVqGcw3SAvonk94w15QI1Zc3YT7HnUiJd5v8dSZS\nz9j200gdl2/K/39LPtbzON+IuBm4WdKZkdcZyb/zTSLiLyXinUCq9R4aEb2uoT6Wn0k6kaXXhylV\nQ5a0I3BXRFyrtNLmu0lrmVwAdFq6oBt9Wx+GekeL1X7hzX1lP5H09UJ/4QRgraiyA9IATyxak3Qr\nN4FUKpkMnBElF5hSf9ZIeWxp0sKxStOh1Z91Um6JiGcV/l0L+FFEvLBkvM2ibW31qpTWqN6XVIfe\ngbQhx9Oj2kJnSLqpfYJOp2M9xhwhtdInkf7A7yVNBCrTSm91qF4QEQ9J+jCpdPWJKh14qnlJWkk3\nAC+LiD/n4YZnke54twO2iYhKF/TC13lOle+7EKcfy1PUfuGVdCapf+9R4FpSq/+EiDiuTLyBbKHn\n0sP5ufa7mNFFe6roR6/3RElPaE2CyGOJSw3hkrQFab3yj7R96IWkGnMVrSGFf5e0EekW9ykV4j1B\n0kksXb6qNK08In4laWKkzZ1Pk3QjqbOxilrH+WaTI+JBSe8grRUzQ2ldn7I+EhFnK80PeBlwHGmL\nwNINkD70m0yM0dUf9wFOyuXG70q6qcavczI17IbUj9FiEfHFPFJuKku+76tMLJqW30tvJm1jeBSp\nkdCchB41rzmd1d7rTVo29xKlFdNE6igte/H5PHB0RMwtHpT0Z+D/SOWiss7Lw62OIy2AFaShV2Wd\nDXyF9MdXds2Ndn9X2iruJqX1Mv5IujurqtM436odpZOUlnzYm8LQxQpaP8M9SInyB0rrbpemtMrk\nG1g6+fTUb1IwUdKkSCt2vpQ0sa6lzjxS125ItZfwJH0DeBpwE6O/s6DaTNFV8jDqvYATI20NWbps\nMpAJPat7zenae70j4lhJN5NaVUGaYLJZyXAbtifz/DXmSppa9hxzXa5V5/+upPNJQ8+qXCgfiYgv\nV/j8Tv6D9Ed3CKmDcRNSQqokl4aqLh/Q7mOk3/UVuab8VNL2ZGXdrTRPYDfg2JyMq17MziWNHrme\nahtbtHyLVPO9j1QKbc063iJ/nbocU1OcE+lQwqsYcwdSi7rOOvVXSfM4bgYul7QZqTZfyiDX0Ote\nc/r7wLsiorY1UnLc7UnjqN9E6hz6bkT03FEiaX5EbLmMj/0qIraocI5L1fqrkDSTVDc+hyXLVwOz\nnZ+kL9JhzZGWCg0DJK1bti9nGfHWIM2MnhsR83Prf9uIuLBCzNKLhS0n5k6kUt2FkRe4U9rSb62q\ndW9Jz2Lpu4nScwU0ugXfLZGXC6n6dyDpbOCwGN2ftC8Kd0I9G9gWeqTd2tfPz/9UQ8jaer3zm3i/\n/LiPNJJAFeuW10l6Z0QsUQrJddpSi2gVXCLpDcD3ampdtC62xanbAZTekEFps+qPM7rlYNUJVcVd\niY6h3iUVrsp149NIncuVfqaRFuO6l7QJx3zSmPYqLX5Ii9tt2+mur6yIuKrDsV9WjSvpVNJ0/1up\nZ6ls6E8Jbz1gnqRrqJ5DljujlyW3+Os+7qC10CWJ9Md3COkXINIb/IsV6n+19npLWky65Xx7RPwq\nH/tNVNhhRmmhn3OAfzOawHcg1f9eFxWW5VWafbom6ef4T+qZfVorpWWDX09qpdb6puzDHYpIZbaD\nSEsnzwaJDO4OAAAHnUlEQVS+Xja55c76HUhroD89d1yfHRE9r0Mj6RekpDiJtFHKb6hncbu+kTQv\nIqbVHHMz0h66q5JKeJNJO1X9qkLMOnPIuyNNAuvY0IiIUqWnQUzoR5KWpXxXRPw2H3sqqdf/gqgw\n+y0nzeLa5aXKL5L2ItXndiaNwz2LNB2+0obTOfauQOtW+daIuLRqzLqp3g0ZWjEvA14aedGnOlUd\nSjpG7F1JK+6tSaqDHhURPa3vkVv72wM3xOj66reUSb6S/kIaSthR3cNN6yDpFOD4qGGpbNW/1+1K\nZRAT+o2k/S/vazu+Pql2V6qlJWlv0iiPEVJr5YXA+yLiOxXOdU3SolT7kUbLnE7aiLp07bMf1Hlb\nsgeAO8vU6tSHDRmUJq58nLS6YuVt8tpi15rQlfbqfAupI3chaQTSHFIiPbvXC7ukayJieus88/vq\n5yUTet8uXv2SW75zSItyVbqbKH7/kr4bEZU71guxaxs5oyWXjlhK2T6eQayhr9KezCHV0fPwnrI+\nBOzYapXnC8TFQOmEnjuGzgTOVJox+CbgA6Sx7oPkS6Sxva166rakqdGTJf1XiQtQbRsyFPwvaWTT\natSwTZ6WXORsDUmtkQN1lJt+TloffK9YcgmB65Rm5fZqdh7l8iRJ7ySVcsoOK91gOXXZWi6QfXAK\n6eI4l+qzuIvvw0qbbHdQ58iZYr9YbX08g5jQlzessMqQwwltJZb7qWecMwCRpn6flB+D5g+kev+t\nAJKmkYbevZ/U8dRrQq9zQ4aWjeoclRF9WOSsYKtl1fkj4thugyjtTHMlaQ7CrqThalsBH43yKw5O\nJC3PUMt47sfJnyJiztgv60os43k9wWua/FYcrSfpiLKj99oNYkJ/dqE1VSSqrYt+gaQfMzpjcB/S\nNlLjwdNbyRwgIuYpbaP3m5IN65ksvSHDgRXP8YeSdh+0clWRCsv7dvq5lRjtMIWUzLcmtU5/Rkrw\nVUY1/bHK4IEV5EalKfDnsWS5rcwol1b+EGl56zrvzPo1+a22C8/A1dDrlic+bBgRP1PaaWeX/KG/\nktaG+fWKO7vHR655/5nUeQvpYrYe6Tb3iojYcVmfu5yY6zK6IcNVncpkPcZrjcT5F7CIwRyJ8yfS\nxgPfAq6mrRVcZrRDjrsq6Rb+BaRVJ58P/LXMyI+6R/Q8HpRmWreLiDjocT+Z5ejHyJkct7Z+j/GQ\n0M+n85T6bUmL6S9vD8ZGyOWRgxm9mP2MVFf/J7BGFDYs6DLeeaS+gzm5H2FcUFpjaDdSJ/izSBtS\nfKt491My7mRSEt85//sk0vDNnpcokPTkGKAJXk3Qj5Ez7X08wN9bH6JCQ2Y8JPRrl9UClTQ3IrZ9\nvM9pRcitwK1Ib6I7Ii//WjLWi0mt/D1IK8SdRVpMred9RXPp5/ZljMSpunha3yhNz9+PNHLqmCg3\nO/gk0hr1D5Fa/FeR7nZ6Xop3ZaQ+zuatUz9HztRtEGvodXvScj62+uN2FiuQpGHSomELSC2ATSQd\nECV3VY/RtZwnkoZrvhM4lbT0Z6+OJC30dHyHj1VdPK12OZHvQUrmU4EvkCaElbEpaXXO+aRtEX9P\nKgWOF9eN/ZKB0M+RM7UaDy30bwGXRucp9btFRNndVlYaShsQ7x8Rd+T/P51UKnhuhZirkzaL3oc0\nJPL8iKi6se9Ak3Q6adLXD4GzIuIXNcQUqZX+gvx4Jqm/4+cRUedyBVZSWwt9oMf5j4eE3rcp9SuL\nTrMOy85EzJ87G5hOGunybeAndczwVP1rTddKacmHVp9BrRt5K22VtzMpqb8aWDcilnd3udKT9PmI\nOCL3ySyViEqMGuoLSY+Sfu8i3dXXUu/uh8Yn9JaVYUp9vygtfrSYNEUd0g5QE8uOIpD0ctLm3XWt\nhY6Wsdb0oNRR+0HSYYy2zBeRhiy2HnPruEgOMknPjYjr1YfdhcarcZPQx7Nc9/1vRke5/JQ03Kqn\nyUB52OcylRw33Ip9G/WvNT3QJH2WPPY8+rwk6yAa7+uu9IMTunWtMF54A1KrsnWXsyspKb26QuzH\nZa1pGxwr0+iRlcV4GOUybkmaHRF7S5pL5xplTzX01rhopV2kprWSr9KGDF+veLqd1pqOiNizYlwb\nXCvN6JGVhRN6sx2e/y3dcl6GKW0t6YWU33qvZWbheWs1zH0rxrTB1td1V8Yjl1zGGUnrAfdXqVVL\nOpG0eUJxXZz5VTswtfR2ft+LiC9WiWmDa2UaPbKycAu9wfL6zZ8ijWv+OGnJ1/WACZLeGhEXlIkb\nEYdIeh3wonzoSmCo5Dn2Yzs/WwlExMQVfQ5N44TebCcCHyQtInQp8MqIuErS1qTWdamEni0gdYw+\ntjl2yTi3k0bdvDpGt/N7T4XzMhu3nNCbbVJrOVpJH4u8yW9eO6XnYH1qTb+eVCu/TFJrO7+VaS1v\ns4FR2wYPNpCKE1P+0faxMjX020lrq7w6InbJ9e1Kk4si4vsRsS9pTfDLgCNIu+58WdLuVWKbjTfu\nFG2wMTqdVouInrb0Ux83x277Oq3t/PaJiJfWGdusyZzQrWdaSTbHNhtvnNCtEremzQaHE7qZWUO4\nU9TMrCGc0M3MGsIJ3cysIZzQzcwa4v8DeCLSWObwHAgAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# bar plot of the 'value_counts' for the 'genre' Series\n", "movies.genre.value_counts().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`plot`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16. How do I handle missing values in pandas? ([video](https://www.youtube.com/watch?v=fCMrO_VzeL8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=16))" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
18236Grant ParkNaNTRIANGLEIL12/31/2000 23:00
18237Spirit LakeNaNDISKIA12/31/2000 23:00
18238Eagle RiverNaNNaNWI12/31/2000 23:45
18239Eagle RiverREDLIGHTWI12/31/2000 23:45
18240YborNaNOVALFL12/31/2000 23:59
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "18236 Grant Park NaN TRIANGLE IL 12/31/2000 23:00\n", "18237 Spirit Lake NaN DISK IA 12/31/2000 23:00\n", "18238 Eagle River NaN NaN WI 12/31/2000 23:45\n", "18239 Eagle River RED LIGHT WI 12/31/2000 23:45\n", "18240 Ybor NaN OVAL FL 12/31/2000 23:59" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What does \"NaN\" mean?**\n", "\n", "- \"NaN\" is not a string, rather it's a special value: **`numpy.nan`**.\n", "- It stands for \"Not a Number\" and indicates a **missing value**.\n", "- **`read_csv`** detects missing values (by default) when reading the file, and replaces them with this special value.\n", "\n", "Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
18236FalseTrueFalseFalseFalse
18237FalseTrueFalseFalseFalse
18238FalseTrueTrueFalseFalse
18239FalseFalseFalseFalseFalse
18240FalseTrueFalseFalseFalse
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "18236 False True False False False\n", "18237 False True False False False\n", "18238 False True True False False\n", "18239 False False False False False\n", "18240 False True False False False" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'isnull' returns a DataFrame of booleans (True if missing, False if not missing)\n", "ufo.isnull().tail()" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
18236TrueFalseTrueTrueTrue
18237TrueFalseTrueTrueTrue
18238TrueFalseFalseTrueTrue
18239TrueTrueTrueTrueTrue
18240TrueFalseTrueTrueTrue
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "18236 True False True True True\n", "18237 True False True True True\n", "18238 True False False True True\n", "18239 True True True True True\n", "18240 True False True True True" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing)\n", "ufo.notnull().tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`isnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) and [**`notnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "City 25\n", "Colors Reported 15359\n", "Shape Reported 2644\n", "State 0\n", "Time 0\n", "dtype: int64" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the number of missing values in each Series\n", "ufo.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This calculation works because:\n", "\n", "1. The **`sum`** method for a DataFrame operates on **`axis=0`** by default (and thus produces column sums).\n", "2. In order to add boolean values, pandas converts **`True`** to **1** and **`False`** to **0**." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
21NaNNaNNaNLA8/15/1943 0:00
22NaNNaNLIGHTLA8/15/1943 0:00
204NaNNaNDISKCA7/15/1952 12:30
241NaNBLUEDISKMT7/4/1953 14:00
613NaNNaNDISKNV7/1/1960 12:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "21 NaN NaN NaN LA 8/15/1943 0:00\n", "22 NaN NaN LIGHT LA 8/15/1943 0:00\n", "204 NaN NaN DISK CA 7/15/1952 12:30\n", "241 NaN BLUE DISK MT 7/4/1953 14:00\n", "613 NaN NaN DISK NV 7/1/1960 12:00" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the 'isnull' Series method to filter the DataFrame rows\n", "ufo[ufo.City.isnull()].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How to handle missing values** depends on the dataset as well as the nature of your analysis. Here are some options:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18241, 5)" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the number of rows and columns\n", "ufo.shape" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2486, 5)" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if 'any' values are missing in a row, then drop that row\n", "ufo.dropna(how='any').shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`dropna`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18241, 5)" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'inplace' parameter for 'dropna' is False by default, thus rows were only dropped temporarily\n", "ufo.shape" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18241, 5)" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if 'all' values are missing in a row, then drop that row (none are dropped in this case)\n", "ufo.dropna(how='all').shape" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(15576, 5)" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row\n", "ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18237, 5)" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row\n", "ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LIGHT 2803\n", "DISK 2122\n", "TRIANGLE 1889\n", "OTHER 1402\n", "CIRCLE 1365\n", "Name: Shape Reported, dtype: int64" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'value_counts' does not include missing values by default\n", "ufo['Shape Reported'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LIGHT 2803\n", "NaN 2644\n", "DISK 2122\n", "TRIANGLE 1889\n", "OTHER 1402\n", "Name: Shape Reported, dtype: int64" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# explicitly include missing values\n", "ufo['Shape Reported'].value_counts(dropna=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# fill in missing values with a specified value\n", "ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`fillna`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "VARIOUS 2977\n", "LIGHT 2803\n", "DISK 2122\n", "TRIANGLE 1889\n", "OTHER 1402\n", "Name: Shape Reported, dtype: int64" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm that the missing values were filled in\n", "ufo['Shape Reported'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Working with missing data in pandas](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 17. What do I need to know about the pandas index? (Part 1) ([video](https://www.youtube.com/watch?v=OYZNk7Z9s6I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=17))" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=193, step=1)" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# every DataFrame has an index (sometimes called the \"row labels\")\n", "drinks.index" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'country', u'beer_servings', u'spirit_servings', u'wine_servings',\n", " u'total_litres_of_pure_alcohol', u'continent'],\n", " dtype='object')" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# column names are also stored in a special \"index\" object\n", "drinks.columns" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(193, 6)" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# neither the index nor the columns are included in the shape\n", "drinks.shape" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
0124Mtechnician85711
1253Fother94043
2323Mwriter32067
3424Mtechnician43537
4533Fother15213
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 1 24 M technician 85711\n", "1 2 53 F other 94043\n", "2 3 23 M writer 32067\n", "3 4 24 M technician 43537\n", "4 5 33 F other 15213" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# index and columns both default to integers if you don't define them\n", "pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What is the index used for?**\n", "\n", "1. identification\n", "2. selection\n", "3. alignment (covered in the next video)" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
6Argentina193252218.3South America
20Bolivia1674183.8South America
23Brazil245145167.2South America
35Chile1301241727.6South America
37Colombia1597634.2South America
52Ecuador1627434.2South America
72Guyana9330217.1South America
132Paraguay213117747.3South America
133Peru163160216.1South America
163Suriname12817875.6South America
185Uruguay115352206.6South America
188Venezuela33310037.7South America
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "6 Argentina 193 25 221 \n", "20 Bolivia 167 41 8 \n", "23 Brazil 245 145 16 \n", "35 Chile 130 124 172 \n", "37 Colombia 159 76 3 \n", "52 Ecuador 162 74 3 \n", "72 Guyana 93 302 1 \n", "132 Paraguay 213 117 74 \n", "133 Peru 163 160 21 \n", "163 Suriname 128 178 7 \n", "185 Uruguay 115 35 220 \n", "188 Venezuela 333 100 3 \n", "\n", " total_litres_of_pure_alcohol continent \n", "6 8.3 South America \n", "20 3.8 South America \n", "23 7.2 South America \n", "35 7.6 South America \n", "37 4.2 South America \n", "52 4.2 South America \n", "72 7.1 South America \n", "132 7.3 South America \n", "133 6.1 South America \n", "163 5.6 South America \n", "185 6.6 South America \n", "188 7.7 South America " ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# identification: index remains with each row when filtering the DataFrame\n", "drinks[drinks.continent=='South America']" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "245" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# selection: select a portion of the DataFrame using the index\n", "drinks.loc[23, 'beer_servings']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
country
Afghanistan0000.0Asia
Albania89132544.9Europe
Algeria250140.7Africa
Andorra24513831212.4Europe
Angola21757455.9Africa
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "country \n", "Afghanistan 0 0 0 \n", "Albania 89 132 54 \n", "Algeria 25 0 14 \n", "Andorra 245 138 312 \n", "Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "country \n", "Afghanistan 0.0 Asia \n", "Albania 4.9 Europe \n", "Algeria 0.7 Africa \n", "Andorra 12.4 Europe \n", "Angola 5.9 Africa " ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set an existing column as the index\n", "drinks.set_index('country', inplace=True)\n", "drinks.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',\n", " u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',\n", " u'Austria',\n", " ...\n", " u'Tanzania', u'USA', u'Uruguay', u'Uzbekistan', u'Vanuatu',\n", " u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],\n", " dtype='object', name=u'country', length=193)" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'country' is now the index\n", "drinks.index" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'beer_servings', u'spirit_servings', u'wine_servings',\n", " u'total_litres_of_pure_alcohol', u'continent'],\n", " dtype='object')" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'country' is no longer a column\n", "drinks.columns" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(193, 5)" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'country' data is no longer part of the DataFrame contents\n", "drinks.shape" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "245" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# country name can now be used for selection\n", "drinks.loc['Brazil', 'beer_servings']" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
Afghanistan0000.0Asia
Albania89132544.9Europe
Algeria250140.7Africa
Andorra24513831212.4Europe
Angola21757455.9Africa
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "Afghanistan 0 0 0 \n", "Albania 89 132 54 \n", "Algeria 25 0 14 \n", "Andorra 245 138 312 \n", "Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "Afghanistan 0.0 Asia \n", "Albania 4.9 Europe \n", "Algeria 0.7 Africa \n", "Andorra 12.4 Europe \n", "Angola 5.9 Africa " ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# index name is optional\n", "drinks.index.name = None\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restore the index name, and move the index back to a column\n", "drinks.index.name = 'country'\n", "drinks.reset_index(inplace=True)\n", "drinks.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`reset_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
count193.000000193.000000193.000000193.000000
mean106.16062280.99481949.4507774.717098
std101.14310388.28431279.6975983.773298
min0.0000000.0000000.0000000.000000
25%20.0000004.0000001.0000001.300000
50%76.00000056.0000008.0000004.200000
75%188.000000128.00000059.0000007.200000
max376.000000438.000000370.00000014.400000
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "count 193.000000 193.000000 193.000000 \n", "mean 106.160622 80.994819 49.450777 \n", "std 101.143103 88.284312 79.697598 \n", "min 0.000000 0.000000 0.000000 \n", "25% 20.000000 4.000000 1.000000 \n", "50% 76.000000 56.000000 8.000000 \n", "75% 188.000000 128.000000 59.000000 \n", "max 376.000000 438.000000 370.000000 \n", "\n", " total_litres_of_pure_alcohol \n", "count 193.000000 \n", "mean 4.717098 \n", "std 3.773298 \n", "min 0.000000 \n", "25% 1.300000 \n", "50% 4.200000 \n", "75% 7.200000 \n", "max 14.400000 " ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# many DataFrame methods output a DataFrame\n", "drinks.describe()" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20.0" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can interact with any DataFrame using its index and columns\n", "drinks.describe().loc['25%', 'beer_servings']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Indexing and selecting data](http://pandas.pydata.org/pandas-docs/stable/indexing.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18. What do I need to know about the pandas index? (Part 2) ([video](https://www.youtube.com/watch?v=15q-is8P_H4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=18))" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=193, step=1)" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# every DataFrame has an index\n", "drinks.index" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Asia\n", "1 Europe\n", "2 Africa\n", "3 Europe\n", "4 Africa\n", "Name: continent, dtype: object" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# every Series also has an index (which carries over from the DataFrame)\n", "drinks.continent.head()" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# set 'country' as the index\n", "drinks.set_index('country', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country\n", "Afghanistan Asia\n", "Albania Europe\n", "Algeria Africa\n", "Andorra Europe\n", "Angola Africa\n", "Name: continent, dtype: object" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Series index is on the left, values are on the right\n", "drinks.continent.head()" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Africa 53\n", "Europe 45\n", "Asia 44\n", "North America 23\n", "Oceania 16\n", "South America 12\n", "Name: continent, dtype: int64" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# another example of a Series (output from the 'value_counts' method)\n", "drinks.continent.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'Africa', u'Europe', u'Asia', u'North America', u'Oceania',\n", " u'South America'],\n", " dtype='object')" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access the Series index\n", "drinks.continent.value_counts().index" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([53, 45, 44, 23, 16, 12], dtype=int64)" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access the Series values\n", "drinks.continent.value_counts().values" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "53" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# elements in a Series can be selected by index (using bracket notation)\n", "drinks.continent.value_counts()['Africa']" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "South America 12\n", "Oceania 16\n", "North America 23\n", "Asia 44\n", "Europe 45\n", "Africa 53\n", "Name: continent, dtype: int64" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# any Series can be sorted by its values\n", "drinks.continent.value_counts().sort_values()" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Africa 53\n", "Asia 44\n", "Europe 45\n", "North America 23\n", "Oceania 16\n", "South America 12\n", "Name: continent, dtype: int64" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# any Series can also be sorted by its index\n", "drinks.continent.value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html) and [**`sort_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_index.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What is the index used for?**\n", "\n", "1. identification (covered in the previous video)\n", "2. selection (covered in the previous video)\n", "3. alignment" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country\n", "Afghanistan 0\n", "Albania 89\n", "Algeria 25\n", "Andorra 245\n", "Angola 217\n", "Name: beer_servings, dtype: int64" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'beer_servings' Series contains the average annual beer servings per person\n", "drinks.beer_servings.head()" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Albania 3000000\n", "Andorra 85000\n", "Name: population, dtype: int64" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a Series containing the population of two countries\n", "people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')\n", "people" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`Series`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Afghanistan NaN\n", "Albania 267000000.0\n", "Algeria NaN\n", "Andorra 20825000.0\n", "Angola NaN\n", "dtype: float64" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the total annual beer servings for each country\n", "(drinks.beer_servings * people).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The two Series were **aligned** by their indexes.\n", "- If a value is missing in either Series, the result is marked as **NaN**.\n", "- Alignment enables us to easily work with **incomplete data**." ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentpopulation
Afghanistan0000.0AsiaNaN
Albania89132544.9Europe3000000.0
Algeria250140.7AfricaNaN
Andorra24513831212.4Europe85000.0
Angola21757455.9AfricaNaN
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "Afghanistan 0 0 0 \n", "Albania 89 132 54 \n", "Algeria 25 0 14 \n", "Andorra 245 138 312 \n", "Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent population \n", "Afghanistan 0.0 Asia NaN \n", "Albania 4.9 Europe 3000000.0 \n", "Algeria 0.7 Africa NaN \n", "Andorra 12.4 Europe 85000.0 \n", "Angola 5.9 Africa NaN " ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)\n", "pd.concat([drinks, people], axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)\n", "\n", "[Indexing and selecting data](http://pandas.pydata.org/pandas-docs/stable/indexing.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 19. How do I select multiple rows and columns from a pandas DataFrame? ([video](https://www.youtube.com/watch?v=xvpNA7bC8cs&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=19))" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) method is used to select rows and columns by **label**. You can pass it:\n", "\n", "- A single label\n", "- A list of labels\n", "- A slice of labels\n", "- A boolean Series\n", "- A colon (which indicates \"all labels\")" ] }, { "cell_type": "code", "execution_count": 155, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "City Ithaca\n", "Colors Reported NaN\n", "Shape Reported TRIANGLE\n", "State NY\n", "Time 6/1/1930 22:00\n", "Name: 0, dtype: object" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# row 0, all columns\n", "ufo.loc[0, :]" ] }, { "cell_type": "code", "execution_count": 156, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 and 1 and 2, all columns\n", "ufo.loc[[0, 1, 2], :]" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 through 2 (inclusive), all columns\n", "ufo.loc[0:2, :]" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this implies \"all columns\", but explicitly stating \"all columns\" is better\n", "ufo.loc[0:2]" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Ithaca\n", "1 Willingboro\n", "2 Holyoke\n", "Name: City, dtype: object" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 through 2 (inclusive), column 'City'\n", "ufo.loc[0:2, 'City']" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityState
0IthacaNY
1WillingboroNJ
2HolyokeCO
\n", "
" ], "text/plain": [ " City State\n", "0 Ithaca NY\n", "1 Willingboro NJ\n", "2 Holyoke CO" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 through 2 (inclusive), columns 'City' and 'State'\n", "ufo.loc[0:2, ['City', 'State']]" ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityState
0IthacaNY
1WillingboroNJ
2HolyokeCO
\n", "
" ], "text/plain": [ " City State\n", "0 Ithaca NY\n", "1 Willingboro NJ\n", "2 Holyoke CO" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# accomplish the same thing using double brackets - but using 'loc' is preferred since it's more explicit\n", "ufo[['City', 'State']].head(3)" ] }, { "cell_type": "code", "execution_count": 162, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
0IthacaNaNTRIANGLENY
1WillingboroNaNOTHERNJ
2HolyokeNaNOVALCO
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "0 Ithaca NaN TRIANGLE NY\n", "1 Willingboro NaN OTHER NJ\n", "2 Holyoke NaN OVAL CO" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 through 2 (inclusive), columns 'City' through 'State' (inclusive)\n", "ufo.loc[0:2, 'City':'State']" ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
0IthacaNaNTRIANGLENY
1WillingboroNaNOTHERNJ
2HolyokeNaNOVALCO
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "0 Ithaca NaN TRIANGLE NY\n", "1 Willingboro NaN OTHER NJ\n", "2 Holyoke NaN OVAL CO" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# accomplish the same thing using 'head' and 'drop'\n", "ufo.head(3).drop('Time', axis=1)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1694 CA\n", "2144 CA\n", "4686 MD\n", "7293 CA\n", "8488 CA\n", "8768 CA\n", "10816 OR\n", "10948 CA\n", "11045 CA\n", "12322 CA\n", "12941 CA\n", "16803 MD\n", "17322 CA\n", "Name: State, dtype: object" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows in which the 'City' is 'Oakland', column 'State'\n", "ufo.loc[ufo.City=='Oakland', 'State']" ] }, { "cell_type": "code", "execution_count": 165, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1694 CA\n", "2144 CA\n", "4686 MD\n", "7293 CA\n", "8488 CA\n", "8768 CA\n", "10816 OR\n", "10948 CA\n", "11045 CA\n", "12322 CA\n", "12941 CA\n", "16803 MD\n", "17322 CA\n", "Name: State, dtype: object" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# accomplish the same thing using \"chained indexing\" - but using 'loc' is preferred since chained indexing can cause problems\n", "ufo[ufo.City=='Oakland'].State" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`iloc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) method is used to select rows and columns by **integer position**. You can pass it:\n", "\n", "- A single integer position\n", "- A list of integer positions\n", "- A slice of integer positions\n", "- A colon (which indicates \"all integer positions\")" ] }, { "cell_type": "code", "execution_count": 166, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityState
0IthacaNY
1WillingboroNJ
\n", "
" ], "text/plain": [ " City State\n", "0 Ithaca NY\n", "1 Willingboro NJ" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows in positions 0 and 1, columns in positions 0 and 3\n", "ufo.iloc[[0, 1], [0, 3]]" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
0IthacaNaNTRIANGLENY
1WillingboroNaNOTHERNJ
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "0 Ithaca NaN TRIANGLE NY\n", "1 Willingboro NaN OTHER NJ" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows in positions 0 through 2 (exclusive), columns in positions 0 through 4 (exclusive)\n", "ufo.iloc[0:2, 0:4]" ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows in positions 0 through 2 (exclusive), all columns\n", "ufo.iloc[0:2, :]" ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# accomplish the same thing - but using 'iloc' is preferred since it's more explicit\n", "ufo[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`ix`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html) method is used to select rows and columns by **label or integer position**, and should only be used when you need to mix label-based and integer-based selection in the same call." ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
country
Afghanistan0000.0Asia
Albania89132544.9Europe
Algeria250140.7Africa
Andorra24513831212.4Europe
Angola21757455.9Africa
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings \\\n", "country \n", "Afghanistan 0 0 0 \n", "Albania 89 132 54 \n", "Algeria 25 0 14 \n", "Andorra 245 138 312 \n", "Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "country \n", "Afghanistan 0.0 Asia \n", "Albania 4.9 Europe \n", "Algeria 0.7 Africa \n", "Andorra 12.4 Europe \n", "Angola 5.9 Africa " ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry', index_col='country')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "89" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# row with label 'Albania', column in position 0\n", "drinks.ix['Albania', 0]" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "89" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# row in position 1, column with label 'beer_servings'\n", "drinks.ix[1, 'beer_servings']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Rules for using numbers with `ix`:**\n", "\n", "- If the index is **strings**, numbers are treated as **integer positions**, and thus slices are **exclusive** on the right.\n", "- If the index is **integers**, numbers are treated as **labels**, and thus slices are **inclusive**." ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servings
country
Albania89132
Algeria250
Andorra245138
\n", "
" ], "text/plain": [ " beer_servings spirit_servings\n", "country \n", "Albania 89 132\n", "Algeria 25 0\n", "Andorra 245 138" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 'Albania' through 'Andorra' (inclusive), columns in positions 0 through 2 (exclusive)\n", "drinks.ix['Albania':'Andorra', 0:2]" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors Reported
0IthacaNaN
1WillingboroNaN
2HolyokeNaN
\n", "
" ], "text/plain": [ " City Colors Reported\n", "0 Ithaca NaN\n", "1 Willingboro NaN\n", "2 Holyoke NaN" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows 0 through 2 (inclusive), columns in positions 0 through 2 (exclusive)\n", "ufo.ix[0:2, 0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Summary of the pandas API for selection](https://github.com/pydata/pandas/issues/9595)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 20. When should I use the \"inplace\" parameter in pandas? ([video](https://www.youtube.com/watch?v=XaCSdr7pPmY&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=20))" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18241, 5)" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.shape" ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Colors ReportedShape ReportedStateTime
0NaNTRIANGLENY6/1/1930 22:00
1NaNOTHERNJ6/30/1930 20:00
2NaNOVALCO2/15/1931 14:00
3NaNDISKKS6/1/1931 13:00
4NaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " Colors Reported Shape Reported State Time\n", "0 NaN TRIANGLE NY 6/1/1930 22:00\n", "1 NaN OTHER NJ 6/30/1930 20:00\n", "2 NaN OVAL CO 2/15/1931 14:00\n", "3 NaN DISK KS 6/1/1931 13:00\n", "4 NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove the 'City' column (doesn't affect the DataFrame since inplace=False)\n", "ufo.drop('City', axis=1).head()" ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm that the 'City' column was not actually removed\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# remove the 'City' column (does affect the DataFrame since inplace=True)\n", "ufo.drop('City', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Colors ReportedShape ReportedStateTime
0NaNTRIANGLENY6/1/1930 22:00
1NaNOTHERNJ6/30/1930 20:00
2NaNOVALCO2/15/1931 14:00
3NaNDISKKS6/1/1931 13:00
4NaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " Colors Reported Shape Reported State Time\n", "0 NaN TRIANGLE NY 6/1/1930 22:00\n", "1 NaN OTHER NJ 6/30/1930 20:00\n", "2 NaN OVAL CO 2/15/1931 14:00\n", "3 NaN DISK KS 6/1/1931 13:00\n", "4 NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm that the 'City' column was actually removed\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 181, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2490, 4)" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop a row if any value is missing from that row (doesn't affect the DataFrame since inplace=False)\n", "ufo.dropna(how='any').shape" ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18241, 4)" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm that no rows were actually removed\n", "ufo.shape" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Colors ReportedShape ReportedState
Time
12/31/2000 23:00NaNTRIANGLEIL
12/31/2000 23:00NaNDISKIA
12/31/2000 23:45NaNNaNWI
12/31/2000 23:45REDLIGHTWI
12/31/2000 23:59NaNOVALFL
\n", "
" ], "text/plain": [ " Colors Reported Shape Reported State\n", "Time \n", "12/31/2000 23:00 NaN TRIANGLE IL\n", "12/31/2000 23:00 NaN DISK IA\n", "12/31/2000 23:45 NaN NaN WI\n", "12/31/2000 23:45 RED LIGHT WI\n", "12/31/2000 23:59 NaN OVAL FL" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use an assignment statement instead of the 'inplace' parameter\n", "ufo = ufo.set_index('Time')\n", "ufo.tail()" ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Colors ReportedShape ReportedState
Time
12/31/2000 23:00REDTRIANGLEIL
12/31/2000 23:00REDDISKIA
12/31/2000 23:45REDLIGHTWI
12/31/2000 23:45REDLIGHTWI
12/31/2000 23:59NaNOVALFL
\n", "
" ], "text/plain": [ " Colors Reported Shape Reported State\n", "Time \n", "12/31/2000 23:00 RED TRIANGLE IL\n", "12/31/2000 23:00 RED DISK IA\n", "12/31/2000 23:45 RED LIGHT WI\n", "12/31/2000 23:45 RED LIGHT WI\n", "12/31/2000 23:59 NaN OVAL FL" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fill missing values using \"backward fill\" strategy (doesn't affect the DataFrame since inplace=False)\n", "ufo.fillna(method='bfill').tail()" ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Colors ReportedShape ReportedState
Time
12/31/2000 23:00REDTRIANGLEIL
12/31/2000 23:00REDDISKIA
12/31/2000 23:45REDDISKWI
12/31/2000 23:45REDLIGHTWI
12/31/2000 23:59REDOVALFL
\n", "
" ], "text/plain": [ " Colors Reported Shape Reported State\n", "Time \n", "12/31/2000 23:00 RED TRIANGLE IL\n", "12/31/2000 23:00 RED DISK IA\n", "12/31/2000 23:45 RED DISK WI\n", "12/31/2000 23:45 RED LIGHT WI\n", "12/31/2000 23:59 RED OVAL FL" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compare with \"forward fill\" strategy (doesn't affect the DataFrame since inplace=False)\n", "ufo.fillna(method='ffill').tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 21. How do I make my pandas DataFrame smaller and faster? ([video](https://www.youtube.com/watch?v=wDYDYGyN_cw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=21))" ] }, { "cell_type": "code", "execution_count": 186, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 193 entries, 0 to 192\n", "Data columns (total 6 columns):\n", "country 193 non-null object\n", "beer_servings 193 non-null int64\n", "spirit_servings 193 non-null int64\n", "wine_servings 193 non-null int64\n", "total_litres_of_pure_alcohol 193 non-null float64\n", "continent 193 non-null object\n", "dtypes: float64(1), int64(3), object(2)\n", "memory usage: 9.1+ KB\n" ] } ], "source": [ "# exact memory usage is unknown because object columns are references elsewhere\n", "drinks.info()" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 193 entries, 0 to 192\n", "Data columns (total 6 columns):\n", "country 193 non-null object\n", "beer_servings 193 non-null int64\n", "spirit_servings 193 non-null int64\n", "wine_servings 193 non-null int64\n", "total_litres_of_pure_alcohol 193 non-null float64\n", "continent 193 non-null object\n", "dtypes: float64(1), int64(3), object(2)\n", "memory usage: 24.4 KB\n" ] } ], "source": [ "# force pandas to calculate the true memory usage\n", "drinks.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index 72\n", "country 9500\n", "beer_servings 1544\n", "spirit_servings 1544\n", "wine_servings 1544\n", "total_litres_of_pure_alcohol 1544\n", "continent 9244\n", "dtype: int64" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the memory usage for each Series (in bytes)\n", "drinks.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`info`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) and [**`memory_usage`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html)" ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings int64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent category\n", "dtype: object" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers\n", "drinks['continent'] = drinks.continent.astype('category')\n", "drinks.dtypes" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Asia\n", "1 Europe\n", "2 Africa\n", "3 Europe\n", "4 Africa\n", "Name: continent, dtype: category\n", "Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'continent' Series appears to be unchanged\n", "drinks.continent.head()" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 0\n", "3 2\n", "4 0\n", "dtype: int8" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)\n", "drinks.continent.cat.codes.head()" ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index 72\n", "country 9500\n", "beer_servings 1544\n", "spirit_servings 1544\n", "wine_servings 1544\n", "total_litres_of_pure_alcohol 1544\n", "continent 488\n", "dtype: int64" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# memory usage has been drastically reduced\n", "drinks.memory_usage(deep=True)" ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index 72\n", "country 9886\n", "beer_servings 1544\n", "spirit_servings 1544\n", "wine_servings 1544\n", "total_litres_of_pure_alcohol 1544\n", "continent 488\n", "dtype: int64" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# repeat this process for the 'country' Series\n", "drinks['country'] = drinks.country.astype('category')\n", "drinks.memory_usage(deep=True)" ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola',\n", " u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia',\n", " u'Austria',\n", " ...\n", " u'United Arab Emirates', u'United Kingdom', u'Uruguay', u'Uzbekistan',\n", " u'Vanuatu', u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'],\n", " dtype='object', length=193)" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# memory usage increased because we created 193 categories\n", "drinks.country.cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **category** data type should only be used with a string Series that has a **small number of possible values**." ] }, { "cell_type": "code", "execution_count": 196, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
0100good
1101very good
2102good
3103excellent
\n", "
" ], "text/plain": [ " ID quality\n", "0 100 good\n", "1 101 very good\n", "2 102 good\n", "3 103 excellent" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a small DataFrame from a dictionary\n", "df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})\n", "df" ] }, { "cell_type": "code", "execution_count": 197, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
3103excellent
0100good
2102good
1101very good
\n", "
" ], "text/plain": [ " ID quality\n", "3 103 excellent\n", "0 100 good\n", "2 102 good\n", "1 101 very good" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the DataFrame by the 'quality' Series (alphabetical order)\n", "df.sort_values('quality')" ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 good\n", "1 very good\n", "2 good\n", "3 excellent\n", "Name: quality, dtype: category\n", "Categories (3, object): [good < very good < excellent]" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define a logical ordering for the categories\n", "df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)\n", "df.quality" ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
0100good
2102good
1101very good
3103excellent
\n", "
" ], "text/plain": [ " ID quality\n", "0 100 good\n", "2 102 good\n", "1 101 very good\n", "3 103 excellent" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sort the DataFrame by the 'quality' Series (logical order)\n", "df.sort_values('quality')" ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
1101very good
3103excellent
\n", "
" ], "text/plain": [ " ID quality\n", "1 101 very good\n", "3 103 excellent" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# comparison operators work with ordered categories\n", "df.loc[df.quality > 'good', :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Overview of categorical data in pandas](http://pandas.pydata.org/pandas-docs/stable/categorical.html)\n", "\n", "[API reference for categorical methods](http://pandas.pydata.org/pandas-docs/stable/api.html#categorical)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 22. How do I use pandas with scikit-learn to create Kaggle submissions? ([video](https://www.youtube.com/watch?v=ylRlGCtAtiE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22))" ] }, { "cell_type": "code", "execution_count": 201, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 201, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the training dataset from Kaggle's Titanic competition into a DataFrame\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Predict passenger survival aboard the Titanic based on [passenger attributes](https://www.kaggle.com/c/titanic/data)\n", "\n", "**Video:** [What is machine learning, and how does it work?](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1)" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(891, 2)" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a feature matrix 'X' by selecting two DataFrame columns\n", "feature_cols = ['Pclass', 'Parch']\n", "X = train.loc[:, feature_cols]\n", "X.shape" ] }, { "cell_type": "code", "execution_count": 203, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(891L,)" ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a response vector 'y' by selecting a Series\n", "y = train.Survived\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes." ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fit a classification model to the training data\n", "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression()\n", "logreg.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Video series:** [Introduction to machine learning with scikit-learn](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A)" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "0 34.5 0 0 330911 7.8292 NaN Q \n", "1 47.0 1 0 363272 7.0000 NaN S \n", "2 62.0 0 0 240276 9.6875 NaN Q \n", "3 27.0 0 0 315154 8.6625 NaN S \n", "4 22.0 1 1 3101298 12.2875 NaN S " ] }, "execution_count": 205, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the testing dataset from Kaggle's Titanic competition into a DataFrame\n", "test = pd.read_csv('http://bit.ly/kaggletest')\n", "test.head()" ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(418, 2)" ] }, "execution_count": 206, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a feature matrix from the testing data that matches the training data\n", "X_new = test.loc[:, feature_cols]\n", "X_new.shape" ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# use the fitted model to make predictions for the testing set observations\n", "new_pred_class = logreg.predict(X_new)" ] }, { "cell_type": "code", "execution_count": 208, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame of passenger IDs and testing set predictions\n", "pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for the [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) constructor" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Survived
PassengerId
8920
8930
8940
8950
8960
\n", "
" ], "text/plain": [ " Survived\n", "PassengerId \n", "892 0\n", "893 0\n", "894 0\n", "895 0\n", "896 0" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ensure that PassengerID is the first column by setting it as the index\n", "pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()" ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# write the DataFrame to a CSV file that can be submitted to Kaggle\n", "pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`to_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html)" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# save a DataFrame to disk (\"pickle it\")\n", "train.to_pickle('train.pkl')" ] }, { "cell_type": "code", "execution_count": 212, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a pickled object from disk (\"unpickle it\")\n", "pd.read_pickle('train.pkl').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`to_pickle`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html) and [**`read_pickle`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 23. More of your pandas questions answered! ([video](https://www.youtube.com/watch?v=oH3wYKvwpJ8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=23))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** Could you explain how to read the pandas documentation?\n", "\n", "[pandas API reference](http://pandas.pydata.org/pandas-docs/stable/api.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What is the difference between **`ufo.isnull()`** and **`pd.isnull(ufo)`**?" ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0FalseTrueFalseFalseFalse
1FalseTrueFalseFalseFalse
2FalseTrueFalseFalseFalse
3FalseTrueFalseFalseFalse
4FalseTrueFalseFalseFalse
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 False True False False False\n", "1 False True False False False\n", "2 False True False False False\n", "3 False True False False False\n", "4 False True False False False" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use 'isnull' as a top-level function\n", "pd.isnull(ufo).head()" ] }, { "cell_type": "code", "execution_count": 215, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0FalseTrueFalseFalseFalse
1FalseTrueFalseFalseFalse
2FalseTrueFalseFalseFalse
3FalseTrueFalseFalseFalse
4FalseTrueFalseFalseFalse
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 False True False False False\n", "1 False True False False False\n", "2 False True False False False\n", "3 False True False False False\n", "4 False True False False False" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# equivalent: use 'isnull' as a DataFrame method\n", "ufo.isnull().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`isnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** Why are DataFrame slices inclusive when using **`.loc`**, but exclusive when using **`.iloc`**?" ] }, { "cell_type": "code", "execution_count": 216, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# label-based slicing is inclusive of the start and stop\n", "ufo.loc[0:4, :]" ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00" ] }, "execution_count": 217, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# position-based slicing is inclusive of the start and exclusive of the stop\n", "ufo.iloc[0:4, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) and [**`iloc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)" ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],\n", " ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],\n", " ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],\n", " ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)" ] }, "execution_count": 218, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'iloc' is simply following NumPy's slicing convention...\n", "ufo.values[0:4, :]" ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'pyth'" ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ...and NumPy is simply following Python's slicing convention\n", "'python'[0:4]" ] }, { "cell_type": "code", "execution_count": 220, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
0IthacaNaNTRIANGLENY
1WillingboroNaNOTHERNJ
2HolyokeNaNOVALCO
3AbileneNaNDISKKS
4New York Worlds FairNaNLIGHTNY
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "0 Ithaca NaN TRIANGLE NY\n", "1 Willingboro NaN OTHER NJ\n", "2 Holyoke NaN OVAL CO\n", "3 Abilene NaN DISK KS\n", "4 New York Worlds Fair NaN LIGHT NY" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it\n", "ufo.loc[0:4, 'City':'State']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How do I randomly sample rows from a DataFrame?" ] }, { "cell_type": "code", "execution_count": 221, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
12192WinstonGREENLIGHTOR9/23/1998 21:00
1775Lake WalesNaNDISKFL1/20/1969 19:00
3141Cannon AFBNaNDISKNM1/6/1976 1:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "12192 Winston GREEN LIGHT OR 9/23/1998 21:00\n", "1775 Lake Wales NaN DISK FL 1/20/1969 19:00\n", "3141 Cannon AFB NaN DISK NM 1/6/1976 1:00" ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1)\n", "ufo.sample(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`sample`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)" ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
217NorridgewockNaNDISKME9/15/1952 14:00
12282IpavaNaNTRIANGLEIL10/1/1998 21:15
17933EllinwoodNaNFIREBALLKS11/13/2000 22:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "217 Norridgewock NaN DISK ME 9/15/1952 14:00\n", "12282 Ipava NaN TRIANGLE IL 10/1/1998 21:15\n", "17933 Ellinwood NaN FIREBALL KS 11/13/2000 22:00" ] }, "execution_count": 222, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the 'random_state' parameter for reproducibility\n", "ufo.sample(n=3, random_state=42)" ] }, { "cell_type": "code", "execution_count": 223, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# sample 75% of the DataFrame's rows without replacement\n", "train = ufo.sample(frac=0.75, random_state=99)" ] }, { "cell_type": "code", "execution_count": 224, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# store the remaining 25% of the rows in another DataFrame\n", "test = ufo.loc[~ufo.index.isin(train.index), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`isin`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 24. How do I create dummy variables in pandas? ([video](https://www.youtube.com/watch?v=0s_1IsROgDc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=24))" ] }, { "cell_type": "code", "execution_count": 225, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 225, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the training dataset from Kaggle's Titanic competition\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 226, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_male
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS1
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male \n", "0 0 A/5 21171 7.2500 NaN S 1 \n", "1 0 PC 17599 71.2833 C85 C 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 1 " ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create the 'Sex_male' dummy variable using the 'map' method\n", "train['Sex_male'] = train.Sex.map({'female':0, 'male':1})\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`map`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html)" ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
femalemale
00.01.0
11.00.0
21.00.0
31.00.0
40.01.0
\n", "
" ], "text/plain": [ " female male\n", "0 0.0 1.0\n", "1 1.0 0.0\n", "2 1.0 0.0\n", "3 1.0 0.0\n", "4 0.0 1.0" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# alternative: use 'get_dummies' to create one column for every possible value\n", "pd.get_dummies(train.Sex).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally speaking:\n", "\n", "- If you have **\"K\" possible values** for a categorical feature, you only need **\"K-1\" dummy variables** to capture all of the information about that feature.\n", "- One convention is to **drop the first dummy variable**, which defines that level as the \"baseline\"." ] }, { "cell_type": "code", "execution_count": 228, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
male
01.0
10.0
20.0
30.0
41.0
\n", "
" ], "text/plain": [ " male\n", "0 1.0\n", "1 0.0\n", "2 0.0\n", "3 0.0\n", "4 1.0" ] }, "execution_count": 228, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the first dummy variable ('female') using the 'iloc' method\n", "pd.get_dummies(train.Sex).iloc[:, 1:].head()" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sex_male
01.0
10.0
20.0
30.0
41.0
\n", "
" ], "text/plain": [ " Sex_male\n", "0 1.0\n", "1 0.0\n", "2 0.0\n", "3 0.0\n", "4 1.0" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add a prefix to identify the source of the dummy variables\n", "pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_CEmbarked_QEmbarked_S
00.00.01.0
11.00.00.0
20.00.01.0
30.00.01.0
40.00.01.0
50.01.00.0
60.00.01.0
70.00.01.0
80.00.01.0
91.00.00.0
\n", "
" ], "text/plain": [ " Embarked_C Embarked_Q Embarked_S\n", "0 0.0 0.0 1.0\n", "1 1.0 0.0 0.0\n", "2 0.0 0.0 1.0\n", "3 0.0 0.0 1.0\n", "4 0.0 0.0 1.0\n", "5 0.0 1.0 0.0\n", "6 0.0 0.0 1.0\n", "7 0.0 0.0 1.0\n", "8 0.0 0.0 1.0\n", "9 1.0 0.0 0.0" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use 'get_dummies' with a feature that has 3 possible values\n", "pd.get_dummies(train.Embarked, prefix='Embarked').head(10)" ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_QEmbarked_S
00.01.0
10.00.0
20.01.0
30.01.0
40.01.0
51.00.0
60.01.0
70.01.0
80.01.0
90.00.0
\n", "
" ], "text/plain": [ " Embarked_Q Embarked_S\n", "0 0.0 1.0\n", "1 0.0 0.0\n", "2 0.0 1.0\n", "3 0.0 1.0\n", "4 0.0 1.0\n", "5 1.0 0.0\n", "6 0.0 1.0\n", "7 0.0 1.0\n", "8 0.0 1.0\n", "9 0.0 0.0" ] }, "execution_count": 231, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the first dummy variable ('C')\n", "pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How to translate these values back to the original 'Embarked' value:\n", "\n", "- **0, 0** means **C**\n", "- **1, 0** means **Q**\n", "- **0, 1** means **S**" ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_maleEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS10.01.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C00.00.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS00.01.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S00.01.0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS10.01.0
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male Embarked_Q \\\n", "0 0 A/5 21171 7.2500 NaN S 1 0.0 \n", "1 0 PC 17599 71.2833 C85 C 0 0.0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 0.0 \n", "3 0 113803 53.1000 C123 S 0 0.0 \n", "4 0 373450 8.0500 NaN S 1 0.0 \n", "\n", " Embarked_S \n", "0 1.0 \n", "1 0.0 \n", "2 1.0 \n", "3 1.0 \n", "4 1.0 " ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# save the DataFrame of dummy variables and concatenate them to the original DataFrame\n", "embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]\n", "train = pd.concat([train, embarked_dummies], axis=1)\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)" ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reset the DataFrame\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 234, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harris22.010A/5 211717.2500NaN0.01.00.00.01.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.2833C851.00.01.00.00.0
2313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250NaN1.00.00.00.01.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000C1231.00.00.00.01.0
4503Allen, Mr. William Henry35.0003734508.0500NaN0.01.00.00.01.0
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_female Sex_male Embarked_C \\\n", "0 A/5 21171 7.2500 NaN 0.0 1.0 0.0 \n", "1 PC 17599 71.2833 C85 1.0 0.0 1.0 \n", "2 STON/O2. 3101282 7.9250 NaN 1.0 0.0 0.0 \n", "3 113803 53.1000 C123 1.0 0.0 0.0 \n", "4 373450 8.0500 NaN 0.0 1.0 0.0 \n", "\n", " Embarked_Q Embarked_S \n", "0 0.0 1.0 \n", "1 0.0 0.0 \n", "2 0.0 1.0 \n", "3 0.0 1.0 \n", "4 0.0 1.0 " ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns)\n", "pd.get_dummies(train, columns=['Sex', 'Embarked']).head()" ] }, { "cell_type": "code", "execution_count": 235, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_maleEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harris22.010A/5 211717.2500NaN1.00.01.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...38.010PC 1759971.2833C850.00.00.0
2313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250NaN0.00.01.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000C1230.00.01.0
4503Allen, Mr. William Henry35.0003734508.0500NaN1.00.01.0
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_male Embarked_Q Embarked_S \n", "0 A/5 21171 7.2500 NaN 1.0 0.0 1.0 \n", "1 PC 17599 71.2833 C85 0.0 0.0 0.0 \n", "2 STON/O2. 3101282 7.9250 NaN 0.0 0.0 1.0 \n", "3 113803 53.1000 C123 0.0 0.0 1.0 \n", "4 373450 8.0500 NaN 1.0 0.0 1.0 " ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the 'drop_first' parameter (new in pandas 0.18) to drop the first dummy variable for each feature\n", "pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`get_dummies`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 25. How do I work with dates and times in pandas? ([video](https://www.youtube.com/watch?v=yCgJGsg0Xa4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=25))" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY6/1/1930 22:00
1WillingboroNaNOTHERNJ6/30/1930 20:00
2HolyokeNaNOVALCO2/15/1931 14:00
3AbileneNaNDISKKS6/1/1931 13:00
4New York Worlds FairNaNLIGHTNY4/18/1933 19:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State Time\n", "0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00\n", "1 Willingboro NaN OTHER NJ 6/30/1930 20:00\n", "2 Holyoke NaN OVAL CO 2/15/1931 14:00\n", "3 Abilene NaN DISK KS 6/1/1931 13:00\n", "4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of UFO reports into a DataFrame\n", "ufo = pd.read_csv('http://bit.ly/uforeports')\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "City object\n", "Colors Reported object\n", "Shape Reported object\n", "State object\n", "Time object\n", "dtype: object" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'Time' is currently stored as a string\n", "ufo.dtypes" ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 22\n", "1 20\n", "2 14\n", "3 13\n", "4 19\n", "Name: Time, dtype: int32" ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# hour could be accessed using string slicing, but this approach breaks too easily\n", "ufo.Time.str.slice(-5, -3).astype(int).head()" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY1930-06-01 22:00:00
1WillingboroNaNOTHERNJ1930-06-30 20:00:00
2HolyokeNaNOVALCO1931-02-15 14:00:00
3AbileneNaNDISKKS1931-06-01 13:00:00
4New York Worlds FairNaNLIGHTNY1933-04-18 19:00:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State \\\n", "0 Ithaca NaN TRIANGLE NY \n", "1 Willingboro NaN OTHER NJ \n", "2 Holyoke NaN OVAL CO \n", "3 Abilene NaN DISK KS \n", "4 New York Worlds Fair NaN LIGHT NY \n", "\n", " Time \n", "0 1930-06-01 22:00:00 \n", "1 1930-06-30 20:00:00 \n", "2 1931-02-15 14:00:00 \n", "3 1931-06-01 13:00:00 \n", "4 1933-04-18 19:00:00 " ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert 'Time' to datetime format\n", "ufo['Time'] = pd.to_datetime(ufo.Time)\n", "ufo.head()" ] }, { "cell_type": "code", "execution_count": 240, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "City object\n", "Colors Reported object\n", "Shape Reported object\n", "State object\n", "Time datetime64[ns]\n", "dtype: object" ] }, "execution_count": 240, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`to_datetime`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)" ] }, { "cell_type": "code", "execution_count": 241, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 22\n", "1 20\n", "2 14\n", "3 13\n", "4 19\n", "Name: Time, dtype: int64" ] }, "execution_count": 241, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convenient Series attributes are now available\n", "ufo.Time.dt.hour.head()" ] }, { "cell_type": "code", "execution_count": 242, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Sunday\n", "1 Monday\n", "2 Sunday\n", "3 Monday\n", "4 Tuesday\n", "Name: Time, dtype: object" ] }, "execution_count": 242, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.Time.dt.weekday_name.head()" ] }, { "cell_type": "code", "execution_count": 243, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 152\n", "1 181\n", "2 46\n", "3 152\n", "4 108\n", "Name: Time, dtype: int64" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.Time.dt.dayofyear.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "API reference for [datetime properties and methods](http://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)" ] }, { "cell_type": "code", "execution_count": 244, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Timestamp('1999-01-01 00:00:00')" ] }, "execution_count": 244, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert a single string to datetime format (outputs a timestamp object)\n", "ts = pd.to_datetime('1/1/1999')\n", "ts" ] }, { "cell_type": "code", "execution_count": 245, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
12832Loma RicaNaNLIGHTCA1999-01-01 02:30:00
12833BauxiteNaNNaNAR1999-01-01 03:00:00
12834FlorenceNaNCYLINDERSC1999-01-01 14:00:00
12835Lake HenshawNaNCIGARCA1999-01-01 15:00:00
12836Wilmington IslandNaNLIGHTGA1999-01-01 17:15:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State \\\n", "12832 Loma Rica NaN LIGHT CA \n", "12833 Bauxite NaN NaN AR \n", "12834 Florence NaN CYLINDER SC \n", "12835 Lake Henshaw NaN CIGAR CA \n", "12836 Wilmington Island NaN LIGHT GA \n", "\n", " Time \n", "12832 1999-01-01 02:30:00 \n", "12833 1999-01-01 03:00:00 \n", "12834 1999-01-01 14:00:00 \n", "12835 1999-01-01 15:00:00 \n", "12836 1999-01-01 17:15:00 " ] }, "execution_count": 245, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compare a datetime Series with a timestamp\n", "ufo.loc[ufo.Time >= ts, :].head()" ] }, { "cell_type": "code", "execution_count": 246, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Timedelta('25781 days 01:59:00')" ] }, "execution_count": 246, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# perform mathematical operations with timestamps (outputs a timedelta object)\n", "ufo.Time.max() - ufo.Time.min()" ] }, { "cell_type": "code", "execution_count": 247, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "25781L" ] }, "execution_count": 247, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# timedelta objects also have attributes you can access\n", "(ufo.Time.max() - ufo.Time.min()).days" ] }, { "cell_type": "code", "execution_count": 248, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# allow plots to appear in the notebook\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 249, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1930 2\n", "1931 2\n", "1933 1\n", "1934 1\n", "1935 1\n", "Name: Year, dtype: int64" ] }, "execution_count": 249, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the number of UFO reports per year\n", "ufo['Year'] = ufo.Time.dt.year\n", "ufo.Year.value_counts().sort_index().head()" ] }, { "cell_type": "code", "execution_count": 250, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEACAYAAABYq7oeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8VXW9//HXGxEHUNRUSHCAiyA4k2Kl9+exzKFBKW84\nVFra4/b7qdlNs8T7K7DJ23BLu/dqj35hoqmIDU6hIOG5pVwTc0ABgRIUSEBjEgEFzuf3x3dt2Zx9\nJg57WPvwfj4e63HW+e619v6szWF91ndY36WIwMzMrFi3WgdgZmb54+RgZmYlnBzMzKyEk4OZmZVw\ncjAzsxJODmZmVqLd5CBpF0l/kvSMpOcljcnK95Y0RdJcSZMl9S7aZ7Sk+ZLmSDqtqHy4pJmS5km6\noTKHZGZm26vd5BARbwGnRMSxwDHAmZJGANcAUyNiCDANGA0gaRgwChgKnAncJEnZ290MXBIRg4HB\nkk4v9wGZmdn261CzUkSsy1Z3AboDAZwNjM/KxwMjs/WzgAkRsSkiFgLzgRGS+gJ7RMSMbLvbivYx\nM7Mc6VBykNRN0jPAUuCR7ATfJyKWAUTEUmD/bPN+wKKi3ZdkZf2AxUXli7MyMzPLmY7WHJqyZqX+\npFrA4aTaw1ablTs4MzOrje7bsnFErJHUCJwBLJPUJyKWZU1Gy7PNlgAHFu3WPytrrbyEJCcaM7NO\niAi1v1X7OjJaad/CSCRJuwEfAuYA9wOfzTa7CLgvW78fOE9SD0kDgEHAk1nT02pJI7IO6guL9ikR\nEXW7jBkzpuYx7IixO/7aL46/tks5daTm8G5gvKRupGRyd0RMkvQEMFHSxcDLpBFKRMRsSROB2cBG\n4NLYEvVlwK3ArsCkiHi4rEdjZmZl0W5yiIjngeEtlK8ATm1ln+uB61so/zNw5LaHaWZm1eQ7pCug\noaGh1iF0Wj3HDo6/1hx/16Fyt1OVg6TIY1xmZnkmiahWh7SZme14nBzMzKyEk4OZmZVwcjAzsxJO\nDmZmVsLJwczMSjg5mJlZCScHMzMr4eRgZmYlnBzMzOrMxo2V/wwnBzOzOnPqqfDgg5X9DM+tZGZW\nZwYPhv32g8cf37rccyuZme3AVq2C+fPhsccq9xlODmZmdSQiJYd//Vf4/vcr9zluVjIzqyPr1sG+\n+8Lf/w4DBsC0aTBsWHrNzUpmZjuoVatgr71gt93g8svhBz+ozOd05BnSZmaWE4XkAHDppTBoECxe\nDP37l/dzXHMwM6sjxclhn33goovgxhvL/zlODmZmdaQ4OQB8+ctwyy2pvJycHMzM6kjz5HDQQfCR\nj8BPf1rez3Gfg5lZHVm1Cnr33rrs6qvhtNPK+zmuOZiZ1ZHmNQeAI4+EX/+6vJ/j5GBmVkdaSg4A\n739/eT/HycHMrI60lhzKzcnBzKyOODmYmVmJ3CQHSf0lTZM0S9Lzkr6YlY+RtFjS09lyRtE+oyXN\nlzRH0mlF5cMlzZQ0T9INlTkkM7Ouq1rJoSNDWTcBV0bEs5J6AX+W9Ej22o8i4kfFG0saCowChgL9\ngamSDs1m0rsZuCQiZkiaJOn0iJhcvsMxM+vaclNziIilEfFstr4WmAP0y15uafa/s4EJEbEpIhYC\n84ERkvoCe0TEjGy724CR2xm/mdkOJTfJoZikQ4BjgD9lRZdLelbSzyUVbsvoBywq2m1JVtYPWFxU\nvpgtScbMzNpReJZD85vgKqHDySFrUvoV8KWsBnETMDAijgGWAv9emRDNzAxg/XrYaSfYddfKf1aH\nps+Q1J2UGG6PiPsAIuK1ok3+H/BAtr4EOLDotf5ZWWvlLRo7duw76w0NDTQ0NHQkVDOzLqt5k1Jj\nYyONjY0V+awOPQlO0m3A6xFxZVFZ34hYmq1/GTg+Ii6QNAy4AziB1Gz0CHBoRISkJ4ArgBnA74Cf\nRMTDLXyenwRnZtbM7NlwzjkwZ07Lr5fzSXDt1hwknQh8Cnhe0jNAANcCF0g6BmgCFgJfAIiI2ZIm\nArOBjcClRWf6y4BbgV2BSS0lBjMza9nq1dXpjAY/Q9rMrG489BD85CfpZ0v8DGkzsx1QtYaxgpOD\nmVndcHIwM7MSTg5mZlbCycHMzEo4OZiZWQknBzMzK+HkYGZmJZwczMyshJODmZmVcHIwM7OtVPNZ\nDuDkYGZWFzZsAKk6z3IAJwczs7pQzSYlcHIwM6sLTg5mZlbCycHMzEo4OZiZWQknBzMzK+HkYGZm\nJZwczMyshJODmZmVcHIwM7MSTg5mZlbCycHMzEqsXu3kYGZmzbjmYGZmJZwczMyshJODmZltZcOG\n9LNaz3IAJwczs9yrdq0BOpAcJPWXNE3SLEnPS7oiK99b0hRJcyVNltS7aJ/RkuZLmiPptKLy4ZJm\nSpon6YbKHJKZWdeSy+QAbAKujIjDgfcBl0k6DLgGmBoRQ4BpwGgAScOAUcBQ4EzgJknK3utm4JKI\nGAwMlnR6WY/GzKwLymVyiIilEfFstr4WmAP0B84GxmebjQdGZutnARMiYlNELATmAyMk9QX2iIgZ\n2Xa3Fe1jZmatyGVyKCbpEOAY4AmgT0Qsg5RAgP2zzfoBi4p2W5KV9QMWF5UvzsrMzKwNtUgO3Tu6\noaRewK+AL0XEWknRbJPmv2+XsWPHvrPe0NBAQ0NDOd/ezKxutJYcGhsbaWxsrMhndig5SOpOSgy3\nR8R9WfEySX0iYlnWZLQ8K18CHFi0e/+srLXyFhUnBzOzHVlryaH5hfN1111Xts/saLPSLcDsiLix\nqOx+4LPZ+kXAfUXl50nqIWkAMAh4Mmt6Wi1pRNZBfWHRPmZm1opVq6B37/a3K6d2aw6STgQ+BTwv\n6RlS89G1wPeAiZIuBl4mjVAiImZLmgjMBjYCl0ZEocnpMuBWYFdgUkQ8XN7DMTPrelatgoMPru5n\nast5Oz8kRR7jMjOrhfPOg7PPhvPPb3s7SUSE2t6qY3yHtJlZzuV+KKuZmVWfk4OZmZVwcjAzsxJO\nDmZmVsLJwczMtrJhA0RU91kO4ORgZpZrhVqDyjJAteOcHMzMcqwWTUrg5GBmlmtODmZmVsLJwczM\nSqxe7eRgZmbNuOZgZmYlnBzMzKyEk4OZmZVYuhT226/6n+vkYGaWYwsXwoAB1f9cJwczsxxbsAAO\nOaT6n+snwZmZ5dSmTdCzJ7zxBvTo0f72fhKcmdkOYNEi6NOnY4mh3JwczMxyqlb9DeDkYGaWW7Xq\nbwAnBzOz3HLNwczMSrjmYGZmJVxzMDOzErWsOfg+BzOzHHrrLdhzT1i3DnbaqWP7+D4HM7Mu7pVX\noH//jieGcnNyMDPLoVo2KUEHkoOkcZKWSZpZVDZG0mJJT2fLGUWvjZY0X9IcSacVlQ+XNFPSPEk3\nlP9QzMy6jlp2RkPHag6/AE5vofxHETE8Wx4GkDQUGAUMBc4EbpJUaP+6GbgkIgYDgyW19J5mZkYd\n1Bwi4jFgZQsvtdTpcTYwISI2RcRCYD4wQlJfYI+ImJFtdxswsnMhm5l1ffVQc2jN5ZKelfRzSb2z\nsn7AoqJtlmRl/YDFReWLszIzM2tBrWsO3Tu5303ANyMiJH0b+Hfg8+ULC8aOHfvOekNDAw0NDeV8\nezOzXOtIzaGxsZHGxsaKfH6H7nOQdDDwQEQc1dZrkq4BIiK+l732MDAGeBl4NCKGZuXnASdHxP9p\n5fN8n4OZ7bDWrYN3vQvefBO6bUP7Ti3ucxBFfQxZH0LBJ4AXsvX7gfMk9ZA0ABgEPBkRS4HVkkZk\nHdQXAvdtd/RmZl3Qyy/DQQdtW2Iot3ablSTdCTQA75L0CqkmcIqkY4AmYCHwBYCImC1pIjAb2Ahc\nWlQFuAy4FdgVmFQY4WRmZlurdX8DePoMM7PcuekmmDkTfvrTbdvP02eYmXVhCxbUdhgrODmYmeXO\nwoW1b1ZycjAzyxnXHMzMrEQeOqSdHMzMcmTNGtiwAfbbr7ZxODmYmeVIob9BZRlz1HlODmZmVfCN\nb6QaQXtqPeFegZODmVkV/OAHMG9e+9vlob8BnBzMzCpu/fpUa1iwoP1tXXMwM9tBrFiRfr70Uvvb\nuuZgZraDWJk9Ls01BzMze0dHaw4R+bgBDpwczMwqbuXK1FTUXs1h1ar0c6+9Kh5Su5wczMwqbMUK\nGD48NRm1NeF0odZQ63scwMnBzKziVq5MD+/ZfXdYtqz17fLSGQ1ODmZmFbdiBey9Nwwc2HbT0osv\nwpAh1YurLU4OZmYVtnIl7LNPajJqq1N61iw4/PDqxdUWJwczswor1BwGDGi75jB7tpODmdkOo1Bz\naKtZafPmNL3GYYdVN7bWODmYmVVYcc2htWalv/4V+vaFnj2rG1trutc6ADOzrq5Qc9hvv9ZrDrNn\nw7Bh1Y2rLU4OZmYVtmJFSg69e8Orr8LGjbDzzltvk6fOaHCzkplZRTU1pTuf99orJYR3vxteeaV0\nuzx1RoOTg5lZRa1ZA716Qfesnaa1EUuzZuWrWcnJwcysggqd0QUDB5Z2ShdGKg0dWt3Y2uLkYGZW\nQYXO6IKWag4vvQR9+uRnpBI4OZiZVVTzmkNLySFvndHg5GBmVlHNaw4tNSvVZXKQNE7SMkkzi8r2\nljRF0lxJkyX1LnpttKT5kuZIOq2ofLikmZLmSbqh/IdiZpY/Hak55O0eB+hYzeEXwOnNyq4BpkbE\nEGAaMBpA0jBgFDAUOBO4SXpnZvKbgUsiYjAwWFLz9zQz63Ka1xz69IF16+CNN7aU1WXNISIeA1Y2\nKz4bGJ+tjwdGZutnARMiYlNELATmAyMk9QX2iIgZ2Xa3Fe1jZtZlFW6AK5C2fipc3uZUKuhsn8P+\nEbEMICKWAvtn5f2ARUXbLcnK+gGLi8oXZ2VmZl3aypVbNyvB1k1LhZFKvXpVP7a2lGv6jDYefNc5\nY8eOfWe9oaGBhoaGcn+EmVnFNa85wNad0tvTpNTY2EhjY+N2xdeaziaHZZL6RMSyrMloeVa+BDiw\naLv+WVlr5a0qTg5mZvWqvZrD9nRGN79wvu666zr3Ri3oaLOSsqXgfuCz2fpFwH1F5edJ6iFpADAI\neDJrelotaUTWQX1h0T5mZl1WSzWH4uSQx85o6NhQ1juB6aQRRq9I+hzwb8CHJM0FPpj9TkTMBiYC\ns4FJwKURUWhyugwYB8wD5kfEw+U+GDOzvGk+lBVKm5XyNowVQFvO3fkhKfIYl5nZturVK03Tvcce\nW8reeCN1Qq9ZA3vuCcuXl6dDWhIRofa3bJ/vkDYzq5C334a33io98e+xB+y+OzzxRD5HKoGTg5lZ\nxRQ6o9XCtfzAgfDgg/lsUgInBzOzimmpv6FgwICUHPLYGQ1ODmZmFdN86oxiAwbkd6QSODmYmVVM\nS8NYCwYOTD/drGRmtoNp6Qa4ggED0s88Pf2tmJODmVmFtFVzGDYM3v/+fI5UAicHM7OKaavm0K8f\nPP54dePZFk4OZmYV0lbNIe+cHMzMKqStoax55+RgZlYhbQ1lzTsnBzOzCnHNwczMSrjmYGZmJeq5\nQ9pTdpuZVUAE9OgBb76ZflaDp+w2M8u5tWthl12qlxjKzcnBzKwC2roBrh44OZiZVUA99zeAk4OZ\nWUW45mBmZiVcczAzsxKuOZiZWQnXHMzMrISTg5mZlXCzkpmZlXDNwczMSrjmYGZmJVxzMDOzEjt0\nzUHSQknPSXpG0pNZ2d6SpkiaK2mypN5F24+WNF/SHEmnbW/wZmZ5taPXHJqAhog4NiJGZGXXAFMj\nYggwDRgNIGkYMAoYCpwJ3CSpLFPLmpnlyaZNaaruPfesdSSdt73JQS28x9nA+Gx9PDAyWz8LmBAR\nmyJiITAfGIGZWRezahX07g3d6rjhfntDD+ARSTMkfT4r6xMRywAiYimwf1beD1hUtO+SrMzMrEup\n58eDFnTfzv1PjIhXJe0HTJE0l5QwivmRbma2Q1mxor47o2E7k0NEvJr9fE3SvaRmomWS+kTEMkl9\ngeXZ5kuAA4t275+VtWjs2LHvrDc0NNDQ0LA9oZqZVU21ag6NjY00NjZW5L07/QxpSbsD3SJiraSe\nwBTgOuCDwIqI+J6krwF7R8Q1WYf0HcAJpOakR4BDW3pYtJ8hbWb17M474YEH4K67qvu55XyG9PbU\nHPoAv5UU2fvcERFTJD0FTJR0MfAyaYQSETFb0kRgNrARuNQZwMy6onofxgrbkRwiYgFwTAvlK4BT\nW9nneuD6zn6mmVk9qPcb4MB3SJuZlV1XqDk4OZiZlZlrDmZmtpUImD8f9t+//W3zzMnBzKyMHn4Y\nXn8dTqvz2eOcHMzMymTTJrjqKvjhD2HnnWsdzfZxcjAzK5Of/Qze/W746EdrHcn26/RNcJXkm+DM\nrN6sWgVDhsCUKXD00bWJoZw3wTk5mJmVwdVXp1FKP/957WIoZ3Jws5KZtWn8ePjHf4T162sdSX79\n9a9wyy3wrW/VOpLycXIw2wGsXAmjRsHIkfCnP3V8v3vugdGjoVev1NFqLbvmGrjyytTf0FU4OZh1\ncc8/D8cfDwcckIZXjhoFH/oQ/Pd/pzH5rZk0CS6/HB56CCZMSD/vvbd6cdeLqVNTwr3yylpHUmYR\nkbslhWVm2+vuuyP23Tfi9tu3lL31VsS4cRGDBkWceGLEL38ZsXbt1vs1Nqb9pk/fUjZ9esT++0cs\nWlSd2POuqSnixz9O39PkybWOJsnOnWU5D7tD2qwL2rQpNQf9+tfwm9/AMSVTZMLmzem18ePhscfS\n8MtPfSo93nLkyDTd9Ac/uPU+3/42/P736Wp5p52qcyx59Npr8LnPpZ933QUDB9Y6osQd0mbWogj4\n7W/hqKPghRdgxoyWEwOkk/snPwkPPgjz5sF73wvf/CY0NMC4caWJAVLCaWqC732voodRcW+9lRJo\nZzz6KBx7LBx+OPzxj/lJDOXmmoNZF/H738O116YT3/XXwxlngDpxDfn229CjR+uvL1oExx2Xhm6+\n+SYsXpyWV1+FL30pXVHnzeuvw/Tp8PjjqZb07LMpyQ0YAIcdtmU56aSWT/abNsHkySlpPvEE/OIX\ncPrp1T+O9vg+BzN7x9Kl8JnPwIIFqdln1CjoVuE2galT4Y47oH//tPTrB7vtBhdeCDfckGoktbZk\nCUycmDrTX3wRTjghnfxPOglGjEg1p/nz02tz58KsWamTftdd4dRT03LYYWnE1q23pmP8/Ofh3HNh\nzz1rfXQtc3IwqwMbN6aT6N13w8yZcPvtqSminJYtS81An/wkfP3rtZ/P57nn0oio8eNTzaVcmprg\nkUfSlf7gwa1v9/rrWxLCrFlw9tlw3nnwgQ9A9w482iwC5sxJ/25Tp6amubPOgksugSOPLN/xVIqT\ng1kNbNiQmiUWLtx62XnnLVfQ/funh7xMm5ba/gcNSienXXZJ7fmPPFK+BLF8OZxySrqS/cY3yvOe\n5TB9ejop33svnHji9r1XRPrOrr02NXctX56SwyWXwD/9E/TsmZLwpEkpIU2bBh/+MJx/fkpSu+xS\nnmOqF04OZjVw/vnpqvLYY+GQQ9Jy8MGpPbrQ7r5kSWrmed/7UvPOIYds2f/OO+ErX0lz7xxxxPbF\n8tpr6Wr4E5+A667bvveqhMmTUxNTYZ6hNWvSlfwLL6Qb8v75n2Gvvdp+jyeeSB3gf/tbai4755w0\nwurBB1Pb//Tp6Tv44x9TwrjoolSD6t27OseYR04OZlU2cWJqtnnmGdh9986/z113pZulpkzpfDPF\n66+nkUQf+1iarqEznc7VcM89cOml6ft6/XUYNiwlxc2b0/F/97vw2c9u3T8SAY2NacrrmTNhzJi0\nTUtNQkuWpBvzGhpSDc2cHMyq6tVX03DQBx5IHZnb6+674V/+BX73Oxg+fNv2ffHFVCP56EfhO9/J\nb2IomDUrdfAOGLB1Evjzn+GLX0yJ4j/+I9XG7rknJYV169JUHZ/5TNrXOs7JwaxKItIV+rHHlndS\ntcJVde/e6cq3oQFOPhkOPLDl7Zua4Cc/SQnhW9+CL3wh/4mhPU1N8MtfpnmJNm9ONYurrkp9BpUe\nbdVVOTmYVcm4cfBf/5Xav9sa+98ZTU0we3ZqRmlsTMMo3/WudHfyyJGpltKtW+r0/tznUsfr+PHw\nD/9Q3jhqbc2a1EQ0dGitI6l/Tg5mVbBwYZqw7tFHt78DuSOamlJzy733pmXFijTW/uGH0w1nV121\nY09ZYe1zcjCrsLVrU3PSmWfCV79amxjmz0+J4ZRTqpOcrP45OZiV2RtvpHsYCs07zz+fOn3vuMNX\n61Y/nBysy3r77TSaZ999001MlTgxr1+f7uR96qkty4IFqQmp0Dl8wglpOgizeuLkYF3O22+nycyu\nvx4OPRRWr05DSC++OHXGFt9MVtDUlCaBK8yN8+KL6X1OOSXdB9C375Zt169PY+Lvvjs11QwalCaP\nO+44eM97UrNNuTuczaqtrpODpDOAG0jThY+LiJLJf50c8mv9+jRiZv36dIV91FFbX91v3pxG9vzu\nd/Dkk+kO1nPPbX2Ezbp1ac6h7343jVYZMybdXQzp6n7cuHRn8RFHwB57wKpVW5a//x323nvrWTW7\ndUtTKDz6aHry2amnpo7dBx5I9xSce266q3jffSv+VZlVXd0mB0ndgHnAB4G/ATOA8yLixWbb1XVy\naGxspKGhoSafvXw5/OAH6ST9la+kE+S2aC32N96Am2+GH/84DbE84IDUPr90aXr4/Pvel4ZlPvRQ\nmr3yIx9JzTRTp8KvfgUHHZTmGBoxIrXnF5pz/vKXdKX/9a+n5wm0ZMOG9D4RacqFwrLPPmlunZbi\n37wZnn46TWPds2eah6cenu9by7+dcnD8tVXO5NCBeQrLagQwPyJeBpA0ATgbeLHNvepMLf7A/v73\ndHfpz36WnubVvXu62v70p+FrX0sn7GIRaS6g555Lc9sXfi5Y0MjgwQ0cdhgMGZKuxhcuhP/8z3QV\n3nzah6VL4Q9/gP/5nzTJ2ne+kxJBwcc/DjfemBLJhAmpWefoo1Ob/mWXpRjbmxxt111T53BHFL77\nnXZKyen44zu2X17U+8nJ8Xcd1U4O/YBFRb8vJiWMHVZEejjLhg2piWXt2nSVvnZtWjZuTBO7FZbN\nm9OJv3h5+ul0o9Y556QTfOEu2699LdUijjwSLrgA+vTZ0j4/dy706pVO1EcfnZpavvnN1MRz/vlb\ntnvkkXTl/fjjLU+V3Ldvms5h1KjWj7F79y3z45tZfah2cuiwIUO2/r1btzRdQLdurS9S+1MKRKQT\nbPMTbkvbNTW1v7QU05o1qV2+8DuUfubbb6eEUHjq1i67pJN1YenZMy277LJ1IujWrfS9Djggte83\nf4JVnz6pNnH11enKf/369PSqK65I329Ls2L26JGSST3MXW9mlVPtPof3AmMj4ozs92uAaN4pLal+\nOxzMzGqoXjukdwLmkjqkXwWeBM6PiDlVC8LMzNpV1WaliNgs6XJgCluGsjoxmJnlTC5vgjMzs9qq\nyqzpksZJWiZpZlHZUZKmS3pO0n2SemXlx0t6pmgZWbTPcEkzJc2TdEM1Yt/W+IteP0jSG5KurKf4\nJR0saZ2kp7PlpnqKv9lrL2Sv96hV/Nv43V+Q/c0/nf3cLOmo7LX35P27l9Rd0q1ZnLOyPsXCPrn/\n25G0s6RbsjifkXRyDuLvL2la9n0+L+mKrHxvSVMkzZU0WVLvon1GS5ovaY6k0zp9DBFR8QU4CTgG\nmFlU9iRwUrb+WeCb2fquQLdsvS+wrOj3PwHHZ+uTgNPzFn/R6/cAdwNXFpXlPn7g4OLtmr1PPcS/\nE/AccET2+95sqSFXPf7O/O1k5UeQ7gmqp+/+fODObH03YAFwUB3FfympqRtgP+CpHHz/fYFjsvVe\npD7bw4DvAV/Nyr8G/Fu2Pgx4htRlcAjwl87+/Vel5hARjwErmxUfmpUDTAXOybbdEBFNWfluQBOA\npL7AHhExI3vtNmAkVbAt8QNIOht4CZhVVFY38QMlox3qKP7TgOci4oVs35UREbWKvxPffcH5wASo\nq+8+gJ5KA092B94C1tRB/J/I1ocB07L9XgNWSTquxvEvjYhns/W1wBygP+nm4fHZZuOL4jkLmBAR\nmyJiITAfGNGZY6jlw/hmSTorWx9FOmAAJI2Q9ALpCvB/Z8miH+mmuYLFWVmttBh/VkX9KnAdW59k\n6yL+zCFZ08ajkk7Kyuol/sEAkh6W9JSkq7PyPMXf1ndfcC5wV7aep9ih9fh/BawjjURcCPwwIlaR\n//gLD2d9DjhL0k6SBgDvyV7LRfySDiHVgp4A+kTEMkgJBNg/26z5jcZLsrJtPoZaJoeLgcskzQB6\nAm8XXoiIJyPiCOB44NpCm3HOtBb/GODHEbGuZpF1TGvxv0pqChgOXAXcqWb9KTnRWvzdgRNJV97/\nCHxc0im1CbFVrf7tQ7o4At6MiNm1CK4DWov/BGATqSlkIPCV7ISWN63FfwvpZDoD+BHwONDCLbLV\nl/0f/BXwpawG0XwkUdlHFtXsDumImAecDiDpUOAjLWwzV9JaUvvrErZkeEhXK0uqEGqL2oj/BOAc\nSd8ntXdvlrQB+A11EH9EvE32nyUinpb0V9LVeL18/4uBP0TEyuy1ScBw4A5yEn8H/vbPY0utAern\nuz8feDir6b8m6XHgOOAx6iD+iNgMFA8geZw0Uegqahi/pO6kxHB7RNyXFS+T1CcilmVNRsuz8tb+\nVrb5b6iaNQdR1Mwiab/sZzfg/wI/zX4/JGuzRNLBwBBgYVZ1Wp01OQm4ELiP6ulQ/BHxvyJiYEQM\nJE1N/t2IuKle4pe0b1aGpIHAIOCleokfmAwcKWnX7D/VycCsGsff0djJYhtF1t8A7zQb5Pm7vzl7\n6RXgA9lrPYH3AnPqIP7C3/5uknbP1j8EbIyIF3MQ/y3A7Ii4sajsflJnOsBFRfHcD5wnqUfWNDYI\neLJTx1ClHvc7SVN0v0X6A/occAWp5/1F0gm0sO2ngReAp4GngI8VvfYe4HlSJ8uN1Yh9W+Nvtt8Y\nth6tlPv4SZ1zxd//h+sp/mz7C7JjmAlcX8v4OxH7ycD0Ft4n9989qYlmYvbdv1CHf/sHZ2WzSDfq\nHpiD+E/PriEPAAAAW0lEQVQkNW09SxqF9DRwBrAPqTN9bhbrXkX7jCaNUpoDnNbZY/BNcGZmVqKW\nHdJmZpZTTg5mZlbCycHMzEo4OZiZWQknBzMzK+HkYGZmJZwczMyshJODmZmV+P/o527arFlzVQAA\nAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the number of UFO reports per year (line plot is the default)\n", "ufo.Year.value_counts().sort_index().plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 26. How do I find and remove duplicate rows in pandas? ([video](https://www.youtube.com/watch?v=ht5buXUMqkQ&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=26))" ] }, { "cell_type": "code", "execution_count": 251, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegenderoccupationzip_code
user_id
124Mtechnician85711
253Fother94043
323Mwriter32067
424Mtechnician43537
533Fother15213
\n", "
" ], "text/plain": [ " age gender occupation zip_code\n", "user_id \n", "1 24 M technician 85711\n", "2 53 F other 94043\n", "3 23 M writer 32067\n", "4 24 M technician 43537\n", "5 33 F other 15213" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of movie reviewers into a DataFrame\n", "user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']\n", "users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')\n", "users.head()" ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(943, 4)" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.shape" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "user_id\n", "939 False\n", "940 True\n", "941 False\n", "942 False\n", "943 False\n", "Name: zip_code, dtype: bool" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# detect duplicate zip codes: True if an item is identical to a previous item\n", "users.zip_code.duplicated().tail()" ] }, { "cell_type": "code", "execution_count": 254, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "148" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the duplicate items (True becomes 1, False becomes 0)\n", "users.zip_code.duplicated().sum()" ] }, { "cell_type": "code", "execution_count": 255, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "user_id\n", "939 False\n", "940 False\n", "941 False\n", "942 False\n", "943 False\n", "dtype: bool" ] }, "execution_count": 255, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# detect duplicate DataFrame rows: True if an entire row is identical to a previous row\n", "users.duplicated().tail()" ] }, { "cell_type": "code", "execution_count": 256, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the duplicate rows\n", "users.duplicated().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Logic for [**`duplicated`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html):\n", "\n", "- **`keep='first'`** (default): Mark duplicates as True except for the first occurrence.\n", "- **`keep='last'`**: Mark duplicates as True except for the last occurrence.\n", "- **`keep=False`**: Mark all duplicates as True." ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegenderoccupationzip_code
user_id
49621Fstudent55414
57251Meducator20003
62117Mstudent60402
68428Mstudent55414
73344Fother60630
80527Fother20009
89032Mstudent97301
\n", "
" ], "text/plain": [ " age gender occupation zip_code\n", "user_id \n", "496 21 F student 55414\n", "572 51 M educator 20003\n", "621 17 M student 60402\n", "684 28 M student 55414\n", "733 44 F other 60630\n", "805 27 F other 20009\n", "890 32 M student 97301" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the duplicate rows (ignoring the first occurrence)\n", "users.loc[users.duplicated(keep='first'), :]" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegenderoccupationzip_code
user_id
6717Mstudent60402
8551Meducator20003
19821Fstudent55414
35032Mstudent97301
42828Mstudent55414
43727Fother20009
46044Fother60630
\n", "
" ], "text/plain": [ " age gender occupation zip_code\n", "user_id \n", "67 17 M student 60402\n", "85 51 M educator 20003\n", "198 21 F student 55414\n", "350 32 M student 97301\n", "428 28 M student 55414\n", "437 27 F other 20009\n", "460 44 F other 60630" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the duplicate rows (ignoring the last occurrence)\n", "users.loc[users.duplicated(keep='last'), :]" ] }, { "cell_type": "code", "execution_count": 259, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegenderoccupationzip_code
user_id
6717Mstudent60402
8551Meducator20003
19821Fstudent55414
35032Mstudent97301
42828Mstudent55414
43727Fother20009
46044Fother60630
49621Fstudent55414
57251Meducator20003
62117Mstudent60402
68428Mstudent55414
73344Fother60630
80527Fother20009
89032Mstudent97301
\n", "
" ], "text/plain": [ " age gender occupation zip_code\n", "user_id \n", "67 17 M student 60402\n", "85 51 M educator 20003\n", "198 21 F student 55414\n", "350 32 M student 97301\n", "428 28 M student 55414\n", "437 27 F other 20009\n", "460 44 F other 60630\n", "496 21 F student 55414\n", "572 51 M educator 20003\n", "621 17 M student 60402\n", "684 28 M student 55414\n", "733 44 F other 60630\n", "805 27 F other 20009\n", "890 32 M student 97301" ] }, "execution_count": 259, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the duplicate rows (including all duplicates)\n", "users.loc[users.duplicated(keep=False), :]" ] }, { "cell_type": "code", "execution_count": 260, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(936, 4)" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the duplicate rows (inplace=False by default)\n", "users.drop_duplicates(keep='first').shape" ] }, { "cell_type": "code", "execution_count": 261, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(936, 4)" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.drop_duplicates(keep='last').shape" ] }, { "cell_type": "code", "execution_count": 262, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(929, 4)" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.drop_duplicates(keep=False).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`drop_duplicates`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html)" ] }, { "cell_type": "code", "execution_count": 263, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# only consider a subset of columns when identifying duplicates\n", "users.duplicated(subset=['age', 'zip_code']).sum()" ] }, { "cell_type": "code", "execution_count": 264, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(927, 4)" ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.drop_duplicates(subset=['age', 'zip_code']).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 27. How do I avoid a SettingWithCopyWarning in pandas? ([video](https://www.youtube.com/watch?v=4R4WsDJ-KVc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=27))" ] }, { "cell_type": "code", "execution_count": 265, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
48.9Pulp FictionRCrime154[u'John Travolta', u'Uma Thurman', u'Samuel L....
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "4 8.9 Pulp Fiction R Crime 154 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... \n", "4 [u'John Travolta', u'Uma Thurman', u'Samuel L.... " ] }, "execution_count": 265, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of top-rated IMDb movies into a DataFrame\n", "movies = pd.read_csv('http://bit.ly/imdbratings')\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 266, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the missing values in the 'content_rating' Series\n", "movies.content_rating.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
1878.2Butch Cassidy and the Sundance KidNaNBiography110[u'Paul Newman', u'Robert Redford', u'Katharin...
6497.7Where Eagles DareNaNAction158[u'Richard Burton', u'Clint Eastwood', u'Mary ...
9367.4True GritNaNAdventure128[u'John Wayne', u'Kim Darby', u'Glen Campbell']
\n", "
" ], "text/plain": [ " star_rating title content_rating \\\n", "187 8.2 Butch Cassidy and the Sundance Kid NaN \n", "649 7.7 Where Eagles Dare NaN \n", "936 7.4 True Grit NaN \n", "\n", " genre duration actors_list \n", "187 Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin... \n", "649 Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ... \n", "936 Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell'] " ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the DataFrame rows that contain those missing values\n", "movies[movies.content_rating.isnull()]" ] }, { "cell_type": "code", "execution_count": 268, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "R 460\n", "PG-13 189\n", "PG 123\n", "NOT RATED 65\n", "APPROVED 47\n", "UNRATED 38\n", "G 32\n", "PASSED 7\n", "NC-17 7\n", "X 4\n", "GP 3\n", "TV-MA 1\n", "Name: content_rating, dtype: int64" ] }, "execution_count": 268, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the unique values in the 'content_rating' Series\n", "movies.content_rating.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Mark the 'NOT RATED' values as missing values, represented by 'NaN'." ] }, { "cell_type": "code", "execution_count": 269, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
58.912 Angry MenNOT RATEDDrama96[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
68.9The Good, the Bad and the UglyNOT RATEDWestern161[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
418.5Sunset Blvd.NOT RATEDDrama110[u'William Holden', u'Gloria Swanson', u'Erich...
638.4MNOT RATEDCrime99[u'Peter Lorre', u'Ellen Widmann', u'Inge Land...
668.4Munna Bhai M.B.B.S.NOT RATEDComedy156[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']
\n", "
" ], "text/plain": [ " star_rating title content_rating genre \\\n", "5 8.9 12 Angry Men NOT RATED Drama \n", "6 8.9 The Good, the Bad and the Ugly NOT RATED Western \n", "41 8.5 Sunset Blvd. NOT RATED Drama \n", "63 8.4 M NOT RATED Crime \n", "66 8.4 Munna Bhai M.B.B.S. NOT RATED Comedy \n", "\n", " duration actors_list \n", "5 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... \n", "6 161 [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... \n", "41 110 [u'William Holden', u'Gloria Swanson', u'Erich... \n", "63 99 [u'Peter Lorre', u'Ellen Widmann', u'Inge Land... \n", "66 156 [u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi'] " ] }, "execution_count": 269, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# first, locate the relevant rows\n", "movies[movies.content_rating=='NOT RATED'].head()" ] }, { "cell_type": "code", "execution_count": 270, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "5 NOT RATED\n", "6 NOT RATED\n", "41 NOT RATED\n", "63 NOT RATED\n", "66 NOT RATED\n", "Name: content_rating, dtype: object" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# then, select the 'content_rating' Series from those rows\n", "movies[movies.content_rating=='NOT RATED'].content_rating.head()" ] }, { "cell_type": "code", "execution_count": 271, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\Kevin\\Anaconda\\lib\\site-packages\\pandas\\core\\generic.py:2701: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " self[name] = value\n" ] } ], "source": [ "# finally, replace the 'NOT RATED' values with 'NaN' (imported from NumPy)\n", "import numpy as np\n", "movies[movies.content_rating=='NOT RATED'].content_rating = np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem:** That statement involves two operations, a **`__getitem__`** and a **`__setitem__`**. pandas can't guarantee whether the **`__getitem__`** operation returns a view or a copy of the data.\n", "\n", "- If **`__getitem__`** returns a view of the data, **`__setitem__`** will affect the 'movies' DataFrame.\n", "- But if **`__getitem__`** returns a copy of the data, **`__setitem__`** will not affect the 'movies' DataFrame." ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 272, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the 'content_rating' Series has not changed\n", "movies.content_rating.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:** Use the **`loc`** method, which replaces the 'NOT RATED' values in a single **`__setitem__`** operation." ] }, { "cell_type": "code", "execution_count": 273, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning)\n", "movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan" ] }, { "cell_type": "code", "execution_count": 274, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "68" ] }, "execution_count": 274, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this time, the 'content_rating' Series has changed\n", "movies.content_rating.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Summary:** Use the **`loc`** method any time you are selecting rows and columns in the same statement.\n", "\n", "**More information:** [Modern Pandas (Part 1)](http://tomaugspurger.github.io/modern-1.html)" ] }, { "cell_type": "code", "execution_count": 275, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... " ] }, "execution_count": 275, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame only containing movies with a high 'star_rating'\n", "top_movies = movies.loc[movies.star_rating >= 9, :]\n", "top_movies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Fix the 'duration' for 'The Shawshank Redemption'." ] }, { "cell_type": "code", "execution_count": 276, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\Kevin\\Anaconda\\lib\\site-packages\\pandas\\core\\indexing.py:465: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " self.obj[item] = s\n" ] } ], "source": [ "# overwrite the relevant cell with the correct duration\n", "top_movies.loc[0, 'duration'] = 150" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem:** pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'." ] }, { "cell_type": "code", "execution_count": 277, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime150[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 150 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... " ] }, "execution_count": 277, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'top_movies' DataFrame has been updated\n", "top_movies" ] }, { "cell_type": "code", "execution_count": 278, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime142[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 142 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... " ] }, "execution_count": 278, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'movies' DataFrame has not been updated\n", "movies.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:** Any time you are attempting to create a DataFrame copy, use the [**`copy`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html) method." ] }, { "cell_type": "code", "execution_count": 279, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# explicitly create a copy of 'movies'\n", "top_movies = movies.loc[movies.star_rating >= 9, :].copy()" ] }, { "cell_type": "code", "execution_count": 280, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning)\n", "top_movies.loc[0, 'duration'] = 150" ] }, { "cell_type": "code", "execution_count": 281, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
star_ratingtitlecontent_ratinggenredurationactors_list
09.3The Shawshank RedemptionRCrime150[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
19.2The GodfatherRCrime175[u'Marlon Brando', u'Al Pacino', u'James Caan']
29.1The Godfather: Part IIRCrime200[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
39.0The Dark KnightPG-13Action152[u'Christian Bale', u'Heath Ledger', u'Aaron E...
\n", "
" ], "text/plain": [ " star_rating title content_rating genre duration \\\n", "0 9.3 The Shawshank Redemption R Crime 150 \n", "1 9.2 The Godfather R Crime 175 \n", "2 9.1 The Godfather: Part II R Crime 200 \n", "3 9.0 The Dark Knight PG-13 Action 152 \n", "\n", " actors_list \n", "0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... \n", "1 [u'Marlon Brando', u'Al Pacino', u'James Caan'] \n", "2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv... \n", "3 [u'Christian Bale', u'Heath Ledger', u'Aaron E... " ] }, "execution_count": 281, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'top_movies' DataFrame has been updated\n", "top_movies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation on indexing and selection: [Returning a view versus a copy](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy)\n", "\n", "Stack Overflow: [What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?](http://stackoverflow.com/questions/34884536/what-is-the-point-of-views-in-pandas-if-it-is-undefined-whether-an-indexing-oper)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 28. How do I change display options in pandas? ([video](https://www.youtube.com/watch?v=yiO43TQ4xvc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=28))" ] }, { "cell_type": "code", "execution_count": 282, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')" ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
5Antigua & Barbuda102128454.9North America
6Argentina193252218.3South America
7Armenia21179113.8Europe
8Australia2617221210.4Oceania
9Austria279751919.7Europe
10Azerbaijan214651.3Europe
11Bahamas122176516.3North America
12Bahrain426372.0Asia
13Bangladesh0000.0Asia
14Barbados143173366.3North America
15Belarus1423734214.4Europe
16Belgium2958421210.5Europe
17Belize26311486.8North America
18Benin344131.1Africa
19Bhutan23000.4Asia
20Bolivia1674183.8South America
21Bosnia-Herzegovina7617384.6Europe
22Botswana17335355.4Africa
23Brazil245145167.2South America
24Brunei31210.6Asia
25Bulgaria2312529410.3Europe
26Burkina Faso25774.3Africa
27Burundi88006.3Africa
28Cote d'Ivoire37174.0Africa
29Cabo Verde14456164.0Africa
.....................
163Suriname12817875.6South America
164Swaziland90224.7Africa
165Sweden152601867.2Europe
166Switzerland18510028010.2Europe
167Syria535161.0Asia
168Tajikistan21500.3Asia
169Thailand9925816.4Asia
170Macedonia10627863.9Europe
171Timor-Leste1140.1Asia
172Togo362191.3Africa
173Tonga362151.1Oceania
174Trinidad & Tobago19715676.4North America
175Tunisia513201.3Africa
176Turkey512271.4Asia
177Turkmenistan1971322.2Asia
178Tuvalu64191.0Oceania
179Uganda45908.3Africa
180Ukraine206237458.9Europe
181United Arab Emirates1613552.8Asia
182United Kingdom21912619510.4Europe
183Tanzania36615.7Africa
184USA249158848.7North America
185Uruguay115352206.6South America
186Uzbekistan2510182.4Asia
187Vanuatu2118110.9Oceania
188Venezuela33310037.7South America
189Vietnam111212.0Asia
190Yemen6000.1Asia
191Zambia321942.5Africa
192Zimbabwe641844.7Africa
\n", "

193 rows × 6 columns

\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "5 Antigua & Barbuda 102 128 45 \n", "6 Argentina 193 25 221 \n", "7 Armenia 21 179 11 \n", "8 Australia 261 72 212 \n", "9 Austria 279 75 191 \n", "10 Azerbaijan 21 46 5 \n", "11 Bahamas 122 176 51 \n", "12 Bahrain 42 63 7 \n", "13 Bangladesh 0 0 0 \n", "14 Barbados 143 173 36 \n", "15 Belarus 142 373 42 \n", "16 Belgium 295 84 212 \n", "17 Belize 263 114 8 \n", "18 Benin 34 4 13 \n", "19 Bhutan 23 0 0 \n", "20 Bolivia 167 41 8 \n", "21 Bosnia-Herzegovina 76 173 8 \n", "22 Botswana 173 35 35 \n", "23 Brazil 245 145 16 \n", "24 Brunei 31 2 1 \n", "25 Bulgaria 231 252 94 \n", "26 Burkina Faso 25 7 7 \n", "27 Burundi 88 0 0 \n", "28 Cote d'Ivoire 37 1 7 \n", "29 Cabo Verde 144 56 16 \n", ".. ... ... ... ... \n", "163 Suriname 128 178 7 \n", "164 Swaziland 90 2 2 \n", "165 Sweden 152 60 186 \n", "166 Switzerland 185 100 280 \n", "167 Syria 5 35 16 \n", "168 Tajikistan 2 15 0 \n", "169 Thailand 99 258 1 \n", "170 Macedonia 106 27 86 \n", "171 Timor-Leste 1 1 4 \n", "172 Togo 36 2 19 \n", "173 Tonga 36 21 5 \n", "174 Trinidad & Tobago 197 156 7 \n", "175 Tunisia 51 3 20 \n", "176 Turkey 51 22 7 \n", "177 Turkmenistan 19 71 32 \n", "178 Tuvalu 6 41 9 \n", "179 Uganda 45 9 0 \n", "180 Ukraine 206 237 45 \n", "181 United Arab Emirates 16 135 5 \n", "182 United Kingdom 219 126 195 \n", "183 Tanzania 36 6 1 \n", "184 USA 249 158 84 \n", "185 Uruguay 115 35 220 \n", "186 Uzbekistan 25 101 8 \n", "187 Vanuatu 21 18 11 \n", "188 Venezuela 333 100 3 \n", "189 Vietnam 111 2 1 \n", "190 Yemen 6 0 0 \n", "191 Zambia 32 19 4 \n", "192 Zimbabwe 64 18 4 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa \n", "5 4.9 North America \n", "6 8.3 South America \n", "7 3.8 Europe \n", "8 10.4 Oceania \n", "9 9.7 Europe \n", "10 1.3 Europe \n", "11 6.3 North America \n", "12 2.0 Asia \n", "13 0.0 Asia \n", "14 6.3 North America \n", "15 14.4 Europe \n", "16 10.5 Europe \n", "17 6.8 North America \n", "18 1.1 Africa \n", "19 0.4 Asia \n", "20 3.8 South America \n", "21 4.6 Europe \n", "22 5.4 Africa \n", "23 7.2 South America \n", "24 0.6 Asia \n", "25 10.3 Europe \n", "26 4.3 Africa \n", "27 6.3 Africa \n", "28 4.0 Africa \n", "29 4.0 Africa \n", ".. ... ... \n", "163 5.6 South America \n", "164 4.7 Africa \n", "165 7.2 Europe \n", "166 10.2 Europe \n", "167 1.0 Asia \n", "168 0.3 Asia \n", "169 6.4 Asia \n", "170 3.9 Europe \n", "171 0.1 Asia \n", "172 1.3 Africa \n", "173 1.1 Oceania \n", "174 6.4 North America \n", "175 1.3 Africa \n", "176 1.4 Asia \n", "177 2.2 Asia \n", "178 1.0 Oceania \n", "179 8.3 Africa \n", "180 8.9 Europe \n", "181 2.8 Asia \n", "182 10.4 Europe \n", "183 5.7 Africa \n", "184 8.7 North America \n", "185 6.6 South America \n", "186 2.4 Asia \n", "187 0.9 Oceania \n", "188 7.7 South America \n", "189 2.0 Asia \n", "190 0.1 Asia \n", "191 2.5 Africa \n", "192 4.7 Africa \n", "\n", "[193 rows x 6 columns]" ] }, "execution_count": 283, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# only 60 rows will be displayed when printing\n", "drinks" ] }, { "cell_type": "code", "execution_count": 284, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "60" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check the current setting for the 'max_rows' option\n", "pd.get_option('display.max_rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`get_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_option.html)" ] }, { "cell_type": "code", "execution_count": 285, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
5Antigua & Barbuda102128454.9North America
6Argentina193252218.3South America
7Armenia21179113.8Europe
8Australia2617221210.4Oceania
9Austria279751919.7Europe
10Azerbaijan214651.3Europe
11Bahamas122176516.3North America
12Bahrain426372.0Asia
13Bangladesh0000.0Asia
14Barbados143173366.3North America
15Belarus1423734214.4Europe
16Belgium2958421210.5Europe
17Belize26311486.8North America
18Benin344131.1Africa
19Bhutan23000.4Asia
20Bolivia1674183.8South America
21Bosnia-Herzegovina7617384.6Europe
22Botswana17335355.4Africa
23Brazil245145167.2South America
24Brunei31210.6Asia
25Bulgaria2312529410.3Europe
26Burkina Faso25774.3Africa
27Burundi88006.3Africa
28Cote d'Ivoire37174.0Africa
29Cabo Verde14456164.0Africa
30Cambodia576512.2Asia
31Cameroon147145.8Africa
32Canada2401221008.2North America
33Central African Republic17211.8Africa
34Chad15110.4Africa
35Chile1301241727.6South America
36China7919285.0Asia
37Colombia1597634.2South America
38Comoros1310.1Africa
39Congo76191.7Africa
40Cook Islands0254745.9Oceania
41Costa Rica14987114.4North America
42Croatia2308725410.2Europe
43Cuba9313754.2North America
44Cyprus1921541138.2Europe
45Czech Republic36117013411.8Europe
46North Korea0000.0Asia
47DR Congo32312.3Africa
48Denmark2248127810.4Europe
49Djibouti154431.1Africa
50Dominica52286266.6North America
51Dominican Republic19314796.2North America
52Ecuador1627434.2South America
53Egypt6410.2Africa
54El Salvador526922.2North America
55Equatorial Guinea9202335.8Africa
56Eritrea18000.5Africa
57Estonia224194599.5Europe
58Ethiopia20300.7Africa
59Fiji773512.0Oceania
60Finland2631339710.0Europe
61France12715137011.8Europe
62Gabon34798598.9Africa
63Gambia8012.4Africa
64Georgia521001495.4Europe
65Germany34611717511.3Europe
66Ghana313101.8Africa
67Greece1331122188.3Europe
68Grenada1994382811.9North America
69Guatemala536922.2North America
70Guinea9020.2Africa
71Guinea-Bissau2831212.5Africa
72Guyana9330217.1South America
73Haiti132615.9North America
74Honduras699823.0North America
75Hungary23421518511.3Europe
76Iceland23361786.6Europe
77India911402.2Asia
78Indonesia5100.1Asia
79Iran0000.0Asia
80Iraq9300.2Asia
81Ireland31311816511.4Europe
82Israel636992.5Asia
83Italy85422376.5Europe
84Jamaica829793.4North America
85Japan77202167.0Asia
86Jordan62110.5Asia
87Kazakhstan124246126.8Asia
88Kenya582221.8Africa
89Kiribati213411.0Oceania
90Kuwait0000.0Asia
91Kyrgyzstan319762.4Asia
92Laos6201236.2Asia
93Latvia2812166210.5Europe
94Lebanon2055311.9Asia
95Lesotho822902.8Africa
96Liberia1915223.1Africa
97Libya0000.0Africa
98Lithuania3432445612.9Europe
99Luxembourg23613327111.4Europe
100Madagascar261540.8Africa
101Malawi81111.5Africa
102Malaysia13400.3Asia
103Maldives0000.0Asia
104Mali5110.6Africa
105Malta1491001206.6Europe
106Marshall Islands0000.0Oceania
107Mauritania0000.0Africa
108Mauritius9831182.6Africa
109Mexico2386855.5North America
110Micronesia6250182.3Oceania
111Monaco0000.0Europe
112Mongolia7718984.9Asia
113Montenegro311141284.9Europe
114Morocco126100.5Africa
115Mozambique471851.3Africa
116Myanmar5100.1Asia
117Namibia376316.8Africa
118Nauru49081.0Oceania
119Nepal5600.2Asia
120Netherlands251881909.4Europe
121New Zealand203791759.3Oceania
122Nicaragua7811813.5North America
123Niger3210.1Africa
124Nigeria42529.1Africa
125Niue18820077.0Oceania
126Norway169711296.7Europe
127Oman221610.7Asia
128Pakistan0000.0Asia
129Palau30663236.9Oceania
130Panama285104187.2North America
131Papua New Guinea443911.5Oceania
132Paraguay213117747.3South America
133Peru163160216.1South America
134Philippines7118614.6Asia
135Poland3432155610.9Europe
136Portugal1946733911.0Europe
137Qatar14270.9Asia
138South Korea1401699.8Asia
139Moldova109226186.3Europe
140Romania29712216710.4Europe
141Russian Federation2473267311.5Asia
142Rwanda43206.8Africa
143St. Kitts & Nevis194205327.7North America
144St. Lucia1713157110.1North America
145St. Vincent & the Grenadines120221116.3North America
146Samoa10518242.6Oceania
147San Marino0000.0Europe
148Sao Tome & Principe56381404.2Africa
149Saudi Arabia0500.1Asia
150Senegal9170.3Africa
151Serbia2831311279.6Europe
152Seychelles15725514.1Africa
153Sierra Leone25326.7Africa
154Singapore6012111.5Asia
155Slovakia19629311611.4Europe
156Slovenia2705127610.6Europe
157Solomon Islands561111.2Oceania
158Somalia0000.0Africa
159South Africa22576818.2Africa
160Spain28415711210.0Europe
161Sri Lanka1610402.2Asia
162Sudan81301.7Africa
163Suriname12817875.6South America
164Swaziland90224.7Africa
165Sweden152601867.2Europe
166Switzerland18510028010.2Europe
167Syria535161.0Asia
168Tajikistan21500.3Asia
169Thailand9925816.4Asia
170Macedonia10627863.9Europe
171Timor-Leste1140.1Asia
172Togo362191.3Africa
173Tonga362151.1Oceania
174Trinidad & Tobago19715676.4North America
175Tunisia513201.3Africa
176Turkey512271.4Asia
177Turkmenistan1971322.2Asia
178Tuvalu64191.0Oceania
179Uganda45908.3Africa
180Ukraine206237458.9Europe
181United Arab Emirates1613552.8Asia
182United Kingdom21912619510.4Europe
183Tanzania36615.7Africa
184USA249158848.7North America
185Uruguay115352206.6South America
186Uzbekistan2510182.4Asia
187Vanuatu2118110.9Oceania
188Venezuela33310037.7South America
189Vietnam111212.0Asia
190Yemen6000.1Asia
191Zambia321942.5Africa
192Zimbabwe641844.7Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings \\\n", "0 Afghanistan 0 0 \n", "1 Albania 89 132 \n", "2 Algeria 25 0 \n", "3 Andorra 245 138 \n", "4 Angola 217 57 \n", "5 Antigua & Barbuda 102 128 \n", "6 Argentina 193 25 \n", "7 Armenia 21 179 \n", "8 Australia 261 72 \n", "9 Austria 279 75 \n", "10 Azerbaijan 21 46 \n", "11 Bahamas 122 176 \n", "12 Bahrain 42 63 \n", "13 Bangladesh 0 0 \n", "14 Barbados 143 173 \n", "15 Belarus 142 373 \n", "16 Belgium 295 84 \n", "17 Belize 263 114 \n", "18 Benin 34 4 \n", "19 Bhutan 23 0 \n", "20 Bolivia 167 41 \n", "21 Bosnia-Herzegovina 76 173 \n", "22 Botswana 173 35 \n", "23 Brazil 245 145 \n", "24 Brunei 31 2 \n", "25 Bulgaria 231 252 \n", "26 Burkina Faso 25 7 \n", "27 Burundi 88 0 \n", "28 Cote d'Ivoire 37 1 \n", "29 Cabo Verde 144 56 \n", "30 Cambodia 57 65 \n", "31 Cameroon 147 1 \n", "32 Canada 240 122 \n", "33 Central African Republic 17 2 \n", "34 Chad 15 1 \n", "35 Chile 130 124 \n", "36 China 79 192 \n", "37 Colombia 159 76 \n", "38 Comoros 1 3 \n", "39 Congo 76 1 \n", "40 Cook Islands 0 254 \n", "41 Costa Rica 149 87 \n", "42 Croatia 230 87 \n", "43 Cuba 93 137 \n", "44 Cyprus 192 154 \n", "45 Czech Republic 361 170 \n", "46 North Korea 0 0 \n", "47 DR Congo 32 3 \n", "48 Denmark 224 81 \n", "49 Djibouti 15 44 \n", "50 Dominica 52 286 \n", "51 Dominican Republic 193 147 \n", "52 Ecuador 162 74 \n", "53 Egypt 6 4 \n", "54 El Salvador 52 69 \n", "55 Equatorial Guinea 92 0 \n", "56 Eritrea 18 0 \n", "57 Estonia 224 194 \n", "58 Ethiopia 20 3 \n", "59 Fiji 77 35 \n", "60 Finland 263 133 \n", "61 France 127 151 \n", "62 Gabon 347 98 \n", "63 Gambia 8 0 \n", "64 Georgia 52 100 \n", "65 Germany 346 117 \n", "66 Ghana 31 3 \n", "67 Greece 133 112 \n", "68 Grenada 199 438 \n", "69 Guatemala 53 69 \n", "70 Guinea 9 0 \n", "71 Guinea-Bissau 28 31 \n", "72 Guyana 93 302 \n", "73 Haiti 1 326 \n", "74 Honduras 69 98 \n", "75 Hungary 234 215 \n", "76 Iceland 233 61 \n", "77 India 9 114 \n", "78 Indonesia 5 1 \n", "79 Iran 0 0 \n", "80 Iraq 9 3 \n", "81 Ireland 313 118 \n", "82 Israel 63 69 \n", "83 Italy 85 42 \n", "84 Jamaica 82 97 \n", "85 Japan 77 202 \n", "86 Jordan 6 21 \n", "87 Kazakhstan 124 246 \n", "88 Kenya 58 22 \n", "89 Kiribati 21 34 \n", "90 Kuwait 0 0 \n", "91 Kyrgyzstan 31 97 \n", "92 Laos 62 0 \n", "93 Latvia 281 216 \n", "94 Lebanon 20 55 \n", "95 Lesotho 82 29 \n", "96 Liberia 19 152 \n", "97 Libya 0 0 \n", "98 Lithuania 343 244 \n", "99 Luxembourg 236 133 \n", "100 Madagascar 26 15 \n", "101 Malawi 8 11 \n", "102 Malaysia 13 4 \n", "103 Maldives 0 0 \n", "104 Mali 5 1 \n", "105 Malta 149 100 \n", "106 Marshall Islands 0 0 \n", "107 Mauritania 0 0 \n", "108 Mauritius 98 31 \n", "109 Mexico 238 68 \n", "110 Micronesia 62 50 \n", "111 Monaco 0 0 \n", "112 Mongolia 77 189 \n", "113 Montenegro 31 114 \n", "114 Morocco 12 6 \n", "115 Mozambique 47 18 \n", "116 Myanmar 5 1 \n", "117 Namibia 376 3 \n", "118 Nauru 49 0 \n", "119 Nepal 5 6 \n", "120 Netherlands 251 88 \n", "121 New Zealand 203 79 \n", "122 Nicaragua 78 118 \n", "123 Niger 3 2 \n", "124 Nigeria 42 5 \n", "125 Niue 188 200 \n", "126 Norway 169 71 \n", "127 Oman 22 16 \n", "128 Pakistan 0 0 \n", "129 Palau 306 63 \n", "130 Panama 285 104 \n", "131 Papua New Guinea 44 39 \n", "132 Paraguay 213 117 \n", "133 Peru 163 160 \n", "134 Philippines 71 186 \n", "135 Poland 343 215 \n", "136 Portugal 194 67 \n", "137 Qatar 1 42 \n", "138 South Korea 140 16 \n", "139 Moldova 109 226 \n", "140 Romania 297 122 \n", "141 Russian Federation 247 326 \n", "142 Rwanda 43 2 \n", "143 St. Kitts & Nevis 194 205 \n", "144 St. Lucia 171 315 \n", "145 St. Vincent & the Grenadines 120 221 \n", "146 Samoa 105 18 \n", "147 San Marino 0 0 \n", "148 Sao Tome & Principe 56 38 \n", "149 Saudi Arabia 0 5 \n", "150 Senegal 9 1 \n", "151 Serbia 283 131 \n", "152 Seychelles 157 25 \n", "153 Sierra Leone 25 3 \n", "154 Singapore 60 12 \n", "155 Slovakia 196 293 \n", "156 Slovenia 270 51 \n", "157 Solomon Islands 56 11 \n", "158 Somalia 0 0 \n", "159 South Africa 225 76 \n", "160 Spain 284 157 \n", "161 Sri Lanka 16 104 \n", "162 Sudan 8 13 \n", "163 Suriname 128 178 \n", "164 Swaziland 90 2 \n", "165 Sweden 152 60 \n", "166 Switzerland 185 100 \n", "167 Syria 5 35 \n", "168 Tajikistan 2 15 \n", "169 Thailand 99 258 \n", "170 Macedonia 106 27 \n", "171 Timor-Leste 1 1 \n", "172 Togo 36 2 \n", "173 Tonga 36 21 \n", "174 Trinidad & Tobago 197 156 \n", "175 Tunisia 51 3 \n", "176 Turkey 51 22 \n", "177 Turkmenistan 19 71 \n", "178 Tuvalu 6 41 \n", "179 Uganda 45 9 \n", "180 Ukraine 206 237 \n", "181 United Arab Emirates 16 135 \n", "182 United Kingdom 219 126 \n", "183 Tanzania 36 6 \n", "184 USA 249 158 \n", "185 Uruguay 115 35 \n", "186 Uzbekistan 25 101 \n", "187 Vanuatu 21 18 \n", "188 Venezuela 333 100 \n", "189 Vietnam 111 2 \n", "190 Yemen 6 0 \n", "191 Zambia 32 19 \n", "192 Zimbabwe 64 18 \n", "\n", " wine_servings total_litres_of_pure_alcohol continent \n", "0 0 0.0 Asia \n", "1 54 4.9 Europe \n", "2 14 0.7 Africa \n", "3 312 12.4 Europe \n", "4 45 5.9 Africa \n", "5 45 4.9 North America \n", "6 221 8.3 South America \n", "7 11 3.8 Europe \n", "8 212 10.4 Oceania \n", "9 191 9.7 Europe \n", "10 5 1.3 Europe \n", "11 51 6.3 North America \n", "12 7 2.0 Asia \n", "13 0 0.0 Asia \n", "14 36 6.3 North America \n", "15 42 14.4 Europe \n", "16 212 10.5 Europe \n", "17 8 6.8 North America \n", "18 13 1.1 Africa \n", "19 0 0.4 Asia \n", "20 8 3.8 South America \n", "21 8 4.6 Europe \n", "22 35 5.4 Africa \n", "23 16 7.2 South America \n", "24 1 0.6 Asia \n", "25 94 10.3 Europe \n", "26 7 4.3 Africa \n", "27 0 6.3 Africa \n", "28 7 4.0 Africa \n", "29 16 4.0 Africa \n", "30 1 2.2 Asia \n", "31 4 5.8 Africa \n", "32 100 8.2 North America \n", "33 1 1.8 Africa \n", "34 1 0.4 Africa \n", "35 172 7.6 South America \n", "36 8 5.0 Asia \n", "37 3 4.2 South America \n", "38 1 0.1 Africa \n", "39 9 1.7 Africa \n", "40 74 5.9 Oceania \n", "41 11 4.4 North America \n", "42 254 10.2 Europe \n", "43 5 4.2 North America \n", "44 113 8.2 Europe \n", "45 134 11.8 Europe \n", "46 0 0.0 Asia \n", "47 1 2.3 Africa \n", "48 278 10.4 Europe \n", "49 3 1.1 Africa \n", "50 26 6.6 North America \n", "51 9 6.2 North America \n", "52 3 4.2 South America \n", "53 1 0.2 Africa \n", "54 2 2.2 North America \n", "55 233 5.8 Africa \n", "56 0 0.5 Africa \n", "57 59 9.5 Europe \n", "58 0 0.7 Africa \n", "59 1 2.0 Oceania \n", "60 97 10.0 Europe \n", "61 370 11.8 Europe \n", "62 59 8.9 Africa \n", "63 1 2.4 Africa \n", "64 149 5.4 Europe \n", "65 175 11.3 Europe \n", "66 10 1.8 Africa \n", "67 218 8.3 Europe \n", "68 28 11.9 North America \n", "69 2 2.2 North America \n", "70 2 0.2 Africa \n", "71 21 2.5 Africa \n", "72 1 7.1 South America \n", "73 1 5.9 North America \n", "74 2 3.0 North America \n", "75 185 11.3 Europe \n", "76 78 6.6 Europe \n", "77 0 2.2 Asia \n", "78 0 0.1 Asia \n", "79 0 0.0 Asia \n", "80 0 0.2 Asia \n", "81 165 11.4 Europe \n", "82 9 2.5 Asia \n", "83 237 6.5 Europe \n", "84 9 3.4 North America \n", "85 16 7.0 Asia \n", "86 1 0.5 Asia \n", "87 12 6.8 Asia \n", "88 2 1.8 Africa \n", "89 1 1.0 Oceania \n", "90 0 0.0 Asia \n", "91 6 2.4 Asia \n", "92 123 6.2 Asia \n", "93 62 10.5 Europe \n", "94 31 1.9 Asia \n", "95 0 2.8 Africa \n", "96 2 3.1 Africa \n", "97 0 0.0 Africa \n", "98 56 12.9 Europe \n", "99 271 11.4 Europe \n", "100 4 0.8 Africa \n", "101 1 1.5 Africa \n", "102 0 0.3 Asia \n", "103 0 0.0 Asia \n", "104 1 0.6 Africa \n", "105 120 6.6 Europe \n", "106 0 0.0 Oceania \n", "107 0 0.0 Africa \n", "108 18 2.6 Africa \n", "109 5 5.5 North America \n", "110 18 2.3 Oceania \n", "111 0 0.0 Europe \n", "112 8 4.9 Asia \n", "113 128 4.9 Europe \n", "114 10 0.5 Africa \n", "115 5 1.3 Africa \n", "116 0 0.1 Asia \n", "117 1 6.8 Africa \n", "118 8 1.0 Oceania \n", "119 0 0.2 Asia \n", "120 190 9.4 Europe \n", "121 175 9.3 Oceania \n", "122 1 3.5 North America \n", "123 1 0.1 Africa \n", "124 2 9.1 Africa \n", "125 7 7.0 Oceania \n", "126 129 6.7 Europe \n", "127 1 0.7 Asia \n", "128 0 0.0 Asia \n", "129 23 6.9 Oceania \n", "130 18 7.2 North America \n", "131 1 1.5 Oceania \n", "132 74 7.3 South America \n", "133 21 6.1 South America \n", "134 1 4.6 Asia \n", "135 56 10.9 Europe \n", "136 339 11.0 Europe \n", "137 7 0.9 Asia \n", "138 9 9.8 Asia \n", "139 18 6.3 Europe \n", "140 167 10.4 Europe \n", "141 73 11.5 Asia \n", "142 0 6.8 Africa \n", "143 32 7.7 North America \n", "144 71 10.1 North America \n", "145 11 6.3 North America \n", "146 24 2.6 Oceania \n", "147 0 0.0 Europe \n", "148 140 4.2 Africa \n", "149 0 0.1 Asia \n", "150 7 0.3 Africa \n", "151 127 9.6 Europe \n", "152 51 4.1 Africa \n", "153 2 6.7 Africa \n", "154 11 1.5 Asia \n", "155 116 11.4 Europe \n", "156 276 10.6 Europe \n", "157 1 1.2 Oceania \n", "158 0 0.0 Africa \n", "159 81 8.2 Africa \n", "160 112 10.0 Europe \n", "161 0 2.2 Asia \n", "162 0 1.7 Africa \n", "163 7 5.6 South America \n", "164 2 4.7 Africa \n", "165 186 7.2 Europe \n", "166 280 10.2 Europe \n", "167 16 1.0 Asia \n", "168 0 0.3 Asia \n", "169 1 6.4 Asia \n", "170 86 3.9 Europe \n", "171 4 0.1 Asia \n", "172 19 1.3 Africa \n", "173 5 1.1 Oceania \n", "174 7 6.4 North America \n", "175 20 1.3 Africa \n", "176 7 1.4 Asia \n", "177 32 2.2 Asia \n", "178 9 1.0 Oceania \n", "179 0 8.3 Africa \n", "180 45 8.9 Europe \n", "181 5 2.8 Asia \n", "182 195 10.4 Europe \n", "183 1 5.7 Africa \n", "184 84 8.7 North America \n", "185 220 6.6 South America \n", "186 8 2.4 Asia \n", "187 11 0.9 Oceania \n", "188 3 7.7 South America \n", "189 1 2.0 Asia \n", "190 0 0.1 Asia \n", "191 4 2.5 Africa \n", "192 4 4.7 Africa " ] }, "execution_count": 285, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# overwrite the current setting so that all rows will be displayed\n", "pd.set_option('display.max_rows', None)\n", "drinks" ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# reset the 'max_rows' option to its default\n", "pd.reset_option('display.max_rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`set_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html) and [**`reset_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.reset_option.html)" ] }, { "cell_type": "code", "execution_count": 287, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 287, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the 'max_columns' option is similar to 'max_rows'\n", "pd.get_option('display.max_columns')" ] }, { "cell_type": "code", "execution_count": 288, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 288, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the training dataset from Kaggle's Titanic competition into a DataFrame\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 289, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 289, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# an ellipsis is displayed in the 'Name' cell of row 1 because of the 'max_colwidth' option\n", "pd.get_option('display.max_colwidth')" ] }, { "cell_type": "code", "execution_count": 290, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 290, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# overwrite the current setting so that more characters will be displayed\n", "pd.set_option('display.max_colwidth', 1000)\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 291, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.28C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.10C123S
4503Allen, Mr. William Henrymale35.0003734508.05NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.25 NaN S \n", "1 0 PC 17599 71.28 C85 C \n", "2 0 STON/O2. 3101282 7.92 NaN S \n", "3 0 113803 53.10 C123 S \n", "4 0 373450 8.05 NaN S " ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# overwrite the 'precision' setting to display 2 digits after the decimal point of 'Fare'\n", "pd.set_option('display.precision', 2)\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 292, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentxy
0Afghanistan0000.0Asia00.0
1Albania89132544.9Europe540004900.0
2Algeria250140.7Africa14000700.0
3Andorra24513831212.4Europe31200012400.0
4Angola21757455.9Africa450005900.0
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent x y \n", "0 0.0 Asia 0 0.0 \n", "1 4.9 Europe 54000 4900.0 \n", "2 0.7 Africa 14000 700.0 \n", "3 12.4 Europe 312000 12400.0 \n", "4 5.9 Africa 45000 5900.0 " ] }, "execution_count": 292, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add two meaningless columns to the drinks DataFrame\n", "drinks['x'] = drinks.wine_servings * 1000\n", "drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 293, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentxy
0Afghanistan0000.0Asia00.0
1Albania89132544.9Europe540004,900.0
2Algeria250140.7Africa14000700.0
3Andorra24513831212.4Europe31200012,400.0
4Angola21757455.9Africa450005,900.0
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent x y \n", "0 0.0 Asia 0 0.0 \n", "1 4.9 Europe 54000 4,900.0 \n", "2 0.7 Africa 14000 700.0 \n", "3 12.4 Europe 312000 12,400.0 \n", "4 5.9 Africa 45000 5,900.0 " ] }, "execution_count": 293, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use a Python format string to specify a comma as the thousands separator\n", "pd.set_option('display.float_format', '{:,}'.format)\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "country object\n", "beer_servings int64\n", "spirit_servings int64\n", "wine_servings int64\n", "total_litres_of_pure_alcohol float64\n", "continent object\n", "x int64\n", "y float64\n", "dtype: object" ] }, "execution_count": 294, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'y' was affected (but not 'x') because the 'float_format' option only affects floats (not ints)\n", "drinks.dtypes" ] }, { "cell_type": "code", "execution_count": 295, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "display.chop_threshold : float or None\n", " if set to a float value, all float values smaller then the given threshold\n", " will be displayed as exactly 0 by repr and friends.\n", " [default: None] [currently: None]\n", "\n", "display.colheader_justify : 'left'/'right'\n", " Controls the justification of column headers. used by DataFrameFormatter.\n", " [default: right] [currently: right]\n", "\n", "display.column_space No description available.\n", " [default: 12] [currently: 12]\n", "\n", "display.date_dayfirst : boolean\n", " When True, prints and parses dates with the day first, eg 20/01/2005\n", " [default: False] [currently: False]\n", "\n", "display.date_yearfirst : boolean\n", " When True, prints and parses dates with the year first, eg 2005/01/20\n", " [default: False] [currently: False]\n", "\n", "display.encoding : str/unicode\n", " Defaults to the detected encoding of the console.\n", " Specifies the encoding to be used for strings returned by to_string,\n", " these are generally strings meant to be displayed on the console.\n", " [default: UTF-8] [currently: UTF-8]\n", "\n", "display.expand_frame_repr : boolean\n", " Whether to print out the full DataFrame repr for wide DataFrames across\n", " multiple lines, `max_columns` is still respected, but the output will\n", " wrap-around across multiple \"pages\" if its width exceeds `display.width`.\n", " [default: True] [currently: True]\n", "\n", "display.float_format : callable\n", " The callable should accept a floating point number and return\n", " a string with the desired format of the number. This is used\n", " in some places like SeriesFormatter.\n", " See formats.format.EngFormatter for an example.\n", " [default: None] [currently: ]\n", "\n", "display.height : int\n", " Deprecated.\n", " [default: 60] [currently: 60]\n", " (Deprecated, use `display.max_rows` instead.)\n", "\n", "display.large_repr : 'truncate'/'info'\n", " For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can\n", " show a truncated table (the default from 0.13), or switch to the view from\n", " df.info() (the behaviour in earlier versions of pandas).\n", " [default: truncate] [currently: truncate]\n", "\n", "display.latex.escape : bool\n", " This specifies if the to_latex method of a Dataframe uses escapes special\n", " characters.\n", " method. Valid values: False,True\n", " [default: True] [currently: True]\n", "\n", "display.latex.longtable :bool\n", " This specifies if the to_latex method of a Dataframe uses the longtable\n", " format.\n", " method. Valid values: False,True\n", " [default: False] [currently: False]\n", "\n", "display.latex.repr : boolean\n", " Whether to produce a latex DataFrame representation for jupyter\n", " environments that support it.\n", " (default: False)\n", " [default: False] [currently: False]\n", "\n", "display.line_width : int\n", " Deprecated.\n", " [default: 80] [currently: 80]\n", " (Deprecated, use `display.width` instead.)\n", "\n", "display.max_categories : int\n", " This sets the maximum number of categories pandas should output when\n", " printing out a `Categorical` or a Series of dtype \"category\".\n", " [default: 8] [currently: 8]\n", "\n", "display.max_columns : int\n", " If max_cols is exceeded, switch to truncate view. Depending on\n", " `large_repr`, objects are either centrally truncated or printed as\n", " a summary view. 'None' value means unlimited.\n", "\n", " In case python/IPython is running in a terminal and `large_repr`\n", " equals 'truncate' this can be set to 0 and pandas will auto-detect\n", " the width of the terminal and print a truncated object which fits\n", " the screen width. The IPython notebook, IPython qtconsole, or IDLE\n", " do not run in a terminal and hence it is not possible to do\n", " correct auto-detection.\n", " [default: 20] [currently: 20]\n", "\n", "display.max_colwidth : int\n", " The maximum width in characters of a column in the repr of\n", " a pandas data structure. When the column overflows, a \"...\"\n", " placeholder is embedded in the output.\n", " [default: 50] [currently: 1000]\n", "\n", "display.max_info_columns : int\n", " max_info_columns is used in DataFrame.info method to decide if\n", " per column information will be printed.\n", " [default: 100] [currently: 100]\n", "\n", "display.max_info_rows : int or None\n", " df.info() will usually show null-counts for each column.\n", " For large frames this can be quite slow. max_info_rows and max_info_cols\n", " limit this null check only to frames with smaller dimensions than\n", " specified.\n", " [default: 1690785] [currently: 1690785]\n", "\n", "display.max_rows : int\n", " If max_rows is exceeded, switch to truncate view. Depending on\n", " `large_repr`, objects are either centrally truncated or printed as\n", " a summary view. 'None' value means unlimited.\n", "\n", " In case python/IPython is running in a terminal and `large_repr`\n", " equals 'truncate' this can be set to 0 and pandas will auto-detect\n", " the height of the terminal and print a truncated object which fits\n", " the screen height. The IPython notebook, IPython qtconsole, or\n", " IDLE do not run in a terminal and hence it is not possible to do\n", " correct auto-detection.\n", " [default: 60] [currently: 60]\n", "\n", "display.max_seq_items : int or None\n", " when pretty-printing a long sequence, no more then `max_seq_items`\n", " will be printed. If items are omitted, they will be denoted by the\n", " addition of \"...\" to the resulting string.\n", "\n", " If set to None, the number of items to be printed is unlimited.\n", " [default: 100] [currently: 100]\n", "\n", "display.memory_usage : bool, string or None\n", " This specifies if the memory usage of a DataFrame should be displayed when\n", " df.info() is called. Valid values True,False,'deep'\n", " [default: True] [currently: True]\n", "\n", "display.mpl_style : bool\n", " Setting this to 'default' will modify the rcParams used by matplotlib\n", " to give plots a more pleasing visual style by default.\n", " Setting this to None/False restores the values to their initial value.\n", " [default: None] [currently: None]\n", "\n", "display.multi_sparse : boolean\n", " \"sparsify\" MultiIndex display (don't display repeated\n", " elements in outer levels within groups)\n", " [default: True] [currently: True]\n", "\n", "display.notebook_repr_html : boolean\n", " When True, IPython notebook will use html representation for\n", " pandas objects (if it is available).\n", " [default: True] [currently: True]\n", "\n", "display.pprint_nest_depth : int\n", " Controls the number of nested levels to process when pretty-printing\n", " [default: 3] [currently: 3]\n", "\n", "display.precision : int\n", " Floating point output precision (number of significant digits). This is\n", " only a suggestion\n", " [default: 6] [currently: 2]\n", "\n", "display.show_dimensions : boolean or 'truncate'\n", " Whether to print out dimensions at the end of DataFrame repr.\n", " If 'truncate' is specified, only print out the dimensions if the\n", " frame is truncated (e.g. not display all rows and/or columns)\n", " [default: truncate] [currently: truncate]\n", "\n", "display.unicode.ambiguous_as_wide : boolean\n", " Whether to use the Unicode East Asian Width to calculate the display text\n", " width.\n", " Enabling this may affect to the performance (default: False)\n", " [default: False] [currently: False]\n", "\n", "display.unicode.east_asian_width : boolean\n", " Whether to use the Unicode East Asian Width to calculate the display text\n", " width.\n", " Enabling this may affect to the performance (default: False)\n", " [default: False] [currently: False]\n", "\n", "display.width : int\n", " Width of the display in characters. In case python/IPython is running in\n", " a terminal this can be set to None and pandas will correctly auto-detect\n", " the width.\n", " Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a\n", " terminal and hence it is not possible to correctly detect the width.\n", " [default: 80] [currently: 80]\n", "\n", "io.excel.xls.writer : string\n", " The default Excel writer engine for 'xls' files. Available options:\n", " 'xlwt' (the default).\n", " [default: xlwt] [currently: xlwt]\n", "\n", "io.excel.xlsm.writer : string\n", " The default Excel writer engine for 'xlsm' files. Available options:\n", " 'openpyxl' (the default).\n", " [default: openpyxl] [currently: openpyxl]\n", "\n", "io.excel.xlsx.writer : string\n", " The default Excel writer engine for 'xlsx' files. Available options:\n", " 'xlsxwriter' (the default), 'openpyxl'.\n", " [default: xlsxwriter] [currently: xlsxwriter]\n", "\n", "io.hdf.default_format : format\n", " default format writing format, if None, then\n", " put will default to 'fixed' and append will default to 'table'\n", " [default: None] [currently: None]\n", "\n", "io.hdf.dropna_table : boolean\n", " drop ALL nan rows when appending to a table\n", " [default: False] [currently: False]\n", "\n", "mode.chained_assignment : string\n", " Raise an exception, warn, or no action if trying to use chained assignment,\n", " The default is warn\n", " [default: warn] [currently: warn]\n", "\n", "mode.sim_interactive : boolean\n", " Whether to simulate interactive mode for purposes of testing\n", " [default: False] [currently: False]\n", "\n", "mode.use_inf_as_null : boolean\n", " True means treat None, NaN, INF, -INF as null (old way),\n", " False means None and NaN are null, but INF, -INF are not null\n", " (new way).\n", " [default: False] [currently: False]\n", "\n", "\n" ] } ], "source": [ "# view the option descriptions (including the default and current values)\n", "pd.describe_option()" ] }, { "cell_type": "code", "execution_count": 296, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "display.max_info_rows : int or None\n", " df.info() will usually show null-counts for each column.\n", " For large frames this can be quite slow. max_info_rows and max_info_cols\n", " limit this null check only to frames with smaller dimensions than\n", " specified.\n", " [default: 1690785] [currently: 1690785]\n", "\n", "display.max_rows : int\n", " If max_rows is exceeded, switch to truncate view. Depending on\n", " `large_repr`, objects are either centrally truncated or printed as\n", " a summary view. 'None' value means unlimited.\n", "\n", " In case python/IPython is running in a terminal and `large_repr`\n", " equals 'truncate' this can be set to 0 and pandas will auto-detect\n", " the height of the terminal and print a truncated object which fits\n", " the screen height. The IPython notebook, IPython qtconsole, or\n", " IDLE do not run in a terminal and hence it is not possible to do\n", " correct auto-detection.\n", " [default: 60] [currently: 60]\n", "\n", "\n" ] } ], "source": [ "# search for specific options by name\n", "pd.describe_option('rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`describe_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.describe_option.html)" ] }, { "cell_type": "code", "execution_count": 297, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "height has been deprecated.\n", "\n", "line_width has been deprecated, use display.width instead (currently both are\n", "identical)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\Kevin\\Anaconda\\lib\\site-packages\\ipykernel\\__main__.py:2: FutureWarning: \n", "mpl_style had been deprecated and will be removed in a future version.\n", "Use `matplotlib.pyplot.style.use` instead.\n", "\n", " from ipykernel import kernelapp as app\n" ] } ], "source": [ "# reset all of the options to their default values\n", "pd.reset_option('all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Options and Settings](http://pandas.pydata.org/pandas-docs/stable/options.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 29. How do I create a pandas DataFrame from another object? ([video](https://www.youtube.com/watch?v=-Ov1N1_FbP8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=29))" ] }, { "cell_type": "code", "execution_count": 298, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorid
0red100
1blue101
2red102
\n", "
" ], "text/plain": [ " color id\n", "0 red 100\n", "1 blue 101\n", "2 red 102" ] }, "execution_count": 298, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame from a dictionary (keys become column names, values become data)\n", "pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']})" ] }, { "cell_type": "code", "execution_count": 299, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcolor
a100red
b101blue
c102red
\n", "
" ], "text/plain": [ " id color\n", "a 100 red\n", "b 101 blue\n", "c 102 red" ] }, "execution_count": 299, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# optionally specify the order of columns and define the index\n", "df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)" ] }, { "cell_type": "code", "execution_count": 300, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcolor
0100red
1101blue
2102red
\n", "
" ], "text/plain": [ " id color\n", "0 100 red\n", "1 101 blue\n", "2 102 red" ] }, "execution_count": 300, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame from a list of lists (each inner list becomes a row)\n", "pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])" ] }, { "cell_type": "code", "execution_count": 301, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.9325265 , 0.48261452],\n", " [ 0.03239681, 0.94908844],\n", " [ 0.17615564, 0.80045853],\n", " [ 0.36113859, 0.95982213]])" ] }, "execution_count": 301, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a NumPy array (with shape 4 by 2) and fill it with random numbers between 0 and 1\n", "import numpy as np\n", "arr = np.random.rand(4, 2)\n", "arr" ] }, { "cell_type": "code", "execution_count": 302, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
00.9325270.482615
10.0323970.949088
20.1761560.800459
30.3611390.959822
\n", "
" ], "text/plain": [ " one two\n", "0 0.932527 0.482615\n", "1 0.032397 0.949088\n", "2 0.176156 0.800459\n", "3 0.361139 0.959822" ] }, "execution_count": 302, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame from the NumPy array\n", "pd.DataFrame(arr, columns=['one', 'two'])" ] }, { "cell_type": "code", "execution_count": 303, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studenttest
010078
110183
210279
310385
410473
510591
610681
710786
810888
910996
\n", "
" ], "text/plain": [ " student test\n", "0 100 78\n", "1 101 83\n", "2 102 79\n", "3 103 85\n", "4 104 73\n", "5 105 91\n", "6 106 81\n", "7 107 86\n", "8 108 88\n", "9 109 96" ] }, "execution_count": 303, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a DataFrame of student IDs (100 through 109) and test scores (random integers between 60 and 100)\n", "pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`np.arange`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) and [**`np.random`**](http://docs.scipy.org/doc/numpy/reference/routines.random.html)" ] }, { "cell_type": "code", "execution_count": 304, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test
student
10078
10171
10290
10363
10483
10592
10697
10767
10871
10979
\n", "
" ], "text/plain": [ " test\n", "student \n", "100 78\n", "101 71\n", "102 90\n", "103 63\n", "104 83\n", "105 92\n", "106 97\n", "107 67\n", "108 71\n", "109 79" ] }, "execution_count": 304, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'set_index' can be chained with the DataFrame constructor to select an index\n", "pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)" ] }, { "cell_type": "code", "execution_count": 305, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c round\n", "b square\n", "Name: shape, dtype: object" ] }, "execution_count": 305, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a new Series using the Series constructor\n", "s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documentation for [**`Series`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)" ] }, { "cell_type": "code", "execution_count": 306, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcolorshape
a100redNaN
b101bluesquare
c102redround
\n", "
" ], "text/plain": [ " id color shape\n", "a 100 red NaN\n", "b 101 blue square\n", "c 102 red round" ] }, "execution_count": 306, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# concatenate the DataFrame and the Series (use axis=1 to concatenate columns)\n", "pd.concat([df, s], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes:**\n", "\n", "- The Series name became the column name in the DataFrame.\n", "- The Series data was aligned to the DataFrame by its index.\n", "- The 'shape' for row 'a' was marked as a missing value (NaN) because that index was not present in the Series.\n", "\n", "Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)\n", "\n", "[Back to top]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 30. How do I apply a function to a pandas Series or DataFrame? ([video](https://www.youtube.com/watch?v=P_q0tkYqvSk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=30))" ] }, { "cell_type": "code", "execution_count": 307, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 307, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the training dataset from Kaggle's Titanic competition into a DataFrame\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Map the existing values of a Series to a different set of values\n", "\n", "**Method:** [**`map`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) (Series method)" ] }, { "cell_type": "code", "execution_count": 308, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSex_num
0male1
1female0
2female0
3female0
4male1
\n", "
" ], "text/plain": [ " Sex Sex_num\n", "0 male 1\n", "1 female 0\n", "2 female 0\n", "3 female 0\n", "4 male 1" ] }, "execution_count": 308, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# map 'female' to 0 and 'male' to 1\n", "train['Sex_num'] = train.Sex.map({'female':0, 'male':1})\n", "train.loc[0:4, ['Sex', 'Sex_num']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Apply a function to each element in a Series\n", "\n", "**Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) (Series method)\n", "\n", "**Note:** **`map`** can be substituted for **`apply`** in many cases, but **`apply`** is more flexible and thus is recommended" ] }, { "cell_type": "code", "execution_count": 309, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameName_length
0Braund, Mr. Owen Harris23
1Cumings, Mrs. John Bradley (Florence Briggs Th...51
2Heikkinen, Miss. Laina22
3Futrelle, Mrs. Jacques Heath (Lily May Peel)44
4Allen, Mr. William Henry24
\n", "
" ], "text/plain": [ " Name Name_length\n", "0 Braund, Mr. Owen Harris 23\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... 51\n", "2 Heikkinen, Miss. Laina 22\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 44\n", "4 Allen, Mr. William Henry 24" ] }, "execution_count": 309, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the length of each string in the 'Name' Series\n", "train['Name_length'] = train.Name.apply(len)\n", "train.loc[0:4, ['Name', 'Name_length']]" ] }, { "cell_type": "code", "execution_count": 310, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FareFare_ceil
07.25008.0
171.283372.0
27.92508.0
353.100054.0
48.05009.0
\n", "
" ], "text/plain": [ " Fare Fare_ceil\n", "0 7.2500 8.0\n", "1 71.2833 72.0\n", "2 7.9250 8.0\n", "3 53.1000 54.0\n", "4 8.0500 9.0" ] }, "execution_count": 310, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# round up each element in the 'Fare' Series to the next integer\n", "import numpy as np\n", "train['Fare_ceil'] = train.Fare.apply(np.ceil)\n", "train.loc[0:4, ['Fare', 'Fare_ceil']]" ] }, { "cell_type": "code", "execution_count": 311, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Braund, Mr. Owen Harris\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n", "2 Heikkinen, Miss. Laina\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n", "4 Allen, Mr. William Henry\n", "Name: Name, dtype: object" ] }, "execution_count": 311, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we want to extract the last name of each person\n", "train.Name.head()" ] }, { "cell_type": "code", "execution_count": 312, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 [Braund, Mr. Owen Harris]\n", "1 [Cumings, Mrs. John Bradley (Florence Briggs ...\n", "2 [Heikkinen, Miss. Laina]\n", "3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]\n", "4 [Allen, Mr. William Henry]\n", "Name: Name, dtype: object" ] }, "execution_count": 312, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use a string method to split the 'Name' Series at commas (returns a Series of lists)\n", "train.Name.str.split(',').head()" ] }, { "cell_type": "code", "execution_count": 313, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define a function that returns an element from a list based on position\n", "def get_element(my_list, position):\n", " return my_list[position]" ] }, { "cell_type": "code", "execution_count": 314, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Braund\n", "1 Cumings\n", "2 Heikkinen\n", "3 Futrelle\n", "4 Allen\n", "Name: Name, dtype: object" ] }, "execution_count": 314, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the 'get_element' function and pass 'position' as a keyword argument\n", "train.Name.str.split(',').apply(get_element, position=0).head()" ] }, { "cell_type": "code", "execution_count": 315, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Braund\n", "1 Cumings\n", "2 Heikkinen\n", "3 Futrelle\n", "4 Allen\n", "Name: Name, dtype: object" ] }, "execution_count": 315, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# alternatively, use a lambda function\n", "train.Name.str.split(',').apply(lambda x: x[0]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Apply a function along either axis of a DataFrame\n", "\n", "**Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) (DataFrame method)" ] }, { "cell_type": "code", "execution_count": 316, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read a dataset of alcohol consumption into a DataFrame\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "drinks.head()" ] }, { "cell_type": "code", "execution_count": 317, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servings
0000
18913254
225014
3245138312
42175745
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings\n", "0 0 0 0\n", "1 89 132 54\n", "2 25 0 14\n", "3 245 138 312\n", "4 217 57 45" ] }, "execution_count": 317, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select a subset of the DataFrame to work with\n", "drinks.loc[:, 'beer_servings':'wine_servings'].head()" ] }, { "cell_type": "code", "execution_count": 318, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "beer_servings 376\n", "spirit_servings 438\n", "wine_servings 370\n", "dtype: int64" ] }, "execution_count": 318, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the 'max' function along axis 0 to calculate the maximum value in each column\n", "drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)" ] }, { "cell_type": "code", "execution_count": 319, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 132\n", "2 25\n", "3 312\n", "4 217\n", "dtype: int64" ] }, "execution_count": 319, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the 'max' function along axis 1 to calculate the maximum value in each row\n", "drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head()" ] }, { "cell_type": "code", "execution_count": 320, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 beer_servings\n", "1 spirit_servings\n", "2 beer_servings\n", "3 wine_servings\n", "4 beer_servings\n", "dtype: object" ] }, "execution_count": 320, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use 'np.argmax' to calculate which column has the maximum value for each row\n", "drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Apply a function to every element in a DataFrame\n", "\n", "**Method:** [**`applymap`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html) (DataFrame method)" ] }, { "cell_type": "code", "execution_count": 321, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beer_servingsspirit_servingswine_servings
00.00.00.0
189.0132.054.0
225.00.014.0
3245.0138.0312.0
4217.057.045.0
\n", "
" ], "text/plain": [ " beer_servings spirit_servings wine_servings\n", "0 0.0 0.0 0.0\n", "1 89.0 132.0 54.0\n", "2 25.0 0.0 14.0\n", "3 245.0 138.0 312.0\n", "4 217.0 57.0 45.0" ] }, "execution_count": 321, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert every DataFrame element into a float\n", "drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head()" ] }, { "cell_type": "code", "execution_count": 322, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0.00.00.00.0Asia
1Albania89.0132.054.04.9Europe
2Algeria25.00.014.00.7Africa
3Andorra245.0138.0312.012.4Europe
4Angola217.057.045.05.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0.0 0.0 0.0 \n", "1 Albania 89.0 132.0 54.0 \n", "2 Algeria 25.0 0.0 14.0 \n", "3 Andorra 245.0 138.0 312.0 \n", "4 Angola 217.0 57.0 45.0 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 322, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# overwrite the existing DataFrame columns\n", "drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)\n", "drinks.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to top]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }