{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 05 -- Understanding Indexes\n", "## Topics Covered\n", "\n", " Indicies \n", "\n", " .iloc indexer \n", "\n", "Setting and Reseting Indicies\n", "\n", " .loc indexer \n", "\n", " Mixing .loc indexer with Boolean Operations\n", "\n", " Altering DataFrame values using the .loc indexer \n", "\n", "Conditionally Apply Values Based on Another Column Value\n", "\n", " .ix indexer \n", "\n", " Indexing Issues \n", "\n", " Resources \n", "\n", "SAS users tend to think of indexing SAS data sets as a means to improve query performance. Another use case for using SAS indices is in performing non-sequential reads for table look-ups.\n", "\n", "Indexing for DataFrames is used to provide direct access to data. Many analytical techniques take advantage of indexes. Indexing is also used as key values to selecting and subseting data. \n", "\n", "It took me a bit of time to understand how indexing for Series and DataFrames actually work. \n", "\n", "Many of the working examples I found, while mostly useful, used synthetic data which is typically ordered neatly. Here we examine issues like overlapping date ranges or index values in non-alphabetical or in non-sequential order and so on.\n", "\n", "That is why you will find some errors in the examples below. By examining the pits I have fallen into, hopefully you can avoid them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from numpy.random import randn\n", "from pandas import Series, DataFrame, Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the creation of the DataFrame df in the cell below. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.], \n", " ['b', 'warm', 'medium', 4, 5, 7, 9],\n", " ['c', 'hot', 'fast', 9, 4, np.nan, 6],\n", " ['d', 'cool', None, np.nan, np.nan, 17, 89],\n", " ['e', 'cool', 'medium', 16, 44, 21, 13],\n", " ['f', 'cold', 'slow', np.nan, 29, 33, 17]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the 'df' DataFrame created above we did not specify a row index or column names resulting in the RangeIndex object used for row labels. Default column labels are created as well using another RangeIndex object." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0acoldslowNaN2.06.03.0
1bwarmmedium4.05.07.09.0
2chotfast9.04.0NaN6.0
3dcoolNoneNaNNaN17.089.0
4ecoolmedium16.044.021.013.0
5fcoldslowNaN29.033.017.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "0 a cold slow NaN 2.0 6.0 3.0\n", "1 b warm medium 4.0 5.0 7.0 9.0\n", "2 c hot fast 9.0 4.0 NaN 6.0\n", "3 d cool None NaN NaN 17.0 89.0\n", "4 e cool medium 16.0 44.0 21.0 13.0\n", "5 f cold slow NaN 29.0 33.0 17.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indices\n", "\n", "The .index attribute returns the DataFrame's index structure. We did not explicitly set an index. As a result a default index object is created. The RangeIndex object has a start position at 0 and the end position set to len(df) - 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the row index." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=6, step=1)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the column labels. Since no labels were specificed, a RangeIndex object is used to identify columns." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=7, step=1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For observations SAS uses the automatic variable \\_N\\_ in the Data Step and FIRSTOBS and OBS in PROC step for its row indexing. \n", "\n", "SAS also has a similar construct (SAS variable list) allowing the use of column name 'ranges' (col1-colN) described here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SAS example below creates a data set with same data used to create the DataFrame df in cell #2 above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_create_analog_df_dataframe.sas */\n", " /******************************************************/\n", " 11 data df;\n", " 12 infile cards dlm=',';\n", " 13 \n", " 14 input id $\n", " 15 col1 $\n", " 16 col2 $\n", " 17 col3\n", " 18 col4\n", " 19 col5\n", " 20 col6 ;\n", " 21 \n", " 22 datalines;\n", " 23 a, cold, slow, ., 2, 6, 3\n", " 24 b, warm, medium, 4, 5, 7, 9\n", " 25 c, hot, fast, 9, 4, ., 6\n", " 26 d, cool, , ., ., 17, 89\n", " 27 e, cool, medium, 16, 44, 21, 13\n", " 28 f, cold, slow, . ,29, 33, 17\n", " 29 ;;;;\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Data Step with the SET options NOBS= is an example of implicit indexing used by SAS. The END= parameter on the SET statement is initialized to 0 and set to one when the last observation is read. The automatic variable \\_N\\_ can be used as the observation (row) index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_sas_row_index1.sas */\n", " /******************************************************/\n", " 31 data _null_;\n", " 32 set df nobs=obs end=end_of_file;\n", " 33 \n", " 34 put _n_ = ;\n", " 35 \n", " 36 if end_of_file then\n", " 37 put 'Data set contains: ' obs ' observations' /;\n", "\n", " _N_=1\n", " _N_=2\n", " _N_=3\n", " _N_=4\n", " _N_=5\n", " _N_=6\n", " Data set contains: 6 observations\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the DataFrame indices we can select specific rows and columns. DataFrames provide indexers to accomplish these tasks. They are:\n", "\n", " 1. .iloc() method which is mainly an integer-based method\n", " 2. .loc() method used to select ranges by labels (either column or row)\n", " 3. .ix() method which supports a combination of the loc() and iloc() methods\n", " \n", "We also illsutrate altering values in a DataFrame using the .loc(). This is equivalent to the SAS update method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## .iloc Indexer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc indexer uses an integer-based method for locating row and column positions. It is a very robust indexer, however, it is limited by the fact that humans are better at remembering labels than numbers. This is analogous to locating an observations by \\_n\\_ or locating a SAS variable using 'colN' from a variable list. \n", "\n", "The syntax for the .iloc indexer is:\n", "\n", " df.iloc[row selection, column selection]\n", " \n", "A comma (,) is used to separate the request of rows for the request for columns. A colon (:) is used to request a range of cells.\n", "\n", "The absence of a either a column or row selection is an implicit request for all columns or rows, respectively. \n", "\n", "Details for the .iloc indexer are located here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".iloc[0] is useful since it returns the first scalar from a Series of the first column from a DataFrame." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 cold\n", "2 slow\n", "3 NaN\n", "4 2\n", "5 6\n", "6 3\n", "Name: 0, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The point= option on the SET statement behaves similarly to return the first row in the data set. Note the SET statement inside a DO loop and the STOP statement. STOP is needed because the POINT= option indicates a non-sequnetial access pattern and thus the end of data set indicator is not available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_sas_row_index2.sas */\n", " /******************************************************/\n", " 52 data _null_;\n", " 53 \n", " 54 do iloc = 1 to 1;\n", " 55 set df point=iloc;\n", " 56 \n", " 57 put _all_ ;\n", " 58 end;\n", " 59 stop;\n", "\n", " _N_=1 _ERROR_=0 iloc=1 id=a col1=cold col2=slow col3=. col4=2 col5=6 col6=3\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example below, you might expect three rows returned, rather than two. The range request for .iloc[] includes the start of the range and does not include the last item in the range value. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
2chotfast9.04.0NaN6.0
3dcoolNoneNaNNaN17.089.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "2 c hot fast 9.0 4.0 NaN 6.0\n", "3 d cool None NaN NaN 17.0 89.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SAS analog example for cell #6 is below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_sas_row_index3.sas */\n", " /******************************************************/\n", " 94 data _null_ ;\n", " 95 \n", " 96 do iloc = 3 to 4;\n", " 97 set df point=iloc;\n", " 98 put _all_ ;\n", " 99 end;\n", " 100 stop;\n", "\n", " _N_=1 _ERROR_=0 iloc=3 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6\n", " _N_=1 _ERROR_=0 iloc=4 id=d col1=cool col2= col3=. col4=. col5=17 col6=89\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the indexer for string slicing, the index position \n", "\n", " iloc[0]\n", " \n", "returns the first row in a DataFrame and \n", "\n", " iloc[-1]\n", " \n", "returns the last row in the DataFrame. This is analogous to the END= option for the SET statement (assuming a sequential access pattern)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc indexer is mainly used to locate first or last row in the DataFrame. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 f\n", "1 cold\n", "2 slow\n", "3 NaN\n", "4 29\n", "5 33\n", "6 17\n", "Name: 5, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc indexer in cell #8 below returns rows 2 and 3 using (2:4) as the row selector and columns 0 to 6 using (0:6) as the column selctor." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345
2chotfast9.04.0NaN
3dcoolNoneNaNNaN17.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5\n", "2 c hot fast 9.0 4.0 NaN\n", "3 d cool None NaN NaN 17.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:4, 0:6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The analog SAS program for returning the same sub-set is below. FIRSTOBS=3 OBS=4 is the equivalent row selector and keep = id -- col5 is the equivalent column selector." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_firstobs_3_obs_4.sas */\n", " /******************************************************/\n", " 60 data df;\n", " 61 set df(keep = id -- col5\n", " 62 firstobs=3 obs=4);\n", " 63 put _all_ ;\n", "\n", " _N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=.\n", " _N_=2 _ERROR_=0 id=d col1=cool col2= col3=. col4=. col5=17\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc idexer illustrating multi-row and multi-column requests. Note the double square brackets ([]) syntax." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
246
1medium5.09.0
3NoneNaN89.0
5slow29.017.0
\n", "
" ], "text/plain": [ " 2 4 6\n", "1 medium 5.0 9.0\n", "3 None NaN 89.0\n", "5 slow 29.0 17.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[[1,3,5], [2, 4, 6]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## .loc Indexer " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .loc indexer is similar to .iloc and allows access to rows and columns by labels. A good analogy is a cell reference in Excel, eg. C:31.\n", "\n", "The syntax for the .loc indexer is:\n", "\n", " df.loc[row selection, column selection]\n", " \n", "For both the row and column selection, a comma (,) is used to request a list of multiple cells. A colon (:) is used to request a range of cells.\n", " \n", "Similiar to the .iloc indexer you can select combinations of rows and columns. The doc details are located here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the DataFrame df2 created below in cell #10. It contains the new columns 'id' and 'date'." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., '08/01/16'], \n", " ['b', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],\n", " ['c', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],\n", " ['d', 'None', 'fast', np.nan, np.nan, 17, 89, '05/31/16'],\n", " ['e', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],\n", " ['f', 'cold', 'slow', np.nan, 29, 33, 17, '01/01/16']],\n", " columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date'])\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Executing just the name for the DataFrame is the equivalent of:\n", " \n", " print(df2)\n", " \n", "The print() method for a DataFrame returns the output without the cell outlines, however." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6date
0acoldslowNaN2.06.03.008/01/16
1bwarmmedium4.05.07.09.003/15/16
2chotfast9.04.0NaN6.004/30/16
3dNonefastNaNNaN17.089.005/31/16
4ecoolmedium16.044.021.013.007/04/16
5fcoldslowNaN29.033.017.001/01/16
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6 date\n", "0 a cold slow NaN 2.0 6.0 3.0 08/01/16\n", "1 b warm medium 4.0 5.0 7.0 9.0 03/15/16\n", "2 c hot fast 9.0 4.0 NaN 6.0 04/30/16\n", "3 d None fast NaN NaN 17.0 89.0 05/31/16\n", "4 e cool medium 16.0 44.0 21.0 13.0 07/04/16\n", "5 f cold slow NaN 29.0 33.0 17.0 01/01/16" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting and resetting Indicies\n", "\n", "In SAS the construction of an index for a data set creates an external file used in a deterministic fashion by SAS. In contrast, the construction of a panda index physically alters either the DataFrame, or a copy of it, depending on argument values to the set_index method. The doc is located here ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by setting the index to 'id', to access rows by a row value or a range of rows values. By default, the column is dropped when it becomes the index. You may not want this behavior, in which case you set the drop= argument value to False. \n", " \n", " drop=False\n", " \n", "This is useful if you have multiple occasions setting and resetting the index. Otherwise, a re-read of the DataFrame is required. Below, use the set_index() method to set the index to the 'id' column." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.set_index('id', inplace=True, drop=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .set_index attribute execution is silent when the inplace=argument value is utilized. Validate the index using the .index attribute." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object', name='id')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To reset the index, use the reset_index() method." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2 = df2.reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to have the remainder of the examples for the .loc indexer to work, we set the index again." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.set_index('id', inplace=True, drop=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the row labeled 'e'." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "id e\n", "col1 cool\n", "col2 medium\n", "col3 16\n", "col4 44\n", "col5 21\n", "col6 13\n", "date 07/04/16\n", "Name: e, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['e', ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return rows in the range of 'b' to 'f' inclusive. 'b':'f' denotes a row range. The absence of a column request is an implicit request for all of them." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6date
id
bbwarmmedium4.05.07.09.003/15/16
cchotfast9.04.0NaN6.004/30/16
ddNonefastNaNNaN17.089.005/31/16
eecoolmedium16.044.021.013.007/04/16
ffcoldslowNaN29.033.017.001/01/16
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6 date\n", "id \n", "b b warm medium 4.0 5.0 7.0 9.0 03/15/16\n", "c c hot fast 9.0 4.0 NaN 6.0 04/30/16\n", "d d None fast NaN NaN 17.0 89.0 05/31/16\n", "e e cool medium 16.0 44.0 21.0 13.0 07/04/16\n", "f f cold slow NaN 29.0 33.0 17.0 01/01/16" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['b':'f' ,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the rows between range of 'd' to 'f' inclusive. 'col6' and 'col2' is a request for columns by label." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6col2
id
d89.0fast
e13.0medium
f17.0slow
\n", "
" ], "text/plain": [ " col6 col2\n", "id \n", "d 89.0 fast\n", "e 13.0 medium\n", "f 17.0 slow" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['d':'f',['col6','col2']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the 'date' column by label." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "id\n", "a 08/01/16\n", "b 03/15/16\n", "c 04/30/16\n", "d 05/31/16\n", "e 07/04/16\n", "f 01/01/16\n", "Name: date, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[: , 'date']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the DataFrame df2 index from 'id' to 'date'. The inplace=True argument does not make a copy of the DataFrame." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.set_index('date', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate the index." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['08/01/16', '03/15/16', '04/30/16', '05/31/16', '07/04/16', '01/01/16'], dtype='object', name='date')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Request a row by label." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "id d\n", "col1 None\n", "col2 fast\n", "col3 NaN\n", "col4 NaN\n", "col5 17\n", "col6 89\n", "Name: 05/31/16, dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['05/31/16']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Request arbitrary rows. Notice the double square braces ([])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6
date
03/15/16bwarmmedium4.05.07.09.0
07/04/16ecoolmedium16.044.021.013.0
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6\n", "date \n", "03/15/16 b warm medium 4.0 5.0 7.0 9.0\n", "07/04/16 e cool medium 16.0 44.0 21.0 13.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[['03/15/16', '07/04/16']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Request a range of rows. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2col1
date
04/30/16fasthot
05/31/16fastNone
07/04/16mediumcool
\n", "
" ], "text/plain": [ " col2 col1\n", "date \n", "04/30/16 fast hot\n", "05/31/16 fast None\n", "07/04/16 medium cool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['04/30/16':'07/04/16',['col2','col1']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SAS program below is equivalent to cell #25 above. It uses character values representing dates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_select_dates_01Feb16_and_31Jul16.sas */\n", " /******************************************************/\n", " 25 data df2(keep = col2 col1);\n", " 26 set df(where=(date between '04/30/16' and '07/04/16'));\n", " 27 put _all_;\n", "\n", " _N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6 date=04/30/16\n", " _N_=2 _ERROR_=0 id=d col1=cool col2= col3=. col4=. col5=17 col6=89 date=05/31/16\n", " _N_=3 _ERROR_=0 id=e col1=cool col2=medium col3=16 col4=44 col5=21 col6=13 date=07/04/16\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In cell #26 below we hit a snag. The issue begins with cell #21 above using the set_index attribute for the 'df2' DataFrame. Examine cell #22 above to observe how the df2['date'] column dtype is 'object'. \n", "\n", "This indicates we are working with string literals and not datetime objects. Cells #24 and #25 work because these specific labels are values found in the df2['date'] index.\n", "\n", "Cell #26, below does not work, since the range request contains the value '07/31/16 as the range end-point which is not a value in the index. \n", "\n", "The remedy, shown below in cell #29 is to use the pd.to_datetime() method to convert the df2['date'] string values into a datetime object. The obvious analogy for SAS users is converting a string variable to a numeric variable which has an associated date format.\n", "\n", "Converting string literals to datetime values is covered in the section, String Literal Mapped to datetime timestamp located here." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "'07/31/16'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2909\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2910\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_searchsorted_monotonic\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mside\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2911\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36m_searchsorted_monotonic\u001b[1;34m(self, label, side)\u001b[0m\n\u001b[0;32m 2875\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2876\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'index must be monotonic increasing or decreasing'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2877\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: index must be monotonic increasing or decreasing", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf2\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'01/01/16'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;34m'07/31/16'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1294\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_tuple\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1295\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1296\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1297\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1298\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_getitem_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1435\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1436\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_valid_type\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1437\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_slice_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1438\u001b[0m \u001b[1;32melif\u001b[0m \u001b[0mis_bool_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1439\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getbool_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_get_slice_axis\u001b[1;34m(self, slice_obj, axis)\u001b[0m\n\u001b[0;32m 1316\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1317\u001b[0m indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,\n\u001b[1;32m-> 1318\u001b[1;33m slice_obj.step, kind=self.name)\n\u001b[0m\u001b[0;32m 1319\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1320\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_indexer\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2783\u001b[0m \"\"\"\n\u001b[0;32m 2784\u001b[0m start_slice, end_slice = self.slice_locs(start, end, step=step,\n\u001b[1;32m-> 2785\u001b[1;33m kind=kind)\n\u001b[0m\u001b[0;32m 2786\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2787\u001b[0m \u001b[1;31m# return a slice\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_locs\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2968\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2969\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2970\u001b[1;33m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_slice_bound\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'right'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkind\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2971\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2972\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2911\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2912\u001b[0m \u001b[1;31m# raise the original KeyError\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2913\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0merr\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2914\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2915\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2905\u001b[0m \u001b[1;31m# we need to look up the label\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2906\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2907\u001b[1;33m \u001b[0mslc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabel\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2908\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2909\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 1945\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1946\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1947\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_maybe_cast_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1948\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1949\u001b[0m \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtolerance\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mtolerance\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mpandas\\index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas\\index.c:4154)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas\\index.c:4018)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas\\hashtable.c:12368)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas\\hashtable.c:12322)\u001b[1;34m()\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: '07/31/16'" ] } ], "source": [ "df2.loc['01/01/16':'07/31/16']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reset the index for the DataFrame 'df2' to the default RangeIndex object. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2.reset_index(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate the index." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=6, step=1)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cast the df2['date'] column from dtype='object' (strings) to dtype=datetime. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2['date'] = pd.to_datetime(df2.date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the df2['date'] column as the index." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2.set_index('date', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate the index. Observe the dytpe is now datetime64--a datetime stamp. See Chapter 8--Date, Time, and Timedelta Objects for more details on datetime arithmetic, shifting time intervals, and determining durations." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2016-08-01', '2016-03-15', '2016-04-30', '2016-05-31',\n", " '2016-07-04', '2016-01-01'],\n", " dtype='datetime64[ns]', name='date', freq=None)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the 'ds2['date'] column values converted to a datetime values, re-do the statement in cell #26 above." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6
date
2016-03-15bwarmmedium4.05.07.09.0
2016-04-30chotfast9.04.0NaN6.0
2016-05-31dNonefastNaNNaN17.089.0
2016-07-04ecoolmedium16.044.021.013.0
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6\n", "date \n", "2016-03-15 b warm medium 4.0 5.0 7.0 9.0\n", "2016-04-30 c hot fast 9.0 4.0 NaN 6.0\n", "2016-05-31 d None fast NaN NaN 17.0 89.0\n", "2016-07-04 e cool medium 16.0 44.0 21.0 13.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['02/01/16':'07/31/16']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SAS example below illustrates a similiar set of steps:\n", " \n", " 1. Read the original 'date' variable which is character and rename it to 'str_date'\n", " 2. Use the input funtion to 'read' the 'str_date' values and assign them to the 'date' variable using the mmddyy10. informat\n", " 3. Print the date value without any formatting showing it is now a SAS datetime value\n", " 4. Print the SAS datetime value using the mmddyy10. date format\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_alter_string_to_datetime_variable.sas */\n", " /******************************************************/\n", " 4 data df2(drop = str_date);\n", " 5 set df(rename=(date=str_date));\n", " 6 date=input(str_date,mmddyy10.);\n", " 7 \n", " 8 if _n_ = 1 then put date= /\n", " 9 date mmddyy10.;\n", "\n", " date=20667\n", " 08/01/2016\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mixing .loc Indexer with Boolean Operators\n", "\n", "This approach works by creating either a Series or array of boolean values (True or False). This Series or array is then used by the .loc indexer to return all of the values that evaluate to True. Using the DataFrame df2 created in the cell above, consider the following.\n", "\n", "We want to return all rows where 'col2' is not equal to 'fast. This is expressed as:\n", "\n", " df2['col2'] != 'fast'\n", " \n", "A Series is returned with the True/False values not equal to 'fast' for fd2['col2'] shown below. The df2['date'] column is returned since it remains as the index for the DataFrame. The second print() method displays this object as being derived from the class: Series." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "date\n", "2016-08-01 True\n", "2016-03-15 True\n", "2016-04-30 False\n", "2016-05-31 False\n", "2016-07-04 True\n", "2016-01-01 True\n", "Name: col2, dtype: bool\n", "\n" ] } ], "source": [ "print(df2['col2'] != 'fast')\n", "print(type(df2['col2'] != 'fast'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing the boolean Series:\n", " \n", " df2['col2'] != 'fast' \n", " \n", "to the .loc indexer to retrieve those rows with a boolean value of True. Also request 'col1' and 'col2', which a request by label." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2
date
2016-08-01coldslow
2016-03-15warmmedium
2016-07-04coolmedium
2016-01-01coldslow
\n", "
" ], "text/plain": [ " col1 col2\n", "date \n", "2016-08-01 cold slow\n", "2016-03-15 warm medium\n", "2016-07-04 cool medium\n", "2016-01-01 cold slow" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[df2['col2'] != 'fast', 'col1':'col2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can combine any number of boolean operation together. Boolean value comparison operators are documented here." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6
date
2016-07-04ecoolmedium16.044.021.013.0
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6\n", "date \n", "2016-07-04 e cool medium 16.0 44.0 21.0 13.0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[(df2.col3 >= 9) & (df2.col1 == 'cool'), ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .isin() method returns a boolean vector similar to the behavior described in cell #34 above. In this example, the .isin list of elements evaluates True if the listed elements are found by the .loc indexer in 'col6'." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6
date
2016-03-15bwarmmedium4.05.07.09.0
2016-04-30chotfast9.04.0NaN6.0
2016-07-04ecoolmedium16.044.021.013.0
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6\n", "date \n", "2016-03-15 b warm medium 4.0 5.0 7.0 9.0\n", "2016-04-30 c hot fast 9.0 4.0 NaN 6.0\n", "2016-07-04 e cool medium 16.0 44.0 21.0 13.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[df2.col6.isin([6, 9, 13])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, the .loc indexers have resulted in an output stream. All of the indexers can be used to sub-set a DataFrame using assignment syntax shown in the cell #33 below." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6
date
2016-03-15bwarmmedium4.05.07.09.0
2016-04-30chotfast9.04.0NaN6.0
2016-07-04ecoolmedium16.044.021.013.0
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6\n", "date \n", "2016-03-15 b warm medium 4.0 5.0 7.0 9.0\n", "2016-04-30 c hot fast 9.0 4.0 NaN 6.0\n", "2016-07-04 e cool medium 16.0 44.0 21.0 13.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df2.loc[df2.col6.isin([6, 9, 13])]\n", "df3" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, in these types of sub-setting operations, the shape of the extracted DataFrame will be smaller than the input DataFrame.\n", "\n", "To return a DataFrame of the same shape as the original, use the where() method. Details for the pandas where() method is described here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape for df2 is (6, 7)\n", "Shape for df3 is (3, 7)\n" ] } ], "source": [ "print('Shape for df2 is', df2.shape)\n", "print('Shape for df3 is', df3.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example SAS program below uses the WHERE IN (list) syntax to subset a data set analogous to the example in cell #32 above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````\n", " /******************************************************/\n", " /* c05_where_in.sas */\n", " /******************************************************/\n", " NOTE: Data set \"WORK.df2\" has 6 observation(s) and 8 variable(s)\n", "\n", " 27 data df3;\n", " 28 set df2(where=(col6 in (6 9 13)));\n", "\n", " NOTE: 3 observations were read from \"WORK.df2\"\n", " NOTE: Data set \"WORK.df3\" has 3 observation(s) and 8 variable(s)\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the SAS variable count and DataFrame column count differ by 1. That is because the DataFrame .shape() method does\n", "not include the index as part of its column count. By reseting the index, the SAS variable count and DataFrame columns count agree." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape for df3 is (3, 8)\n" ] } ], "source": [ "df3.reset_index(inplace=True)\n", "print('Shape for df3 is', df3.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Altering DataFrame values using the .loc indexer\n", "The .loc indexer can also be used to do an in-place update of values. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the values for df2['col2'] column before updating." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "date\n", "2016-08-01 slow\n", "2016-03-15 medium\n", "2016-04-30 fast\n", "2016-05-31 fast\n", "2016-07-04 medium\n", "2016-01-01 slow\n", "Name: col2, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[: , 'col2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conditionaly Apply Values Based on Another Column Value" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2.loc[df2['col6'] > 50, \"col2\"] = \"VERY FAST\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values for df2['col2'] column after the update." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "date\n", "2016-08-01 slow\n", "2016-03-15 medium\n", "2016-04-30 fast\n", "2016-05-31 VERY FAST\n", "2016-07-04 medium\n", "2016-01-01 slow\n", "Name: col2, dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[: , 'col2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## .ix Indexer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .ix indexer combines characteristics of the .loc and .iloc indexers. This means you can select rows and columns by labels and integers. \n", "\n", "The syntax for the .ix indexer is:\n", "\n", " df.ix[row selection, column selection]\n", " \n", "For both the row and column selection, a comma (,) is used to request a list of multiple cells. A colon (:) is used to request a range of cells.\n", " \n", "Similar to the .loc indexer you can select combinations of rows and columns. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .ix indexer is sometimes tricky to use. A good rule of thumb is if you are indexing using labels or indexing using integers, use the .loc and .iloc to avoid unexpected results. The documentation details are found here.\n", "\n", "Consider the creation of the DataFrame 'df4' constructed in the cell below. It is similar to DataFrame df2 created in cell #10 above. The differences are the addition of another column and columns being identified with labels as well as integers." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df4 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., 17, '08/01/16'], \n", " ['b', 'warm', 'medium', 4, 5, 7, 9, 21, '03/15/16'],\n", " ['c', 'hot', 'fast', 9, 4, np.nan, 6, 10, '04/30/16'],\n", " ['d', 'None', 'fast', np.nan, np.nan, 17, 89, 44, '05/31/16'],\n", " ['e', 'cool', 'medium', 16, 44, 21, 13, 99, '07/04/16'],\n", " ['f', 'cold', 'slow', np.nan, 29, 33, 17, 11,'01/01/16']],\n", " columns=['id', 'col1', 'col2', 'col3', 'col4', 4, 5, 6, 'date'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4456date
0acoldslowNaN2.06.03.01708/01/16
1bwarmmedium4.05.07.09.02103/15/16
2chotfast9.04.0NaN6.01004/30/16
3dNonefastNaNNaN17.089.04405/31/16
4ecoolmedium16.044.021.013.09907/04/16
5fcoldslowNaN29.033.017.01101/01/16
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 4 5 6 date\n", "0 a cold slow NaN 2.0 6.0 3.0 17 08/01/16\n", "1 b warm medium 4.0 5.0 7.0 9.0 21 03/15/16\n", "2 c hot fast 9.0 4.0 NaN 6.0 10 04/30/16\n", "3 d None fast NaN NaN 17.0 89.0 44 05/31/16\n", "4 e cool medium 16.0 44.0 21.0 13.0 99 07/04/16\n", "5 f cold slow NaN 29.0 33.0 17.0 11 01/01/16" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the index to column df4['id']." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df4.set_index('id', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .ix indexer allows slicing by labels and integer poitions along the index. And look closely at the results since the request for columns is based on the integer-position of the column. The column requestor of 6 is not a column label, but its position." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
6date
id
b2103/15/16
c1004/30/16
d4405/31/16
e9907/04/16
\n", "
" ], "text/plain": [ " 6 date\n", "id \n", "b 21 03/15/16\n", "c 10 04/30/16\n", "d 44 05/31/16\n", "e 99 07/04/16" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.ix['b':'e', 6:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A Review:\n", "\n", " .iloc uses the integer position in the index and only accepts integers\n", " .loc uses the labels in the index\n", " .ix generally behaves like the .loc\n", "\n", "Finally, to appreciate the differences, consider the following DataFrame. Also notice the un-Pythonic style of using a semi-colon at the end of the DataFrame definition." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
6acoldslowNaN2.06.03.0
8bwarmmedium4.05.07.09.0
2chotfast9.04.0NaN6.0
3dcoolNoneNaNNaN17.089.0
4ecoolmedium16.044.021.013.0
5fcoldslowNaN29.033.017.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "6 a cold slow NaN 2.0 6.0 3.0\n", "8 b warm medium 4.0 5.0 7.0 9.0\n", "2 c hot fast 9.0 4.0 NaN 6.0\n", "3 d cool None NaN NaN 17.0 89.0\n", "4 e cool medium 16.0 44.0 21.0 13.0\n", "5 f cold slow NaN 29.0 33.0 17.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.], \n", " ['b', 'warm', 'medium', 4, 5, 7, 9],\n", " ['c', 'hot', 'fast', 9, 4, np.nan, 6],\n", " ['d', 'cool', None, np.nan, np.nan, 17, 89],\n", " ['e', 'cool', 'medium', 16, 44, 21, 13],\n", " ['f', 'cold', 'slow', np.nan, 29, 33, 17]],\n", " index = [6, 8, 2, 3, 4, 5,]);\n", "df5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc indexer returns the first two rows since it looks at positions." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
6acoldslowNaN2.06.03.0
8bwarmmedium4.05.07.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "6 a cold slow NaN 2.0 6.0 3.0\n", "8 b warm medium 4.0 5.0 7.0 9.0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.iloc[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .loc indexer returns 3 rows since it looks at the labels." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
6acoldslowNaN2.06.03.0
8bwarmmedium4.05.07.09.0
2chotfast9.04.0NaN6.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "6 a cold slow NaN 2.0 6.0 3.0\n", "8 b warm medium 4.0 5.0 7.0 9.0\n", "2 c hot fast 9.0 4.0 NaN 6.0" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.loc[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .ix indexer returns the same number of rows as the .loc indexer since its behavior is to first use labels before looking by position. Looking by position with an integer-based index can lead to unexpected results. This illustrated in cell #49 below." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
6acoldslowNaN2.06.03.0
8bwarmmedium4.05.07.09.0
2chotfast9.04.0NaN6.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "6 a cold slow NaN 2.0 6.0 3.0\n", "8 b warm medium 4.0 5.0 7.0 9.0\n", "2 c hot fast 9.0 4.0 NaN 6.0" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.ix[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the next two examples, review the DataFrame df5 index structure in the cell below. " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Int64Index([6, 8, 2, 3, 4, 5], dtype='int64')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .iloc example in the cell below returns the first row. That's because it is looking by position." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
6acoldslowNaN2.06.03.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "6 a cold slow NaN 2.0 6.0 3.0" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.iloc[:1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .ix example in the cell below raises a KeyError since 1 is not found in the index." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "1", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2909\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2910\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_searchsorted_monotonic\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mside\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2911\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36m_searchsorted_monotonic\u001b[1;34m(self, label, side)\u001b[0m\n\u001b[0;32m 2875\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2876\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'index must be monotonic increasing or decreasing'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2877\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: index must be monotonic increasing or decreasing", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf5\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mix\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 76\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 77\u001b[0m \u001b[0mkey\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcom\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_apply_if_callable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 78\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 79\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 80\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_get_label\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlabel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 990\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 991\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 992\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_slice_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 993\u001b[0m elif (is_list_like_indexer(key) and\n\u001b[0;32m 994\u001b[0m not (isinstance(key, tuple) and\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_get_slice_axis\u001b[1;34m(self, slice_obj, axis)\u001b[0m\n\u001b[0;32m 1235\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mneed_slice\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslice_obj\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1236\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1237\u001b[1;33m \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_convert_slice_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslice_obj\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1238\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1239\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_convert_slice_indexer\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 190\u001b[0m \u001b[1;31m# if we are accessing via lowered dim, use the last dim\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 191\u001b[0m \u001b[0max\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndim\u001b[0m \u001b[1;33m-\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 192\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0max\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_convert_slice_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkind\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 193\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 194\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_has_valid_setitem_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36m_convert_slice_indexer\u001b[1;34m(self, key, kind)\u001b[0m\n\u001b[0;32m 1077\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1078\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1079\u001b[1;33m \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mslice_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mstart\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstop\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstep\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkind\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mkind\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1080\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1081\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mis_index_slice\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_indexer\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2783\u001b[0m \"\"\"\n\u001b[0;32m 2784\u001b[0m start_slice, end_slice = self.slice_locs(start, end, step=step,\n\u001b[1;32m-> 2785\u001b[1;33m kind=kind)\n\u001b[0m\u001b[0;32m 2786\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2787\u001b[0m \u001b[1;31m# return a slice\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_locs\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2968\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2969\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2970\u001b[1;33m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_slice_bound\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'right'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkind\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2971\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2972\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2911\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2912\u001b[0m \u001b[1;31m# raise the original KeyError\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2913\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0merr\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2914\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2915\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2905\u001b[0m \u001b[1;31m# we need to look up the label\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2906\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2907\u001b[1;33m \u001b[0mslc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabel\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2908\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2909\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 1945\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1946\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1947\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_maybe_cast_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1948\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1949\u001b[0m \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtolerance\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mtolerance\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mpandas\\index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas\\index.c:4154)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas\\index.c:4018)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas\\hashtable.c:6610)\u001b[1;34m()\u001b[0m\n", "\u001b[1;32mpandas\\hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas\\hashtable.c:6554)\u001b[1;34m()\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: 1" ] } ], "source": [ "df5.ix[:1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing Issues\n", "So far, so good. We have a basic understanding of how indexes can be established, utilized, and reset. We can use the .iloc, .loc, and .ix indexers to retrieve subsets of columns and rows. But what about real-world scenarios where data is rarely, if ever tidy?\n", "\n", "The synthetic examples above work (except the intentional errors of course) since they rely on constructing the DataFrames in an orderly manner, like having 'id' columns in alphabetical order or dates in chronological order. \n", "\n", "Consider the DataFrame 'df5' created below. It is similar to DataFrame 'df2' in cell #10 above with the exception of the df5['id'] column containing non-unique values." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcol1col2col3col4col5col6date
0bcoldslowNaN2.06.03.001/01/16
1cwarmmedium4.05.07.09.003/15/16
2ahotfast9.04.0NaN6.004/30/16
3dcoolNoneNaNNaN17.089.005/31/16
4ccoolmedium16.044.021.013.007/04/16
5ecoldslowNaN29.033.017.008/30/16
\n", "
" ], "text/plain": [ " id col1 col2 col3 col4 col5 col6 date\n", "0 b cold slow NaN 2.0 6.0 3.0 01/01/16\n", "1 c warm medium 4.0 5.0 7.0 9.0 03/15/16\n", "2 a hot fast 9.0 4.0 NaN 6.0 04/30/16\n", "3 d cool None NaN NaN 17.0 89.0 05/31/16\n", "4 c cool medium 16.0 44.0 21.0 13.0 07/04/16\n", "5 e cold slow NaN 29.0 33.0 17.0 08/30/16" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5 = pd.DataFrame([['b', 'cold','slow', np.nan, 2., 6., 3., '01/01/16'], \n", " ['c', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],\n", " ['a', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],\n", " ['d', 'cool', None, np.nan, np.nan, 17, 89, '05/31/16'],\n", " ['c', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],\n", " ['e', 'cold', 'slow', np.nan, 29, 33, 17, '08/30/16']],\n", " columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date']);\n", "df5" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the index for DataFrame df5 to the df5['id'] column. " ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df5.set_index('id', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate the index for DataFrame df5." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['b', 'c', 'a', 'd', 'c', 'e'], dtype='object', name='id')" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the .loc indexer to request the rows in the range of 'b' through 'd'." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6date
id
bcoldslowNaN2.06.03.001/01/16
cwarmmedium4.05.07.09.003/15/16
ahotfast9.04.0NaN6.004/30/16
dcoolNoneNaNNaN17.089.005/31/16
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 date\n", "id \n", "b cold slow NaN 2.0 6.0 3.0 01/01/16\n", "c warm medium 4.0 5.0 7.0 9.0 03/15/16\n", "a hot fast 9.0 4.0 NaN 6.0 04/30/16\n", "d cool None NaN NaN 17.0 89.0 05/31/16" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.loc['b':'d', :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look closely at the results from the example above, you will find the first occurence of the row 'id' label 'c' was returned, but not the second row labeled 'c'. 'id' label 'c' is obviously non-unique. And that is only part of the issue. Consider futher the use of a non-unique label for the row range selection in the example below. \n", "\n", "This issue is described in sparse prose here.\n", "\n", "If we should want the row label range of 'b' to 'c' with all the columns we raise the error:\n", "\n", " \"Cannot get right slice bound for non-unique label: 'c'\"\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "\"Cannot get right slice bound for non-unique label: 'c'\"", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf5\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'b'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;34m'c'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1292\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1293\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mtype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mis\u001b[0m \u001b[0mtuple\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1294\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_tuple\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1295\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1296\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_getitem_tuple\u001b[1;34m(self, tup)\u001b[0m\n\u001b[0;32m 802\u001b[0m \u001b[1;32mcontinue\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 803\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 804\u001b[1;33m \u001b[0mretval\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mretval\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getitem_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 805\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 806\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mretval\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1435\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1436\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_valid_type\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1437\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_slice_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1438\u001b[0m \u001b[1;32melif\u001b[0m \u001b[0mis_bool_indexer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1439\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_getbool_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py\u001b[0m in \u001b[0;36m_get_slice_axis\u001b[1;34m(self, slice_obj, axis)\u001b[0m\n\u001b[0;32m 1316\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1317\u001b[0m indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,\n\u001b[1;32m-> 1318\u001b[1;33m slice_obj.step, kind=self.name)\n\u001b[0m\u001b[0;32m 1319\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1320\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_indexer\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2783\u001b[0m \"\"\"\n\u001b[0;32m 2784\u001b[0m start_slice, end_slice = self.slice_locs(start, end, step=step,\n\u001b[1;32m-> 2785\u001b[1;33m kind=kind)\n\u001b[0m\u001b[0;32m 2786\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2787\u001b[0m \u001b[1;31m# return a slice\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mslice_locs\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 2968\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2969\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2970\u001b[1;33m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_slice_bound\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'right'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkind\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2971\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2972\u001b[0m \u001b[0mend_slice\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\Users\\randy\\Anaconda3\\lib\\site-packages\\pandas\\indexes\\base.py\u001b[0m in \u001b[0;36mget_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 2922\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2923\u001b[0m raise KeyError(\"Cannot get %s slice bound for non-unique \"\n\u001b[1;32m-> 2924\u001b[1;33m \"label: %r\" % (side, original_label))\n\u001b[0m\u001b[0;32m 2925\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2926\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mslc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mslice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: \"Cannot get right slice bound for non-unique label: 'c'\"" ] } ], "source": [ "df5.loc['b':'c', :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is going on here is the values in 'id' column are non-unique. To enable detection of this condition the attribute .index.is_montonic_increasing and .index_montonic_decreasing return a boolean to test for this non-uniqueness property." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applied to the df5 DataFrame created in cell #54 above returns False." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5.index.is_monotonic_increasing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The .is_monotonic_increasing attribute applied to the first DataFrame created above, 'df' returns true." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index.is_monotonic_increasing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While not spelled out in any documentation I found, the moral of the story is when using indices with non-unique values, be wary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "\n", " Indexing and Selecting Data pandas 0.19.0 doucmentation\n", "\n", " Selecting DataFrame rows and columns using iloc, loc, and ix in Pandas by Shane Lynn\n", "\n", " Indexing, Slicing and Subsetting DataFrames in Python by Chris Friedline.\n", "\n", " Non-monotonic indexes require exact matches pandas 0.19.0 documentation\n", "\n", " Useful pandas Snippets by Becky Swegler, Computers are for People." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Navigation\n", "\n", " Return to Chapter List " ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }