{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Software Engineering for Data Scientists\n", "\n", "## *Sophisticated Data Manipulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Python's Data Science Ecosystem\n", "\n", "With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python's Data Science Ecosystem\n", "\n", "In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.\n", "Some of the most important ones are:\n", "\n", "#### [``numpy``](http://numpy.org/): Numerical Python\n", "\n", "Numpy is short for \"Numerical Python\", and contains tools for efficient manipulation of arrays of data.\n", "If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.\n", "\n", "#### [``scipy``](http://scipy.org/): Scientific Python\n", "\n", "Scipy is short for \"Scientific Python\", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.\n", "We will not look closely at Scipy today, but we will use its functionality later in the course.\n", "\n", "#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python\n", "\n", "Pandas is short for \"Panel Data\", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.\n", "If you've used the [R](http://rstats.org) statistical language (and in particular the so-called \"Hadley Stack\"), much of the functionality in Pandas should feel very familiar.\n", "\n", "#### [``matplotlib``](http://matplotlib.org): Visualization in Python\n", "\n", "Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Installation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installing Pandas & friends\n", "\n", "Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like ``conda``. All it takes is to run\n", "\n", "```\n", "$ conda install numpy scipy pandas matplotlib\n", "```\n", "\n", "and (so long as your conda setup is working) the packages will be downloaded and installed on your system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Arrays and slicing in Numpy" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lists in native Python\n", "\n", "Let's create a **list**, a native Python object that we've used earlier today." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 5, 7, 8]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_list = [2, 5, 7, 8]\n", "my_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This list is one-dimensional, let's make it multidimensional!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "multi_list = [[1, 2, 3], [4, 5, 6]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we access the *6* element in the second row, third column for native Python list?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting to numpy Arrays" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "my_array = np.array(my_list)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(my_array)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_array)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('int64')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array.dtype" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4,)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array.shape" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "my_array = np.array([my_list])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 4)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2, 5, 7, 8]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "multi_array = np.array([[1, 2, 3], [4, 5, 6]], np.int32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we access the *6* element in the second row, third column for numpy array?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we retrieve a slice of the array, `array([[1, 2], [4, 5]])`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we retrieve the second column of the array?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Introduction to Pandas DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the elements of a table?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Pandas DataFrames as table elements\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What operations do we perform on tables?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABccc
x121.0
y2433.0
z364.0
\n", "
" ], "text/plain": [ " A B ccc\n", "x 1 2 1.0\n", "y 2 4 33.0\n", "z 3 6 4.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'A': [1,2,3], 'B': [2, 4, 6], 'ccc': [1.0, 33, 4]})\n", "df.index = ['x', 'y', 'z']\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABccc
x242.0
y4866.0
z6128.0
\n", "
" ], "text/plain": [ " A B ccc\n", "x 2 4 2.0\n", "y 4 8 66.0\n", "z 6 12 8.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df+df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
x12
z36
y24
\n", "
" ], "text/plain": [ " A B\n", "x 1 2\n", "z 3 6\n", "y 2 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame({'A': [1,3, 2], 'B': [2, 6, 4]})\n", "df2.index = ['x', 'z', 'y']\n", "df2" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABccc
x24NaN
y48NaN
z612NaN
\n", "
" ], "text/plain": [ " A B ccc\n", "x 2 4 NaN\n", "y 4 8 NaN\n", "z 6 12 NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df+df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub_df = df[['A', 'ccc']]\n", "sub_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['A'] + 2*df['B']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Operations on a Pandas DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Manipulating Data with DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading the data\n", "\n", "Shell commands can be run from the notebook by preceding them with an exclamation point:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "uncomment this to download the data:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 42.7M 0 42.7M 0 0 1669k 0 --:--:-- 0:00:26 --:--:-- 1644k\n" ] } ], "source": [ "!curl -o pronto.csv https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading Data into a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('pronto.csv')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "275091" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idstarttimestoptimebikeidtripdurationfrom_station_nameto_station_namefrom_station_idto_station_idusertypegenderbirthyear
043110/13/2014 10:31:00 AM10/13/2014 10:48:00 AMSEA00298985.9352nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1960.0
143210/13/2014 10:32:00 AM10/13/2014 10:48:00 AMSEA00195926.3752nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1970.0
243310/13/2014 10:33:00 AM10/13/2014 10:48:00 AMSEA00486883.8312nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1988.0
343410/13/2014 10:34:00 AM10/13/2014 10:48:00 AMSEA00333865.9372nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1977.0
443510/13/2014 10:34:00 AM10/13/2014 10:49:00 AMSEA00202923.9232nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1971.0
\n", "
" ], "text/plain": [ " trip_id starttime stoptime bikeid \\\n", "0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 \n", "1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 \n", "2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 \n", "3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 \n", "4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 \n", "\n", " tripduration from_station_name \\\n", "0 985.935 2nd Ave & Spring St \n", "1 926.375 2nd Ave & Spring St \n", "2 883.831 2nd Ave & Spring St \n", "3 865.937 2nd Ave & Spring St \n", "4 923.923 2nd Ave & Spring St \n", "\n", " to_station_name from_station_id \\\n", "0 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "1 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "2 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "3 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "4 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "\n", " to_station_id usertype gender birthyear \n", "0 PS-04 Member Male 1960.0 \n", "1 PS-04 Member Male 1970.0 \n", "2 PS-04 Member Female 1988.0 \n", "3 PS-04 Member Female 1977.0 \n", "4 PS-04 Member Male 1971.0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the ``read_csv`` command to read the comma-separated-value data:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: strings in Python can be defined either with double quotes or single quotes*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing Pandas Dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``head()`` and ``tail()`` methods show us the first and last rows of the data" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idstarttimestoptimebikeidtripdurationfrom_station_nameto_station_namefrom_station_idto_station_idusertypegenderbirthyear
043110/13/2014 10:31:00 AM10/13/2014 10:48:00 AMSEA00298985.9352nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1960.0
143210/13/2014 10:32:00 AM10/13/2014 10:48:00 AMSEA00195926.3752nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1970.0
243310/13/2014 10:33:00 AM10/13/2014 10:48:00 AMSEA00486883.8312nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1988.0
343410/13/2014 10:34:00 AM10/13/2014 10:48:00 AMSEA00333865.9372nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1977.0
443510/13/2014 10:34:00 AM10/13/2014 10:49:00 AMSEA00202923.9232nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1971.0
\n", "
" ], "text/plain": [ " trip_id starttime stoptime bikeid \\\n", "0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 \n", "1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 \n", "2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 \n", "3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 \n", "4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 \n", "\n", " tripduration from_station_name \\\n", "0 985.935 2nd Ave & Spring St \n", "1 926.375 2nd Ave & Spring St \n", "2 883.831 2nd Ave & Spring St \n", "3 865.937 2nd Ave & Spring St \n", "4 923.923 2nd Ave & Spring St \n", "\n", " to_station_name from_station_id \\\n", "0 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "1 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "2 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "3 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "4 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "\n", " to_station_id usertype gender birthyear \n", "0 PS-04 Member Male 1960.0 \n", "1 PS-04 Member Male 1970.0 \n", "2 PS-04 Member Female 1988.0 \n", "3 PS-04 Member Female 1977.0 \n", "4 PS-04 Member Male 1971.0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['trip_id',\n", " 'starttime',\n", " 'stoptime',\n", " 'bikeid',\n", " 'tripduration',\n", " 'from_station_name',\n", " 'to_station_name',\n", " 'from_station_id',\n", " 'to_station_id',\n", " 'usertype',\n", " 'gender',\n", " 'birthyear']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smaller_df = df.loc[[1,4,6,7,9,34],:]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smaller_df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``shape`` attribute shows us the number of elements:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(275091, 12)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``columns`` attribute gives us the column names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``index`` attribute gives us the index names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``dtypes`` attribute gives the data types of each column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sophisticated Data Manipulation\n", "\n", "Here we'll cover some key features of manipulating data with pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access columns by name using square-bracket indexing:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idstarttimestoptimebikeidtripdurationfrom_station_nameto_station_namefrom_station_idto_station_idusertypegenderbirthyear
043110/13/2014 10:31:00 AM10/13/2014 10:48:00 AMSEA00298985.9352nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1960.0
143210/13/2014 10:32:00 AM10/13/2014 10:48:00 AMSEA00195926.3752nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1970.0
243310/13/2014 10:33:00 AM10/13/2014 10:48:00 AMSEA00486883.8312nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1988.0
343410/13/2014 10:34:00 AM10/13/2014 10:48:00 AMSEA00333865.9372nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1977.0
443510/13/2014 10:34:00 AM10/13/2014 10:49:00 AMSEA00202923.9232nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1971.0
\n", "
" ], "text/plain": [ " trip_id starttime stoptime bikeid \\\n", "0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 \n", "1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 \n", "2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 \n", "3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 \n", "4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 \n", "\n", " tripduration from_station_name \\\n", "0 985.935 2nd Ave & Spring St \n", "1 926.375 2nd Ave & Spring St \n", "2 883.831 2nd Ave & Spring St \n", "3 865.937 2nd Ave & Spring St \n", "4 923.923 2nd Ave & Spring St \n", "\n", " to_station_name from_station_id \\\n", "0 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "1 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "2 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "3 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "4 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "\n", " to_station_id usertype gender birthyear \n", "0 PS-04 Member Male 1960.0 \n", "1 PS-04 Member Male 1970.0 \n", "2 PS-04 Member Female 1988.0 \n", "3 PS-04 Member Female 1977.0 \n", "4 PS-04 Member Male 1971.0 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "df_small = df['stoptime']" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df_small)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10/13/2014 10:48:00 AM\n", "1 10/13/2014 10:48:00 AM\n", "2 10/13/2014 10:48:00 AM\n", "3 10/13/2014 10:48:00 AM\n", "4 10/13/2014 10:49:00 AM\n", "Name: stoptime, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_small.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df_small)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_small.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mathematical operations on columns happen *element-wise*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trip_duration_hours = df['tripduration']/3600\n", "trip_duration_hours[:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trip_duration_hours.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "df['trip_duration_hours'] = df['tripduration']/3600" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "del df['trip_duration_hours']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idstarttimestoptimebikeidtripdurationfrom_station_nameto_station_namefrom_station_idto_station_idusertypegenderbirthyear
043110/13/2014 10:31:00 AM10/13/2014 10:48:00 AMSEA00298985.9352nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1960.0
143210/13/2014 10:32:00 AM10/13/2014 10:48:00 AMSEA00195926.3752nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1970.0
243310/13/2014 10:33:00 AM10/13/2014 10:48:00 AMSEA00486883.8312nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1988.0
343410/13/2014 10:34:00 AM10/13/2014 10:48:00 AMSEA00333865.9372nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1977.0
443510/13/2014 10:34:00 AM10/13/2014 10:49:00 AMSEA00202923.9232nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1971.0
\n", "
" ], "text/plain": [ " trip_id starttime stoptime bikeid \\\n", "0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 \n", "1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 \n", "2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 \n", "3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 \n", "4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 \n", "\n", " tripduration from_station_name \\\n", "0 985.935 2nd Ave & Spring St \n", "1 926.375 2nd Ave & Spring St \n", "2 883.831 2nd Ave & Spring St \n", "3 865.937 2nd Ave & Spring St \n", "4 923.923 2nd Ave & Spring St \n", "\n", " to_station_name from_station_id \\\n", "0 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "1 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "2 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "3 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "4 Occidental Park / Occidental Ave S & S Washing... CBD-06 \n", "\n", " to_station_id usertype gender birthyear \n", "0 PS-04 Member Male 1960.0 \n", "1 PS-04 Member Male 1970.0 \n", "2 PS-04 Member Female 1988.0 \n", "3 PS-04 Member Female 1977.0 \n", "4 PS-04 Member Male 1971.0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
starttimestoptime
010/13/2014 10:31:00 AM10/13/2014 10:48:00 AM
410/13/2014 10:34:00 AM10/13/2014 10:49:00 AM
\n", "
" ], "text/plain": [ " starttime stoptime\n", "0 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM\n", "4 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[[0,4],['starttime', 'stoptime']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_long_trips = df[df['tripduration'] >10000]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sel = df['tripduration'] > 10000\n", "df_long_trips = df[sel]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_long_trips" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[sel].shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a copy of a slice\n", "df_subset = df[['starttime', 'stoptime']].copy()\n", "df_subset['trip_hours'] = df['tripduration']/3600" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columns can be created (or overwritten) with the assignment operator.\n", "Let's create a *tripminutes* column with the number of minutes for each trip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More complicated mathematical operations can be done with tools in the ``numpy`` package:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One trick to know when working with columns of times is that Pandas ``DateTimeIndex`` provides a nice interface for working with columns of times.\n", "\n", "For a dataset of this size, using ``pd.to_datetime`` and specifying the date format can make things much faster (from the [strftime reference](http://strftime.org/), we see that the pronto data has format ``\"%m/%d/%Y %I:%M:%S %p\"``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Note: you can also use ``infer_datetime_format=True`` in most cases to automatically infer the correct format, though due to a bug it doesn't work when AM/PM are present)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple Grouping of Data\n", "\n", "The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Value Counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas includes an array of useful functionality for manipulating and analyzing tabular data.\n", "We'll take a look at two of these here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``pandas.value_counts`` returns statistics on the unique values within each column.\n", "\n", "We can use it, for example, to break down rides by gender:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.value_counts(df[\"gender\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or to break down rides by age:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.value_counts(2019 - df[\"birthyear\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the values rather than the index are sorted. Use ``sort=False`` to turn this behavior off:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.value_counts(df[\"birthyear\"], sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explore other things as well: day of week, hour of day, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group-by Operation\n", "\n", "One of the killer features of the Pandas dataframe is the ability to do group-by operations.\n", "You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count
from_station_id
BT-0110463
BT-037334
BT-044666
BT-055699
BT-06150
\n", "
" ], "text/plain": [ " count\n", "from_station_id \n", "BT-01 10463\n", "BT-03 7334\n", "BT-04 4666\n", "BT-05 5699\n", "BT-06 150" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sub = df[['from_station_id', 'trip_id']]\n", "df_count = df_sub.groupby(['from_station_id']).count()\n", "df_count.columns = ['count']\n", "df_count.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_mean = df.groupby(['from_station_id']).mean()\n", "df_mean.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfgroup = df.groupby(['from_station_id'])\n", "dfgroup.groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)\n", "\n", "```\n", ".groupby().()\n", "```\n", "\n", "for example, we can group by gender and find the average of all numerical columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(gender).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also possible to index the grouped object like it is a dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``unstack()`` operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing data with ``pandas``\n", "\n", "Of course, looking at tables of data is not very intuitive.\n", "Fortunately Pandas has many useful plotting functions built-in, all of which make use of the ``matplotlib`` library to generate plots.\n", "\n", "Whenever you do plotting in the IPython notebook, you will want to first run this *magic command* which configures the notebook to work well with plots:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can simply call the ``plot()`` method of any series or dataframe to get a reasonable view of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "df['tripduration'].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjusting the Plot Style\n", "\n", "Matplotlib has a number of plot styles you can use. For example, if you like R you might use the ggplot style:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.style.use(\"ggplot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other plot types\n", "\n", "Pandas supports a range of other plotting types; you can find these by using the autocomplete on the ``plot`` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we can create a histogram of trip durations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you'd like to adjust the x and y limits of the plot, you can use the ``set_xlim()`` and ``set_ylim()`` method of the resulting object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Breakout: Exploring the Data\n", "\n", "Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a ``groupby``, and find the appropriate aggregation to count the number in each group)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split this plot by gender. Do you see any seasonal ridership patterns by gender?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split this plot by user type. Do you see any seasonal ridership patterns by usertype?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat the above three steps, counting the number of rides by time of day rather that by month." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there any other interesting insights you can discover in the data using these tools?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Files\n", "- Writing and running python modules\n", "- Using python modules in your Jupyter Notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A script for creating a dataframe with counts of the occurrence of a columns' values\n", "df_count = df.groupby('from_station_id').count()\n", "df_count1 = df_count[['trip_id']]\n", "df_count2 = df_count1.rename(columns={'trip_id': 'count'})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_count2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_table_count(df_arg, groupby_column):\n", " df_count = df_arg.groupby(groupby_column).count()\n", " column_name = df.columns[0]\n", " df_count1 = df_count[[column_name]]\n", " df_count2 = df_count1.rename(columns={column_name: 'count'})\n", " return df_count2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dff = make_table_count(df, 'from_station_id')\n", "dff.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 1 }