{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.\n", "\n", "This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.\n", "\n", "Please refer to [this post](http://pbpython.com/excel-file-combine.html) for the full post." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Collecting the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import pandas and numpy" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the files in our input directory, using the convenient shell commands in ipython." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All-Web-Site-Data-Audience-Overview.xlsx sales-jan-2014.xlsx\r\n", "customer-status.xlsx\t\t\t sales-mar-2014.xlsx\r\n", "March-2017-forecast-article.xlsx\t sales_transactions.xlsx\r\n", "mn-budget-detail-2014.csv\t\t sample-sales-reps.xlsx\r\n", "sales-estimate.xlsx\t\t\t sample-sales-tax.csv\r\n", "sales-feb-2014.xlsx\t\t\t sample-salesv3.xlsx\r\n", "salesfunnel.xlsx\r\n" ] } ], "source": [ "!ls ../data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of files, but we only want to look at the sales .xlsx files." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../data/sales-feb-2014.xlsx ../data/sales-mar-2014.xlsx\r\n", "../data/sales-jan-2014.xlsx\r\n" ] } ], "source": [ "!ls ../data/sales-*-2014.xlsx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the python glob module to easily list out the files we need" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import glob" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['../data/sales-feb-2014.xlsx',\n", " '../data/sales-jan-2014.xlsx',\n", " '../data/sales-mar-2014.xlsx']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob(\"../data/sales-*-2014.xlsx\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us what we need, let's import each of our files and combine them into one file. \n", "\n", "Panda's concat and append can do this for us. I'm going to use append in this example.\n", "\n", "The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "all_data = pd.DataFrame()\n", "for f in glob.glob(\"../data/sales-*-2014.xlsx\"):\n", " df = pd.read_excel(f)\n", " all_data = all_data.append(df,ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numberquantityunit priceext price
count384.000000384.000000384.000000384.000000
mean478125.98958324.37239656.6514061394.517344
std220902.94740114.37321927.0758831117.809743
min141962.000000-1.00000010.210000-97.160000
25%257198.00000012.00000032.612500482.745000
50%424914.00000023.50000058.1600001098.710000
75%714466.00000037.00000080.9650002132.260000
max786968.00000049.00000099.7300004590.810000
\n", "
" ], "text/plain": [ " account number quantity unit price ext price\n", "count 384.000000 384.000000 384.000000 384.000000\n", "mean 478125.989583 24.372396 56.651406 1394.517344\n", "std 220902.947401 14.373219 27.075883 1117.809743\n", "min 141962.000000 -1.000000 10.210000 -97.160000\n", "25% 257198.000000 12.000000 32.612500 482.745000\n", "50% 424914.000000 23.500000 58.160000 1098.710000\n", "75% 714466.000000 37.000000 80.965000 2132.260000\n", "max 786968.000000 49.000000 99.730000 4590.810000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedate
0383080Will LLCB1-20000733.69235.832014-02-01 09:04:59
1412290Jerde-HilpertS1-277221121.12232.322014-02-01 11:51:46
2412290Jerde-HilpertB1-86481335.99107.972014-02-01 17:24:32
3412290Jerde-HilpertB1-200002378.901814.702014-02-01 19:56:48
4672390Kuhn-GusikowskiS1-065324855.822679.362014-02-02 03:45:20
\n", "
" ], "text/plain": [ " account number name sku quantity unit price ext price \\\n", "0 383080 Will LLC B1-20000 7 33.69 235.83 \n", "1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 \n", "2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 \n", "3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 \n", "4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 \n", "\n", " date \n", "0 2014-02-01 09:04:59 \n", "1 2014-02-01 11:51:46 \n", "2 2014-02-01 17:24:32 \n", "3 2014-02-01 19:56:48 \n", "4 2014-02-02 03:45:20 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not critical in this example but the best practice is to convert the date column to a date time object." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "all_data['date'] = pd.to_datetime(all_data['date'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.\n", "\n", "First, we read in the data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernamestatus
0740150Barton LLCgold
1714466Trantow-Barrowssilver
2218895Kulas Incbronze
3307599Kassulke, Ondricka and Metzbronze
4412290Jerde-Hilpertbronze
5729833Koepp Ltdsilver
6146832Kiehn-Spinkasilver
7688981Keeling LLCsilver
8786968Frami, Hills and Schmidtsilver
9239344Stokes LLCgold
10672390Kuhn-Gusikowskisilver
11141962Herman LLCgold
12424914White-Trantowsilver
13527099Sanford and Sonsbronze
14642753Pollich LLCbronze
15257198Cronin, Oberbrunner and Spencergold
\n", "
" ], "text/plain": [ " account number name status\n", "0 740150 Barton LLC gold\n", "1 714466 Trantow-Barrows silver\n", "2 218895 Kulas Inc bronze\n", "3 307599 Kassulke, Ondricka and Metz bronze\n", "4 412290 Jerde-Hilpert bronze\n", "5 729833 Koepp Ltd silver\n", "6 146832 Kiehn-Spinka silver\n", "7 688981 Keeling LLC silver\n", "8 786968 Frami, Hills and Schmidt silver\n", "9 239344 Stokes LLC gold\n", "10 672390 Kuhn-Gusikowski silver\n", "11 141962 Herman LLC gold\n", "12 424914 White-Trantow silver\n", "13 527099 Sanford and Sons bronze\n", "14 642753 Pollich LLC bronze\n", "15 257198 Cronin, Oberbrunner and Spencer gold" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "status = pd.read_excel(\"../data/customer-status.xlsx\")\n", "status" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0383080Will LLCB1-20000733.69235.832014-02-01 09:04:59NaN
1412290Jerde-HilpertS1-277221121.12232.322014-02-01 11:51:46bronze
2412290Jerde-HilpertB1-86481335.99107.972014-02-01 17:24:32bronze
3412290Jerde-HilpertB1-200002378.901814.702014-02-01 19:56:48bronze
4672390Kuhn-GusikowskiS1-065324855.822679.362014-02-02 03:45:20silver
\n", "
" ], "text/plain": [ " account number name sku quantity unit price ext price \\\n", "0 383080 Will LLC B1-20000 7 33.69 235.83 \n", "1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 \n", "2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 \n", "3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 \n", "4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 \n", "\n", " date status \n", "0 2014-02-01 09:04:59 NaN \n", "1 2014-02-01 11:51:46 bronze \n", "2 2014-02-01 17:24:32 bronze \n", "3 2014-02-01 19:56:48 bronze \n", "4 2014-02-02 03:45:20 silver " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st = pd.merge(all_data, status, how='left')\n", "all_data_st.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks pretty good but let's look at a specific account." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
15737550Fritsch, Russel and AndersonS1-474124051.012040.402014-02-05 01:20:40NaN
25737550Fritsch, Russel and AndersonS1-065323418.69635.462014-02-07 09:22:02NaN
66737550Fritsch, Russel and AndersonS1-277221570.231053.452014-02-16 18:24:42NaN
78737550Fritsch, Russel and AndersonS2-340772693.352427.102014-02-20 18:45:43NaN
80737550Fritsch, Russel and AndersonS1-936833110.52326.122014-02-21 13:55:45NaN
\n", "
" ], "text/plain": [ " account number name sku quantity \\\n", "15 737550 Fritsch, Russel and Anderson S1-47412 40 \n", "25 737550 Fritsch, Russel and Anderson S1-06532 34 \n", "66 737550 Fritsch, Russel and Anderson S1-27722 15 \n", "78 737550 Fritsch, Russel and Anderson S2-34077 26 \n", "80 737550 Fritsch, Russel and Anderson S1-93683 31 \n", "\n", " unit price ext price date status \n", "15 51.01 2040.40 2014-02-05 01:20:40 NaN \n", "25 18.69 635.46 2014-02-07 09:22:02 NaN \n", "66 70.23 1053.45 2014-02-16 18:24:42 NaN \n", "78 93.35 2427.10 2014-02-20 18:45:43 NaN \n", "80 10.52 326.12 2014-02-21 13:55:45 NaN " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st[all_data_st[\"account number\"]==737550].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0383080Will LLCB1-20000733.69235.832014-02-01 09:04:59bronze
1412290Jerde-HilpertS1-277221121.12232.322014-02-01 11:51:46bronze
2412290Jerde-HilpertB1-86481335.99107.972014-02-01 17:24:32bronze
3412290Jerde-HilpertB1-200002378.901814.702014-02-01 19:56:48bronze
4672390Kuhn-GusikowskiS1-065324855.822679.362014-02-02 03:45:20silver
\n", "
" ], "text/plain": [ " account number name sku quantity unit price ext price \\\n", "0 383080 Will LLC B1-20000 7 33.69 235.83 \n", "1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 \n", "2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 \n", "3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 \n", "4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 \n", "\n", " date status \n", "0 2014-02-01 09:04:59 bronze \n", "1 2014-02-01 11:51:46 bronze \n", "2 2014-02-01 17:24:32 bronze \n", "3 2014-02-01 19:56:48 bronze \n", "4 2014-02-02 03:45:20 silver " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st['status'].fillna('bronze',inplace=True)\n", "all_data_st.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the data just to make sure we're all good." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
15737550Fritsch, Russel and AndersonS1-474124051.012040.402014-02-05 01:20:40bronze
25737550Fritsch, Russel and AndersonS1-065323418.69635.462014-02-07 09:22:02bronze
66737550Fritsch, Russel and AndersonS1-277221570.231053.452014-02-16 18:24:42bronze
78737550Fritsch, Russel and AndersonS2-340772693.352427.102014-02-20 18:45:43bronze
80737550Fritsch, Russel and AndersonS1-936833110.52326.122014-02-21 13:55:45bronze
\n", "
" ], "text/plain": [ " account number name sku quantity \\\n", "15 737550 Fritsch, Russel and Anderson S1-47412 40 \n", "25 737550 Fritsch, Russel and Anderson S1-06532 34 \n", "66 737550 Fritsch, Russel and Anderson S1-27722 15 \n", "78 737550 Fritsch, Russel and Anderson S2-34077 26 \n", "80 737550 Fritsch, Russel and Anderson S1-93683 31 \n", "\n", " unit price ext price date status \n", "15 51.01 2040.40 2014-02-05 01:20:40 bronze \n", "25 18.69 635.46 2014-02-07 09:22:02 bronze \n", "66 70.23 1053.45 2014-02-16 18:24:42 bronze \n", "78 93.35 2427.10 2014-02-20 18:45:43 bronze \n", "80 10.52 326.12 2014-02-21 13:55:45 bronze " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st[all_data_st[\"account number\"]==737550].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -\n", "\n", "\"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.\"\n", "\n", "For our purposes, the status field is a good candidate for a category type.\n", "\n", "You must make sure you have a recent version of pandas installed for this example to work." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.20.2'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we typecast it to a category using astype." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "all_data_st[\"status\"] = all_data_st[\"status\"].astype(\"category\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn't immediately appear to change anything yet." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0383080Will LLCB1-20000733.69235.832014-02-01 09:04:59bronze
1412290Jerde-HilpertS1-277221121.12232.322014-02-01 11:51:46bronze
2412290Jerde-HilpertB1-86481335.99107.972014-02-01 17:24:32bronze
3412290Jerde-HilpertB1-200002378.901814.702014-02-01 19:56:48bronze
4672390Kuhn-GusikowskiS1-065324855.822679.362014-02-02 03:45:20silver
\n", "
" ], "text/plain": [ " account number name sku quantity unit price ext price \\\n", "0 383080 Will LLC B1-20000 7 33.69 235.83 \n", "1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 \n", "2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 \n", "3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 \n", "4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 \n", "\n", " date status \n", "0 2014-02-01 09:04:59 bronze \n", "1 2014-02-01 11:51:46 bronze \n", "2 2014-02-01 17:24:32 bronze \n", "3 2014-02-01 19:56:48 bronze \n", "4 2014-02-02 03:45:20 silver " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Buy you can see that it is a new data type." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "account number int64\n", "name object\n", "sku object\n", "quantity int64\n", "unit price float64\n", "ext price float64\n", "date datetime64[ns]\n", "status category\n", "dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0383080Will LLCB1-20000733.69235.832014-02-01 09:04:59bronze
196218895Kulas IncS2-838814178.273209.072014-01-20 09:37:58bronze
197383080Will LLCB1-333642690.192344.942014-01-20 09:39:59bronze
198604255Halvorson, Crona and ChamplinS2-114813796.713578.272014-01-20 13:07:28bronze
200527099Sanford and SonsB1-059141864.321157.762014-01-20 21:40:58bronze
\n", "
" ], "text/plain": [ " account number name sku quantity \\\n", "0 383080 Will LLC B1-20000 7 \n", "196 218895 Kulas Inc S2-83881 41 \n", "197 383080 Will LLC B1-33364 26 \n", "198 604255 Halvorson, Crona and Champlin S2-11481 37 \n", "200 527099 Sanford and Sons B1-05914 18 \n", "\n", " unit price ext price date status \n", "0 33.69 235.83 2014-02-01 09:04:59 bronze \n", "196 78.27 3209.07 2014-01-20 09:37:58 bronze \n", "197 90.19 2344.94 2014-01-20 09:39:59 bronze \n", "198 96.71 3578.27 2014-01-20 13:07:28 bronze \n", "200 64.32 1157.76 2014-01-20 21:40:58 bronze " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.sort_values(by=[\"status\"]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ " all_data_st[\"status\"].cat.set_categories([ \"gold\",\"silver\",\"bronze\"],inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can sort it so that gold shows on top." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
68740150Barton LLCB1-388511781.221380.742014-02-17 17:12:16gold
63257198Cronin, Oberbrunner and SpencerS1-277222810.21285.882014-02-15 17:27:44gold
207740150Barton LLCB1-864812030.41608.202014-01-22 16:33:51gold
61740150Barton LLCB1-200002881.392278.922014-02-15 07:45:16gold
60239344Stokes LLCS2-838813043.001290.002014-02-15 02:13:23gold
\n", "
" ], "text/plain": [ " account number name sku quantity \\\n", "68 740150 Barton LLC B1-38851 17 \n", "63 257198 Cronin, Oberbrunner and Spencer S1-27722 28 \n", "207 740150 Barton LLC B1-86481 20 \n", "61 740150 Barton LLC B1-20000 28 \n", "60 239344 Stokes LLC S2-83881 30 \n", "\n", " unit price ext price date status \n", "68 81.22 1380.74 2014-02-17 17:12:16 gold \n", "63 10.21 285.88 2014-02-15 17:27:44 gold \n", "207 30.41 608.20 2014-01-22 16:33:51 gold \n", "61 81.39 2278.92 2014-02-15 07:45:16 gold \n", "60 43.00 1290.00 2014-02-15 02:13:23 gold " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.sort_values(by=[\"status\"]).head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 384\n", "unique 3\n", "top bronze\n", "freq 172\n", "Name: status, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st[\"status\"].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quantityunit priceext price
status
gold24.37500053.7238891351.944583
silver22.84285757.2727141320.032214
bronze25.61627957.3711631472.965930
\n", "
" ], "text/plain": [ " quantity unit price ext price\n", "status \n", "gold 24.375000 53.723889 1351.944583\n", "silver 22.842857 57.272714 1320.032214\n", "bronze 25.616279 57.371163 1472.965930" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, you can run multiple aggregation functions on the data to get really useful information " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quantityunit priceext price
summeanstdsummeanstdsummeanstd
status
gold175524.37500014.5751453868.1253.72388928.74008097340.011351.9445831182.657312
silver319822.84285714.5128438018.1857.27271426.556242184804.511320.0322141086.384051
bronze440625.61627914.1360719867.8457.37116326.857370253350.141472.9659301116.683843
\n", "
" ], "text/plain": [ " quantity unit price \\\n", " sum mean std sum mean std \n", "status \n", "gold 1755 24.375000 14.575145 3868.12 53.723889 28.740080 \n", "silver 3198 22.842857 14.512843 8018.18 57.272714 26.556242 \n", "bronze 4406 25.616279 14.136071 9867.84 57.371163 26.857370 \n", "\n", " ext price \n", " sum mean std \n", "status \n", "gold 97340.01 1351.944583 1182.657312 \n", "silver 184804.51 1320.032214 1086.384051 \n", "bronze 253350.14 1472.965930 1116.683843 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].agg([np.sum,np.mean, np.std])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.\n", "\n", "Maybe we should look at how many bronze customers we have and see what is going on.\n", "\n", "What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.\n", "\n", "I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "status\n", "gold 4\n", "silver 7\n", "bronze 9\n", "Name: name, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_st.drop_duplicates(subset=[\"account number\",\"name\"]).iloc[:,[0,1,7]].groupby([\"status\"])[\"name\"].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }