{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.\n",
    "\n",
    "This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.\n",
    "\n",
    "Please refer to [this post](http://pbpython.com/excel-file-combine.html) for the full post."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collecting the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import pandas and numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the files in our input directory, using the convenient shell commands in ipython."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All-Web-Site-Data-Audience-Overview.xlsx  sales-jan-2014.xlsx\r\n",
      "customer-status.xlsx\t\t\t  sales-mar-2014.xlsx\r\n",
      "March-2017-forecast-article.xlsx\t  sales_transactions.xlsx\r\n",
      "mn-budget-detail-2014.csv\t\t  sample-sales-reps.xlsx\r\n",
      "sales-estimate.xlsx\t\t\t  sample-sales-tax.csv\r\n",
      "sales-feb-2014.xlsx\t\t\t  sample-salesv3.xlsx\r\n",
      "salesfunnel.xlsx\r\n"
     ]
    }
   ],
   "source": [
    "!ls ../data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a lot of files, but we only want to look at the sales .xlsx files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "../data/sales-feb-2014.xlsx  ../data/sales-mar-2014.xlsx\r\n",
      "../data/sales-jan-2014.xlsx\r\n"
     ]
    }
   ],
   "source": [
    "!ls ../data/sales-*-2014.xlsx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the python glob module to easily list out the files we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../data/sales-feb-2014.xlsx',\n",
       " '../data/sales-jan-2014.xlsx',\n",
       " '../data/sales-mar-2014.xlsx']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "glob.glob(\"../data/sales-*-2014.xlsx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us what we need, let's import each of our files and combine them into one file. \n",
    "\n",
    "Panda's concat and append can do this for us. I'm going to use append in this example.\n",
    "\n",
    "The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_data = pd.DataFrame()\n",
    "for f in glob.glob(\"../data/sales-*-2014.xlsx\"):\n",
    "    df = pd.read_excel(f)\n",
    "    all_data = all_data.append(df,ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>384.000000</td>\n",
       "      <td>384.000000</td>\n",
       "      <td>384.000000</td>\n",
       "      <td>384.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>478125.989583</td>\n",
       "      <td>24.372396</td>\n",
       "      <td>56.651406</td>\n",
       "      <td>1394.517344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>220902.947401</td>\n",
       "      <td>14.373219</td>\n",
       "      <td>27.075883</td>\n",
       "      <td>1117.809743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>141962.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>10.210000</td>\n",
       "      <td>-97.160000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>257198.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>32.612500</td>\n",
       "      <td>482.745000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>424914.000000</td>\n",
       "      <td>23.500000</td>\n",
       "      <td>58.160000</td>\n",
       "      <td>1098.710000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>714466.000000</td>\n",
       "      <td>37.000000</td>\n",
       "      <td>80.965000</td>\n",
       "      <td>2132.260000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>786968.000000</td>\n",
       "      <td>49.000000</td>\n",
       "      <td>99.730000</td>\n",
       "      <td>4590.810000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       account number    quantity  unit price    ext price\n",
       "count      384.000000  384.000000  384.000000   384.000000\n",
       "mean    478125.989583   24.372396   56.651406  1394.517344\n",
       "std     220902.947401   14.373219   27.075883  1117.809743\n",
       "min     141962.000000   -1.000000   10.210000   -97.160000\n",
       "25%     257198.000000   12.000000   32.612500   482.745000\n",
       "50%     424914.000000   23.500000   58.160000  1098.710000\n",
       "75%     714466.000000   37.000000   80.965000  2132.260000\n",
       "max     786968.000000   49.000000   99.730000  4590.810000"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>7</td>\n",
       "      <td>33.69</td>\n",
       "      <td>235.83</td>\n",
       "      <td>2014-02-01 09:04:59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>11</td>\n",
       "      <td>21.12</td>\n",
       "      <td>232.32</td>\n",
       "      <td>2014-02-01 11:51:46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-86481</td>\n",
       "      <td>3</td>\n",
       "      <td>35.99</td>\n",
       "      <td>107.97</td>\n",
       "      <td>2014-02-01 17:24:32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>23</td>\n",
       "      <td>78.90</td>\n",
       "      <td>1814.70</td>\n",
       "      <td>2014-02-01 19:56:48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>672390</td>\n",
       "      <td>Kuhn-Gusikowski</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>48</td>\n",
       "      <td>55.82</td>\n",
       "      <td>2679.36</td>\n",
       "      <td>2014-02-02 03:45:20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   account number             name       sku  quantity  unit price  ext price  \\\n",
       "0          383080         Will LLC  B1-20000         7       33.69     235.83   \n",
       "1          412290    Jerde-Hilpert  S1-27722        11       21.12     232.32   \n",
       "2          412290    Jerde-Hilpert  B1-86481         3       35.99     107.97   \n",
       "3          412290    Jerde-Hilpert  B1-20000        23       78.90    1814.70   \n",
       "4          672390  Kuhn-Gusikowski  S1-06532        48       55.82    2679.36   \n",
       "\n",
       "                  date  \n",
       "0  2014-02-01 09:04:59  \n",
       "1  2014-02-01 11:51:46  \n",
       "2  2014-02-01 17:24:32  \n",
       "3  2014-02-01 19:56:48  \n",
       "4  2014-02-02 03:45:20  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is not critical in this example but the best practice is to convert the date column to a date time object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_data['date'] = pd.to_datetime(all_data['date'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Combining Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.\n",
    "\n",
    "First, we read in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>740150</td>\n",
       "      <td>Barton LLC</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>714466</td>\n",
       "      <td>Trantow-Barrows</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>218895</td>\n",
       "      <td>Kulas Inc</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>307599</td>\n",
       "      <td>Kassulke, Ondricka and Metz</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>729833</td>\n",
       "      <td>Koepp Ltd</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>146832</td>\n",
       "      <td>Kiehn-Spinka</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>688981</td>\n",
       "      <td>Keeling LLC</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>786968</td>\n",
       "      <td>Frami, Hills and Schmidt</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>239344</td>\n",
       "      <td>Stokes LLC</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>672390</td>\n",
       "      <td>Kuhn-Gusikowski</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>141962</td>\n",
       "      <td>Herman LLC</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>424914</td>\n",
       "      <td>White-Trantow</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>527099</td>\n",
       "      <td>Sanford and Sons</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>642753</td>\n",
       "      <td>Pollich LLC</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>257198</td>\n",
       "      <td>Cronin, Oberbrunner and Spencer</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    account number                             name  status\n",
       "0           740150                       Barton LLC    gold\n",
       "1           714466                  Trantow-Barrows  silver\n",
       "2           218895                        Kulas Inc  bronze\n",
       "3           307599      Kassulke, Ondricka and Metz  bronze\n",
       "4           412290                    Jerde-Hilpert  bronze\n",
       "5           729833                        Koepp Ltd  silver\n",
       "6           146832                     Kiehn-Spinka  silver\n",
       "7           688981                      Keeling LLC  silver\n",
       "8           786968         Frami, Hills and Schmidt  silver\n",
       "9           239344                       Stokes LLC    gold\n",
       "10          672390                  Kuhn-Gusikowski  silver\n",
       "11          141962                       Herman LLC    gold\n",
       "12          424914                    White-Trantow  silver\n",
       "13          527099                 Sanford and Sons  bronze\n",
       "14          642753                      Pollich LLC  bronze\n",
       "15          257198  Cronin, Oberbrunner and Spencer    gold"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "status = pd.read_excel(\"../data/customer-status.xlsx\")\n",
    "status"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>7</td>\n",
       "      <td>33.69</td>\n",
       "      <td>235.83</td>\n",
       "      <td>2014-02-01 09:04:59</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>11</td>\n",
       "      <td>21.12</td>\n",
       "      <td>232.32</td>\n",
       "      <td>2014-02-01 11:51:46</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-86481</td>\n",
       "      <td>3</td>\n",
       "      <td>35.99</td>\n",
       "      <td>107.97</td>\n",
       "      <td>2014-02-01 17:24:32</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>23</td>\n",
       "      <td>78.90</td>\n",
       "      <td>1814.70</td>\n",
       "      <td>2014-02-01 19:56:48</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>672390</td>\n",
       "      <td>Kuhn-Gusikowski</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>48</td>\n",
       "      <td>55.82</td>\n",
       "      <td>2679.36</td>\n",
       "      <td>2014-02-02 03:45:20</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   account number             name       sku  quantity  unit price  ext price  \\\n",
       "0          383080         Will LLC  B1-20000         7       33.69     235.83   \n",
       "1          412290    Jerde-Hilpert  S1-27722        11       21.12     232.32   \n",
       "2          412290    Jerde-Hilpert  B1-86481         3       35.99     107.97   \n",
       "3          412290    Jerde-Hilpert  B1-20000        23       78.90    1814.70   \n",
       "4          672390  Kuhn-Gusikowski  S1-06532        48       55.82    2679.36   \n",
       "\n",
       "                 date  status  \n",
       "0 2014-02-01 09:04:59     NaN  \n",
       "1 2014-02-01 11:51:46  bronze  \n",
       "2 2014-02-01 17:24:32  bronze  \n",
       "3 2014-02-01 19:56:48  bronze  \n",
       "4 2014-02-02 03:45:20  silver  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st = pd.merge(all_data, status, how='left')\n",
    "all_data_st.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks pretty good but let's look at a specific account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-47412</td>\n",
       "      <td>40</td>\n",
       "      <td>51.01</td>\n",
       "      <td>2040.40</td>\n",
       "      <td>2014-02-05 01:20:40</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>34</td>\n",
       "      <td>18.69</td>\n",
       "      <td>635.46</td>\n",
       "      <td>2014-02-07 09:22:02</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>15</td>\n",
       "      <td>70.23</td>\n",
       "      <td>1053.45</td>\n",
       "      <td>2014-02-16 18:24:42</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S2-34077</td>\n",
       "      <td>26</td>\n",
       "      <td>93.35</td>\n",
       "      <td>2427.10</td>\n",
       "      <td>2014-02-20 18:45:43</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-93683</td>\n",
       "      <td>31</td>\n",
       "      <td>10.52</td>\n",
       "      <td>326.12</td>\n",
       "      <td>2014-02-21 13:55:45</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    account number                          name       sku  quantity  \\\n",
       "15          737550  Fritsch, Russel and Anderson  S1-47412        40   \n",
       "25          737550  Fritsch, Russel and Anderson  S1-06532        34   \n",
       "66          737550  Fritsch, Russel and Anderson  S1-27722        15   \n",
       "78          737550  Fritsch, Russel and Anderson  S2-34077        26   \n",
       "80          737550  Fritsch, Russel and Anderson  S1-93683        31   \n",
       "\n",
       "    unit price  ext price                date status  \n",
       "15       51.01    2040.40 2014-02-05 01:20:40    NaN  \n",
       "25       18.69     635.46 2014-02-07 09:22:02    NaN  \n",
       "66       70.23    1053.45 2014-02-16 18:24:42    NaN  \n",
       "78       93.35    2427.10 2014-02-20 18:45:43    NaN  \n",
       "80       10.52     326.12 2014-02-21 13:55:45    NaN  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st[all_data_st[\"account number\"]==737550].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>7</td>\n",
       "      <td>33.69</td>\n",
       "      <td>235.83</td>\n",
       "      <td>2014-02-01 09:04:59</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>11</td>\n",
       "      <td>21.12</td>\n",
       "      <td>232.32</td>\n",
       "      <td>2014-02-01 11:51:46</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-86481</td>\n",
       "      <td>3</td>\n",
       "      <td>35.99</td>\n",
       "      <td>107.97</td>\n",
       "      <td>2014-02-01 17:24:32</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>23</td>\n",
       "      <td>78.90</td>\n",
       "      <td>1814.70</td>\n",
       "      <td>2014-02-01 19:56:48</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>672390</td>\n",
       "      <td>Kuhn-Gusikowski</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>48</td>\n",
       "      <td>55.82</td>\n",
       "      <td>2679.36</td>\n",
       "      <td>2014-02-02 03:45:20</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   account number             name       sku  quantity  unit price  ext price  \\\n",
       "0          383080         Will LLC  B1-20000         7       33.69     235.83   \n",
       "1          412290    Jerde-Hilpert  S1-27722        11       21.12     232.32   \n",
       "2          412290    Jerde-Hilpert  B1-86481         3       35.99     107.97   \n",
       "3          412290    Jerde-Hilpert  B1-20000        23       78.90    1814.70   \n",
       "4          672390  Kuhn-Gusikowski  S1-06532        48       55.82    2679.36   \n",
       "\n",
       "                 date  status  \n",
       "0 2014-02-01 09:04:59  bronze  \n",
       "1 2014-02-01 11:51:46  bronze  \n",
       "2 2014-02-01 17:24:32  bronze  \n",
       "3 2014-02-01 19:56:48  bronze  \n",
       "4 2014-02-02 03:45:20  silver  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st['status'].fillna('bronze',inplace=True)\n",
    "all_data_st.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the data just to make sure we're all good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-47412</td>\n",
       "      <td>40</td>\n",
       "      <td>51.01</td>\n",
       "      <td>2040.40</td>\n",
       "      <td>2014-02-05 01:20:40</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>34</td>\n",
       "      <td>18.69</td>\n",
       "      <td>635.46</td>\n",
       "      <td>2014-02-07 09:22:02</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>15</td>\n",
       "      <td>70.23</td>\n",
       "      <td>1053.45</td>\n",
       "      <td>2014-02-16 18:24:42</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S2-34077</td>\n",
       "      <td>26</td>\n",
       "      <td>93.35</td>\n",
       "      <td>2427.10</td>\n",
       "      <td>2014-02-20 18:45:43</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>737550</td>\n",
       "      <td>Fritsch, Russel and Anderson</td>\n",
       "      <td>S1-93683</td>\n",
       "      <td>31</td>\n",
       "      <td>10.52</td>\n",
       "      <td>326.12</td>\n",
       "      <td>2014-02-21 13:55:45</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    account number                          name       sku  quantity  \\\n",
       "15          737550  Fritsch, Russel and Anderson  S1-47412        40   \n",
       "25          737550  Fritsch, Russel and Anderson  S1-06532        34   \n",
       "66          737550  Fritsch, Russel and Anderson  S1-27722        15   \n",
       "78          737550  Fritsch, Russel and Anderson  S2-34077        26   \n",
       "80          737550  Fritsch, Russel and Anderson  S1-93683        31   \n",
       "\n",
       "    unit price  ext price                date  status  \n",
       "15       51.01    2040.40 2014-02-05 01:20:40  bronze  \n",
       "25       18.69     635.46 2014-02-07 09:22:02  bronze  \n",
       "66       70.23    1053.45 2014-02-16 18:24:42  bronze  \n",
       "78       93.35    2427.10 2014-02-20 18:45:43  bronze  \n",
       "80       10.52     326.12 2014-02-21 13:55:45  bronze  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st[all_data_st[\"account number\"]==737550].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -\n",
    "\n",
    "\"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.\"\n",
    "\n",
    "For our purposes, the status field is a good candidate for a category type.\n",
    "\n",
    "You must make sure you have a recent version of pandas installed for this example to work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.20.2'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we typecast it to a category using astype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_data_st[\"status\"] = all_data_st[\"status\"].astype(\"category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This doesn't immediately appear to change anything yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>7</td>\n",
       "      <td>33.69</td>\n",
       "      <td>235.83</td>\n",
       "      <td>2014-02-01 09:04:59</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>11</td>\n",
       "      <td>21.12</td>\n",
       "      <td>232.32</td>\n",
       "      <td>2014-02-01 11:51:46</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-86481</td>\n",
       "      <td>3</td>\n",
       "      <td>35.99</td>\n",
       "      <td>107.97</td>\n",
       "      <td>2014-02-01 17:24:32</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>412290</td>\n",
       "      <td>Jerde-Hilpert</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>23</td>\n",
       "      <td>78.90</td>\n",
       "      <td>1814.70</td>\n",
       "      <td>2014-02-01 19:56:48</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>672390</td>\n",
       "      <td>Kuhn-Gusikowski</td>\n",
       "      <td>S1-06532</td>\n",
       "      <td>48</td>\n",
       "      <td>55.82</td>\n",
       "      <td>2679.36</td>\n",
       "      <td>2014-02-02 03:45:20</td>\n",
       "      <td>silver</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   account number             name       sku  quantity  unit price  ext price  \\\n",
       "0          383080         Will LLC  B1-20000         7       33.69     235.83   \n",
       "1          412290    Jerde-Hilpert  S1-27722        11       21.12     232.32   \n",
       "2          412290    Jerde-Hilpert  B1-86481         3       35.99     107.97   \n",
       "3          412290    Jerde-Hilpert  B1-20000        23       78.90    1814.70   \n",
       "4          672390  Kuhn-Gusikowski  S1-06532        48       55.82    2679.36   \n",
       "\n",
       "                 date  status  \n",
       "0 2014-02-01 09:04:59  bronze  \n",
       "1 2014-02-01 11:51:46  bronze  \n",
       "2 2014-02-01 17:24:32  bronze  \n",
       "3 2014-02-01 19:56:48  bronze  \n",
       "4 2014-02-02 03:45:20  silver  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Buy you can see that it is a new data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "account number             int64\n",
       "name                      object\n",
       "sku                       object\n",
       "quantity                   int64\n",
       "unit price               float64\n",
       "ext price                float64\n",
       "date              datetime64[ns]\n",
       "status                  category\n",
       "dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>7</td>\n",
       "      <td>33.69</td>\n",
       "      <td>235.83</td>\n",
       "      <td>2014-02-01 09:04:59</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>218895</td>\n",
       "      <td>Kulas Inc</td>\n",
       "      <td>S2-83881</td>\n",
       "      <td>41</td>\n",
       "      <td>78.27</td>\n",
       "      <td>3209.07</td>\n",
       "      <td>2014-01-20 09:37:58</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>383080</td>\n",
       "      <td>Will LLC</td>\n",
       "      <td>B1-33364</td>\n",
       "      <td>26</td>\n",
       "      <td>90.19</td>\n",
       "      <td>2344.94</td>\n",
       "      <td>2014-01-20 09:39:59</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>604255</td>\n",
       "      <td>Halvorson, Crona and Champlin</td>\n",
       "      <td>S2-11481</td>\n",
       "      <td>37</td>\n",
       "      <td>96.71</td>\n",
       "      <td>3578.27</td>\n",
       "      <td>2014-01-20 13:07:28</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>200</th>\n",
       "      <td>527099</td>\n",
       "      <td>Sanford and Sons</td>\n",
       "      <td>B1-05914</td>\n",
       "      <td>18</td>\n",
       "      <td>64.32</td>\n",
       "      <td>1157.76</td>\n",
       "      <td>2014-01-20 21:40:58</td>\n",
       "      <td>bronze</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     account number                           name       sku  quantity  \\\n",
       "0            383080                       Will LLC  B1-20000         7   \n",
       "196          218895                      Kulas Inc  S2-83881        41   \n",
       "197          383080                       Will LLC  B1-33364        26   \n",
       "198          604255  Halvorson, Crona and Champlin  S2-11481        37   \n",
       "200          527099               Sanford and Sons  B1-05914        18   \n",
       "\n",
       "     unit price  ext price                date  status  \n",
       "0         33.69     235.83 2014-02-01 09:04:59  bronze  \n",
       "196       78.27    3209.07 2014-01-20 09:37:58  bronze  \n",
       "197       90.19    2344.94 2014-01-20 09:39:59  bronze  \n",
       "198       96.71    3578.27 2014-01-20 13:07:28  bronze  \n",
       "200       64.32    1157.76 2014-01-20 21:40:58  bronze  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.sort_values(by=[\"status\"]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    " all_data_st[\"status\"].cat.set_categories([ \"gold\",\"silver\",\"bronze\"],inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can sort it so that gold shows on top."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>account number</th>\n",
       "      <th>name</th>\n",
       "      <th>sku</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>740150</td>\n",
       "      <td>Barton LLC</td>\n",
       "      <td>B1-38851</td>\n",
       "      <td>17</td>\n",
       "      <td>81.22</td>\n",
       "      <td>1380.74</td>\n",
       "      <td>2014-02-17 17:12:16</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>63</th>\n",
       "      <td>257198</td>\n",
       "      <td>Cronin, Oberbrunner and Spencer</td>\n",
       "      <td>S1-27722</td>\n",
       "      <td>28</td>\n",
       "      <td>10.21</td>\n",
       "      <td>285.88</td>\n",
       "      <td>2014-02-15 17:27:44</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>207</th>\n",
       "      <td>740150</td>\n",
       "      <td>Barton LLC</td>\n",
       "      <td>B1-86481</td>\n",
       "      <td>20</td>\n",
       "      <td>30.41</td>\n",
       "      <td>608.20</td>\n",
       "      <td>2014-01-22 16:33:51</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>740150</td>\n",
       "      <td>Barton LLC</td>\n",
       "      <td>B1-20000</td>\n",
       "      <td>28</td>\n",
       "      <td>81.39</td>\n",
       "      <td>2278.92</td>\n",
       "      <td>2014-02-15 07:45:16</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>239344</td>\n",
       "      <td>Stokes LLC</td>\n",
       "      <td>S2-83881</td>\n",
       "      <td>30</td>\n",
       "      <td>43.00</td>\n",
       "      <td>1290.00</td>\n",
       "      <td>2014-02-15 02:13:23</td>\n",
       "      <td>gold</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     account number                             name       sku  quantity  \\\n",
       "68           740150                       Barton LLC  B1-38851        17   \n",
       "63           257198  Cronin, Oberbrunner and Spencer  S1-27722        28   \n",
       "207          740150                       Barton LLC  B1-86481        20   \n",
       "61           740150                       Barton LLC  B1-20000        28   \n",
       "60           239344                       Stokes LLC  S2-83881        30   \n",
       "\n",
       "     unit price  ext price                date status  \n",
       "68        81.22    1380.74 2014-02-17 17:12:16   gold  \n",
       "63        10.21     285.88 2014-02-15 17:27:44   gold  \n",
       "207       30.41     608.20 2014-01-22 16:33:51   gold  \n",
       "61        81.39    2278.92 2014-02-15 07:45:16   gold  \n",
       "60        43.00    1290.00 2014-02-15 02:13:23   gold  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.sort_values(by=[\"status\"]).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count        384\n",
       "unique         3\n",
       "top       bronze\n",
       "freq         172\n",
       "Name: status, dtype: object"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st[\"status\"].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit price</th>\n",
       "      <th>ext price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>status</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gold</th>\n",
       "      <td>24.375000</td>\n",
       "      <td>53.723889</td>\n",
       "      <td>1351.944583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>silver</th>\n",
       "      <td>22.842857</td>\n",
       "      <td>57.272714</td>\n",
       "      <td>1320.032214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bronze</th>\n",
       "      <td>25.616279</td>\n",
       "      <td>57.371163</td>\n",
       "      <td>1472.965930</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         quantity  unit price    ext price\n",
       "status                                    \n",
       "gold    24.375000   53.723889  1351.944583\n",
       "silver  22.842857   57.272714  1320.032214\n",
       "bronze  25.616279   57.371163  1472.965930"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, you can run multiple aggregation functions on the data to get really useful information "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">quantity</th>\n",
       "      <th colspan=\"3\" halign=\"left\">unit price</th>\n",
       "      <th colspan=\"3\" halign=\"left\">ext price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>status</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gold</th>\n",
       "      <td>1755</td>\n",
       "      <td>24.375000</td>\n",
       "      <td>14.575145</td>\n",
       "      <td>3868.12</td>\n",
       "      <td>53.723889</td>\n",
       "      <td>28.740080</td>\n",
       "      <td>97340.01</td>\n",
       "      <td>1351.944583</td>\n",
       "      <td>1182.657312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>silver</th>\n",
       "      <td>3198</td>\n",
       "      <td>22.842857</td>\n",
       "      <td>14.512843</td>\n",
       "      <td>8018.18</td>\n",
       "      <td>57.272714</td>\n",
       "      <td>26.556242</td>\n",
       "      <td>184804.51</td>\n",
       "      <td>1320.032214</td>\n",
       "      <td>1086.384051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bronze</th>\n",
       "      <td>4406</td>\n",
       "      <td>25.616279</td>\n",
       "      <td>14.136071</td>\n",
       "      <td>9867.84</td>\n",
       "      <td>57.371163</td>\n",
       "      <td>26.857370</td>\n",
       "      <td>253350.14</td>\n",
       "      <td>1472.965930</td>\n",
       "      <td>1116.683843</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       quantity                       unit price                        \\\n",
       "            sum       mean        std        sum       mean        std   \n",
       "status                                                                   \n",
       "gold       1755  24.375000  14.575145    3868.12  53.723889  28.740080   \n",
       "silver     3198  22.842857  14.512843    8018.18  57.272714  26.556242   \n",
       "bronze     4406  25.616279  14.136071    9867.84  57.371163  26.857370   \n",
       "\n",
       "        ext price                            \n",
       "              sum         mean          std  \n",
       "status                                       \n",
       "gold     97340.01  1351.944583  1182.657312  \n",
       "silver  184804.51  1320.032214  1086.384051  \n",
       "bronze  253350.14  1472.965930  1116.683843  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].agg([np.sum,np.mean, np.std])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.\n",
    "\n",
    "Maybe we should look at how many bronze customers we have and see what is going on.\n",
    "\n",
    "What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.\n",
    "\n",
    "I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "status\n",
       "gold      4\n",
       "silver    7\n",
       "bronze    9\n",
       "Name: name, dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_st.drop_duplicates(subset=[\"account number\",\"name\"]).iloc[:,[0,1,7]].groupby([\"status\"])[\"name\"].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}