{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# School Budgeting with Machine Learning in Python\n",
    "> A Summary of lecture \"Case Study- School Budgeting with Machine Learning in Python\", via datacamp\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introducing the challenge\n",
    "- Budgets for schools are huge, complex, and not standardize.\n",
    "    - Hundreds of hours each year are spent manually labelling\n",
    "- Goal: Build a machine learning algorithm that can automate the process\n",
    "- Supervised Learning problem\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Note: Due to the size of dataset, it is not included in this repository, however, you can download it through [kaggle repo](https://www.kaggle.com/jeromeblanchet/drivendatas-boxplots-for-education-dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the data\n",
    "Now it's time to check out the dataset! You'll use pandas (which has been pre-imported as pd) to load your data into a DataFrame and then do some Exploratory Data Analysis (EDA) of it.\n",
    "\n",
    "Some of the column names correspond to **features** - descriptions of the budget items - such as the ```Job_Title_Description``` column. The values in this column tell us if a budget item is for a teacher, custodian, or other employee.\n",
    "\n",
    "Some columns correspond to the budget item **labels** you will be trying to predict with your model. For example, the ```Object_Type``` column describes whether the budget item is related classroom supplies, salary, travel expenses, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Function</th>\n",
       "      <th>Use</th>\n",
       "      <th>Sharing</th>\n",
       "      <th>Reporting</th>\n",
       "      <th>Student_Type</th>\n",
       "      <th>Position_Type</th>\n",
       "      <th>Object_Type</th>\n",
       "      <th>Pre_K</th>\n",
       "      <th>Operating_Status</th>\n",
       "      <th>Object_Description</th>\n",
       "      <th>...</th>\n",
       "      <th>Sub_Object_Description</th>\n",
       "      <th>Location_Description</th>\n",
       "      <th>FTE</th>\n",
       "      <th>Function_Description</th>\n",
       "      <th>Facility_or_Department</th>\n",
       "      <th>Position_Extra</th>\n",
       "      <th>Total</th>\n",
       "      <th>Program_Description</th>\n",
       "      <th>Fund_Description</th>\n",
       "      <th>Text_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>134338</th>\n",
       "      <td>Teacher Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>KINDERGARTEN</td>\n",
       "      <td>50471.810</td>\n",
       "      <td>KINDERGARTEN</td>\n",
       "      <td>General Fund</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>206341</th>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>Non-Operating</td>\n",
       "      <td>CONTRACTOR SERVICES</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>RGN  GOB</td>\n",
       "      <td>NaN</td>\n",
       "      <td>UNDESIGNATED</td>\n",
       "      <td>3477.860</td>\n",
       "      <td>BUILDING IMPROVEMENT SERVICES</td>\n",
       "      <td>NaN</td>\n",
       "      <td>BUILDING IMPROVEMENT SERVICES</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>326408</th>\n",
       "      <td>Teacher Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Base Salary/Compensation</td>\n",
       "      <td>Non PreK</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>Personal Services - Teachers</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>TEACHER</td>\n",
       "      <td>62237.130</td>\n",
       "      <td>Instruction - Regular</td>\n",
       "      <td>General Purpose School</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>364634</th>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Substitute</td>\n",
       "      <td>Benefits</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>EMPLOYEE BENEFITS</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>UNALLOC BUDGETS/SCHOOLS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PROFESSIONAL-INSTRUCTIONAL</td>\n",
       "      <td>22.300</td>\n",
       "      <td>GENERAL MIDDLE/JUNIOR HIGH SCH</td>\n",
       "      <td>NaN</td>\n",
       "      <td>REGULAR INSTRUCTION</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47683</th>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>TEACHER COVERAGE FOR TEACHER</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NON-PROJECT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PROFESSIONAL-INSTRUCTIONAL</td>\n",
       "      <td>54.166</td>\n",
       "      <td>GENERAL HIGH SCHOOL EDUCATION</td>\n",
       "      <td>NaN</td>\n",
       "      <td>REGULAR INSTRUCTION</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Function          Use          Sharing Reporting  \\\n",
       "134338     Teacher Compensation  Instruction  School Reported    School   \n",
       "206341                 NO_LABEL     NO_LABEL         NO_LABEL  NO_LABEL   \n",
       "326408     Teacher Compensation  Instruction  School Reported    School   \n",
       "364634  Substitute Compensation  Instruction  School Reported    School   \n",
       "47683   Substitute Compensation  Instruction  School Reported    School   \n",
       "\n",
       "       Student_Type Position_Type               Object_Type     Pre_K  \\\n",
       "134338     NO_LABEL       Teacher                  NO_LABEL  NO_LABEL   \n",
       "206341     NO_LABEL      NO_LABEL                  NO_LABEL  NO_LABEL   \n",
       "326408  Unspecified       Teacher  Base Salary/Compensation  Non PreK   \n",
       "364634  Unspecified    Substitute                  Benefits  NO_LABEL   \n",
       "47683   Unspecified       Teacher   Substitute Compensation  NO_LABEL   \n",
       "\n",
       "         Operating_Status            Object_Description  ...  \\\n",
       "134338  PreK-12 Operating                           NaN  ...   \n",
       "206341      Non-Operating           CONTRACTOR SERVICES  ...   \n",
       "326408  PreK-12 Operating  Personal Services - Teachers  ...   \n",
       "364634  PreK-12 Operating             EMPLOYEE BENEFITS  ...   \n",
       "47683   PreK-12 Operating  TEACHER COVERAGE FOR TEACHER  ...   \n",
       "\n",
       "       Sub_Object_Description Location_Description  FTE  \\\n",
       "134338                    NaN                  NaN  1.0   \n",
       "206341                    NaN                  NaN  NaN   \n",
       "326408                    NaN                  NaN  1.0   \n",
       "364634                    NaN                  NaN  NaN   \n",
       "47683                     NaN                  NaN  NaN   \n",
       "\n",
       "           Function_Description Facility_or_Department  \\\n",
       "134338                      NaN                    NaN   \n",
       "206341                 RGN  GOB                    NaN   \n",
       "326408                      NaN                    NaN   \n",
       "364634  UNALLOC BUDGETS/SCHOOLS                    NaN   \n",
       "47683               NON-PROJECT                    NaN   \n",
       "\n",
       "                    Position_Extra      Total             Program_Description  \\\n",
       "134338               KINDERGARTEN   50471.810                    KINDERGARTEN   \n",
       "206341                UNDESIGNATED   3477.860   BUILDING IMPROVEMENT SERVICES   \n",
       "326408                     TEACHER  62237.130           Instruction - Regular   \n",
       "364634  PROFESSIONAL-INSTRUCTIONAL     22.300  GENERAL MIDDLE/JUNIOR HIGH SCH   \n",
       "47683   PROFESSIONAL-INSTRUCTIONAL     54.166   GENERAL HIGH SCHOOL EDUCATION   \n",
       "\n",
       "              Fund_Description                         Text_1  \n",
       "134338            General Fund                            NaN  \n",
       "206341                     NaN  BUILDING IMPROVEMENT SERVICES  \n",
       "326408  General Purpose School                            NaN  \n",
       "364634                     NaN            REGULAR INSTRUCTION  \n",
       "47683                      NaN            REGULAR INSTRUCTION  \n",
       "\n",
       "[5 rows x 25 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./dataset/TrainingData.csv', index_col=0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Function</th>\n",
       "      <th>Use</th>\n",
       "      <th>Sharing</th>\n",
       "      <th>Reporting</th>\n",
       "      <th>Student_Type</th>\n",
       "      <th>Position_Type</th>\n",
       "      <th>Object_Type</th>\n",
       "      <th>Pre_K</th>\n",
       "      <th>Operating_Status</th>\n",
       "      <th>Object_Description</th>\n",
       "      <th>...</th>\n",
       "      <th>Sub_Object_Description</th>\n",
       "      <th>Location_Description</th>\n",
       "      <th>FTE</th>\n",
       "      <th>Function_Description</th>\n",
       "      <th>Facility_or_Department</th>\n",
       "      <th>Position_Extra</th>\n",
       "      <th>Total</th>\n",
       "      <th>Program_Description</th>\n",
       "      <th>Fund_Description</th>\n",
       "      <th>Text_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>109283</th>\n",
       "      <td>Professional Development</td>\n",
       "      <td>ISPD</td>\n",
       "      <td>Shared Services</td>\n",
       "      <td>Non-School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Instructional Coach</td>\n",
       "      <td>Other Compensation/Stipend</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>WORKSHOP PARTICIPANT</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>STAFF DEV AND INSTR MEDIA</td>\n",
       "      <td>NaN</td>\n",
       "      <td>INST STAFF TRAINING SVCS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>48.620000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>GENERAL FUND</td>\n",
       "      <td>STAFF DEV AND INSTR MEDIA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102430</th>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Substitute</td>\n",
       "      <td>Base Salary/Compensation</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>SALARIES OF PART TIME EMPLOYEE</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00431</td>\n",
       "      <td>TITLE II,D</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PROFESSIONAL-INSTRUCTIONAL</td>\n",
       "      <td>128.824985</td>\n",
       "      <td>INSTRUCTIONAL STAFF TRAINING</td>\n",
       "      <td>NaN</td>\n",
       "      <td>INSTRUCTIONAL STAFF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413949</th>\n",
       "      <td>Parent &amp; Community Relations</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>Other</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PARENT/TITLE I</td>\n",
       "      <td>4902.290000</td>\n",
       "      <td>Misc</td>\n",
       "      <td>Schoolwide Schools</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>433672</th>\n",
       "      <td>Library &amp; Media</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School on Central Budgets</td>\n",
       "      <td>Non-School</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Librarian</td>\n",
       "      <td>Benefits</td>\n",
       "      <td>NO_LABEL</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>EMPLOYEE BENEFITS</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ED RESOURCE SERVICES</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NON-PROJECT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OFFICE/ADMINISTRATIVE SUPPORT</td>\n",
       "      <td>4020.290000</td>\n",
       "      <td>MEDIA SUPPORT SERVICES</td>\n",
       "      <td>NaN</td>\n",
       "      <td>INSTRUCTIONAL STAFF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>415831</th>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>School Reported</td>\n",
       "      <td>School</td>\n",
       "      <td>Poverty</td>\n",
       "      <td>Substitute</td>\n",
       "      <td>Substitute Compensation</td>\n",
       "      <td>Non PreK</td>\n",
       "      <td>PreK-12 Operating</td>\n",
       "      <td>Salaries And Wages For Substitute Professionals</td>\n",
       "      <td>...</td>\n",
       "      <td>Inservice Substitute Teachers Grant Funded</td>\n",
       "      <td>School</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Instruction</td>\n",
       "      <td>Instruction And Curriculum</td>\n",
       "      <td>CERTIFIED SUBSTITUTE</td>\n",
       "      <td>46.530000</td>\n",
       "      <td>Accelerated Education</td>\n",
       "      <td>\"Title  Part A Improving Basic Programs\"</td>\n",
       "      <td>MISCELLANEOUS</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 25 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                            Function          Use                    Sharing  \\\n",
       "109283      Professional Development         ISPD            Shared Services   \n",
       "102430       Substitute Compensation  Instruction            School Reported   \n",
       "413949  Parent & Community Relations     NO_LABEL            School Reported   \n",
       "433672               Library & Media  Instruction  School on Central Budgets   \n",
       "415831       Substitute Compensation  Instruction            School Reported   \n",
       "\n",
       "         Reporting Student_Type        Position_Type  \\\n",
       "109283  Non-School  Unspecified  Instructional Coach   \n",
       "102430      School  Unspecified           Substitute   \n",
       "413949      School     NO_LABEL                Other   \n",
       "433672  Non-School  Unspecified            Librarian   \n",
       "415831      School      Poverty           Substitute   \n",
       "\n",
       "                       Object_Type     Pre_K   Operating_Status  \\\n",
       "109283  Other Compensation/Stipend  NO_LABEL  PreK-12 Operating   \n",
       "102430    Base Salary/Compensation  NO_LABEL  PreK-12 Operating   \n",
       "413949                    NO_LABEL  NO_LABEL  PreK-12 Operating   \n",
       "433672                    Benefits  NO_LABEL  PreK-12 Operating   \n",
       "415831     Substitute Compensation  Non PreK  PreK-12 Operating   \n",
       "\n",
       "                                     Object_Description  ...  \\\n",
       "109283                   WORKSHOP PARTICIPANT            ...   \n",
       "102430                   SALARIES OF PART TIME EMPLOYEE  ...   \n",
       "413949                                              NaN  ...   \n",
       "433672                                EMPLOYEE BENEFITS  ...   \n",
       "415831  Salaries And Wages For Substitute Professionals  ...   \n",
       "\n",
       "                            Sub_Object_Description  \\\n",
       "109283                                         NaN   \n",
       "102430                                         NaN   \n",
       "413949                                         NaN   \n",
       "433672                                         NaN   \n",
       "415831  Inservice Substitute Teachers Grant Funded   \n",
       "\n",
       "                  Location_Description      FTE  \\\n",
       "109283  STAFF DEV AND INSTR MEDIA           NaN   \n",
       "102430                             NaN  0.00431   \n",
       "413949                             NaN  1.00000   \n",
       "433672            ED RESOURCE SERVICES      NaN   \n",
       "415831                         School       NaN   \n",
       "\n",
       "                  Function_Description      Facility_or_Department  \\\n",
       "109283  INST STAFF TRAINING SVCS                               NaN   \n",
       "102430                      TITLE II,D                         NaN   \n",
       "413949                             NaN                         NaN   \n",
       "433672                     NON-PROJECT                         NaN   \n",
       "415831                     Instruction  Instruction And Curriculum   \n",
       "\n",
       "                       Position_Extra        Total  \\\n",
       "109283                            NaN    48.620000   \n",
       "102430     PROFESSIONAL-INSTRUCTIONAL   128.824985   \n",
       "413949                 PARENT/TITLE I  4902.290000   \n",
       "433672  OFFICE/ADMINISTRATIVE SUPPORT  4020.290000   \n",
       "415831           CERTIFIED SUBSTITUTE    46.530000   \n",
       "\n",
       "                 Program_Description  \\\n",
       "109283                           NaN   \n",
       "102430  INSTRUCTIONAL STAFF TRAINING   \n",
       "413949                          Misc   \n",
       "433672        MEDIA SUPPORT SERVICES   \n",
       "415831         Accelerated Education   \n",
       "\n",
       "                                Fund_Description  \\\n",
       "109283            GENERAL FUND                     \n",
       "102430                                       NaN   \n",
       "413949                        Schoolwide Schools   \n",
       "433672                                       NaN   \n",
       "415831  \"Title  Part A Improving Basic Programs\"   \n",
       "\n",
       "                                Text_1  \n",
       "109283  STAFF DEV AND INSTR MEDIA       \n",
       "102430             INSTRUCTIONAL STAFF  \n",
       "413949                             NaN  \n",
       "433672             INSTRUCTIONAL STAFF  \n",
       "415831                  MISCELLANEOUS   \n",
       "\n",
       "[5 rows x 25 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 400277 entries, 134338 to 415831\n",
      "Data columns (total 25 columns):\n",
      " #   Column                  Non-Null Count   Dtype  \n",
      "---  ------                  --------------   -----  \n",
      " 0   Function                400277 non-null  object \n",
      " 1   Use                     400277 non-null  object \n",
      " 2   Sharing                 400277 non-null  object \n",
      " 3   Reporting               400277 non-null  object \n",
      " 4   Student_Type            400277 non-null  object \n",
      " 5   Position_Type           400277 non-null  object \n",
      " 6   Object_Type             400277 non-null  object \n",
      " 7   Pre_K                   400277 non-null  object \n",
      " 8   Operating_Status        400277 non-null  object \n",
      " 9   Object_Description      375493 non-null  object \n",
      " 10  Text_2                  88217 non-null   object \n",
      " 11  SubFund_Description     306855 non-null  object \n",
      " 12  Job_Title_Description   292743 non-null  object \n",
      " 13  Text_3                  109152 non-null  object \n",
      " 14  Text_4                  53746 non-null   object \n",
      " 15  Sub_Object_Description  91603 non-null   object \n",
      " 16  Location_Description    162054 non-null  object \n",
      " 17  FTE                     126071 non-null  float64\n",
      " 18  Function_Description    342195 non-null  object \n",
      " 19  Facility_or_Department  53886 non-null   object \n",
      " 20  Position_Extra          264764 non-null  object \n",
      " 21  Total                   395722 non-null  float64\n",
      " 22  Program_Description     304660 non-null  object \n",
      " 23  Fund_Description        202877 non-null  object \n",
      " 24  Text_1                  292285 non-null  object \n",
      "dtypes: float64(2), object(23)\n",
      "memory usage: 79.4+ MB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summarizing the data\n",
    "You'll continue your EDA in this exercise by computing summary statistics for the numeric data in the dataset.\n",
    "\n",
    "You can use df.info() in the IPython Shell to determine which columns of the data are numeric, specifically type float64. You'll notice that there are two numeric columns, called FTE and Total.\n",
    "\n",
    "- FTE: Stands for \"full-time equivalent\". If the budget item is associated to an employee, this number tells us the percentage of full-time that the employee works. A value of 1 means the associated employee works for the school full-time. A value close to 0 means the item is associated to a part-time or contracted employee.\n",
    "- Total: Stands for the total cost of the expenditure. This number tells us how much the budget item cost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FTE</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>126071.000000</td>\n",
       "      <td>3.957220e+05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.426794</td>\n",
       "      <td>1.310586e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.573576</td>\n",
       "      <td>3.682254e+05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-0.087551</td>\n",
       "      <td>-8.746631e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000792</td>\n",
       "      <td>7.379770e+01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.130927</td>\n",
       "      <td>4.612300e+02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.652662e+03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>46.800000</td>\n",
       "      <td>1.297000e+08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 FTE         Total\n",
       "count  126071.000000  3.957220e+05\n",
       "mean        0.426794  1.310586e+04\n",
       "std         0.573576  3.682254e+05\n",
       "min        -0.087551 -8.746631e+07\n",
       "25%         0.000792  7.379770e+01\n",
       "50%         0.130927  4.612300e+02\n",
       "75%         1.000000  3.652662e+03\n",
       "max        46.800000  1.297000e+08"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the summary statistics\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0, 0.5, 'num employee')"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAElCAYAAAAoZK9zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deZwdVZ338c/XhB3ZAwMJEpCoBAZRI+I2IjgQFg2jrA9LQJyIg9uIg0EdUYRncBRBH0WHMZGArIM6BAUxsig6soRFICASA5IQBgIhEBCBwPf5o07LTed253a67m26+/t+ve6rb/3q1Klzq5P763Oq6pRsExERUadXDHQDIiJi6ElyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiIioXZJLvCxJ+q6kf62prldJekrSiLJ8raQP1VF3qe8KSZPrqq8P+z1Z0qOS/reGul4r6VZJSyV9vIXylrRteX+2pJP7sK/lfh8xNCW5RMdJul/SM+WLbImk/5F0jKS//nu0fYztL7dY13t6K2P7Advr2n6hhrZ/UdIPutW/l+0Z/a27j+3YEjgOGG/7b5qtl3S9pMWSTuu27meSJnTb5HjgWtuvtP3Nmtu63O+ozt9HvHwlucRAea/tVwJbAacCnwGm1b0TSSPrrvNlYivgMduP9LD+BGAGsDWwX1cykXQQMM/27Cb1zWlXY2P4SXKJAWX7CdszgYOAyZJ2gOWHWiRtIuknpZezWNJ1kl4h6VzgVcBlZZjleEljy5DN0ZIeAK5uiDUmmldLulHSE5IulbRR2deukhY0trHrL29JE4HPAgeV/f2urP/rMFtp1+cl/UnSI5LOkbR+WdfVjsmSHihDWp/r6dhIWr9sv6jU9/lS/3uAWcAWpR1nN9l8a+Bq208ANwHbSFoPmFo+Q+N+rgbeDXyr1Pea7kOHko6U9OtefpU9fYbefkcjG47fyaUH+5SkyyRtLOk8SU9KuknS2IY6XydpVvm3cI+kA/varmi/JJd4WbB9I7AAeGeT1ceVdaOAzai+HG37cOABql7Qurb/vWGbdwHbAXv2sMsjgA8CWwDLgJUOBdn+GfB/gYvK/l7fpNiR5fVuYBtgXeBb3cq8A3gtsDvwBUnb9bDL/wesX+p5V2nzUbZ/AewFLCztOLLJtncCfy9pA2ACcBfwZeAM20u6fa7dgOuAj5b6/tDjQeijlfyOGh0MHA6MBl4N/Bb4PrARcDdwIoCkdagS6/nApsAhwJmStq+rzVGPJJd4OVlI9WXS3fPA5sBWtp+3fZ1XPineF20/bfuZHtafa/tO208D/wocWNMJ5kOBr9ueZ/spquGpg7v1mr5k+xnbvwN+B6yQpEpbDgJOsL3U9v3AaVRfwK34N6pE/Uvg28BqwI5UPYjzJf1K0kdX7SO2xfdt/7H0tK4A/mj7F7aXAf8FvKGU2xe43/b3bS+zfQvwQ2D/gWl29CTJJV5ORgOLm8S/CswFfi5pnqSpLdQ1vw/r/0T15btJS63s3Ralvsa6R1L1uLo0Xt31Z6reTXebAKs3qWt0K42wvdj2QaV39Q2qXtDHqIbF7gTeAxwjaXwr9bVK1ZVzT5XXoX3Y9OGG9880We46RlsBbylDpEskLaFK6Ctc1BADa6ie7IxBRtKbqb44VxjXt72UamjsuDL8cY2km2xfBfTUg1lZz2bLhvevouodPQo8Dazd0K4RVMNxrda7kOoLsLHuZVRflmNWsm2jR0ubtqIa0uqq68E+1NFlCnC97Tsl/S1wuu3nJN0B7NBQf6PljgMtfnnb3qtZuK8N7sV84Je2/77GOqMN0nOJASVpPUn7AhcCP7B9R5My+0raVpKAJ4EXyguqL+1tVmHXh0kaL2lt4CTgknJp7B+ANSXtI2k14PPAGg3bPQyMVcNl091cAPyzpK0lrctL52iW9aVxpS0XA6dIeqWkrYBPAT/ofcvlSdoUOBb4YgndB7y7tG0CMK+HTW8D3i9pbVX3sxzdl/12s6q/o2Z+ArxG0uGSViuvN/dy3ioGSJJLDJTLJC2l+kv0c8DXgaN6KDsO+AXwFNWJ3jNtX1vW/Rvw+TJE8uk+7P9c4GyqIao1gY9DdfUa8E/A96h6CU9TXUzQ5b/Kz8ck3dKk3uml7l9RfZH/hWo4alV8rOx/HlWP7vxSf198DTipnP+B6njtRnXcZza5JLnL6cBzVIlhBnBeH/fbaFV/Rysovdg9qC4AWEj1+/sKy/8BEC8DysPCIiKibum5RERE7ZJcIiKidkkuERFRuySXiIioXZJLRM2azU82VHWfgyyiS5JLRETULnfoR0SflRtaNdDtiJev9Fxi2OhtqnZVU/yf2TA31m8k/Y2kMyQ9Lun3kt7QUP5+SSdIuqus/76kNXvY73Zl+GiJpDmS3lfib5b0cOOklpI+IOm28v4VkqZK+qOkxyRdrPJogLJ+lzJN/RJJv5O0aw/7P0rSZQ3LcyVd3LA8X9JO5f3byhT3T5Sfb2sod62kUyT9hmpOtG267WdzSbd33Sipapr+eaoeCndfH+cai8HOdl55DfkXsA7VXelHUfXY30g1f9f2Zf3ZZflNVHfsX011h/0RwAjgZOCahvrup5oAckuqmZx/A5xc1u0KLCjvV6OadPOzVBNR7gYsBV5b1t8F7NVQ74+B48r7TwLXU81JtgbwH8AFZd1o4DFgb6o/Ev++LI9q8tm3AZaUcptTTYD5YMO6x8u6jcr7w8sxOqQsb1zKXks1ff72Zf1qJfYhYCzV1DlTGo73kw2fc/OuY53X8Hil5xLDRStTtf/Y9s22/0L1Jf8X2+e4mufrIl6a9r3Lt2zPt70YOIXqy7i7Xahm9D3V9nO2r6aaH6ur7AzgMIDSK9mTapoXgA8Dn7O9wPazVPOD7V96OocBl9u+3PaLtmcBs6mSzXJsz6NKaDtRPRfmSuBBSa8ry9fZfhHYB7jX9rnlGF0A/B54b0N1Z9ueU9Y/X2LjqZLMibbPaij7IrCDpLVsP2Q7T7ocRnLOJYaLv07V3hAbSTUPWJdWp33v0n3a/i2a7HcLYH758m4s2zV1/g+Au8tEkgdSfdE/1NDmH0tq3PYFqun7twIOkNT4xb8acE2TNkD1XJddgW3L+yVUieWtZbmrrX/qtl33af6bPcrgUKre2SVdAdtPq3qk8qeBaWUo7Tjbv++hfTHEpOcSw0XXVO0bNLzWtf2RftTZfdr+hU3KLAS27DaL8l+nzrf9INVknP9ANRzVmOzmUw2ZNbZ5zbLNfKoHnjWuW8f2qT20tSu5dD1A7JdUyeVdvJRcuj8uYLm2Fs0mI/wi1ZDi+Wp44JrtK11Njb85VQ/oP3toWwxBSS4xXLRjqvZjJY0pw1mfpRo66+4GqpmNjy/73JVqmOnChjLnAMcDf0s1HNflu1RT7m8FIGmUpEll3Q+A90raU9IISWuW+2t6embML6kevbyW7QVUjzWeCGwM3FrKXE51jP6PpJGl5zGe6tj15nngAKrzLOeWCxE2k/Q+VY8lfpZqRusXeqskhpYklxgW3J6p2s8Hfk41Jf48qpP+3ff7HPA+qmfePwqcCRzRbXjox5QhMFePXe7yDWAm1RM4l1Kd3H9LqXc+MIkqqS2i6sn8Cz38n7b9B6ov+OvK8pOlzb8p55Sw/RjVuanjqC4OOB7Y1/ajKzsQ5XO+n+q59tOphhyPozrWi6l6SP+0snpi6MiU+xGrQNL9wIds/6Km+v4IfLiu+iIGWnouEQNM0geozmVcPdBtiahLrhaLGECSrqU6r3F4tyvKIga1DItFRETtMiwWERG1y7BYsckmm3js2LED3YyIiEHl5ptvftT2qO7xJJdi7NixzJ49e6CbERExqEjqPqsDkGGxiIhogySXiIioXZJLRETULsklIiJql+QSERG1S3KJiIjaJblERETtklwiIqJ2SS4REVG73KFfg7FTfzpg+77/1H0GbN8RET1JzyUiImqX5BIREbVLcomIiNoluURERO2SXCIionZtSy6Spkt6RNKdDbGvSvq9pNsl/VjSBg3rTpA0V9I9kvZsiE8ssbmSpjbEt5Z0g6R7JV0kafUSX6Mszy3rx7brM0ZERHPt7LmcDUzsFpsF7GB7R+APwAkAksYDBwPbl23OlDRC0gjg28BewHjgkFIW4CvA6bbHAY8DR5f40cDjtrcFTi/lIiKig9qWXGz/CljcLfZz28vK4vXAmPJ+EnCh7Wdt3wfMBXYur7m259l+DrgQmCRJwG7AJWX7GcB+DXXNKO8vAXYv5SMiokMG8pzLB4EryvvRwPyGdQtKrKf4xsCShkTVFV+urrL+iVI+IiI6ZECSi6TPAcuA87pCTYp5FeK91dWsHVMkzZY0e9GiRb03OiIiWtbx5CJpMrAvcKjtri/9BcCWDcXGAAt7iT8KbCBpZLf4cnWV9evTbXiui+2zbE+wPWHUqFH9/WgREVF0NLlImgh8Bnif7T83rJoJHFyu9NoaGAfcCNwEjCtXhq1OddJ/ZklK1wD7l+0nA5c21DW5vN8fuLohiUVERAe0beJKSRcAuwKbSFoAnEh1ddgawKxyjv1628fYniPpYuAuquGyY22/UOr5KHAlMAKYbntO2cVngAslnQzcCkwr8WnAuZLmUvVYDm7XZ4yIiOballxsH9IkPK1JrKv8KcApTeKXA5c3ic+jupqse/wvwAF9amxERNQqd+hHRETtklwiIqJ2SS4REVG7JJeIiKhdkktERNQuySUiImqX5BIREbVLcomIiNoluURERO2SXCIionZJLhERUbskl4iIqF2SS0RE1C7JJSIiapfkEhERtUtyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiIioXZJLRETULsklIiJql+QSERG1S3KJiIjatS25SJou6RFJdzbENpI0S9K95eeGJS5J35Q0V9Ltkt7YsM3kUv5eSZMb4m+SdEfZ5puS1Ns+IiKic9rZczkbmNgtNhW4yvY44KqyDLAXMK68pgDfgSpRACcCbwF2Bk5sSBbfKWW7tpu4kn1ERESHtC252P4VsLhbeBIwo7yfAezXED/HleuBDSRtDuwJzLK92PbjwCxgYlm3nu3f2jZwTre6mu0jIiI6pNPnXDaz/RBA+blpiY8G5jeUW1BivcUXNIn3to8VSJoiabak2YsWLVrlDxUREct7uZzQV5OYVyHeJ7bPsj3B9oRRo0b1dfOIiOhBp5PLw2VIi/LzkRJfAGzZUG4MsHAl8TFN4r3tIyIiOqTTyWUm0HXF12Tg0ob4EeWqsV2AJ8qQ1pXAHpI2LCfy9wCuLOuWStqlXCV2RLe6mu0jIiI6ZGS7KpZ0AbArsImkBVRXfZ0KXCzpaOAB4IBS/HJgb2Au8GfgKADbiyV9GbiplDvJdtdFAh+huiJtLeCK8qKXfURERIe0LbnYPqSHVbs3KWvg2B7qmQ5MbxKfDezQJP5Ys31ERETnvFxO6EdExBCS5BIREbVLcomIiNoluURERO2SXCIionZJLhERUbskl4iIqF2SS0RE1C7JJSIiapfkEhERtUtyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiIio3UqTi6TXSLpK0p1leUdJn29/0yIiYrBqpefyn8AJwPMAtm8HDm5noyIiYnBrJbmsbfvGbrFl7WhMREQMDa0kl0clvRowgKT9gYfa2qqIiBjURrZQ5ljgLOB1kh4E7gMOa2urIiJiUFtpcrE9D3iPpHWAV9he2v5mRUTEYNbK1WKbSZoGXGJ7qaTxko7uQNsiImKQauWcy9nAlcAWZfkPwCfb1aCIiBj8Wkkum9i+GHgRwPYy4IW2tioiIga1VpLL05I25qWrxXYBnujPTiX9s6Q5ku6UdIGkNSVtLekGSfdKukjS6qXsGmV5blk/tqGeE0r8Hkl7NsQnlthcSVP709aIiOi7VpLLccBM4NWSfgOcA3xsVXcoaTTwcWCC7R2AEVQ3ZX4FON32OOBxoOu8ztHA47a3BU4v5ZA0vmy3PTAROFPSCEkjgG8DewHjgUNK2YiI6JCVJhfbNwPvAt4GfBjYvtyl3x8jgbUkjQTWprpvZjfgkrJ+BrBfeT+pLFPW7y5JJX6h7Wdt3wfMBXYur7m259l+DriwlI2IiA5p5Wqx2cAUYKHtO20/358d2n4Q+BrwAFVSeQK4GVhSzucALABGl/ejgfll22Wl/MaN8W7b9BRv9tmmSJotafaiRYv687EiIqJBK8NiB1N9Od8k6UJJe5aewyqRtCFVT2JrqivQ1qEawurOXZv0sK6v8RWD9lm2J9ieMGrUqJU1PSIiWtTKsNhc258DXgOcD0wHHpD0JUkbrcI+3wPcZ3tR6QX9iGrIbYMyTAYwBlhY3i8AtgQo69cHFjfGu23TUzwiIjqkpee5SNoROA34KvBDYH/gSeDqVdjnA8AuktYuPaDdgbuAa0q9AJOBS8v7mWWZsv5q2y7xg8vVZFsD44AbgZuAceXqs9Wpel4zV6GdERGxilY6/Yukm4ElwDRgqu1ny6obJL29rzu0fYOkS4BbqGZXvpVq7rKfAhdKOrnEppVNpgHnSppL1WM5uNQzR9LFVIlpGXCs7RdKmz9KdePnCGC67Tl9bWdERKw6VZ2AXgpI25T5xYa0CRMmePbs2au07dipP625Na27/9R9BmzfERGSbrY9oXu8lWGxxyR9veuqKkmnSVq/DW2MiIghopXkMh1YChxYXk8C329noyIiYnBr5Xkur7b9gYblL0m6rV0NioiIwa+Vnsszkt7RtVBO4j/TviZFRMRg10rP5SPAjHKeRVRXbB3ZzkZFRMTg1sqTKG8DXi9pvbL8ZNtbFRERg1qPyUXSp3qIA2D7621qU0REDHK99Vxe2bFWRETEkNJjcrH9pU42JCIiho5WptzfRtJlkhZJekTSpZK26UTjIiJicGrlUuTzgYuBzammyP8v4IJ2NioiIga3VpKLbJ9re1l5/YAeno8SEREBrd3nco2kqVSPCzZwEPDTrme52F7cxvZFRMQg1EpyOaj8/HC3+Aepkk3Ov0RExHJauYly6040JCIiho5WHhY2AtgHGNtYPjdRRkRET1oZFrsM+AtwB/Bie5sTERFDQSvJZYztHdvekoiIGDJauRT5Ckl7tL0lERExZLTSc7ke+LGkVwDPU027b9vrtbVlERExaLWSXE4D3grcYTs3T0ZExEq1Mix2L3BnEktERLSqlZ7LQ8C1kq4Anu0K5lLkiIjoSSvJ5b7yWr28IiIietXKHfpfApC0ju2n69ippA2A7wE7UE0h80HgHuAiqps17wcOtP24qkdffgPYG/gzcKTtW0o9k4HPl2pPtj2jxN8EnA2sBVwOfCLDehERndPK81zeKuku4O6y/HpJZ/Zzv98Afmb7dcDrS91TgatsjwOuKssAewHjymsK8J3Sjo2AE4G3ADsDJ0rasGzznVK2a7uJ/WxvRET0QSsn9M8A9gQeA7D9O+DvVnWHktYr208r9T1newkwCZhRis0A9ivvJwHnuHI9sIGkzUubZtlebPtxYBYwsaxbz/ZvS2/lnIa6IiKiA1pJLtie3y30Qj/2uQ2wCPi+pFslfU/SOsBmth8q+3sI2LSUHw007n9BifUWX9AkvgJJUyTNljR70aJF/fhIERHRqJXkMl/S2wBLWl3SpylDZKtoJPBG4Du23wA8zUtDYM2oScyrEF8xaJ9le4LtCaNGjeq91RER0bJWkssxwLG81CPYqSyvqgXAAts3lOVLqJLNw2VIi/LzkYbyWzZsPwZYuJL4mCbxiIjokJUmF9uP2j7U9ma2N7V9mO3HVnWHtv+Xqjf02hLaHbgLmAlMLrHJwKXl/UzgCFV2AZ4ow2ZXAntI2rCcyN8DuLKsWyppl3Kl2RENdUVERAe0cp9LO3wMOE/S6sA84CiqRHexpKOBB4ADStnLqS5Dnkt1KfJRUD1eWdKXgZtKuZMaHrn8EV66FPmK8oqIiA4ZkORi+zZgQpNVuzcpa3oYhrM9HZjeJD6b6h6aiIgYAC1dLRYREdEXrTzmeAOq8xZjWf4xxx9vX7MiImIwa2VY7HKqZ7rkMccREdGSVpLLmrY/1faWRETEkNHKOZdzJf2jpM0lbdT1anvLIiJi0Gql5/Ic8FXgc7x0p7uppnGJiIhYQSvJ5VPAtrYfbXdjIiJiaGhlWGwO1c2LERERLWml5/ICcJuka1j+Mce5FDkiIppqJbn8d3lFRES0pJXHHM9YWZmIiIhGrdyhfx9NnodiO1eLRUREU60MizVOMLkm1WzFuc8lIiJ61MrzXB5reD1o+wxgtw60LSIiBqlWhsXe2LD4CqqezCvb1qKIiBj0WhkWO63h/TLgfuDAtrQmIiKGhFauFnt3JxoSERFDRyvDYmsAH2DF57mc1L5mRUTEYNbKsNilwBPAzTTcoR8REdGTVpLLGNsT296SiIgYMlqZuPJ/JP1t21sSERFDRis9l3cAR5Y79Z8FBNj2jm1tWUREDFqtJJe92t6KiIgYUlq5FPlPnWhIREQMHa2cc2kLSSMk3SrpJ2V5a0k3SLpX0kWSVi/xNcry3LJ+bEMdJ5T4PZL2bIhPLLG5kqZ2+rNFRAx3A5ZcgE8AdzcsfwU43fY44HHg6BI/Gnjc9rbA6aUcksYDBwPbAxOBM0vCGgF8m2o4bzxwSCkbEREdMiDJRdIYYB/ge2VZVJNhXlKKzAD2K+8nlWXK+t1L+UnAhbaftX0fMBfYubzm2p5n+zngwlI2IiI6ZKB6LmcAxwMvluWNgSW2l5XlBcDo8n40MB+grH+ilP9rvNs2PcVXIGmKpNmSZi9atKi/nykiIoqOJxdJ+wKP2L65MdykqFeyrq/xFYP2WbYn2J4watSoXlodERF90cqlyHV7O/A+SXtTPXxsPaqezAaSRpbeyRhgYSm/ANgSWCBpJLA+sLgh3qVxm57iERHRAR3vudg+wfYY22OpTshfbftQ4Bpg/1JsMtWcZgAzyzJl/dW2XeIHl6vJtgbGATcCNwHjytVnq5d9zOzAR4uIiGIgei49+QxwoaSTgVuBaSU+DThX0lyqHsvBALbnSLoYuIvqOTPH2n4BQNJHgSuBEcB023M6+kkiIoa5AU0utq8Fri3v51Fd6dW9zF+AA3rY/hTglCbxy4HLa2xqRET0wUDe5xIREUNUkktERNQuySUiImqX5BIREbVLcomIiNoluURERO2SXCIionZJLhERUbskl4iIqF2SS0RE1C7JJSIiapfkEhERtUtyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiIioXZJLRETULsklIiJql+QSERG1S3KJiIjaJblERETtklwiIqJ2SS4REVG7jicXSVtKukbS3ZLmSPpEiW8kaZake8vPDUtckr4paa6k2yW9saGuyaX8vZImN8TfJOmOss03JanTnzMiYjgbiJ7LMuA429sBuwDHShoPTAWusj0OuKosA+wFjCuvKcB3oEpGwInAW4CdgRO7ElIpM6Vhu4kd+FwREVF0PLnYfsj2LeX9UuBuYDQwCZhRis0A9ivvJwHnuHI9sIGkzYE9gVm2F9t+HJgFTCzr1rP9W9sGzmmoKyIiOmBAz7lIGgu8AbgB2Mz2Q1AlIGDTUmw0ML9hswUl1lt8QZN4s/1PkTRb0uxFixb19+NEREQxYMlF0rrAD4FP2n6yt6JNYl6F+IpB+yzbE2xPGDVq1MqaHBERLRqQ5CJpNarEcp7tH5Xww2VIi/LzkRJfAGzZsPkYYOFK4mOaxCMiokMG4moxAdOAu21/vWHVTKDriq/JwKUN8SPKVWO7AE+UYbMrgT0kbVhO5O8BXFnWLZW0S9nXEQ11RUREB4wcgH2+HTgcuEPSbSX2WeBU4GJJRwMPAAeUdZcDewNzgT8DRwHYXizpy8BNpdxJtheX9x8BzgbWAq4or4iI6JCOJxfbv6b5eRGA3ZuUN3BsD3VNB6Y3ic8GduhHMyMioh9yh35ERNQuySUiImqX5BIREbVLcomIiNoluURERO2SXCIionZJLhERUbskl4iIqF2SS0RE1C7JJSIiapfkEhERtUtyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiIioXZJLRETULsklIiJql+QSERG1S3KJiIjaJblERETtklwiIqJ2SS4REVG7JJeIiKjdkE0ukiZKukfSXElTB7o9ERHDyZBMLpJGAN8G9gLGA4dIGj+wrYqIGD6GZHIBdgbm2p5n+zngQmDSALcpImLYGDnQDWiT0cD8huUFwFu6F5I0BZhSFp+SdE8/9rkJ8Gg/tl8l+kqn99irATkGLzPD/RgM988Pw+8YbNUsOFSTi5rEvELAPgs4q5YdSrNtT6ijrsEqxyDHYLh/fsgx6DJUh8UWAFs2LI8BFg5QWyIihp2hmlxuAsZJ2lrS6sDBwMwBblNExLAxJIfFbC+T9FHgSmAEMN32nDbvtpbhtUEuxyDHYLh/fsgxAED2CqciIiIi+mWoDotFRMQASnKJiIjaJbn003CcZkbSdEmPSLqzIbaRpFmS7i0/NxzINrabpC0lXSPpbklzJH2ixIfNcZC0pqQbJf2uHIMvlfjWkm4ox+CiclHNkCZphKRbJf2kLA+7Y9Bdkks/DONpZs4GJnaLTQWusj0OuKosD2XLgONsbwfsAhxbfvfD6Tg8C+xm+/XATsBESbsAXwFOL8fgceDoAWxjp3wCuLtheTgeg+UkufTPsJxmxvavgMXdwpOAGeX9DGC/jjaqw2w/ZPuW8n4p1RfLaIbRcXDlqbK4WnkZ2A24pMSH9DEAkDQG2Af4XlkWw+wYNJPk0j/NppkZPUBtGWib2X4Iqi9eYNMBbk/HSBoLvAG4gWF2HMpw0G3AI8As4I/AEtvLSpHh8H/iDOB44MWyvDHD7xisIMmlf1qaZiaGLknrAj8EPmn7yYFuT6fZfsH2TlSzYOwMbNesWGdb1TmS9gUesX1zY7hJ0SF7DHoyJG+i7KBMM/OShyVtbvshSZtT/SU7pElajSqxnGf7RyU87I4DgO0lkq6lOv+0gaSR5S/3of5/4u3A+yTtDawJrEfVkxlOx6Cp9Fz6J9PMvGQmMLm8nwxcOoBtabsyrj4NuNv2157pDrIAAANZSURBVBtWDZvjIGmUpA3K+7WA91Cde7oG2L8UG9LHwPYJtsfYHkv1//9q24cyjI5BT3KHfj+Vv1jO4KVpZk4Z4Ca1naQLgF2pphZ/GDgR+G/gYuBVwAPAAba7n/QfMiS9A7gOuIOXxto/S3XeZVgcB0k7Up2sHkH1h+rFtk+StA3VxS0bAbcCh9l+duBa2hmSdgU+bXvf4XoMGiW5RERE7TIsFhERtUtyiYiI2iW5RERE7ZJcIiKidkkuERFRuySXiF6Uezl+LelOSfs1xC+VtMUq1HVDmT33nd3WvbPMLHxbuWekpzqulTShvL9f0iZNyuwq6W0Ny8dIOqIvbY3orySXiN4dQnUvx1uBfwGQ9F7gFtt9vet6d+D3tt9g+7pu6w4FvmZ7J9vP9LPNuwJ/TS62v2v7nH7WGdEnSS4RvXseWAtYA3hR0kjgk8BXe9pA0laSrpJ0e/n5Kkk7Af8O7N29dyLpQ8CBwBcknVd6Hj9pWP8tSUe20tgyieYxwD+X/bxT0hclfbqsv1bS6ZJ+VZ5F82ZJPyrPHTm5oZ7DyrNabpP0H+XxEhEtS3KJ6N35wJ7Az4AvAv8EnGP7z71s861SZkfgPOCbtm8DvgBc1L13Yvt7VNPG/EuZOmSV2b4f+C7Vs0R2atJDAnjO9t+VcpcCxwI7AEdK2ljSdsBBwNvLpJQvUPWsIlqWiSsjemH7CapndVCeKvkZ4P2S/hPYEDjN9m+7bfZW4P3l/blUPZaXk6757+4A5nQ9IkDSPKqJWN8BvAm4qZpCjbUYJhNwRn2SXCJa9wXgFKrzMDdT9WouBd69ku36OsfSMpYfVVizt8KSjgX+sSzu3UL9XXNcvdjwvmt5JNWU8TNsn9BSayOayLBYRAskjQO2sP1LYG2qL2LT/Iv/f6hmyIVqOOnXfdzdn4DxktaQtD7VhQA9sv3tMgS2U7nIYCnwyj7us9FVwP6SNgWQtJGkrfpRXwxDSS4RrTkF+Hx5fwFwJHA98LUmZT8OHCXpduBwquert8z2fKqZlW+nOmdzax/behnwD10n9Pu4LbbvovqsPy+fYRaweV/rieEtsyJHRETt0nOJiIjaJblERETtklwiIqJ2SS4REVG7JJeIiKhdkktERNQuySUiImr3/wEP0YLwiYTahQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create the histogram\n",
    "plt.hist(df['FTE'].dropna(), bins=10)\n",
    "\n",
    "# Add title and labels\n",
    "plt.title('Distribution of %full-time \\n employee works')\n",
    "plt.xlabel('% of full-time')\n",
    "plt.ylabel('num employee')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'whiskers': [<matplotlib.lines.Line2D at 0x25258b00488>,\n",
       "  <matplotlib.lines.Line2D at 0x25258b0fe08>],\n",
       " 'caps': [<matplotlib.lines.Line2D at 0x25258b14b48>,\n",
       "  <matplotlib.lines.Line2D at 0x25258b14cc8>],\n",
       " 'boxes': [<matplotlib.lines.Line2D at 0x25258b0cc48>],\n",
       " 'medians': [<matplotlib.lines.Line2D at 0x25258b17f48>],\n",
       " 'fliers': [<matplotlib.lines.Line2D at 0x25258b1bb08>],\n",
       " 'means': []}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAATMElEQVR4nO3df2zU933H8df7zj+OH3WwhwuImNA/UHXoFA3JYp3qP+qWaHXWrPyxqYBSJcqFSNN26tRNQHZ/tJXmqPtnXUS2Vajnlj/GhW5MacSYUEUuqk5U3UzDMhL/QVfAQPlhgpOas893tj/7A+NhY8PXP7/+8H0+pOjuPr7Lvf6wX3z0+f74mHNOAAD/xMIOAACYGwocADxFgQOApyhwAPAUBQ4AnqpZyi9bu3at27x581J+JQB478yZM7ecc81Tx5e0wDdv3qzu7u6l/EoA8J6ZXZpunCUUAPAUBQ4AnqLAAcBTFDgAeIoCBwBPUeCItHw+r1QqpXg8rlQqpXw+H3YkILAlPY0QWE7y+byy2axyuZza2tpULBaVTqclSbt37w45HfBotpS3k21tbXWcB47lIpVK6eDBg2pvb58YKxQKymQyOnfuXIjJgMnM7IxzrvWBcQocURWPx1Uul1VbWzsxVq1WlUgkNDo6GmIyYLKZCpw1cERWMplUsVicNFYsFpVMJkNKBMwOa+CIrGw2q6997WtatWqVent7tWnTJpVKJb3++uthRwMCYQYOSGJrQfiIAkdkdXZ26ujRo7pw4YLGxsZ04cIFHT16VJ2dnWFHAwLhICYii4OY8AUHMYEpOIgJ31HgiKxsNqt0Oq1CoaBqtapCoaB0Oq1sNht2NCAQzkJBZN272jKTyainp0fJZFKdnZ1chQlvsAYOAMsca+AA8JihwAHAUxQ4AHiKAgcAT1HgAOApChwAPEWBA4CnKHAA8BQFDgCeosABwFMUOAB4igIHAE9R4ADgKQocADxFgQOApyhwRFo+n1cqlVI8HlcqlVI+nw87EhBY4B15zCwuqVvSVefcV8zsM5LelNQk6ZeSvu6cqyxOTGDh5fN5ZbNZ5XI5tbW1qVgsKp1OSxK78sALs5mBf0NSz32v/1bS95xzWyT1S0ovZDBgsXV2diqXy6m9vV21tbVqb29XLpdTZ2dn2NGAQAIVuJk9KekPJf1g/LVJ+qKkfx1/y2FJOxcjILBYenp61NbWNmmsra1NPT09M3wCWF6CzsD/XtI+SWPjr39H0sfOuZHx11ckbZzug2b2ipl1m1l3X1/fvMICCymZTKpYLE4aKxaLSiaTISUCZueRBW5mX5F00zl35v7had467e7IzrlDzrlW51xrc3PzHGMCCy+bzSqdTqtQKKharapQKCidTiubzYYdDQgkyEHMz0v6IzN7VlJCUoPuzsjXmFnN+Cz8SUm/WbyYwMK7d6Ayk8mop6dHyWRSnZ2dHMCEN8y5aSfO07/Z7AuS/mr8LJR/kXTMOfemmX1f0vvOuX982OdbW1tdd3f3vAIDQNSY2RnnXOvU8fmcB75f0jfN7Fe6uyaem8f/CwAwS4HPA5ck59y7kt4df/5rSdsXPhIAIAiuxAQAT1HgAOApChwAPEWBA4CnKHAA8BQFDgCeosABwFMUOAB4igIHAE9R4ADgKQocADxFgQOApyhwAPAUBQ4AnqLAAcBTFDgiLZ/PK5VKKR6PK5VKKZ/Phx0JCGxWGzoAj5N8Pq9sNqtcLqe2tjYVi0Wl02lJYl9MeGFWe2LOF3tiYjlJpVI6ePCg2tvbJ8YKhYIymYzOnTsXYjJgspn2xKTAEVnxeFzlclm1tbUTY9VqVYlEQqOjoyEmAyZbjE2NAa8lk0kVi8VJY8ViUclkMqREwOxQ4IisbDardDqtQqGgarWqQqGgdDqtbDYbdjQgEA5iIrLuHajMZDLq6elRMplUZ2cnBzDhDdbAAWCZYw0cAB4zFDgAeIoCBwBPUeAA4CkKHAA8RYEj0riZFXzGeeCILG5mBd9xHjgii5tZwRfczAqYgptZwRdcyANMwc2s4LtHFriZJczsP83sv83sAzP7zvj4Z8zsF2Z23syOmlnd4scFFg43s4LvghzEHJb0RefcHTOrlVQ0s/+Q9E1J33POvWlm35eUlvRPi5gVWFDczAq+m9UauJmtlFSU9KeS/l3SeufciJn9vqRvO+f+4GGfZw0cAGZvXmvgZhY3s7OSbkr6qaT/lfSxc25k/C1XJG2c4bOvmFm3mXX39fXNLT0A4AGBCtw5N+qc+11JT0raLmm6ozzTTuWdc4ecc63Oudbm5ua5JwUATDKrs1Cccx9LelfS5yStMbN7a+hPSvrNwkYDADxMkLNQms1szfjzFZJ2SOqRVJD0x+Nve0HSTxYrJADgQUHOQtkg6bCZxXW38H/snDtuZh9KetPM/kbSe5Jyi5gTADDFIwvcOfe+pG3TjP9ad9fDAQAh4EpMAPAUBQ4AnqLAAcBTFDgAeIoCR6SxIw98xo48iCx25IHv2NABkcWOPPAFO/IAU7AjD3zBjjzAFOzIA99R4IgsduSB7ziIichiRx74jjVwAFjmWAMHgMcMBQ4AnqLAEWlciQmfcRATkcWVmPAdBzERWVyJCV9wJSYwBVdiwhechQJMwZWY8B0FjsjiSkz4joOYiKzdu3fr9OnT6ujo0PDwsOrr67V3714OYMIbzMARWfl8XkePHtWGDRsUi8W0YcMGHT16lFMJ4Q0KHJG1b98+1dTUqKurS+VyWV1dXaqpqdG+ffvCjgYEQoEjsq5cuaIXX3xRmUxGiURCmUxGL774oq5cuRJ2NCAQ1sARaT/84Q915MiRiQt59uzZE3YkIDAKHJFVU1OjgYEBvfTSS+rt7dWmTZs0MDCgmhr+LOAHflMRWaOjoxocHNTQ0JDGxsY0NDSkwcHBsGMBgbEGjsiqq6vTnj17tHbtWsViMa1du1Z79uxRXV1d2NGAQChwRFalUtHp06d18OBBlctlHTx4UKdPn1alUgk7GhAISyiIrK1bt2rnzp2TtlTbs2eP3nrrrbCjAYEwA0dkZbNZHTlyZNIM/MiRI1xKD28wA0dksakxfMftZAFgmZvz7WTNrMXMCmbWY2YfmNk3xsebzOynZnZ+/LFxMYIDAKYXZA18RNJfOueSkj4n6c/MbKukA5JOOee2SDo1/hoAsEQeWeDOuWvOuV+OPx+Q1CNpo6SvSjo8/rbDknYuVkgAwINmdRaKmW2WtE3SLyStc85dk+6WvKRPz/CZV8ys28y6+/r65pcWADAhcIGb2WpJxyT9hXPut0E/55w75Jxrdc61Njc3zyUjAGAagQrczGp1t7z/2Tn3b+PDN8xsw/jPN0i6uTgRAQDTCXIWiknKSepxzv3dfT96W9IL489fkPSThY8HAJhJkAt5Pi/p65L+x8zOjo/9taTvSvqxmaUl9Ur6k8WJCACYziML3DlXlGQz/PhLCxsHABAU90IBAE9R4Ii0fD6vVCqleDyuVCrFjvTwCjezQmTl83lls1nlcrmJPTHT6bQkcUMreIGbWSGyUqmUdu7cqbfeemviboT3Xp87dy7seMCEmW5mxQwckfXhhx/q5s2bWrVqlZxzKpVKOnTokG7duhV2NCAQ1sARWfF4fGIT47uXO0iDg4OKx+NhxgICo8ARWSMjIxoaGlImk9HAwIAymYyGhoY0MjISdjQgEAockbZr1y51dXXpU5/6lLq6urRr166wIwGBUeCItBMnTqhUKk2sgZ84cSLsSEBgFDgiq6mpSZ988onK5bLMTOVyWZ988omamprCjgYEwlkoiKyVK1eqXC7ro48+0tjYmD766COtWLFCK1euDDsaEAgzcETW1atXtXLlSm3cuFGxWEwbN27UypUrdfXq1bCjAYFQ4Iisuro6vfrqq7pw4YJGR0d14cIFvfrqq6qrqws7GhAIBY7IqlQqeuONN1QoFFStVlUoFPTGG2+oUqmEHQ0IhDVwRNbWrVu1ZcsWdXR0aHh4WPX19ero6GANHN5gBo7Iam9v1/Hjx/Xaa6+pVCrptdde0/Hjx9Xe3h52NCAQChyRVSgUtH///kkX8uzfv1+FQiHsaEAg3I0QkRWPx1Uul1VbWzsxVq1WlUgkNDo6GmIyYLKZ7kbIDByRlUwmVSwWJ40Vi0Ulk8mQEgGzQ4EjsrLZrNLp9KSzUNLptLLZbNjRgEA4CwWRtXv3bp0+fXrSWSh79+5lNx54gxk4IiufzyuXy2l4eFiSNDw8rFwux76Y8AYHMRFZq1evVqlUUmNjoz7++GOtWbNG/f39WrVqle7cuRN2PGACBzGBKUqlklavXq1jx45peHhYx44dmyh1wAcUOCLtwIEDam9vV21trdrb23XgwIGwIwGBsYSCyDIzJRIJrV+/XpcuXdJTTz2l69evq1wuayn/LoBHYVd6YIr6+nqVy2VdvHhRkiYe6+vrwwsFzAJLKIismprp5y8zjQPLDQWOyLp3sHL9+vWKxWJav379pHFguaPAEWkvv/yyrl27ptHRUV27dk0vv/xy2JGAwDiIicgyM9XX12tsbEzValW1tbWKxWIaHh7mICaWFQ5iAtMYHh6WmUmSRkZGKG54hSUURNa94p7pEVjuHlngZtZlZjfN7Nx9Y01m9lMzOz/+2Li4MYGF55zTtm3bJmbdU18Dy12QGfiPJH15ytgBSaecc1sknRp/DXjn4sWLOnXqlCqVik6dOjVxLjjgg0eugTvnfmZmm6cMf1XSF8afH5b0rqT9C5gLWHSxWEz9/f165plnNDo6qng8rtHRUcVirCzCD3P9TV3nnLsmSeOPn57pjWb2ipl1m1l3X1/fHL8OWHhjY2OSNGkJ5f5xYLlb9KmGc+6Qc67VOdfa3Ny82F8HBGZm2rFjh5LJpGKxmJLJpHbs2MFBTHhjrgV+w8w2SNL4482FiwQsDeeczp49O3HlZalU0tmzZzmICW/MtcDflvTC+PMXJP1kYeIAS6empkblclnS/y+flMtl7oUCbwQ5jTAv6eeSPmtmV8wsLem7kp4xs/OSnhl/DXiloaFBpVJJ5XJZZqZyuaxSqaSGhoawowGBBDkLZaYdXr+0wFmAJdXf369EIqHr169Lkq5fv64VK1aov78/5GRAMJwvhciKx+NKJBJ65513VKlU9M477yiRSCgej4cdDQiEAkdkjYyMPLB5Q319vUZGRkJKBMwOBY5I2759uzo6OlRXV6eOjg5t37497EhAYBQ4IqupqUnHjx/XmjVrJElr1qzR8ePH1dTUFHIyIBgKHJE2NjamW7duSZJu3brFVZjwCgWOyLp9+7YaGhrU0tKiWCymlpYWNTQ06Pbt22FHAwKhwBFp69at06VLlzQ2NqZLly5p3bp1YUcCAqPAEWnnz5/Xc889p76+Pj333HM6f/582JGAwLhmGJEWi8X09ttv696N1mKxGOvg8AYzcETa2NiYGhsbZWZqbGykvOEVChyRtm7dOg0ODso5p8HBQdbA4RUKHJF248YN1dbWSpJqa2t148aNkBMBwVHgiLw7d+5MegR8QYEj8u7dvIqbWME3FDgi794mxmxmDN/wG4tIe+KJJ3Ty5ElVKhWdPHlSTzzxRNiRgMA4DxyRNjAwoJdeekm9vb3atGmTBgYGwo4EBEaBI7LuXbRz8eJFSZp4ZCkFvuA3FZG1cePGWY0Dyw0zcETW5cuXtW3bNlUqFfX09CiZTKqurk7vvfde2NGAQJiBI9L27t370NfAcmbOuSX7stbWVtfd3b1k3wc8jJnN+LOl/LsAHsXMzjjnWqeOMwNH5DU2Nur9999XY2Nj2FGAWWENHJG2YsUK9ff36+mnn554PTQ0FHIqIBhm4Ii0w4cPyzk38d/hw4fDjgQERoEj0p5//nkVCgVVq1UVCgU9//zzYUcCAmMJBZHV0tKiy5cv69lnn1W5XFYikVClUlFLS0vY0YBAmIEjsnp7e9XS0qJyuSxJKpfLamlpUW9vb8jJgGCYgeOx9LBTBB/m8uXLs/ospxsiTBQ4HkuzLVYzo4zhHZZQAMBTFDgAeIolFCx7TU1N6u/vX/Tvmeu6eVCNjY26ffv2on4HomVeBW5mX5b0uqS4pB845767IKmA+/T39z8W69OL/Q8EomfOSyhmFpf0D5I6JG2VtNvMti5UMADAw81nDXy7pF85537tnKtIelPSVxcmFgDgUeazhLJR0uX7Xl+R9HvziwM8yH2rQfq2/5sNu281hB0Bj5n5FPh0C3oPLFSa2SuSXpGkTZs2zePrEFX2nd+GHWFBNDY26va3w06Bx8l8CvyKpPtvGvGkpN9MfZNz7pCkQ9LdDR3m8X2IqMfhACawGOazBv5fkraY2WfMrE7SLklvL0wsAMCjzHkG7pwbMbM/l3RSd08j7HLOfbBgyQAADzWv88CdcycknVigLACAWeBSegDwFAUOAJ6iwAHAUxQ4AHiKAgcAT9lSXiRhZn2SLi3ZFwLBrZV0K+wQwAyecs41Tx1c0gIHlisz63bOtYadA5gNllAAwFMUOAB4igIH7joUdgBgtlgDBwBPMQMHAE9R4ADgKQockWZmXWZ208zOhZ0FmC0KHFH3I0lfDjsEMBcUOCLNOfczSbfDzgHMBQUOAJ6iwAHAUxQ4AHiKAgcAT1HgiDQzy0v6uaTPmtkVM0uHnQkIikvpAcBTzMABwFMUOAB4igIHAE9R4ADgKQocADxFgQOApyhwAPDU/wF4nwKXs/V53QAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.boxplot(df['FTE'].dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looking at the datatypes\n",
    "- ML algorithms work on numbers, not strings\n",
    "    - Need a numeric representation of these strings\n",
    "- Strings can be slow compared to numbers\n",
    "- In pandas, ```category``` dtype encodes categorical data numerically,\n",
    "    - Can speed up code\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring datatypes in pandas\n",
    "It's always good to know what datatypes you're working with, especially when the inefficient pandas type object may be involved. Towards that end, let's explore what we have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "object     23\n",
       "float64     2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encode the labels as categorical variables\n",
    "Remember, your ultimate goal is to predict the probability that a certain label is attached to a budget line item. You just saw that many columns in your data are the inefficient object type. Does this include the labels you're trying to predict? Let's find out!\n",
    "\n",
    "There are 9 columns of labels in the dataset. Each of these columns is a category that has many possible values it can take. \n",
    "\n",
    "You will notice that every label is encoded as an object datatype. Because category datatypes are much more efficient your task is to convert the labels to category types using the ```.astype()``` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type',\n",
    "          'Object_Type', 'Pre_K', 'Operating_Status']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Function            category\n",
      "Use                 category\n",
      "Sharing             category\n",
      "Reporting           category\n",
      "Student_Type        category\n",
      "Position_Type       category\n",
      "Object_Type         category\n",
      "Pre_K               category\n",
      "Operating_Status    category\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Define the lambda function: categorize_label\n",
    "categorize_label = lambda x: x.astype('category')\n",
    "\n",
    "# Convert df[LABELS] to a category type\n",
    "df[LABELS] = categorize_label(df[LABELS])\n",
    "\n",
    "# Print the converted dtypes\n",
    "print(df[LABELS].dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Counting unique labels\n",
    "As Peter mentioned in the video, there are over 100 unique labels. In this exercise, you will explore this fact by counting and plotting the number of unique values for each category of label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAFTCAYAAAA+6GcUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3de9yl9bz/8de7KU3SUVNGTKUD5dCU6UBtOiHaVA45pJy24bcdshGxHUpsx9iENHTajqUoFSpJiXSeCkVoIqIiGqKaev/++F6r1tzdh2vu7mtd657r/Xw81uNe17UO3093c3/Wd32v7/fzlW0iIqI7Vmg7gIiIGKwk/oiIjknij4jomCT+iIiOSeKPiOiYJP6IiI5Zse0A6lhnnXW84YYbth1GRMS0cumll95ie9bI89Mi8W+44YZccsklbYcRETGtSLp+tPMZ6omI6Jgk/oiIjknij4jomCT+iIiOSeKPiOiYJP6IiI5J4o+I6Jgk/oiIjpkWC7jq2PCg06fsvRZ9aI8pe6+IiGGTHn9ERMck8UdEdEwSf0RExyTxR0R0TBJ/RETHJPFHRHRMEn9ERMc0lvglzZR0kaQrJP1M0iHV+WMlXSdpYXWb21QMERFxf00u4LoD2MX23yWtBJwv6TvVYwfaPrHBtiMiYgyNJX7bBv5eHa5U3dxUexERUU+jY/ySZkhaCNwEnGX7wuqhD0i6UtInJK08xmvnS7pE0iU333xzk2FGRHRKo4nf9t225wKPALaV9DjgHcBjgG2AtYG3j/HaBbbn2Z43a9b9NomPiIhJGsisHtt/BX4A7G77Rhd3AMcA2w4ihoiIKJqc1TNL0prV/VWA3YBrJM2uzgnYC/hpUzFERMT9NTmrZzZwnKQZlA+YE2yfJun7kmYBAhYCr20whoiIGKHJWT1XAluNcn6XptqMiIiJZeVuRETHJPFHRHRMEn9ERMck8UdEdEwSf0RExyTxR0R0TBJ/RETHJPFHRHRMEn9ERMck8UdEdEwSf0RExyTxR0R0TBJ/RETHJPFHRHRMEn9ERMck8UdEdEwSf0RExyTxR0R0TBJ/RETHNJb4Jc2UdJGkKyT9TNIh1fmNJF0o6VpJx0t6UFMxRETE/U2Y+CXtIGnV6v5LJX1c0gY13vsOYBfbWwJzgd0lbQ98GPiE7U2BW4FXTT78iIhYVnV6/EcAt0vaEngbcD3wfxO9yMXfq8OVqpuBXYATq/PHAXsta9ARETF5dRL/EtsG9gQ+afuTwGp13lzSDEkLgZuAs4BfA3+1vaR6yg3A+mO8dr6kSyRdcvPNN9dpLiIiaqiT+BdLegewH3C6pBmU3vuEbN9tey7wCGBbYPPRnjbGaxfYnmd73qxZs+o0FxERNdRJ/C+kjNe/0vYfKT30jy5LI7b/CvwA2B5YU9KK1UOPAP6wLO8VEREPzISJv0r2JwErV6duAb450eskzZK0ZnV/FWA34GrgHOD51dNeBpyy7GFHRMRk1ZnV82rKxdgjq1PrAyfXeO/ZwDmSrgQuBs6yfRrwduDNkn4FPBQ4ajKBR0TE5Kw48VN4HWV8/kIA29dKWneiF9m+EthqlPO/qd4vIiJaUGeM/w7bd/YOqvH5US/IRkTE8KuT+M+V9E5gFUlPA74OnNpsWBER0ZQ6if8g4GbgKuA1wLeBdzUZVERENGfCMX7b9wCfr24RETHNTZj4JV3HKGP6th/VSEQREdGoOrN65vXdnwm8AFi7mXAiIqJpdRZw/bnv9nvb/0sptBYREdNQnaGerfsOV6B8A6hVpC0iIoZPnaGew/ruLwEWAfs0Ek1ERDSuzqyenQcRSEREDMaYiV/Sm8d7oe2PT304ERHRtPF6/BnHj4hYDo2Z+G0fMshAIiJiMOrM6plJ2RD9sZR5/ADYfmWDcUVEREPq1Or5IvAw4BnAuZRdsxY3GVRERDSnTuLfxPa7gX/YPg7YA3h8s2FFRERT6iT+u6qff5X0OGANYMPGIoqIiEbVWcC1QNJawLuBbwEPqe5HRMQ0VCfxH2P7bsr4fipyRkRMc3WGeq6TtEDSrpJU940lPVLSOZKulvQzSQdU5w+W9HtJC6vbsyYdfURELLM6if/RwPcom64vkvRpSTvWeN0S4C22Nwe2B14naYvqsU/Ynlvdvj2pyCMiYlLqlGX+p+0TbD8XmAusThn2meh1N9q+rLq/GLgaWP8BxhsREQ9QnTF+JD0VeCHwTOBilrE6p6QNga2AC4EdgNdL2h+4hPKt4NZRXjMfmA8wZ86cZWkuYrm14UGnT9l7LfrQHlP2XjG9TNjjr7ZefBPwQ+BxtvexfVLdBiQ9BDgJeJPt24AjgI0p3x5uZOmyz/eyvcD2PNvzZs2aVbe5iIiYQJ0e/5ZVwl5mklaiJP0v2/4GgO0/9T3+eeC0ybx3RERMTp0x/skmfQFHAVf3l3CWNLvvaXsDP53M+0dExOTUGuOfpB2A/YCrJC2szr0TeLGkuYApu3m9psEYIiJihMYSv+3zgdHm/Wf6ZkREi+pc3F1P0lGSvlMdbyHpVc2HFhERTaizgOtY4Azg4dXxLymzfCIiYhqqk/jXsX0CcA+A7SXA3Y1GFRERjamT+P8h6aGUi7FI2h74W6NRRUREY+pc3H0zpRzzxpJ+BMwCnt9oVBER0ZgJE7/ty6qSDY+mzNL5he27JnhZREQMqTqbre8/4tTWkrD9fw3FFBERDaoz1LNN3/2ZwK7AZUASf0TENFRnqOcN/ceS1gC+2FhEERHRqDqzeka6Hdh0qgOJiIjBqDPGfyrVVE7KB8UWwAlNBhUREc2pM8b/sb77S4Drbd/QUDwREdGwOmP8E26zGBER00edoZ7F3DfUs9RDgG2vPuVRRUREY+oM9XwC+CNlJo+AfYHVbH+kycAiIqIZdWb1PMP2Z20vtn2b7SOA5zUdWERENKNO4r9b0r6SZkhaQdK+pDpnRMS0VSfxvwTYB/hTdXtBdS4iIqahOrN6FgF7Nh9KREQMwpiJX9LbbH9E0uGMMqvH9hvHe2NJj6TU83kYZROXBbY/KWlt4HhgQ8pm6/vYvnXS/wUREbFMxuvxX139vGSS770EeEtV1nk14FJJZwEvB862/SFJBwEHAW+fZBsREbGMxkz8tk+tfh43mTe2fSNwY3V/saSrgfUpw0Y7VU87DvgBSfwREQNTZwHXZsBbKUMz9z7f9i51G5G0IbAVcCGwXvWhgO0bJa07xmvmA/MB5syZU7epiIiYQJ0FXF8HPgd8gUlM45T0EOAk4E22b5NU63W2FwALAObNmzfayuGIiJiEOol/SbVoa5lJWomS9L9s+xvV6T9Jml319mcDN03mvSMiYnLqzOM/VdJ/Spotae3ebaIXqXTtjwKutv3xvoe+Bbysuv8y4JRljjoiIiatTo+/l6QP7Dtn4FETvG4HYD/gKkkLq3PvBD4EnCDpVcBvKQvCIiJiQOos4NpoMm9s+3xKUbfR7DqZ94yIiAeuzqye/Uc7bzubrUdETEN1hnq26bs/k9Jbv4yyKjciIqaZOkM9b+g/lrQGpTZ/RERMQ3Vm9Yx0O7DpVAcSERGDUWeM/1TuK9K2ArAFcEKTQUVERHPqjPF/rO/+EuB62zc0FE9ERDSszhj/uYMIJCIiBmMyY/wRETGNJfFHRHTMmIlf0tnVzw8PLpyIiGjaeGP8syU9FXiOpK8xovyC7csajSwiIhoxXuJ/D2VbxEcAHx/xmIHaG7FERMTwGG/rxROBEyW92/ahA4wpIiIaVGc656GSngM8pTr1A9unNRtWREQ0ZcJZPZI+CBwA/Ly6HVCdi4iIaajOyt09gLm27wGQdBxwOfCOJgOLiIhm1J3Hv2bf/TWaCCQiIgajTo//g8Dlks6hTOl8CuntR0RMW3Uu7n5V0g8oG7IIeLvtPzYdWERENKPWUI/tG21/y/YpdZO+pKMl3STpp33nDpb0e0kLq9uzJht4RERMTpO1eo4Fdh/l/Cdsz61u326w/YiIGEVjid/2ecBfmnr/iIiYnHETv6QV+odqpsjrJV1ZDQWtNU7b8yVdIumSm2++eYpDiIjornETfzV3/wpJc6aovSOAjYG5wI3AYeO0vcD2PNvzZs2aNUXNR0REnemcs4GfSboI+EfvpO3nLGtjtv/Uuy/p80BKP0REDFidxH/IVDUmabbtG6vDvYGpHkaKiIgJ1NpzV9IGwKa2vyfpwcCMiV4n6avATsA6km4A3gvsJGkupazzIuA1DyD2iIiYhAkTv6RXA/OBtSnj8+sDnwN2He91tl88yumjJhFjRERMoTrTOV8H7ADcBmD7WmDdJoOKiIjm1En8d9i+s3cgaUXKUE1ERExDdRL/uZLeCawi6WnA14FTmw0rIiKaUifxHwTcDFxFuRj7beBdTQYVERHNqTOr555q85ULKUM8v7CdoZ6IiGmqzqyePSizeH5NKcu8kaTX2P5O08FFRMTUq7OA6zBgZ9u/ApC0MXA6kMQfETEN1Rnjv6mX9Cu/AW5qKJ6IiGjYmD1+Sc+t7v5M0reBEyhj/C8ALh5AbBER0YDxhnqe3Xf/T8BTq/s3A2OWU46IiOE2ZuK3/YpBBhIREYNRZ1bPRsAbgA37nz+ZsswREdG+OrN6TqYUVzsVuKfZcCIioml1Ev+/bH+q8UgiImIg6iT+T0p6L3AmcEfvpO3LGosqIiIaUyfxPx7YD9iF+4Z6XB1HRMQ0Uyfx7w08qr80c0RETF91Ev8VwJpktW5EjGLDg06fsvda9KE9puy9Ymx1Ev96wDWSLmbpMf5M54yImIbqJP73TuaNJR0N/Dul1s/jqnNrA8dT1gQsAvaxfetk3j8iIiZnwiJtts8d7VbjvY8Fdh9x7iDgbNubAmdXxxERMUATJn5JiyXdVt3+JeluSbdN9Drb5wF/GXF6T+C46v5xwF7LHHFERDwgdXbgWq3/WNJewLaTbG892zdW73ujpHXHeqKk+cB8gDlz5kyyuYiIGKlOPf6l2D6ZAczht73A9jzb82bNmtV0cxERnVGnSNtz+w5XAOZRFnBNxp8kza56+7PJFNGIiIGrM6unvy7/EspsnD0n2d63gJcBH6p+njLJ94mIiEmqM8Y/qbr8kr4K7ASsI+kGyrTQDwEnSHoV8FvKbl4RETFA4229+J5xXmfbh473xrZfPMZDu9YJLCIimjFej/8fo5xbFXgV8FBg3MQfMd2lFEEsr8bbevGw3n1JqwEHAK8AvgYcNtbrIiJiuI07xl+VWHgzsC9lwdXWKbEQETG9jTfG/1HgucAC4PG2/z6wqCIiojHjLeB6C/Bw4F3AH/rKNiyuU7IhIiKG03hj/Mu8qjeWlouDETGMktwjIjomiT8iomOS+CMiOiaJPyKiY5L4IyI6Jok/IqJjkvgjIjomiT8iomOS+CMiOiaJPyKiY5L4IyI6Jok/IqJjkvgjIjpmws3WmyBpEbAYuBtYYnteG3FERHRRK4m/srPtW1psPyKikzLUExHRMW31+A2cKcnAkbYXjHyCpPnAfIA5c+YMOLzl21RtEJPNYSKmp7Z6/DvY3hp4JvA6SU8Z+QTbC2zPsz1v1qxZg48wImI51Urit/2H6udNwDeBbduIIyKiiwae+CWtKmm13n3g6cBPBx1HRERXtTHGvx7wTUm99r9i+7stxBER0UkDT/y2fwNsOeh2IyKiyHTOiIiOSeKPiOiYJP6IiI5J4o+I6Jgk/oiIjmmzSFvEvVJGIqbaMP6bGpaY0uOPiOiYJP6IiI5J4o+I6Jgk/oiIjknij4jomCT+iIiOSeKPiOiYJP6IiI5J4o+I6Jgk/oiIjknij4jomCT+iIiOSeKPiOiYVhK/pN0l/ULSryQd1EYMERFdNfDEL2kG8BngmcAWwIslbTHoOCIiuqqNHv+2wK9s/8b2ncDXgD1biCMiopNke7ANSs8Hdrf9H9XxfsB2tl8/4nnzgfnV4aOBX0xRCOsAt0zRe02VxFRPYqpvGONKTPVMZUwb2J418mQbO3BplHP3+/SxvQBYMOWNS5fYnjfV7/tAJKZ6ElN9wxhXYqpnEDG1MdRzA/DIvuNHAH9oIY6IiE5qI/FfDGwqaSNJDwJeBHyrhTgiIjpp4EM9tpdIej1wBjADONr2zwYYwpQPH02BxFRPYqpvGONKTPU0HtPAL+5GRES7snI3IqJjkvgjIjomiT8iomOS+FskaRVJj247joimSVp5CGJ42jiPfXiQsYxH0gqSVm+yjc4kfknrS3qypKf0bi3H82xgIfDd6niupFantUr61Ci3QyWlpEYfSZtIOkPSFdXxEyS9I3GNGtO2kq4Crq2Ot5R0eEvhfEbSHv0nqiR7LLBlOyHdG8dXJK0uaVXg58AvJB3YVHudSPzVp/mPgHcBB1a3t7YaFBxMqVv0VwDbC4ENW4wHYCYwl/JHei3wBGBt4FWS/reNgCQtlnTbiNvvJH1T0qPaiAn4AnAIcE91fBXw0pZi6TeMcX0K+HfgzwC2rwB2bimWpwOHSXougKSZlDVEKwHPbimmni1s3wbsBXwbmAPs11RjbZRsaMNewKNt39F2IH2W2P6bNFoFi9ZsAuxiewmApCOAM4GnUZJIGz5OWdn9FUq5jxcBD6PUbjoa2KmFmFa1/ePe/zvblnRXC3GMNIxxrWD7+hH/zu9uIxDbiyTtBpwhaV1KYr3Q9pvbiGeElSStRMlVn7Z9l6TG5tp3oscP/IbyqT5MfirpJcAMSZtWX39/3HJM6wOr9h2vCjzc9t1AWx+au9s+0vZi27dVNZyeZft4YK2WYvqzpI2oakxJ2gv4Y0ux9BvGuH4naVvAkmZIehPwyzYCkbQ1sC7wNuADwO+AL0naunqsTUcCiyh/c+dJ2gC4ranGutLjvx1YKOls+hKY7Te2FxJvAP6bEs9XKSuZD20xHoCPUH5PP6D0rp8C/E817vi9lmK6R9I+wInV8fP7Hmtr9eHrgaOAx0i6HriR8k2kbcMY1/+jDPfMAW4CzqrOteGwvvtXAuv1nTOwy8Aj6jVuf4rye+q5XlJjQ2KdWLkr6WWjnbd93KBjGU21Oc2q1Rhf27HMplx7EHCR7VYL6FXj+J8EnkT54/wJ8F/A74En2j6/xdjWoPwN/bWtGEYzrHFNF5KeZvusAbf5ntHO235fI+11IfEDVAXhNqsOf2G71bFPSV8BXksZ77wUWAP4uO2PthzX+sAG9H0btH1eexENH0lrAe8GdqR8GJ0PvN/2rYnrfjFtCHyC8sENZZLFW2wvaimkCUm6zPZAh34kvaXvcCblgvjVtl/ZSHtdSPySdgKOo4yhiVIW+mVtJjRJC23PlbQv8ETg7cCltp/QYkwfBl4I/Iz7ZobY9nNajGkW8GrKjKf+D6NG/iBqxnQG5ZvHl6pTLwF2sP30tmKC4YxL0gWUomNf7ovpNbafNPar2iXpcttbtRzDysC3bD+jiffvyhj/YcDTbf8CQNJmlHH1J7YY00Cv4tc0jLOfTgF+SLnG0MpskFGsY/u9fceHSLq0tWjuM4xxrWD7mL7jYyW1NcZfV9t/hwAPBhqbrtyVxL9SL+kD2P5llXTb9DngOspFpsav4tfUm/00TIn/wbbf3nYQI5wr6fm2TwSo5oV/p+WYYDjj+r6kt1L21jblG+WpvZWpw3BdaxhUi9x6HzgzgFk0ONmjK0M9R1N+qV+sTu0LrGj7FS3E0j9nWFVcN1PGY3/Xm0PfBkknUVYwDs3sJ0nvB35s+9ttxTCSpFsp12Tuovz/exDwt+ph2147cd0b0+/Gedi25wwsmJokfcP2cwfc5gZ9h0uAPzWZC7qS+FcGXke56CXgPOCzbQxpSHrvKKfXBp4BHGz7awMO6V7DOPtJ0mLK3OY7KAlNJSQ3WstkgphmjPd4te5h4IY1rmEj6cHAW4A5tl8taVPKEOdpLcb0Rdv7TXRuytrrQuKfDiStDXxv0LMJYtlJ+hpl1fBZHqI/oGGMS9JPKDF91fbituMBkHQ8ZSbd/rYfJ2kV4ALbc1uMaamZRJJWBK60vUUT7S3XK3clnVD9vErSlSNvbcfXz/ZfKL3ZgRvG35Okx1Q/tx7t1kZMfY4FXgX8UtL7JW3Scjw9xzJ8cb0c2Bi4QtKXJO3acjwAG9v+COUbJLb/SXt/e++ovtU+QffVoloM/IkysaGZdoekY9AISbNt3zhi/Oxetq8fdExjkbQL8C7bA189OIy/J0kLbM+XdM7oIQ3+9zRSNW9+X8pU3OuAz1N6tq1dpxnWuKphqOcAnwbupHwLOLyNRWaSfgzsCvzI9taSNqb8frYddCx9MX3Q9sAqqS7Xib9H0odHzgwZ7dyAYum/et+zNqUQ2f62rxl0THDvH+YZtndro/2xSJpp+18TnRu0Krm+BNgfuIVSRG5HYNM2f4fDGJekLYBXUCpgfp8yp39H4IVtDG2q1OV/F7AFpQjhDsDLbf9g0LGMiGstYFPKAi6gucWTXUn891uJJ+nKNhZLjdKrNvBn2/8YdCwjqewHsJ/tv0345AEZ4//dwFdWjmj/BODxlKR6jO0b+h5rbfHPMMYl6ULgn5Qe/terYZXeY98a9OJASQIeQanftT1liOcntm8ZZByjxPUfwAFVbAur2C5o6pvtcj2Pv1oo8p/AxiPGqlejpUqYwzS8NIp/AVdJOgu494Oojemckh5GqRa6iqStuG8MdnXK4paBk7S97Z9Q6t6PegG1peQ6dHFJeq7tb1A6EqNW42xjRbhtSzrZ9hOB0wfd/jgOALahfAjtXF3jOqSpxpbrHr9Ksaq1gA8CB/U9tLi6mBp9hmk6ZxXLy4F5wMXcl/hvA46rksqgY2r1m8ZYhjGuYYypR9JngGNtX9x2LD2SLra9jaSFwHa271BV1qWJ9pbrHn81ZPE3SZ8E/tKbTiZpNUnb2b6w3QiHS5vz9UeyfZykLwIvtv3lCV8QUd/OwGslLaJ8s+2tDWmtThZwg6Q1gZOBs6rFeI1Vxl2ue/w9ki4Htu59BZa0AnDJsPZI2lItZPkg5aJX/wWmtrY4RNJ5tlvdH7lH0l8pi/9G1cbQBQxnXJJuB3412kO0nGSHafbaaCQ9lbIC+ztuqIrwct3j76P+cU/b91QLJGJpxwDvpZTR3ZkyE6PtvSHPUqn1cjxLX3doY6juZpbezGNYDGNc19H+PrZLUdlj97WULUavAo5qe+ptT/8qXdvn9s7R0L67XUl+v5H0RuCI6vg/KQXJYmmr2D5bkqrez8GSfkj5MGhLr/zy6/rOmQYrF45jce+PcsgMY1x3DksPus9xlEVbPwSeSflme0CrEd3nsf0H1fTqxqoHL9crd/u8FngyZdemG4DtgPmtRjSc/lUNg10r6fWS9qbsUdoa2xuNcmtr6GlRnSdV88QHaVGdJw04rh/VedJYEwoasoXtl9o+krKF578NsO1RjbNy9yaycjcGQdI2wNXAmpSSsGsAH6mmCrYV00qUPVp74/w/AI5sauxzKgzrjJZhjGuQMY1sa5h+H1m52wAN4S5OUY+kL1D2COjNONoPuNv2f7QX1fjaXMQ1nmGMa5AxSbqb+64TCViFspCrtYqv1YXmv/YWTapssL4X5VvcZ2zf2US7XRnjH8ZdnIaOys5kB3L/PXfbrIuzje0t+46/L+mK1qKpZ1h7U8MY18Bisj1u2eqWnADsTZl2Phf4OmVm3Vzgs0AjHZyuJP5h3MVpGH2dsjPY5xmeD8i7JW1s+9cAkh7F8MQWD1zbs8batort3nz9lwJH2z6suta2sKlGu5L4T5P0LA/RLk5DaontIyZ+2kAdCJwj6TeUJLEBZZppaySt7BGb+Iw4t2jwUdWyaNANStrI9nXjnKt1EXg51v/BtwvwDrh3ynlzjXZkjH/odnEaJiqbwAC8kTKb4JssvfViq+UtVHZQezTl/9s1I5NuC/EMXeG4vjiezP2vZf1fi/GM9ru6tKqV03lVVYHZwI2UstWb2b5L0mzgVNvzmmi3Ez1+26u1HcOQu5Qy1trrYrx1xONtrtydSVl3sSMlxh9K+pxbKMs8jIXj+lULfjamDBH0hsMMDDzxV0XGHgusobLpe8/q9K0KD95E2YB+NrBj32y1hwH/3VSjXenxj7rk3w3Vup5uJG1L2ej9xur4ZcDzKEMDB7fZ469KDS8GvlSdejGwlu0XtBBLf+G4S/oeWkwp+jXwwnH9JF1Nmave+h+1pD0ps1OeA3yr76HFwNdst1Idd7qSdIHtJ03Z+w3Bv5HGSTq173AmsC1wacuzVYaGpMuA3Wz/pfqQ/BrwBsrMgs1tP7/F2K4YMatn1HMDjul5tk9qq/2xSPo68MbeB/gwkPQk2xe0Hcd0N9XTXrsy1LNUzRBJjwQ+0lI4w2hGX6/+hcCCKrGdVJWJbdPlffXmkbQd7V8QPE3SS7j/WPr7WouoWAf4uaSLWPoaTSvF4yqvlXS1qy0WVXaZOixraJbZlPbQO5H4R3ED8Li2gxgiMyStWBWs2pWly1m0/W9kO2B/Sb+tjucAV6vawrKlKo+nAH+jXBtp9ULzCAe3HcAonuC+fXVt31pdH4kWtf1HPRCSDue+T8wVKEMYw74IaJC+Cpwr6RbKNnk/BJC0CSXBtWn3ltsfzSNsD11cts+VtB5lJyeAi2zf1GZMwAqS1rJ9K9w7g6wTeWeKTenczq78D+i/ELcE+KrttocLhobtD0g6mzKz4My+i4MrUMb6W2P7ekm9zcKPkbQOsNrIueED9mNJj7d9VYsx3I+kfYCPUuoZCThc0oG2T2wxrMMov68TKZ2vfYAPtBjPdDWl5ZmX64u7kubY/u3Ez4xhJem9lFk0j7a9maSHUzbt3qHFmH5Oqel+HWWop/XNRaq4rgCe1uvlVzWqvtfmhfAqji0oi5MEnG37523GM4yqtUYjk/HfKJ3Wt9ie0jLyy3uP/2RgawBJJ9l+XsvxxLLbG9gKuAzA9h8ktb0u45kttz+WFUYM7fyZ4Si9vjbwj+ob26zRVvMGH6dstfgVygfkiyhz+X8BHA3sNJWNDcM/iib1j4u1tggpHpA7q6Gn3raZq7YcT2+LvkcCu1T3b2c4/pa+K+kMSS+X9HLgdKDVMiXVN7a3U5UioFRa/dLYr+is3W0faXux7aoHts8AAAk5SURBVNtsLwCeZft4YK2pbmwY/rE2yWPcj+njBElHAmtKejWlwuoX2gxoWJOZ7QOBBcATgC0p03LbLk64N2UR1z+gfGMD2v7GNozukbSPpBWq2z59j0157lrex/h79bf7a29DavVMK9XOUU+n/H87w/ZZLcezkGr4qbeoRtKVbY/xDyNJF9netlezp/rGdkF+V0urqs5+EngSJdH/BPgvyq6BT7R9/lS2t1yP8Q9p/e1YRlWiPwvKXqSS9rX95RZDutO2JQ3F8JOk823vOMoFwmHo4Iz8xvZKStnv6FNdvB1rc/opTfqwnPf4Y/qStDplg/X1KbVezqqODwQW2t6zxdjeCmwKPI2yacYrga/YPrytmIbZsH1jG0aD3iUwiT+GkqRTgFuBCyiridcCHgQcYLvtMhJDmcwkfdH2fhOdi+Ej6ceUhZOX0rfRUFM1oZL4YyhJusr246v7M4BbgDm2F7cb2fAaWfte0orAlba3aCGWsYafev4MfNT2Zwcc2lCStND23EG1t1yP8ce01qtLju27JV3XdtIfJ4kB0NZYuqR3AO+k7BNwW+80cCdlls/A2d6x+jnqDB5JDwV+TNlXNga8S2B6/DGU+mZkwdKzslq/YCnpfcAfgS9W8exLKSPRasVXSR+0/Y6JnzlYkrbmvo10zrd9eXV+9jCVkG6TBrxLYBJ/xDKSdKHt7SY6N8B4HmP7mirB3o/tywYdU4+k9wAvAHqb1OxFKbnx/rZiiiT+iGVWXYj7DGXDGlN2BXud7Se3FM8C2/MlnTPKw25zw6FqV7CtXG2VKWkVyvqHzduKaZi09aGdMf6IZfcSymKbT1IS/4+qc62wPb/6uXNbMYxjEWXXu94eySsDv24tmuHzZsr+F4eN8pgpxe2mXHr8EcsJSS8Avmt7saR3UQoUHtobUx9wLL09MOZQ9gfoTXfdjTLO/6JBxzTMJM3sfSsa79yUtZfEH7FsJB3DKLN72t5OsFc2otq/4IPAx4B3tnHtQWVjeigX5VcC7qHMT/8ngO3jBh3TMBs5FXesc1MlQz0Ry+60vvszKYXI/tBSLP16C3/2AI6wfYqkg1uK5SuUDVdeCVxPKQj5SOAYytTTACQ9jLI6fZVqS8peReHVgQc31m56/BEPjKQVKBuetHYRtYrjNEpRr92AJ1J61xe1sRGLpE8ADwHe3Ft/UZXh+Bhwu+03DTqmYVR9M3o5ZbOh/p0CFwPH2v7GaK97wO0m8Uc8MJIeDZxue5OW43gwZY/iq2xfK2k28HjbZ7YQy7XAZh6RYKpV2NfY3nTQMQ0zSc9rqjzDaDLUE7GMRlnB+0dKff5W2b5d0q+BZ0h6BvDDNpL+feHcv1dZrcJOb3ME2ydJ2gN4LGX4sHf+fU20t7xvxBIx5WyvZnv1vttmg+ytjUXSAcCXgXWr25ckvaGlcH4uaf+RJyW9FLimhXiGmqTPAS8E3kAZ538BsEFj7WWoJ2LZSDrb9q4TnRs0SVcCT7L9j+q4tU1PJK1PWa37T0rFSVOmda4C7G3794OOaZj1zcjq/XwI8A3bT2+ivQz1RNQkaSZlpsU6ktZi6RkYD28tsPuIvpK+1X2N8dxGVYl9O0m7UIYvBHzH9tltxDMN9Obr3y7p4ZTqpRs11VgSf0R9rwHeREnyl/adX0wp4dC2Y4ALJX2zOt4LOKrFeLD9feD7bcYwTZwqaU3go8BllG9Ije1UlqGeiJokbQPcADzf9uHVVLznUcoSHGz7L23GB0tVwhRwXhurdmPZVNOBt7f94+p4ZWCm7b811mYSf0Q9ki4DdrP9F0lPoRRpewMwF9jc9vNbimsm8FpgE+Aq4CjbS9qIJSZH0gW2nzSo9jKrJ6K+GX29+hcCC2yfZPvdlKTbluMoC4CuAp5JWSQV08uZkp4naSDXZDLGH1HfDEkrVr3pXSlVFXva/Fvaom+byqOAi1qMJSbnzZSNWO6W9E8a3ogliT+ivq8C50q6hTJN8YcAkjYBGhuPraF/m8olA+o0xhQaa4vKpmSMP2IZSNoemA2c2TdffjPgIW3tdDXM21RGPdUQz77ARrYPlfRIYLbtRr69JfFHRLRM0hGU0tW72N68Widypu1tmmgvQz0REe3bzvbWki4HsH2rpAc11Vhm9UREtO+uqnKpASTNonwDaEQSf0RE+z4FfBNYT9IHgPOB/2mqsYzxR0QMAUmPoUwTBvi+7aubaitj/BERw+HBQG+4Z5UmG8pQT0REyyS9h7ICe21gHeAYSe9qrL0M9UREtEvS1cBWtv9VHa8CXGZ78ybaS48/IqJ9i+jbchFYGfh1U42lxx8R0TJJJ1N2KDurOrUbZWbPTQC23ziV7eXibkRE+84AzqbM3b8bOKfJxpL4IyJaImlFynz9VwLXU4bfH0nZTe2dtu8a5+WTljH+iIj2fJQyk2cj20+0vRXwKGCN6rFGZIw/IqIlkq4FNvOIRFyVb7jG9qZNtJsef0REezwy6Vcn76aq29OEJP6IiPb8XNL+I09KeilwTVONZqgnIqIlktYHvkHZ0e1SSi9/G0rJhr1t/76RdpP4IyLaJWkX4LGUXdN+ZvvsRttL4o+I6JaM8UdEdEwSf0RExyTxR+dJ+vsyPPdgSW9t6v0jBiGJPyKiY5L4I0Yh6dmSLpR0uaTvSVqv7+EtJX1f0rWSXt33mgMlXSzpSkmHjPKesyWdJ2mhpJ9K+reB/MdEjJDEHzG684Htq9opXwPe1vfYE4A9gCcB75H0cElPBzYFtgXmAk+U9JQR7/kS4Azbc4EtgYUN/zdEjCrVOSNG9wjgeEmzgQcB1/U9dortfwL/lHQOJdnvCDwduLx6zkMoHwTn9b3uYuBoSSsBJ9tO4o9WpMcfMbrDgU/bfjzwGpbeHWnk4hdTFt580Pbc6raJ7aOWepJ9HvAU4PfAF0dbqh8xCEn8EaNbg5KgAV424rE9Jc2U9FBgJ0pP/gzglZIeAmUpvqR1+18kaQPgJtufB44Ctm4w/ogxZagnAh4s6Ya+448DBwNfl/R74CfARn2PXwScDswBDrX9B+APkjYHLpAE8HfgpVRb51V2Ag6UdFf1eHr80YqUbIiI6JgM9UREdEwSf0RExyTxR0R0TBJ/RETHJPFHRHRMEn9ERMck8UdEdEwSf0REx/x/APMQbAFMg1EAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Calculate number of unique values for each labels: num_unique_labels\n",
    "num_unique_labels = df[LABELS].apply(pd.Series.nunique)\n",
    "\n",
    "# Plot number of unique values for each label\n",
    "num_unique_labels.plot(kind='bar')\n",
    "\n",
    "# Label the axes\n",
    "plt.xlabel('Labels')\n",
    "plt.ylabel('Number of unique values');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How do we measure success?\n",
    "- Accuracy can be misleading when classes are imbalanced\n",
    "    - Legitmate email: 99%, Spam: 1%\n",
    "    - Model that never predicts spam will be 99% accurate!\n",
    "- Metric used in this problem: log loss\n",
    "    - Loss function\n",
    "    - Measure of error\n",
    "    - Want to minimize the error (unlike accuracy)\n",
    "- Log loss binary classification\n",
    "$$ log loss = -\\frac{1}{N} \\sum^{N}_{i=1}(y_i \\log(p_i)) + (1- y_i)\\log(1-p_i)) $$\n",
    "    - Actual value: $y: {1=\\text{yes}, 0=\\text{no}}$\n",
    "    - Prediction (probability that the value is 1): $p$\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing log loss with NumPy\n",
    "To see how the log loss metric handles the trade-off between accuracy and confidence, we will use some sample data generated with NumPy and compute the log loss using the provided function ```compute_log_loss()```, which Peter showed you in the video.\n",
    "```python\n",
    "def compute_log_loss(predicted, actual, eps=1e-14):\n",
    "    \"\"\"Compute the logarithmic loss between predicted and\n",
    "       actual when these are 1D arrays\n",
    "       \n",
    "       :param predicted: The predicted probabilties as floats between 0-1\n",
    "       :param actual: The actual binary labels. Either 0 or 1\n",
    "       :param eps (optional): log(0) is inf, so we need to offset our\n",
    "                               predicted values slightly by eps from 0 or 1.\n",
    "    \"\"\"\n",
    "    predicted = np.clip(predicted, eps, 1-eps)\n",
    "    loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))\n",
    "    return loss\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_log_loss(predicted, actual, eps=1e-14):\n",
    "    \"\"\"Compute the logarithmic loss between predicted and\n",
    "       actual when these are 1D arrays\n",
    "\n",
    "       :param predicted: The predicted probabilties as floats between 0-1\n",
    "       :param actual: The actual binary labels. Either 0 or 1\n",
    "       :param eps (optional): log(0) is inf, so we need to offset our\n",
    "                               predicted values slightly by eps from 0 or 1.\n",
    "    \"\"\"\n",
    "    predicted = np.clip(predicted, eps, 1-eps)\n",
    "    loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))\n",
    "    return loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "correct_confident = np.array([0.95, 0.95, 0.95, 0.95, 0.95, 0.05, 0.05, 0.05, 0.05, 0.05])\n",
    "correct_not_confident = np.array([0.65, 0.65, 0.65, 0.65, 0.65, 0.35, 0.35, 0.35, 0.35, 0.35])\n",
    "wrong_not_confident = np.array([0.35, 0.35, 0.35, 0.35, 0.35, 0.65, 0.65, 0.65, 0.65, 0.65])\n",
    "wrong_confident = np.array([0.05, 0.05, 0.05, 0.05, 0.05, 0.95, 0.95, 0.95, 0.95, 0.95])\n",
    "actual_labels = np.array([1., 1., 1., 1., 1., 0., 0., 0., 0., 0.])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Log loss, correct and confident: 0.05129329438755058\n",
      "Log loss, correct and not confident: 0.4307829160924542\n",
      "Log loss, wrong and not confident: 1.049822124498678\n",
      "Log loss, wrong and confident: 2.9957322735539904\n",
      "Log loss, actual labels: 9.99200722162646e-15\n"
     ]
    }
   ],
   "source": [
    "# Compute and print log loss for 1st case\n",
    "correct_confident_loss = compute_log_loss(correct_confident, actual_labels)\n",
    "print(\"Log loss, correct and confident: {}\".format(correct_confident_loss)) \n",
    "\n",
    "# Compute log loss for 2nd case\n",
    "correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)\n",
    "print(\"Log loss, correct and not confident: {}\".format(correct_not_confident_loss)) \n",
    "\n",
    "# Compute and print log loss for 3rd case\n",
    "wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)\n",
    "print(\"Log loss, wrong and not confident: {}\".format(wrong_not_confident_loss)) \n",
    "\n",
    "# Compute and print log loss for 4th case\n",
    "wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)\n",
    "print(\"Log loss, wrong and confident: {}\".format(wrong_confident_loss)) \n",
    "\n",
    "# Compute and print log loss for actual labels\n",
    "actual_labels_loss = compute_log_loss(actual_labels, actual_labels)\n",
    "print(\"Log loss, actual labels: {}\".format(actual_labels_loss)) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time to build model\n",
    "- Always a good approach to start with a very simple model\n",
    "- Gives a sense of how challengeing the problem is\n",
    "- Many more things can go wrong in complex models\n",
    "- How much signal can we pull out using basic methods?\n",
    "- Train basic model on numeric data only\n",
    "    - Want to go from raw data to predictions quickly\n",
    "- Multiclass logistic regression\n",
    "    - Train classifier on each label separately and use those to predict\n",
    "- Format predictions and save to csv\n",
    "- Compute log loss score\n",
    "- Splitting the multi-class dataset\n",
    "    - Recall: Train-test split\n",
    "        - Will not work here\n",
    "        - May end up with labels in test set that never appear in training set\n",
    "    -  Solution: ```StratifiedShyffleSplit```\n",
    "        - Only works with a single target variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up a train-test split in scikit-learn\n",
    "Alright, you've been patient and awesome. It's finally time to start training models!\n",
    "\n",
    "The first step is to split the data into a training set and a test set. Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least ```min_count``` examples of each label appear in each split: ```multilabel_train_test_split```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "from warnings import warn\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def multilabel_sample(y, size=1000, min_count=5, seed=None):\n",
    "    \"\"\" Takes a matrix of binary labels `y` and returns\n",
    "        the indices for a sample of size `size` if\n",
    "        `size` > 1 or `size` * len(y) if size =< 1.\n",
    "        The sample is guaranteed to have > `min_count` of\n",
    "        each label.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        if (np.unique(y).astype(int) != np.array([0, 1])).any():\n",
    "            raise ValueError()\n",
    "    except (TypeError, ValueError):\n",
    "        raise ValueError('multilabel_sample only works with binary indicator matrices')\n",
    "\n",
    "    if (y.sum(axis=0) < min_count).any():\n",
    "        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')\n",
    "\n",
    "    if size <= 1:\n",
    "        size = np.floor(y.shape[0] * size)\n",
    "\n",
    "    if y.shape[1] * min_count > size:\n",
    "        msg = \"Size less than number of columns * min_count, returning {} items instead of {}.\"\n",
    "        warn(msg.format(y.shape[1] * min_count, size))\n",
    "        size = y.shape[1] * min_count\n",
    "\n",
    "    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))\n",
    "\n",
    "    if isinstance(y, pd.DataFrame):\n",
    "        choices = y.index\n",
    "        y = y.values\n",
    "    else:\n",
    "        choices = np.arange(y.shape[0])\n",
    "\n",
    "    sample_idxs = np.array([], dtype=choices.dtype)\n",
    "\n",
    "    # first, guarantee > min_count of each label\n",
    "    for j in range(y.shape[1]):\n",
    "        label_choices = choices[y[:, j] == 1]\n",
    "        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)\n",
    "        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])\n",
    "\n",
    "    sample_idxs = np.unique(sample_idxs)\n",
    "\n",
    "    # now that we have at least min_count of each, we can just random sample\n",
    "    sample_count = int(size - sample_idxs.shape[0])\n",
    "\n",
    "    # get sample_count indices from remaining choices\n",
    "    remaining_choices = np.setdiff1d(choices, sample_idxs)\n",
    "    remaining_sampled = rng.choice(remaining_choices,\n",
    "                                   size=sample_count,\n",
    "                                   replace=False)\n",
    "\n",
    "    return np.concatenate([sample_idxs, remaining_sampled])\n",
    "\n",
    "\n",
    "def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):\n",
    "    \"\"\" Takes a dataframe `df` and returns a sample of size `size` where all\n",
    "        classes in the binary matrix `labels` are represented at\n",
    "        least `min_count` times.\n",
    "    \"\"\"\n",
    "    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)\n",
    "    return df.loc[idxs]\n",
    "\n",
    "\n",
    "def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):\n",
    "    \"\"\" Takes a features matrix `X` and a label matrix `Y` and\n",
    "        returns (X_train, X_test, Y_train, Y_test) where all\n",
    "        classes in Y are represented at least `min_count` times.\n",
    "    \"\"\"\n",
    "    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])\n",
    "\n",
    "    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)\n",
    "    train_set_idxs = np.setdiff1d(index, test_set_idxs)\n",
    "\n",
    "    test_set_mask = index.isin(test_set_idxs)\n",
    "    train_set_mask = ~test_set_mask\n",
    "\n",
    "    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll start with a simple model that uses just the numeric columns of your DataFrame when calling ```multilabel_train_test_split```. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUMERIC_COLUMNS = ['FTE', 'Total']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X_train info:\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 320222 entries, 134338 to 415831\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count   Dtype  \n",
      "---  ------  --------------   -----  \n",
      " 0   FTE     320222 non-null  float64\n",
      " 1   Total   320222 non-null  float64\n",
      "dtypes: float64(2)\n",
      "memory usage: 7.3 MB\n",
      "None\n",
      "\n",
      "X_test info:\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 80055 entries, 206341 to 72072\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype  \n",
      "---  ------  --------------  -----  \n",
      " 0   FTE     80055 non-null  float64\n",
      " 1   Total   80055 non-null  float64\n",
      "dtypes: float64(2)\n",
      "memory usage: 1.8 MB\n",
      "None\n",
      "\n",
      "y_train info:\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 320222 entries, 134338 to 415831\n",
      "Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating\n",
      "dtypes: uint8(104)\n",
      "memory usage: 34.2 MB\n",
      "None\n",
      "\n",
      "y_test info:\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 80055 entries, 206341 to 72072\n",
      "Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating\n",
      "dtypes: uint8(104)\n",
      "memory usage: 8.6 MB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# Create the new DataFrame: numeric_data_only\n",
    "numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000).copy()\n",
    "\n",
    "# Get labels and convert to dummy variables: label_dummies\n",
    "label_dummies = pd.get_dummies(df[LABELS])\n",
    "\n",
    "# Create training and test sets\n",
    "X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only, label_dummies,\n",
    "                                                               size=0.2, seed=123)\n",
    "\n",
    "# Print the info\n",
    "print(\"X_train info:\")\n",
    "print(X_train.info())\n",
    "print(\"\\nX_test info:\")\n",
    "print(X_test.info())\n",
    "print(\"\\ny_train info:\")\n",
    "print(y_train.info())\n",
    "print(\"\\ny_test info:\")\n",
    "print(y_test.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training a model\n",
    "With split data in hand, you're only a few lines away from training a model.\n",
    "\n",
    "In this exercise, you will import the logistic regression and one versus rest classifiers in order to fit a multi-class logistic regression model to the ```NUMERIC_COLUMNS``` of your feature data.\n",
    "\n",
    "Then you'll test and print the accuracy with the ```.score()``` method to see the results of training.\n",
    "\n",
    "**Before you train!** Remember, we're ultimately going to be using logloss to score our model, so don't worry too much about the accuracy here. Keep in mind that you're throwing away all of the text data in the dataset - that's by far most of the data! So don't get your hopes up for a killer performance just yet. We're just interested in getting things up and running at the moment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.0\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.multiclass import OneVsRestClassifier\n",
    "\n",
    "# Instantiate the classifier: clf\n",
    "clf = OneVsRestClassifier(LogisticRegression())\n",
    "\n",
    "# Fit the classifier to the training data\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "# Print the accuracy\n",
    "print(\"Accuracy: {}\".format(clf.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok! The good news is that your workflow didn't cause any errors. The bad news is that your model scored the lowest possible accuracy: 0.0! But hey, you just threw away ALL of the text data in the budget. Later, you won't. Before you add the text data, let's see how the model does when scored by log loss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making predictions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use your model to predict values on holdout data\n",
    "You're ready to make some predictions! Remember, the train-test-split you've carried out so far is for model development. The original competition provides an additional test set, for which you'll never actually see the correct labels. This is called the \"holdout data.\"\n",
    "\n",
    "The point of the holdout data is to provide a fair test for machine learning competitions. If the labels aren't known by anyone but DataCamp, DrivenData, or whoever is hosting the competition, you can be sure that no one submits a mere copy of labels to artificially pump up the performance on their model.\n",
    "\n",
    "Remember that the original goal is to predict the probability of each label. In this exercise you'll do just that by using the .predict_proba() method on your trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the holdout data: holdout\n",
    "holdout = pd.read_csv('./dataset/HoldoutData.csv', index_col=0)\n",
    "\n",
    "# Generate predictions: predictions\n",
    "predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing out your results to a csv for submission\n",
    "At last, you're ready to submit some predictions for scoring. In this exercise, you'll write your predictions to a .csv using the ```.to_csv()``` method on a pandas DataFrame. Then you'll evaluate your performance according to the LogLoss metric discussed earlier!\n",
    "\n",
    "You'll need to make sure your submission obeys the correct format.\n",
    "\n",
    "To do this, you'll use your predictions values to create a new DataFrame, ```prediction_df```.\n",
    "\n",
    "**Interpreting LogLoss & Beating the Benchmark**:\n",
    "\n",
    "When interpreting your log loss score, keep in mind that the score will change based on the number of samples tested. To get a sense of how this very basic model performs, compare your score to the DrivenData benchmark model performance: 2.0455, which merely submitted uniform probabilities for each class.\n",
    "\n",
    "Remember, the lower the log loss the better. Is your model's log loss lower than 2.0455?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "BOX_PLOTS_COLUMN_INDICES = [range(0, 37),\n",
    " range(37, 48),\n",
    " range(48, 51),\n",
    " range(51, 76),\n",
    " range(76, 79),\n",
    " range(79, 82),\n",
    " range(82, 87),\n",
    " range(87, 96),\n",
    " range(96, 104)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _multi_multi_log_loss(predicted,\n",
    "                          actual,\n",
    "                          class_column_indices=BOX_PLOTS_COLUMN_INDICES,\n",
    "                          eps=1e-15):\n",
    "    \"\"\" Multi class version of Logarithmic Loss metric as implemented on\n",
    "    DrivenData.org\n",
    "    \"\"\"\n",
    "    class_scores = np.ones(len(class_column_indices), dtype=np.float64)\n",
    "    \n",
    "    # calculate log loss for each set of columns that belong to a class:\n",
    "    for k, this_class_indices in enumerate(class_column_indices):\n",
    "        # get just the columns for this class\n",
    "        preds_k = predicted[:, this_class_indices].astype(np.float64)\n",
    "        \n",
    "        # normalize so probabilities sum to one (unless sum is zero, then we clip)\n",
    "        preds_k /= np.clip(preds_k.sum(axis=1).reshape(-1, 1), eps, np.inf)\n",
    "\n",
    "        actual_k = actual[:, this_class_indices]\n",
    "\n",
    "        # shrink predictions so\n",
    "        y_hats = np.clip(preds_k, eps, 1 - eps)\n",
    "        sum_logs = np.sum(actual_k * np.log(y_hats))\n",
    "        class_scores[k] = (-1.0 / actual.shape[0]) * sum_logs\n",
    "        \n",
    "    return np.average(class_scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "def score_submission(pred_path='./', holdout_path='https://s3.amazonaws.com/assets.datacamp.com/production/course_2826/datasets/TestSetLabelsSample.csv'):\n",
    "    # this happens on the backend to get the score\n",
    "    holdout_labels = pd.get_dummies(\n",
    "        pd.read_csv(holdout_path, index_col=0)\n",
    "        .apply(lambda x: x.astype('category'), axis=0)\n",
    "    )\n",
    "    \n",
    "    preds = pd.read_csv(pred_path, index_col=0)\n",
    "    \n",
    "    # make sure that format is correct\n",
    "    assert (preds.columns == holdout_labels.columns).all()\n",
    "    assert (preds.index == holdout_labels.index).all()\n",
    "    \n",
    "    return _multi_multi_log_loss(preds.values, holdout_labels.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Your model, trained with numeric data only, yields logloss score: 1.9587992012561084\n"
     ]
    }
   ],
   "source": [
    "# Format predictions in DataFrame: prediction_df\n",
    "prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,\n",
    "                             index=holdout.index,\n",
    "                             data=predictions)\n",
    "\n",
    "# Save prediction_df to csv\n",
    "prediction_df.to_csv('./dataset/predictions.csv')\n",
    "\n",
    "# Submit the predictions for scoring: score\n",
    "score = score_submission(pred_path='./dataset/predictions.csv')\n",
    "\n",
    "# Print score\n",
    "print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A very brief introduction to NLP\n",
    "- A very brief introduction to NLP\n",
    "    - Data fpr NLP:\n",
    "        - Text, documents, speech,...\n",
    "    - Tokenization\n",
    "        - Spliting a string into segments\n",
    "        - Store segments as list\n",
    "    - Example: \"Natural Langauge Processing\" -> [\"Natural\", \"Language\", \"Processing\"]\n",
    "- Bag of words representation\n",
    "    - Count the number of times a particular token appears\n",
    "    - \"Bag of words\"\n",
    "        - Count the number of times a word was pulled out of the bag\n",
    "    - This approach discards information about word order\n",
    "        - \"Red, not blue\" is the same as \"blue, not red\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Representing text numerically\n",
    "- Representing text numerically\n",
    "    - Bag-of-words\n",
    "        - Simple way to represent text in machine learning\n",
    "        - Discards information about grammar and word order\n",
    "        - Computes frequency of occurance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a bag-of-words in scikit-learn\n",
    "In this exercise, you'll study the effects of tokenizing in different ways by comparing the bag-of-words representations resulting from different token patterns.\n",
    "\n",
    "You will focus on one feature only, the ```Position_Extra``` column, which describes any additional information not captured by the ```Position_Type``` label.\n",
    "\n",
    "For example, in the Shell you can check out the budget item in row 8960 of the data using ```df.loc[8960]```. Looking at the output reveals that this ```Object_Description``` is overtime pay. For who? The Position Type is merely \"other\", but the Position Extra elaborates: \"BUS DRIVER\". Explore the column further to see more instances. It has a lot of NaN values.\n",
    "\n",
    "Your task is to turn the raw text in this column into a bag-of-words representation by creating tokens that contain only alphanumeric characters.\n",
    "\n",
    "For comparison purposes, the first 15 tokens of ```vec_basic```, which splits ```df.Position_Extra``` into tokens when it encounters only whitespace characters, have been printed along with the length of the representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 385 tokens in Position_Extra if we split on non-alpha numeric\n",
      "['1st', '2nd', '3rd', '4th', '56', '5th', '9th', 'a', 'ab', 'accountability', 'adaptive', 'addit', 'additional', 'adm', 'admin']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# Create the token pattern: TOKENS_ALPHANUMERIC\n",
    "TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\\\s+)'\n",
    "\n",
    "# Fill missing values in df.Position_Extra\n",
    "df.Position_Extra.fillna('', inplace=True)\n",
    "\n",
    "# Instantiate the CountVectorizer:vec_alphanumeric\n",
    "vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)\n",
    "\n",
    "# Fit to the data\n",
    "vec_alphanumeric.fit(df.Position_Extra)\n",
    "\n",
    "# Print the number of tokens and first 15 tokens\n",
    "msg = \"There are {} tokens in Position_Extra if we split on non-alpha numeric\"\n",
    "print(msg.format(len(vec_alphanumeric.get_feature_names())))\n",
    "print(vec_alphanumeric.get_feature_names()[:15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining text columns for tokenization\n",
    "In order to get a bag-of-words representation for all of the text data in our DataFrame, you must first convert the text data in each row of the DataFrame into a single string.\n",
    "\n",
    "In the previous exercise, this wasn't necessary because you only looked at one column of data, so each row was already just a single string. ```CountVectorizer``` expects each row to just be a single string, so in order to use all of the text columns, you'll need a method to turn a list of strings into a single string.\n",
    "\n",
    "In this exercise, you'll complete the function definition ```combine_text_columns()```. When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the ```.fit_transform()``` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define combine_text_columns()\n",
    "def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):\n",
    "    \"\"\" converts all text in each row of data_frame to single vector \"\"\"\n",
    "    \n",
    "    # Drop non-text columns that are in the df\n",
    "    to_drop = set(to_drop) & set(data_frame.columns.tolist())\n",
    "    text_data = data_frame.drop(to_drop, axis='columns')\n",
    "    \n",
    "    # Replace nans with blanks\n",
    "    text_data.fillna(\"\", inplace=True)\n",
    "    \n",
    "    # Join all text items in a row that have a space in between\n",
    "    return text_data.apply(lambda x: \" \".join(x), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What's in a token?\n",
    "Now you will use ```combine_text_columns``` to convert all training text data in your DataFrame to a single vector that can be passed to the vectorizer object and made into a bag-of-words using the ```.fit_transform()``` method.\n",
    "\n",
    "You'll compare the effect of tokenizing using any non-whitespace characters as a token and using only alphanumeric characters as a token.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 4757 tokens in the dataset\n",
      "There are 3284 alpha-numeric tokens in the dataset\n"
     ]
    }
   ],
   "source": [
    "# Create the basic token pattern\n",
    "TOKENS_BASIC = '\\\\S+(?=\\\\s+)'\n",
    "\n",
    "# Create the alphanumeric token pattern\n",
    "TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\\\s+)'\n",
    "\n",
    "# Instantiate basic CountVectorizer: vec_basic\n",
    "vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)\n",
    "\n",
    "# Instantiate alphanumeric CountVecotrizer: vec_alphanumeric\n",
    "vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)\n",
    "\n",
    "# Create the text vector\n",
    "text_vector = combine_text_columns(df)\n",
    "\n",
    "# Fit and transform vec_basic\n",
    "vec_basic.fit_transform(text_vector)\n",
    "\n",
    "# Print number of tokens of vec_basic\n",
    "print(\"There are {} tokens in the dataset\".format(len(vec_basic.get_feature_names())))\n",
    "\n",
    "# Fit and transform vec_alphanumeric\n",
    "vec_alphanumeric.fit_transform(text_vector)\n",
    "\n",
    "# Print number of tokens of vec_alphanumeric\n",
    "print(\"There are {} alpha-numeric tokens in the dataset\".format(len(vec_alphanumeric.get_feature_names())))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipelines, feature & text preprocessing\n",
    "- The pipeline workflow\n",
    "    - Repeatable way to go from raw data to trained model\n",
    "    - Pipeline object takes sequential list of steps\n",
    "        - Output of one step is input to next step\n",
    "    - Each step is a tuple with two elements\n",
    "        - Name: String\n",
    "        - Transform: obj implementing ```.fit()``` and ```.transform()```\n",
    "    - Flexible: a step can itself be another pipeline!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate pipeline\n",
    "In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it's time to turn to one of scikit-learn's most useful objects: the Pipeline.\n",
    "\n",
    "For the next few exercises, you'll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_df = pd.read_csv('./dataset/sample_data.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on sample data - numeric, no nans:  0.62\n"
     ]
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.multiclass import OneVsRestClassifier\n",
    "\n",
    "# Split and select numeric data only, no nans\n",
    "X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']], \n",
    "                                                    pd.get_dummies(sample_df['label']),\n",
    "                                                   random_state=22)\n",
    "\n",
    "# Instantiate Pipeline object: pl\n",
    "pl = Pipeline([\n",
    "    ('clf', OneVsRestClassifier(LogisticRegression()))\n",
    "])\n",
    "\n",
    "# Fit the pipeline to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on sample data - numeric, no nans: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing numeric features\n",
    "What would have happened if you had included the with ```'with_missing'``` column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you'll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.\n",
    "\n",
    "By default, the [imputer transformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) replaces NaNs with the mean value of the column. That's a good enough imputation strategy for the sample data, so you won't need to pass anything extra to the imputer.\n",
    "\n",
    "After importing the transformer, you will edit the steps list used in the previous exercise by inserting a ```(name, transform)``` tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on sample data - all numeric, incl nans:  0.636\n"
     ]
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "# Create training and test sets using only numeric data\n",
    "X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],\n",
    "                                                   pd.get_dummies(sample_df['label']),\n",
    "                                                   random_state=456)\n",
    "\n",
    "# Instantiate Pipeline object: pl\n",
    "pl = Pipeline([\n",
    "    ('imp', SimpleImputer()),\n",
    "    ('clf', OneVsRestClassifier(LogisticRegression()))\n",
    "])\n",
    "\n",
    "# fit the pipeline to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on sample data - all numeric, incl nans: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text features and feature unions\n",
    "- Preprocessing multiple dtypes\n",
    "    - Want to use all available features in one pipeline\n",
    "    - Problem\n",
    "        - Pipeline steps for numeric and text processing can't follow each other\n",
    "        - E.g., output of ```CountVectorizer``` can`t be input to ```Imputer```\n",
    "    - Solution\n",
    "        - ```FunctionTransformer()``` & ```FeatureUnion()```\n",
    "- FunctionTransformer\n",
    "    - Turns a Python function into an object that a scikit-learn pipeline can understand\n",
    "    - Need to write two functions for pipeline preprocessing\n",
    "        - Take entire DataFrame, return numeric columns\n",
    "        - Take entire DataFrame, return text columns\n",
    "    - Can then preprocess numeric and text data in separate pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing text features\n",
    "Here, you'll perform a similar preprocessing pipeline step, only this time you'll use the ```text``` column from the sample data.\n",
    "\n",
    "To preprocess the text, you'll turn to ```CountVectorizer()``` to generate a bag-of-words representation of the data. Using the [default](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) arguments, add a ```(step, transform)``` tuple to the steps list in your pipeline.\n",
    "\n",
    "Make sure you select only the text column for splitting your training and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_df['text'] = sample_df['text'].fillna(\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on sample data - just text data:  0.808\n"
     ]
    }
   ],
   "source": [
    "# Split out only the text data\n",
    "X_train, X_test, y_train, y_test = train_test_split(sample_df['text'],\n",
    "                                                    pd.get_dummies(sample_df['label']), \n",
    "                                                    random_state=456)\n",
    "\n",
    "# Instantiate Pipeline object: pl\n",
    "pl = Pipeline([\n",
    "        ('vec', CountVectorizer()),\n",
    "        ('clf', OneVsRestClassifier(LogisticRegression()))\n",
    "    ])\n",
    "\n",
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on sample data - just text data: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple types of processing: FunctionTransformer\n",
    "The next two exercises will introduce new topics you'll need to make your pipeline truly excel.\n",
    "\n",
    "Any step in the pipeline must be an object that implements the ```fit``` and ```transform``` methods. The [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) creates an object with these methods out of any Python function that you pass to it. We'll use it to help select subsets of data in a way that plays nicely with pipelines.\n",
    "\n",
    "You are working with numeric data that needs imputation, and text data that needs to be converted into a bag-of-words. You'll create functions that separate the text from the numeric variables and see how the ```.fit()``` and ```.transform()``` methods work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text Data\n",
      "0           \n",
      "1        foo\n",
      "2    foo bar\n",
      "3           \n",
      "4    foo bar\n",
      "Name: text, dtype: object\n",
      "\n",
      "Numeric Data\n",
      "     numeric  with_missing\n",
      "0 -10.856306      4.433240\n",
      "1   9.973454      4.310229\n",
      "2   2.829785      2.469828\n",
      "3 -15.062947      2.852981\n",
      "4  -5.786003      1.826475\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import FunctionTransformer\n",
    "\n",
    "# Obtain the text data: get_text_data\n",
    "get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)\n",
    "\n",
    "# Obtain the numberic data: get_numeric_data\n",
    "get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)\n",
    "\n",
    "# Fit and transform the text data: just_text_data\n",
    "just_text_data = get_text_data.fit_transform(sample_df)\n",
    "\n",
    "# Fit and transform the numeric data: just_numeric_data\n",
    "just_numeric_data = get_numeric_data.fit_transform(sample_df)\n",
    "\n",
    "# Print head to check results\n",
    "print('Text Data')\n",
    "print(just_text_data.head())\n",
    "print('\\nNumeric Data')\n",
    "print(just_numeric_data.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple types of processing: FeatureUnion\n",
    "Now that you can separate text and numeric data in your pipeline, you're ready to perform separate steps on each by nesting pipelines and using ```FeatureUnion()```.\n",
    "\n",
    "These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don't want to impute our text data, and you don't want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using ```FeatureUnion()```.\n",
    "\n",
    "In the end, you'll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using ```FeatureUnion()```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on sample data - all data:  0.928\n"
     ]
    }
   ],
   "source": [
    "from sklearn.pipeline import FeatureUnion\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],\n",
    "                                                   pd.get_dummies(sample_df['label']),\n",
    "                                                   random_state=22)\n",
    "\n",
    "# Create a FeatureUnion with nested pipeline: process_and_join_features\n",
    "process_and_join_features = FeatureUnion(\n",
    "    transformer_list=[\n",
    "        ('numeric_features', Pipeline([\n",
    "            ('selector', get_numeric_data),\n",
    "            ('imputer', SimpleImputer())\n",
    "        ])),\n",
    "        ('text_features', Pipeline([\n",
    "            ('selector', get_text_data),\n",
    "            ('vectorizer', CountVectorizer())\n",
    "        ]))\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Instantiate nested pipeline: pl\n",
    "pl = Pipeline([\n",
    "    ('union', process_and_join_features),\n",
    "    ('clf', OneVsRestClassifier(LogisticRegression()))\n",
    "])\n",
    "\n",
    "# Fit pl to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on sample data - all data: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choosing a classification model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using FunctionTransformer on the main dataset\n",
    "In this exercise you're going to use ```FunctionTransformer``` on the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.\n",
    "\n",
    "Recall from Chapter 2 that you used a custom function ```combine_text_columns``` to select and properly format text data for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the dummy encoding of the labels\n",
    "dummy_labels = pd.get_dummies(df[LABELS])\n",
    "\n",
    "# Get the columns that are features in the original df\n",
    "NON_LABELS = [c for c in df.columns if c not in LABELS]\n",
    "\n",
    "# Split into training and test sets\n",
    "X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],\n",
    "                                                               dummy_labels,\n",
    "                                                              size=0.2,\n",
    "                                                              seed=123)\n",
    "\n",
    "# Preprocess the text data: get_text_data\n",
    "get_text_data = FunctionTransformer(combine_text_columns, validate=False)\n",
    "\n",
    "# Preprocess the numeric data: get_numeric_data\n",
    "get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add a model to the pipeline\n",
    "You're about to take everything you've learned so far and implement it in a Pipeline that works with the real, [DrivenData](https://www.drivendata.org/) budget line item data you've been exploring.\n",
    "\n",
    "Surprise! The structure of the pipeline is exactly the same as earlier in this chapter:\n",
    "\n",
    "- the preprocessing step uses FeatureUnion to join the results of nested pipelines that each rely on FunctionTransformer to select multiple datatypes\n",
    "- the model step stores the model object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on budget dataset:  0.0\n"
     ]
    }
   ],
   "source": [
    "# Complete the pipeline: pl\n",
    "pl = Pipeline([\n",
    "    ('union', FeatureUnion(\n",
    "        transformer_list=[\n",
    "            ('numeric_features', Pipeline([\n",
    "                ('selector', get_numeric_data),\n",
    "                ('imputer', SimpleImputer())\n",
    "            ])),\n",
    "            ('text_features', Pipeline([\n",
    "                ('selector', get_text_data),\n",
    "                ('vectorizer', CountVectorizer())\n",
    "            ]))\n",
    "        ]\n",
    "    )),\n",
    "    ('clf', OneVsRestClassifier(LogisticRegression(max_iter=1000), n_jobs=-1))\n",
    "])\n",
    "\n",
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on budget dataset: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Try a different class of model\n",
    "Now you're cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.\n",
    "\n",
    "Until now, you've been using the model step ```('clf', OneVsRestClassifier(LogisticRegression()))``` in your pipeline.\n",
    "\n",
    "But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you'll see in this exercise.\n",
    "\n",
    "In particular, you'll swap out the logistic-regression model and replace it with a [random forest](https://en.wikipedia.org/wiki/Random_forest) classifier, which uses the statistics of an ensemble of decision trees to generate predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on budget dataset:  0.9132096683530073\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# Edit model step in pipeline\n",
    "pl = Pipeline([\n",
    "    ('union', FeatureUnion(\n",
    "        transformer_list = [\n",
    "            ('numeric_features', Pipeline([\n",
    "                ('selector', get_numeric_data),\n",
    "                ('imputer', SimpleImputer())\n",
    "            ])),\n",
    "            ('text_features', Pipeline([\n",
    "                ('selector', get_text_data),\n",
    "                ('vectorizer', CountVectorizer())\n",
    "            ]))\n",
    "        ]\n",
    "    )),\n",
    "    ('clf', RandomForestClassifier(n_jobs=-1))\n",
    "])\n",
    "\n",
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on budget dataset: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Can you adjust the model or parameters to improve accuracy?\n",
    "You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on budget dataset:  0.9125601149209919\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# Edit model step in pipeline\n",
    "pl = Pipeline([\n",
    "    ('union', FeatureUnion(\n",
    "        transformer_list = [\n",
    "            ('numeric_features', Pipeline([\n",
    "                ('selector', get_numeric_data),\n",
    "                ('imputer', SimpleImputer())\n",
    "            ])),\n",
    "            ('text_features', Pipeline([\n",
    "                ('selector', get_text_data),\n",
    "                ('vectorizer', CountVectorizer())\n",
    "            ]))\n",
    "        ]\n",
    "    )),\n",
    "    ('clf', RandomForestClassifier(n_estimators=15, n_jobs=-1))\n",
    "])\n",
    "\n",
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on budget dataset: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning from the expert: processing\n",
    "- Text preprocessing\n",
    "    - NLP tricks for text data\n",
    "        - Tokenize on punctuation to avoid hyphens, underscores, etc.\n",
    "        - Includes unigrams and bi-grams in the model to capture important information involving multiple tokens - e.g. \"middle school\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deciding what's a word\n",
    "Before you build up to the winning pipeline, it will be useful to look a little deeper into how the text features will be processed.\n",
    "\n",
    "In this exercise, you will use ```CountVectorizer``` on the training data ```X_train``` to see the effect of tokenization on punctuation.\n",
    "\n",
    "Remember, since CountVectorizer expects a vector, you'll need to use the preloaded function, combine_text_columns before fitting to the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['00a', '12', '1st', '2nd', '3rd', '4th', '5', '56', '5th', '6']\n"
     ]
    }
   ],
   "source": [
    "# Create the text vector\n",
    "text_vector = combine_text_columns(X_train)\n",
    "\n",
    "# Create the token pattern: TOKENS_ALPHANUMERIC\n",
    "TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\\\s+)'\n",
    "\n",
    "# Instantiate the CountVectorizer: text_features\n",
    "text_features = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)\n",
    "\n",
    "# Fit text_features to the text vector\n",
    "text_features.fit(text_vector)\n",
    "\n",
    "# Print the first 10 tokens\n",
    "print(text_features.get_feature_names()[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### N-gram range in scikit-learn\n",
    "In this exercise you'll insert a ```CountVectorizer``` instance into your pipeline for the main dataset, and compute multiple n-gram features to be used in the model.\n",
    "\n",
    "In order to look for ngram relationships at multiple scales, you will use the ```ngram_range``` parameter as Peter discussed in the video.\n",
    "\n",
    "**Special functions**: You'll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the ```dim_red``` step following the ```vectorizer``` step , and the ```scale``` step preceeding the ```clf``` (classification) step.\n",
    "\n",
    "These have been added in order to account for the fact that you're using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) technique, which is what the ```dim_red``` step does, and we have to [scale the features](https://en.wikipedia.org/wiki/Feature_scaling) to lie between -1 and 1, which is what the scale step does.\n",
    "\n",
    "The ```dim_red``` step uses a scikit-learn function called ```SelectKBest()```, applying something called the [chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) to select the K \"best\" features. The ```scale``` step uses a scikit-learn function called ```MaxAbsScaler()``` in order to squash the relevant features into the interval -1 to 1.\n",
    "\n",
    "You won't need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import chi2, SelectKBest\n",
    "from sklearn.preprocessing import MaxAbsScaler\n",
    "\n",
    "# Select 300 best features\n",
    "chi_k = 300\n",
    "\n",
    "# Perform preprocessing\n",
    "get_text_data = FunctionTransformer(combine_text_columns, validate=False)\n",
    "get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)\n",
    "\n",
    "# Create the token pattern: TOKENS_ALPHANUMERIC\n",
    "TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\\\s+)'\n",
    "\n",
    "# Instantiate pipeline: pl\n",
    "pl = Pipeline([\n",
    "        ('union', FeatureUnion(\n",
    "            transformer_list = [\n",
    "                ('numeric_features', Pipeline([\n",
    "                    ('selector', get_numeric_data),\n",
    "                    ('imputer', SimpleImputer())\n",
    "                ])),\n",
    "                ('text_features', Pipeline([\n",
    "                    ('selector', get_text_data),\n",
    "                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,\n",
    "                                                   ngram_range=(1, 2))),\n",
    "                    ('dim_red', SelectKBest(chi2, chi_k))\n",
    "                ]))\n",
    "             ]\n",
    "        )),\n",
    "        ('scale', MaxAbsScaler()),\n",
    "        ('clf', OneVsRestClassifier(LogisticRegression(max_iter=1000)))\n",
    "    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on budget dataset:  0.5466491786896509\n"
     ]
    }
   ],
   "source": [
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on budget dataset: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning from the expert: a stats trick\n",
    "- Interaction terms\n",
    "    - Example\n",
    "        - English teacher for 2nd grade\n",
    "        - 2nd grade - budget for English teacher\n",
    "    - Interaction terms mathematically describe when tokens appear together\n",
    "    - the math:\n",
    "    $$ \\beta_1 x_1 + \\beta_2 x_2 + \\beta_3 (x_1 \\times x_2) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implement interaction modeling in scikit-learn\n",
    "It's time to add interaction features to your model. The ```PolynomialFeatures``` object in scikit-learn does just that, but here you're going to use a custom interaction object, ```SparseInteractions```. Interaction terms are a statistical tool that lets your model express what happens if two features appear together in the same row.\n",
    "\n",
    "```SparseInteractions``` does the same thing as ```PolynomialFeatures```, but it uses sparse matrices to do so. You can get the code for ```SparseInteractions``` at [this GitHub Gist](https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py).\n",
    "\n",
    "```PolynomialFeatures``` and ```SparseInteractions``` both take the argument ```degree```, which tells them what polynomial degree of interactions to compute.\n",
    "\n",
    "You're going to consider interaction terms of ```degree=2``` in your pipeline. You will insert these steps after the preprocessing steps you've built out so far, but before the classifier steps.\n",
    "\n",
    "Pipelines with interaction terms take a while to train (since you're making n features into n-squared features!), so as long as you set it up right, we'll do the heavy lifting and tell you what your score is!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import combinations\n",
    "\n",
    "import numpy as np\n",
    "from scipy import sparse\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "\n",
    "class SparseInteractions(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self, degree=2, feature_name_separator=\"_\"):\n",
    "        self.degree = degree\n",
    "        self.feature_name_separator = feature_name_separator\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X):\n",
    "        if not sparse.isspmatrix_csc(X):\n",
    "            X = sparse.csc_matrix(X)\n",
    "\n",
    "        if hasattr(X, \"columns\"):\n",
    "            self.orig_col_names = X.columns\n",
    "        else:\n",
    "            self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])\n",
    "\n",
    "        spi = self._create_sparse_interactions(X)\n",
    "        return spi\n",
    "\n",
    "    def get_feature_names(self):\n",
    "        return self.feature_names\n",
    "\n",
    "    def _create_sparse_interactions(self, X):\n",
    "        out_mat = []\n",
    "        self.feature_names = self.orig_col_names.tolist()\n",
    "\n",
    "        for sub_degree in range(2, self.degree + 1):\n",
    "            for col_ixs in combinations(range(X.shape[1]), sub_degree):\n",
    "                # add name for new column\n",
    "                name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])\n",
    "                self.feature_names.append(name)\n",
    "\n",
    "                # get column multiplications value\n",
    "                out = X[:, col_ixs[0]]\n",
    "                for j in col_ixs[1:]:\n",
    "                    out = out.multiply(X[:, j])\n",
    "\n",
    "                out_mat.append(out)\n",
    "\n",
    "        return sparse.hstack([X] + out_mat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate pipeline: pl\n",
    "pl = Pipeline([\n",
    "        ('union', FeatureUnion(\n",
    "            transformer_list = [\n",
    "                ('numeric_features', Pipeline([\n",
    "                    ('selector', get_numeric_data),\n",
    "                    ('imputer', SimpleImputer())\n",
    "                ])),\n",
    "                ('text_features', Pipeline([\n",
    "                    ('selector', get_text_data),\n",
    "                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,\n",
    "                                                   ngram_range=(1, 2))),  \n",
    "                    ('dim_red', SelectKBest(chi2, chi_k))\n",
    "                ]))\n",
    "             ]\n",
    "        )),\n",
    "        ('int', SparseInteractions(degree=2)),\n",
    "        ('scale', MaxAbsScaler()),\n",
    "        ('clf', OneVsRestClassifier(LogisticRegression(max_iter=1000)))\n",
    "    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Accuracy on sparse interaction:  0.7826369371057398\n"
     ]
    }
   ],
   "source": [
    "# Fit to the training data\n",
    "pl.fit(X_train, y_train)\n",
    "\n",
    "# Compute and print accuracy\n",
    "accuracy = pl.score(X_test, y_test)\n",
    "print(\"\\nAccuracy on sparse interaction: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning from the expert the winning model\n",
    "- The hashing trick\n",
    "    - Adding new features may cause enormous increase in array size\n",
    "    - Hashing is a way of increasing memory efficiency\n",
    "        - Hash function limits possible outputs, fixing array size\n",
    "- When to use the hashing trick\n",
    "    - Want to make array of features as small as possible\n",
    "        - Dimensionality reduction\n",
    "        - Particularly useful on large datasets\n",
    "        - E.g., lots of text data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why is hashing a useful trick?\n",
    "In the video, Peter explained that a [hash](https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick) function takes an input, in your case a token, and outputs a hash value. For example, the input may be a string and the hash value may be an integer.\n",
    "\n",
    "By explicitly stating how many possible outputs the hashing function may have, we limit the size of the objects that need to be processed. With these limits known, computation can be made more efficient and we can get results faster, even on large datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementing the hashing trick in scikit-learn\n",
    "In this exercise you will check out the scikit-learn implementation of ```HashingVectorizer``` before adding it to your pipeline later.\n",
    "\n",
    "As you saw in the video, ```HashingVectorizer``` acts just like ```CountVectorizer``` in that it can accept ```token_pattern``` and ```ngram_range``` parameters. The important difference is that it creates hash values from the text, so that we get all the computational advantages of hashing!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          0\n",
      "0  0.377964\n",
      "1  0.755929\n",
      "2  0.377964\n",
      "3  0.377964\n",
      "4  0.235702\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import HashingVectorizer\n",
    "\n",
    "# Get text data: text_data\n",
    "text_data = combine_text_columns(X_train)\n",
    "\n",
    "# Create the token pattern: TOKENS_ALPHANUMERIC\n",
    "TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\\\s+)' \n",
    "\n",
    "# Instantiate the HashingVectorizer: hashing_vec\n",
    "hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC)\n",
    "\n",
    "# Fit and transform the Hashing Vectorizer\n",
    "hashed_text = hashing_vec.fit_transform(text_data)\n",
    "\n",
    "# Create DataFrame and print the head\n",
    "hashed_df = pd.DataFrame(hashed_text.data)\n",
    "print(hashed_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build the winning model\n",
    "You have arrived! This is where all of your hard work pays off. It's time to build the model that won DrivenData's competition.\n",
    "\n",
    "You've constructed a robust, powerful pipeline capable of processing training and testing data. Now that you understand the data and know all of the tools you need, you can essentially solve the whole problem in a relatively small number of lines of code. Wow!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate the winning model pipeline: pl\n",
    "pl = Pipeline([\n",
    "        ('union', FeatureUnion(\n",
    "            transformer_list = [\n",
    "                ('numeric_features', Pipeline([\n",
    "                    ('selector', get_numeric_data),\n",
    "                    ('imputer', SimpleImputer())\n",
    "                ])),\n",
    "                ('text_features', Pipeline([\n",
    "                    ('selector', get_text_data),\n",
    "                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,\n",
    "                                                     norm=None, binary=False,\n",
    "                                                     ngram_range=(1, 2))),\n",
    "                    ('dim_red', SelectKBest(chi2, chi_k))\n",
    "                ]))\n",
    "             ]\n",
    "        )),\n",
    "        ('int', SparseInteractions(degree=2)),\n",
    "        ('scale', MaxAbsScaler()),\n",
    "        ('clf', OneVsRestClassifier(LogisticRegression()))\n",
    "    ])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}