{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering\n",
    "> In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature engineering\n",
    "- Creation of new features based on existing features\n",
    "- Insight into relationships between features\n",
    "- Extract and expand data\n",
    "- Dataset-dependent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identifying areas for feature engineering\n",
    "Take an exploratory look at the `volunteer` dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>opportunity_id</th>\n",
       "      <th>content_id</th>\n",
       "      <th>vol_requests</th>\n",
       "      <th>event_time</th>\n",
       "      <th>title</th>\n",
       "      <th>hits</th>\n",
       "      <th>summary</th>\n",
       "      <th>is_priority</th>\n",
       "      <th>category_id</th>\n",
       "      <th>category_desc</th>\n",
       "      <th>...</th>\n",
       "      <th>end_date_date</th>\n",
       "      <th>status</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Community Board</th>\n",
       "      <th>Community Council</th>\n",
       "      <th>Census Tract</th>\n",
       "      <th>BIN</th>\n",
       "      <th>BBL</th>\n",
       "      <th>NTA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4996</td>\n",
       "      <td>37004</td>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>Volunteers Needed For Rise Up &amp; Stay Put! Home...</td>\n",
       "      <td>737</td>\n",
       "      <td>Building on successful events last summer and ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>July 30 2011</td>\n",
       "      <td>approved</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5008</td>\n",
       "      <td>37036</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>Web designer</td>\n",
       "      <td>22</td>\n",
       "      <td>Build a website for an Afghan business</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>...</td>\n",
       "      <td>February 01 2011</td>\n",
       "      <td>approved</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5016</td>\n",
       "      <td>37143</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>Urban Adventures - Ice Skating at Lasker Rink</td>\n",
       "      <td>62</td>\n",
       "      <td>Please join us and the students from Mott Hall...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>...</td>\n",
       "      <td>January 29 2011</td>\n",
       "      <td>approved</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5022</td>\n",
       "      <td>37237</td>\n",
       "      <td>500</td>\n",
       "      <td>0</td>\n",
       "      <td>Fight global hunger and support women farmers ...</td>\n",
       "      <td>14</td>\n",
       "      <td>The Oxfam Action Corps is a group of dedicated...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>...</td>\n",
       "      <td>March 31 2012</td>\n",
       "      <td>approved</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5055</td>\n",
       "      <td>37425</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>Stop 'N' Swap</td>\n",
       "      <td>31</td>\n",
       "      <td>Stop 'N' Swap reduces NYC's waste by finding n...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>Environment</td>\n",
       "      <td>...</td>\n",
       "      <td>February 05 2011</td>\n",
       "      <td>approved</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 35 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   opportunity_id  content_id  vol_requests  event_time  \\\n",
       "0            4996       37004            50           0   \n",
       "1            5008       37036             2           0   \n",
       "2            5016       37143            20           0   \n",
       "3            5022       37237           500           0   \n",
       "4            5055       37425            15           0   \n",
       "\n",
       "                                               title  hits  \\\n",
       "0  Volunteers Needed For Rise Up & Stay Put! Home...   737   \n",
       "1                                       Web designer    22   \n",
       "2      Urban Adventures - Ice Skating at Lasker Rink    62   \n",
       "3  Fight global hunger and support women farmers ...    14   \n",
       "4                                      Stop 'N' Swap    31   \n",
       "\n",
       "                                             summary is_priority  category_id  \\\n",
       "0  Building on successful events last summer and ...         NaN          NaN   \n",
       "1             Build a website for an Afghan business         NaN          1.0   \n",
       "2  Please join us and the students from Mott Hall...         NaN          1.0   \n",
       "3  The Oxfam Action Corps is a group of dedicated...         NaN          1.0   \n",
       "4  Stop 'N' Swap reduces NYC's waste by finding n...         NaN          4.0   \n",
       "\n",
       "               category_desc  ...     end_date_date    status Latitude  \\\n",
       "0                        NaN  ...      July 30 2011  approved      NaN   \n",
       "1  Strengthening Communities  ...  February 01 2011  approved      NaN   \n",
       "2  Strengthening Communities  ...   January 29 2011  approved      NaN   \n",
       "3  Strengthening Communities  ...     March 31 2012  approved      NaN   \n",
       "4                Environment  ...  February 05 2011  approved      NaN   \n",
       "\n",
       "   Longitude  Community Board Community Council  Census Tract  BIN  BBL NTA  \n",
       "0        NaN              NaN                NaN          NaN  NaN  NaN NaN  \n",
       "1        NaN              NaN                NaN          NaN  NaN  NaN NaN  \n",
       "2        NaN              NaN                NaN          NaN  NaN  NaN NaN  \n",
       "3        NaN              NaN                NaN          NaN  NaN  NaN NaN  \n",
       "4        NaN              NaN                NaN          NaN  NaN  NaN NaN  \n",
       "\n",
       "[5 rows x 35 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')\n",
    "volunteer.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding categorical variables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encoding categorical variables - binary\n",
    "Take a look at the `hiking` dataset. There are several columns here that need encoding, one of which is the `Accessible` column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either `Y` or `N` - so it needs to be encoded into 1s and 0s. Use scikit-learn's `LabelEncoder` method to do that transformation.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Prop_ID</th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Park_Name</th>\n",
       "      <th>Length</th>\n",
       "      <th>Difficulty</th>\n",
       "      <th>Other_Details</th>\n",
       "      <th>Accessible</th>\n",
       "      <th>Limited_Access</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B057</td>\n",
       "      <td>Salt Marsh Nature Trail</td>\n",
       "      <td>Enter behind the Salt Marsh Nature Center, loc...</td>\n",
       "      <td>Marine Park</td>\n",
       "      <td>0.8 miles</td>\n",
       "      <td>None</td>\n",
       "      <td>&lt;p&gt;The first half of this mile-long trail foll...</td>\n",
       "      <td>Y</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B073</td>\n",
       "      <td>Lullwater</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>1.0 mile</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Explore the Lullwater to see how nature thrive...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B073</td>\n",
       "      <td>Midwood</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.75 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Step back in time with a walk through Brooklyn...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B073</td>\n",
       "      <td>Peninsula</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Discover how the Peninsula has changed over th...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B073</td>\n",
       "      <td>Waterfall</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Trace the source of the Lake on the Waterfall ...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Prop_ID                     Name  \\\n",
       "0    B057  Salt Marsh Nature Trail   \n",
       "1    B073                Lullwater   \n",
       "2    B073                  Midwood   \n",
       "3    B073                Peninsula   \n",
       "4    B073                Waterfall   \n",
       "\n",
       "                                            Location      Park_Name  \\\n",
       "0  Enter behind the Salt Marsh Nature Center, loc...    Marine Park   \n",
       "1  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "2  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "3  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "4  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "\n",
       "       Length Difficulty                                      Other_Details  \\\n",
       "0   0.8 miles       None  <p>The first half of this mile-long trail foll...   \n",
       "1    1.0 mile       Easy  Explore the Lullwater to see how nature thrive...   \n",
       "2  0.75 miles       Easy  Step back in time with a walk through Brooklyn...   \n",
       "3   0.5 miles       Easy  Discover how the Peninsula has changed over th...   \n",
       "4   0.5 miles       Easy  Trace the source of the Lake on the Waterfall ...   \n",
       "\n",
       "  Accessible Limited_Access  lat  lon  \n",
       "0          Y              N  NaN  NaN  \n",
       "1          N              N  NaN  NaN  \n",
       "2          N              N  NaN  NaN  \n",
       "3          N              N  NaN  NaN  \n",
       "4          N              N  NaN  NaN  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hiking = pd.read_json('./dataset/hiking.json')\n",
    "hiking.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Accessible</th>\n",
       "      <th>Accessible_enc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Y</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Accessible  Accessible_enc\n",
       "0          Y               1\n",
       "1          N               0\n",
       "2          N               0\n",
       "3          N               0\n",
       "4          N               0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "# Set up the LabelEncoder object\n",
    "enc = LabelEncoder()\n",
    "\n",
    "# Apply the encoding to the \"Accessible\" column\n",
    "hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])\n",
    "\n",
    "# Compare the two columns\n",
    "hiking[['Accessible', 'Accessible_enc']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encoding categorical variables - one-hot\n",
    "One of the columns in the `volunteer` dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' `get_dummies()` function to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Education  Emergency Preparedness  Environment  Health  \\\n",
      "0          0                       0            0       0   \n",
      "1          0                       0            0       0   \n",
      "2          0                       0            0       0   \n",
      "3          0                       0            0       0   \n",
      "4          0                       0            1       0   \n",
      "\n",
      "   Helping Neighbors in Need  Strengthening Communities  \n",
      "0                          0                          0  \n",
      "1                          0                          1  \n",
      "2                          0                          1  \n",
      "3                          0                          1  \n",
      "4                          0                          0  \n"
     ]
    }
   ],
   "source": [
    "# Transform the category_desc column\n",
    "category_enc = pd.get_dummies(volunteer['category_desc'])\n",
    "\n",
    "# Take a look at the encoded columns\n",
    "print(category_enc.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Engineering numerical features\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineering numerical features - taking an average\n",
    "A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named `running_times_5k`. For each `name` in the dataset, take the mean of their 5 run times.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>run1</th>\n",
       "      <th>run2</th>\n",
       "      <th>run3</th>\n",
       "      <th>run4</th>\n",
       "      <th>run5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Sue</td>\n",
       "      <td>20.1</td>\n",
       "      <td>18.5</td>\n",
       "      <td>19.6</td>\n",
       "      <td>20.3</td>\n",
       "      <td>18.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Mark</td>\n",
       "      <td>16.5</td>\n",
       "      <td>17.1</td>\n",
       "      <td>16.9</td>\n",
       "      <td>17.6</td>\n",
       "      <td>17.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Sean</td>\n",
       "      <td>23.5</td>\n",
       "      <td>25.1</td>\n",
       "      <td>25.2</td>\n",
       "      <td>24.6</td>\n",
       "      <td>23.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Erin</td>\n",
       "      <td>21.7</td>\n",
       "      <td>21.1</td>\n",
       "      <td>20.9</td>\n",
       "      <td>22.1</td>\n",
       "      <td>22.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Jenny</td>\n",
       "      <td>25.8</td>\n",
       "      <td>27.1</td>\n",
       "      <td>26.1</td>\n",
       "      <td>26.7</td>\n",
       "      <td>26.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Russell</td>\n",
       "      <td>30.9</td>\n",
       "      <td>29.6</td>\n",
       "      <td>31.4</td>\n",
       "      <td>30.4</td>\n",
       "      <td>29.9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      name  run1  run2  run3  run4  run5\n",
       "0      Sue  20.1  18.5  19.6  20.3  18.3\n",
       "1     Mark  16.5  17.1  16.9  17.6  17.3\n",
       "2     Sean  23.5  25.1  25.2  24.6  23.9\n",
       "3     Erin  21.7  21.1  20.9  22.1  22.2\n",
       "4    Jenny  25.8  27.1  26.1  26.7  26.9\n",
       "5  Russell  30.9  29.6  31.4  30.4  29.9"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "running_times_5k = pd.read_csv('./dataset/running_times_5k.csv')\n",
    "running_times_5k"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      name  run1  run2  run3  run4  run5   mean\n",
      "0      Sue  20.1  18.5  19.6  20.3  18.3  19.36\n",
      "1     Mark  16.5  17.1  16.9  17.6  17.3  17.08\n",
      "2     Sean  23.5  25.1  25.2  24.6  23.9  24.46\n",
      "3     Erin  21.7  21.1  20.9  22.1  22.2  21.60\n",
      "4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52\n",
      "5  Russell  30.9  29.6  31.4  30.4  29.9  30.44\n"
     ]
    }
   ],
   "source": [
    "# Create a list of the columns to average\n",
    "run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']\n",
    "\n",
    "# Use apply to create a mean column\n",
    "running_times_5k['mean'] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)\n",
    "\n",
    "# Take a look at the results\n",
    "print(running_times_5k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineering numerical features - datetime\n",
    "There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start_date_converted</th>\n",
       "      <th>start_date_month</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-07-30</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-01-29</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-14</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  start_date_converted  start_date_month\n",
       "0           2011-07-30                 7\n",
       "1           2011-02-01                 2\n",
       "2           2011-01-29                 1\n",
       "3           2011-02-14                 2\n",
       "4           2011-02-05                 2"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# First, convert string column to date column\n",
    "volunteer['start_date_converted'] = pd.to_datetime(volunteer['start_date_date'])\n",
    "\n",
    "# Extract just the month from the converted column\n",
    "volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row: row.month)\n",
    "\n",
    "# Take a look at the converted and new month columns\n",
    "volunteer[['start_date_converted', 'start_date_month']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text classification\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineering features from strings - extraction\n",
    "The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Length</th>\n",
       "      <th>Length_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.8 miles</td>\n",
       "      <td>0.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0 mile</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.75 miles</td>\n",
       "      <td>0.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Length  Length_num\n",
       "0   0.8 miles        0.80\n",
       "1    1.0 mile        1.00\n",
       "2  0.75 miles        0.75\n",
       "3   0.5 miles        0.50\n",
       "4   0.5 miles        0.50"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# Write a pattern to extract numbers and decimals\n",
    "def return_mileage(length):\n",
    "    pattern = re.compile(r'\\d+\\.\\d+')\n",
    "    \n",
    "    if length == None:\n",
    "        return\n",
    "    \n",
    "    # Search the text for matches\n",
    "    mile = re.match(pattern, length)\n",
    "    \n",
    "    # If a value is returned, use group(0) to return the found value\n",
    "    if mile is not None:\n",
    "        return float(mile.group(0))\n",
    "    \n",
    "# Apply the function to the Length column and take a look at both columns\n",
    "hiking['Length_num'] = hiking['Length'].apply(lambda row: return_mileage(row))\n",
    "hiking[['Length', 'Length_num']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineering features from strings - tf/idf\n",
    "Let's transform the `volunteer` dataset's `title` column into a text vector, to use in a prediction task in the next exercise.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Need to drop NaN for train_test_split\n",
    "volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')\n",
    "volunteer = volunteer.dropna(subset=['category_desc'], axis=0)\n",
    "\n",
    "# Take the title text\n",
    "title_text = volunteer['title']\n",
    "\n",
    "# Create the vectorizer method\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "\n",
    "# Transform the text into tf-idf vectors\n",
    "text_tfidf = tfidf_vec.fit_transform(title_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text classification using tf/idf vectors\n",
    "Now that we've encoded the `volunteer` dataset's `title` column into tf/idf vectors, let's use those vectors to try to predict the `category_desc` column.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.5225806451612903\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "nb = GaussianNB()\n",
    "\n",
    "# Split the dataset according to the class distribution of category_desc\n",
    "y = volunteer['category_desc']\n",
    "X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)\n",
    "\n",
    "# Fit the model to the training data\n",
    "nb.fit(X_train, y_train)\n",
    "\n",
    "# Print out the model's accuracy\n",
    "print(nb.score(X_test, y_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}