{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting features for modeling\n",
    "> This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA). This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature selection\n",
    "- Selecting features to be used for modeling\n",
    "- Doesn't create new features\n",
    "- Improve model's performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identifying areas for feature selection\n",
    "Take an exploratory look at the post-feature engineering `hiking` dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Prop_ID</th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Park_Name</th>\n",
       "      <th>Length</th>\n",
       "      <th>Difficulty</th>\n",
       "      <th>Other_Details</th>\n",
       "      <th>Accessible</th>\n",
       "      <th>Limited_Access</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B057</td>\n",
       "      <td>Salt Marsh Nature Trail</td>\n",
       "      <td>Enter behind the Salt Marsh Nature Center, loc...</td>\n",
       "      <td>Marine Park</td>\n",
       "      <td>0.8 miles</td>\n",
       "      <td>None</td>\n",
       "      <td>&lt;p&gt;The first half of this mile-long trail foll...</td>\n",
       "      <td>Y</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B073</td>\n",
       "      <td>Lullwater</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>1.0 mile</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Explore the Lullwater to see how nature thrive...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B073</td>\n",
       "      <td>Midwood</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.75 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Step back in time with a walk through Brooklyn...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B073</td>\n",
       "      <td>Peninsula</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Discover how the Peninsula has changed over th...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B073</td>\n",
       "      <td>Waterfall</td>\n",
       "      <td>Enter Park at Lincoln Road and Ocean Avenue en...</td>\n",
       "      <td>Prospect Park</td>\n",
       "      <td>0.5 miles</td>\n",
       "      <td>Easy</td>\n",
       "      <td>Trace the source of the Lake on the Waterfall ...</td>\n",
       "      <td>N</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Prop_ID                     Name  \\\n",
       "0    B057  Salt Marsh Nature Trail   \n",
       "1    B073                Lullwater   \n",
       "2    B073                  Midwood   \n",
       "3    B073                Peninsula   \n",
       "4    B073                Waterfall   \n",
       "\n",
       "                                            Location      Park_Name  \\\n",
       "0  Enter behind the Salt Marsh Nature Center, loc...    Marine Park   \n",
       "1  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "2  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "3  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "4  Enter Park at Lincoln Road and Ocean Avenue en...  Prospect Park   \n",
       "\n",
       "       Length Difficulty                                      Other_Details  \\\n",
       "0   0.8 miles       None  <p>The first half of this mile-long trail foll...   \n",
       "1    1.0 mile       Easy  Explore the Lullwater to see how nature thrive...   \n",
       "2  0.75 miles       Easy  Step back in time with a walk through Brooklyn...   \n",
       "3   0.5 miles       Easy  Discover how the Peninsula has changed over th...   \n",
       "4   0.5 miles       Easy  Trace the source of the Lake on the Waterfall ...   \n",
       "\n",
       "  Accessible Limited_Access  lat  lon  \n",
       "0          Y              N  NaN  NaN  \n",
       "1          N              N  NaN  NaN  \n",
       "2          N              N  NaN  NaN  \n",
       "3          N              N  NaN  NaN  \n",
       "4          N              N  NaN  NaN  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hiking = pd.read_json('./dataset/hiking.json')\n",
    "hiking.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing redundant features\n",
    "- Remove noisy features\n",
    "- Remove correlated features\n",
    "    - Statistically correlated: features move together directionally\n",
    "    - Linear models assume feature independence\n",
    "    - Pearson correlation coefficient\n",
    "- Remove duplicated features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting relevant features\n",
    "Now let's identify the redundant columns in the `volunteer` dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.\n",
    "\n",
    "For example, if you explore the `volunteer` dataset in the console, you'll see three features which are related to location: `locality`, `region`, and `postalcode`. They contain repeated information, so it would make sense to keep only one of the features.\n",
    "\n",
    "There are also features that have gone through the feature engineering process: columns like `Education` and `Emergency Preparedness` are a product of encoding the categorical variable `category_desc`, so `category_desc` itself is redundant now.\n",
    "\n",
    "Take a moment to examine the features of volunteer in the console, and try to identify the redundant features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vol_requests</th>\n",
       "      <th>title</th>\n",
       "      <th>hits</th>\n",
       "      <th>category_desc</th>\n",
       "      <th>locality</th>\n",
       "      <th>region</th>\n",
       "      <th>postalcode</th>\n",
       "      <th>created_date</th>\n",
       "      <th>vol_requests_lognorm</th>\n",
       "      <th>created_month</th>\n",
       "      <th>Education</th>\n",
       "      <th>Emergency Preparedness</th>\n",
       "      <th>Environment</th>\n",
       "      <th>Health</th>\n",
       "      <th>Helping Neighbors in Need</th>\n",
       "      <th>Strengthening Communities</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>Web designer</td>\n",
       "      <td>22</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>5 22nd St\\nNew York, NY 10010\\n(40.74053152272...</td>\n",
       "      <td>NY</td>\n",
       "      <td>10010.0</td>\n",
       "      <td>2011-01-14</td>\n",
       "      <td>0.693147</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>Urban Adventures - Ice Skating at Lasker Rink</td>\n",
       "      <td>62</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NY</td>\n",
       "      <td>10026.0</td>\n",
       "      <td>2011-01-19</td>\n",
       "      <td>2.995732</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>500</td>\n",
       "      <td>Fight global hunger and support women farmers ...</td>\n",
       "      <td>14</td>\n",
       "      <td>Strengthening Communities</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NY</td>\n",
       "      <td>2114.0</td>\n",
       "      <td>2011-01-21</td>\n",
       "      <td>6.214608</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>15</td>\n",
       "      <td>Stop 'N' Swap</td>\n",
       "      <td>31</td>\n",
       "      <td>Environment</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NY</td>\n",
       "      <td>10455.0</td>\n",
       "      <td>2011-01-28</td>\n",
       "      <td>2.708050</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>15</td>\n",
       "      <td>Queens Stop 'N' Swap</td>\n",
       "      <td>135</td>\n",
       "      <td>Environment</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NY</td>\n",
       "      <td>11372.0</td>\n",
       "      <td>2011-01-28</td>\n",
       "      <td>2.708050</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   vol_requests                                              title  hits  \\\n",
       "0             2                                       Web designer    22   \n",
       "1            20      Urban Adventures - Ice Skating at Lasker Rink    62   \n",
       "2           500  Fight global hunger and support women farmers ...    14   \n",
       "3            15                                      Stop 'N' Swap    31   \n",
       "4            15                               Queens Stop 'N' Swap   135   \n",
       "\n",
       "               category_desc  \\\n",
       "0  Strengthening Communities   \n",
       "1  Strengthening Communities   \n",
       "2  Strengthening Communities   \n",
       "3                Environment   \n",
       "4                Environment   \n",
       "\n",
       "                                            locality region  postalcode  \\\n",
       "0  5 22nd St\\nNew York, NY 10010\\n(40.74053152272...     NY     10010.0   \n",
       "1                                                NaN     NY     10026.0   \n",
       "2                                                NaN     NY      2114.0   \n",
       "3                                                NaN     NY     10455.0   \n",
       "4                                                NaN     NY     11372.0   \n",
       "\n",
       "  created_date  vol_requests_lognorm  created_month  Education  \\\n",
       "0   2011-01-14              0.693147              1          0   \n",
       "1   2011-01-19              2.995732              1          0   \n",
       "2   2011-01-21              6.214608              1          0   \n",
       "3   2011-01-28              2.708050              1          0   \n",
       "4   2011-01-28              2.708050              1          0   \n",
       "\n",
       "   Emergency Preparedness  Environment  Health  Helping Neighbors in Need  \\\n",
       "0                       0            0       0                          0   \n",
       "1                       0            0       0                          0   \n",
       "2                       0            0       0                          0   \n",
       "3                       0            1       0                          0   \n",
       "4                       0            1       0                          0   \n",
       "\n",
       "   Strengthening Communities  \n",
       "0                          1  \n",
       "1                          1  \n",
       "2                          1  \n",
       "3                          0  \n",
       "4                          0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "volunteer = pd.read_csv('./dataset/volunteer_sample.csv')\n",
    "volunteer.dropna(subset=['category_desc'], axis=0, inplace=True)\n",
    "volunteer.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['vol_requests', 'title', 'hits', 'category_desc', 'locality', 'region',\n",
       "       'postalcode', 'created_date', 'vol_requests_lognorm', 'created_month',\n",
       "       'Education', 'Emergency Preparedness', 'Environment', 'Health',\n",
       "       'Helping Neighbors in Need', 'Strengthening Communities'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "volunteer.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>hits</th>\n",
       "      <th>postalcode</th>\n",
       "      <th>vol_requests_lognorm</th>\n",
       "      <th>created_month</th>\n",
       "      <th>Education</th>\n",
       "      <th>Emergency Preparedness</th>\n",
       "      <th>Environment</th>\n",
       "      <th>Health</th>\n",
       "      <th>Helping Neighbors in Need</th>\n",
       "      <th>Strengthening Communities</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Web designer</td>\n",
       "      <td>22</td>\n",
       "      <td>10010.0</td>\n",
       "      <td>0.693147</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Urban Adventures - Ice Skating at Lasker Rink</td>\n",
       "      <td>62</td>\n",
       "      <td>10026.0</td>\n",
       "      <td>2.995732</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Fight global hunger and support women farmers ...</td>\n",
       "      <td>14</td>\n",
       "      <td>2114.0</td>\n",
       "      <td>6.214608</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stop 'N' Swap</td>\n",
       "      <td>31</td>\n",
       "      <td>10455.0</td>\n",
       "      <td>2.708050</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Queens Stop 'N' Swap</td>\n",
       "      <td>135</td>\n",
       "      <td>11372.0</td>\n",
       "      <td>2.708050</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  hits  postalcode  \\\n",
       "0                                       Web designer    22     10010.0   \n",
       "1      Urban Adventures - Ice Skating at Lasker Rink    62     10026.0   \n",
       "2  Fight global hunger and support women farmers ...    14      2114.0   \n",
       "3                                      Stop 'N' Swap    31     10455.0   \n",
       "4                               Queens Stop 'N' Swap   135     11372.0   \n",
       "\n",
       "   vol_requests_lognorm  created_month  Education  Emergency Preparedness  \\\n",
       "0              0.693147              1          0                       0   \n",
       "1              2.995732              1          0                       0   \n",
       "2              6.214608              1          0                       0   \n",
       "3              2.708050              1          0                       0   \n",
       "4              2.708050              1          0                       0   \n",
       "\n",
       "   Environment  Health  Helping Neighbors in Need  Strengthening Communities  \n",
       "0            0       0                          0                          1  \n",
       "1            0       0                          0                          1  \n",
       "2            0       0                          0                          1  \n",
       "3            1       0                          0                          0  \n",
       "4            1       0                          0                          0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a list of redundant column names to drop\n",
    "to_drop = [\"locality\", \"region\", \"category_desc\", \"created_date\", \"vol_requests\"]\n",
    "\n",
    "# Drop those columns from the dataset\n",
    "volunteer_subset = volunteer.drop(to_drop, axis=1)\n",
    "\n",
    "# Print out the head of the new dataset\n",
    "volunteer_subset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking for correlated features\n",
    "Let's take a look at the `wine` dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flavanoids</th>\n",
       "      <th>Total phenols</th>\n",
       "      <th>Malic acid</th>\n",
       "      <th>OD280/OD315 of diluted wines</th>\n",
       "      <th>Hue</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.06</td>\n",
       "      <td>2.80</td>\n",
       "      <td>1.71</td>\n",
       "      <td>3.92</td>\n",
       "      <td>1.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.76</td>\n",
       "      <td>2.65</td>\n",
       "      <td>1.78</td>\n",
       "      <td>3.40</td>\n",
       "      <td>1.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.24</td>\n",
       "      <td>2.80</td>\n",
       "      <td>2.36</td>\n",
       "      <td>3.17</td>\n",
       "      <td>1.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.49</td>\n",
       "      <td>3.85</td>\n",
       "      <td>1.95</td>\n",
       "      <td>3.45</td>\n",
       "      <td>0.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.69</td>\n",
       "      <td>2.80</td>\n",
       "      <td>2.59</td>\n",
       "      <td>2.93</td>\n",
       "      <td>1.04</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Flavanoids  Total phenols  Malic acid  OD280/OD315 of diluted wines   Hue\n",
       "0        3.06           2.80        1.71                          3.92  1.04\n",
       "1        2.76           2.65        1.78                          3.40  1.05\n",
       "2        3.24           2.80        2.36                          3.17  1.03\n",
       "3        3.49           3.85        1.95                          3.45  0.86\n",
       "4        2.69           2.80        2.59                          2.93  1.04"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wine = pd.read_csv('./dataset/wine_sample.csv')\n",
    "wine.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                              Flavanoids  Total phenols  Malic acid  \\\n",
      "Flavanoids                      1.000000       0.864564   -0.411007   \n",
      "Total phenols                   0.864564       1.000000   -0.335167   \n",
      "Malic acid                     -0.411007      -0.335167    1.000000   \n",
      "OD280/OD315 of diluted wines    0.787194       0.699949   -0.368710   \n",
      "Hue                             0.543479       0.433681   -0.561296   \n",
      "\n",
      "                              OD280/OD315 of diluted wines       Hue  \n",
      "Flavanoids                                        0.787194  0.543479  \n",
      "Total phenols                                     0.699949  0.433681  \n",
      "Malic acid                                       -0.368710 -0.561296  \n",
      "OD280/OD315 of diluted wines                      1.000000  0.565468  \n",
      "Hue                                               0.565468  1.000000  \n"
     ]
    }
   ],
   "source": [
    "# Print out the column correlations of the wine dataset\n",
    "print(wine.corr())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take a minute to find the column where the correlation value is greater than 0.75 at least twice\n",
    "to_drop = \"Flavanoids\"\n",
    "\n",
    "# Drop that column from the DataFrame\n",
    "wine = wine.drop(to_drop, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting features using text vectors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring text vectors, part 1\n",
    "Let's expand on the text vector exploration method we just learned about, using the `volunteer` dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our `text_tfidf` vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_csv = pd.read_csv('./dataset/vocab_volunteer.csv', index_col=0).to_dict()\n",
    "vocab = vocab_csv['0']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "volunteer = volunteer[['category_desc', 'title']]\n",
    "volunteer = volunteer.dropna(subset=['category_desc'], axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Take the title text\n",
    "title_text = volunteer['title']\n",
    "\n",
    "# Create the vectorizer method\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "\n",
    "# Transform the text into tf-idf vectors\n",
    "text_tfidf = tfidf_vec.fit_transform(title_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[189, 942, 466]\n"
     ]
    }
   ],
   "source": [
    "# Add in the rest of the parameters\n",
    "def return_weights(vocab, original_vocab, vector, vector_index, top_n):\n",
    "    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))\n",
    "    \n",
    "    # Let's transform that zipped dict into a series\n",
    "    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})\n",
    "    \n",
    "    # Let's sort the series to pull out the top n weighted words\n",
    "    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index\n",
    "    return [original_vocab[i] for i in zipped_index]\n",
    "\n",
    "# Print out the weighted words\n",
    "print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, vector_index=8, top_n=3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring text vectors, part 2\n",
    "Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def words_to_filter(vocab, original_vocab, vector, top_n):\n",
    "    filter_list = []\n",
    "    for i in range(0, vector.shape[0]):\n",
    "        # here we'll call the function from the previous exercise, \n",
    "        # and extend the list we're creating\n",
    "        filtered = return_weights(vocab, original_vocab, vector, i, top_n)\n",
    "        filter_list.extend(filtered)\n",
    "    # Return the list in a set, so we don't get duplicate word indices\n",
    "    return set(filter_list)\n",
    "\n",
    "# Call the function to get the list of word indices\n",
    "filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, top_n=3)\n",
    "\n",
    "# By converting filtered_words back to a list, \n",
    "# we can use it to filter the columns in the text vector\n",
    "filtered_text = text_tfidf[:, list(filtered_words)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training Naive Bayes with feature selection\n",
    "Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the `volunteer` dataset's `title` and `category_desc` columns.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.5483870967741935\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "nb = GaussianNB()\n",
    "y = volunteer['category_desc']\n",
    "\n",
    "# Split the dataset according to the class distribution of category_desc,\n",
    "# using the filtered_text vector\n",
    "X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y)\n",
    "\n",
    "# Fit the model to the training data\n",
    "nb.fit(X_train, y_train)\n",
    "\n",
    "# Print out the model's accuracy\n",
    "print(nb.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that our accuracy score wasn't that different from the score at the end of chapter 3. That's okay; the `title` field is a very small text field, appropriate for demonstrating how filtering vectors works."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dimensionality reduction\n",
    "- Unsupervised learning method\n",
    "- Combine/decomposes a feature space\n",
    "- Feature extraction\n",
    "- Principal component analysis\n",
    "    - Linear transformation to uncorrelated space\n",
    "    - Captures as much variance as possible in each component\n",
    "- PCA caveats\n",
    "    - Difficult to interpret components\n",
    "    - End of preprocessing journey"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using PCA\n",
    "Let's apply PCA to the `wine` dataset, to see if we can get an increase in our model's accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Alcohol</th>\n",
       "      <th>Malic acid</th>\n",
       "      <th>Ash</th>\n",
       "      <th>Alcalinity of ash</th>\n",
       "      <th>Magnesium</th>\n",
       "      <th>Total phenols</th>\n",
       "      <th>Flavanoids</th>\n",
       "      <th>Nonflavanoid phenols</th>\n",
       "      <th>Proanthocyanins</th>\n",
       "      <th>Color intensity</th>\n",
       "      <th>Hue</th>\n",
       "      <th>OD280/OD315 of diluted wines</th>\n",
       "      <th>Proline</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>14.23</td>\n",
       "      <td>1.71</td>\n",
       "      <td>2.43</td>\n",
       "      <td>15.6</td>\n",
       "      <td>127</td>\n",
       "      <td>2.80</td>\n",
       "      <td>3.06</td>\n",
       "      <td>0.28</td>\n",
       "      <td>2.29</td>\n",
       "      <td>5.64</td>\n",
       "      <td>1.04</td>\n",
       "      <td>3.92</td>\n",
       "      <td>1065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>13.20</td>\n",
       "      <td>1.78</td>\n",
       "      <td>2.14</td>\n",
       "      <td>11.2</td>\n",
       "      <td>100</td>\n",
       "      <td>2.65</td>\n",
       "      <td>2.76</td>\n",
       "      <td>0.26</td>\n",
       "      <td>1.28</td>\n",
       "      <td>4.38</td>\n",
       "      <td>1.05</td>\n",
       "      <td>3.40</td>\n",
       "      <td>1050</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>13.16</td>\n",
       "      <td>2.36</td>\n",
       "      <td>2.67</td>\n",
       "      <td>18.6</td>\n",
       "      <td>101</td>\n",
       "      <td>2.80</td>\n",
       "      <td>3.24</td>\n",
       "      <td>0.30</td>\n",
       "      <td>2.81</td>\n",
       "      <td>5.68</td>\n",
       "      <td>1.03</td>\n",
       "      <td>3.17</td>\n",
       "      <td>1185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>14.37</td>\n",
       "      <td>1.95</td>\n",
       "      <td>2.50</td>\n",
       "      <td>16.8</td>\n",
       "      <td>113</td>\n",
       "      <td>3.85</td>\n",
       "      <td>3.49</td>\n",
       "      <td>0.24</td>\n",
       "      <td>2.18</td>\n",
       "      <td>7.80</td>\n",
       "      <td>0.86</td>\n",
       "      <td>3.45</td>\n",
       "      <td>1480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>13.24</td>\n",
       "      <td>2.59</td>\n",
       "      <td>2.87</td>\n",
       "      <td>21.0</td>\n",
       "      <td>118</td>\n",
       "      <td>2.80</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0.39</td>\n",
       "      <td>1.82</td>\n",
       "      <td>4.32</td>\n",
       "      <td>1.04</td>\n",
       "      <td>2.93</td>\n",
       "      <td>735</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Type  Alcohol  Malic acid   Ash  Alcalinity of ash  Magnesium  \\\n",
       "0     1    14.23        1.71  2.43               15.6        127   \n",
       "1     1    13.20        1.78  2.14               11.2        100   \n",
       "2     1    13.16        2.36  2.67               18.6        101   \n",
       "3     1    14.37        1.95  2.50               16.8        113   \n",
       "4     1    13.24        2.59  2.87               21.0        118   \n",
       "\n",
       "   Total phenols  Flavanoids  Nonflavanoid phenols  Proanthocyanins  \\\n",
       "0           2.80        3.06                  0.28             2.29   \n",
       "1           2.65        2.76                  0.26             1.28   \n",
       "2           2.80        3.24                  0.30             2.81   \n",
       "3           3.85        3.49                  0.24             2.18   \n",
       "4           2.80        2.69                  0.39             1.82   \n",
       "\n",
       "   Color intensity   Hue  OD280/OD315 of diluted wines  Proline  \n",
       "0             5.64  1.04                          3.92     1065  \n",
       "1             4.38  1.05                          3.40     1050  \n",
       "2             5.68  1.03                          3.17     1185  \n",
       "3             7.80  0.86                          3.45     1480  \n",
       "4             4.32  1.04                          2.93      735  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wine = pd.read_csv('./dataset/wine_types.csv')\n",
    "wine.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05\n",
      " 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06\n",
      " 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07\n",
      " 8.25392788e-08]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "# Set up PCA and the X vector for dimensionality reduction\n",
    "pca = PCA()\n",
    "wine_X = wine.drop('Type', axis=1)\n",
    "\n",
    "# Apply PCA to the wine dataset X vector\n",
    "transformed_X = pca.fit_transform(wine_X)\n",
    "\n",
    "# Look at the percentage of variance explained by the different components\n",
    "print(pca.explained_variance_ratio_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training a model with PCA\n",
    "Now that we have run PCA on the `wine` dataset, let's try training a model with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7555555555555555\n"
     ]
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "y = wine['Type']\n",
    "\n",
    "knn = KNeighborsClassifier()\n",
    "\n",
    "# Split the transformed X and the y labels into training and test sets\n",
    "X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)\n",
    "\n",
    "# Fit knn to the training data\n",
    "knn.fit(X_wine_train, y_wine_train)\n",
    "\n",
    "# Score knn on the test data and print it out\n",
    "print(knn.score(X_wine_test, y_wine_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}