{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Engineering\n", "> In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature engineering\n", "- Creation of new features based on existing features\n", "- Insight into relationships between features\n", "- Extract and expand data\n", "- Dataset-dependent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Identifying areas for feature engineering\n", "Take an exploratory look at the `volunteer` dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
opportunity_idcontent_idvol_requestsevent_timetitlehitssummaryis_prioritycategory_idcategory_desc...end_date_datestatusLatitudeLongitudeCommunity BoardCommunity CouncilCensus TractBINBBLNTA
0499637004500Volunteers Needed For Rise Up & Stay Put! Home...737Building on successful events last summer and ...NaNNaNNaN...July 30 2011approvedNaNNaNNaNNaNNaNNaNNaNNaN
150083703620Web designer22Build a website for an Afghan businessNaN1.0Strengthening Communities...February 01 2011approvedNaNNaNNaNNaNNaNNaNNaNNaN
2501637143200Urban Adventures - Ice Skating at Lasker Rink62Please join us and the students from Mott Hall...NaN1.0Strengthening Communities...January 29 2011approvedNaNNaNNaNNaNNaNNaNNaNNaN
35022372375000Fight global hunger and support women farmers ...14The Oxfam Action Corps is a group of dedicated...NaN1.0Strengthening Communities...March 31 2012approvedNaNNaNNaNNaNNaNNaNNaNNaN
4505537425150Stop 'N' Swap31Stop 'N' Swap reduces NYC's waste by finding n...NaN4.0Environment...February 05 2011approvedNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 35 columns

\n", "
" ], "text/plain": [ " opportunity_id content_id vol_requests event_time \\\n", "0 4996 37004 50 0 \n", "1 5008 37036 2 0 \n", "2 5016 37143 20 0 \n", "3 5022 37237 500 0 \n", "4 5055 37425 15 0 \n", "\n", " title hits \\\n", "0 Volunteers Needed For Rise Up & Stay Put! Home... 737 \n", "1 Web designer 22 \n", "2 Urban Adventures - Ice Skating at Lasker Rink 62 \n", "3 Fight global hunger and support women farmers ... 14 \n", "4 Stop 'N' Swap 31 \n", "\n", " summary is_priority category_id \\\n", "0 Building on successful events last summer and ... NaN NaN \n", "1 Build a website for an Afghan business NaN 1.0 \n", "2 Please join us and the students from Mott Hall... NaN 1.0 \n", "3 The Oxfam Action Corps is a group of dedicated... NaN 1.0 \n", "4 Stop 'N' Swap reduces NYC's waste by finding n... NaN 4.0 \n", "\n", " category_desc ... end_date_date status Latitude \\\n", "0 NaN ... July 30 2011 approved NaN \n", "1 Strengthening Communities ... February 01 2011 approved NaN \n", "2 Strengthening Communities ... January 29 2011 approved NaN \n", "3 Strengthening Communities ... March 31 2012 approved NaN \n", "4 Environment ... February 05 2011 approved NaN \n", "\n", " Longitude Community Board Community Council Census Tract BIN BBL NTA \n", "0 NaN NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN NaN \n", "\n", "[5 rows x 35 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')\n", "volunteer.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding categorical variables\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding categorical variables - binary\n", "Take a look at the `hiking` dataset. There are several columns here that need encoding, one of which is the `Accessible` column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either `Y` or `N` - so it needs to be encoded into 1s and 0s. Use scikit-learn's `LabelEncoder` method to do that transformation.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Prop_IDNameLocationPark_NameLengthDifficultyOther_DetailsAccessibleLimited_Accesslatlon
0B057Salt Marsh Nature TrailEnter behind the Salt Marsh Nature Center, loc...Marine Park0.8 milesNone<p>The first half of this mile-long trail foll...YNNaNNaN
1B073LullwaterEnter Park at Lincoln Road and Ocean Avenue en...Prospect Park1.0 mileEasyExplore the Lullwater to see how nature thrive...NNNaNNaN
2B073MidwoodEnter Park at Lincoln Road and Ocean Avenue en...Prospect Park0.75 milesEasyStep back in time with a walk through Brooklyn...NNNaNNaN
3B073PeninsulaEnter Park at Lincoln Road and Ocean Avenue en...Prospect Park0.5 milesEasyDiscover how the Peninsula has changed over th...NNNaNNaN
4B073WaterfallEnter Park at Lincoln Road and Ocean Avenue en...Prospect Park0.5 milesEasyTrace the source of the Lake on the Waterfall ...NNNaNNaN
\n", "
" ], "text/plain": [ " Prop_ID Name \\\n", "0 B057 Salt Marsh Nature Trail \n", "1 B073 Lullwater \n", "2 B073 Midwood \n", "3 B073 Peninsula \n", "4 B073 Waterfall \n", "\n", " Location Park_Name \\\n", "0 Enter behind the Salt Marsh Nature Center, loc... Marine Park \n", "1 Enter Park at Lincoln Road and Ocean Avenue en... Prospect Park \n", "2 Enter Park at Lincoln Road and Ocean Avenue en... Prospect Park \n", "3 Enter Park at Lincoln Road and Ocean Avenue en... Prospect Park \n", "4 Enter Park at Lincoln Road and Ocean Avenue en... Prospect Park \n", "\n", " Length Difficulty Other_Details \\\n", "0 0.8 miles None

The first half of this mile-long trail foll... \n", "1 1.0 mile Easy Explore the Lullwater to see how nature thrive... \n", "2 0.75 miles Easy Step back in time with a walk through Brooklyn... \n", "3 0.5 miles Easy Discover how the Peninsula has changed over th... \n", "4 0.5 miles Easy Trace the source of the Lake on the Waterfall ... \n", "\n", " Accessible Limited_Access lat lon \n", "0 Y N NaN NaN \n", "1 N N NaN NaN \n", "2 N N NaN NaN \n", "3 N N NaN NaN \n", "4 N N NaN NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hiking = pd.read_json('./dataset/hiking.json')\n", "hiking.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccessibleAccessible_enc
0Y1
1N0
2N0
3N0
4N0
\n", "
" ], "text/plain": [ " Accessible Accessible_enc\n", "0 Y 1\n", "1 N 0\n", "2 N 0\n", "3 N 0\n", "4 N 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "# Set up the LabelEncoder object\n", "enc = LabelEncoder()\n", "\n", "# Apply the encoding to the \"Accessible\" column\n", "hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])\n", "\n", "# Compare the two columns\n", "hiking[['Accessible', 'Accessible_enc']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding categorical variables - one-hot\n", "One of the columns in the `volunteer` dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' `get_dummies()` function to do so." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Education Emergency Preparedness Environment Health \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 1 0 \n", "\n", " Helping Neighbors in Need Strengthening Communities \n", "0 0 0 \n", "1 0 1 \n", "2 0 1 \n", "3 0 1 \n", "4 0 0 \n" ] } ], "source": [ "# Transform the category_desc column\n", "category_enc = pd.get_dummies(volunteer['category_desc'])\n", "\n", "# Take a look at the encoded columns\n", "print(category_enc.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Engineering numerical features\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Engineering numerical features - taking an average\n", "A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named `running_times_5k`. For each `name` in the dataset, take the mean of their 5 run times.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerun1run2run3run4run5
0Sue20.118.519.620.318.3
1Mark16.517.116.917.617.3
2Sean23.525.125.224.623.9
3Erin21.721.120.922.122.2
4Jenny25.827.126.126.726.9
5Russell30.929.631.430.429.9
\n", "
" ], "text/plain": [ " name run1 run2 run3 run4 run5\n", "0 Sue 20.1 18.5 19.6 20.3 18.3\n", "1 Mark 16.5 17.1 16.9 17.6 17.3\n", "2 Sean 23.5 25.1 25.2 24.6 23.9\n", "3 Erin 21.7 21.1 20.9 22.1 22.2\n", "4 Jenny 25.8 27.1 26.1 26.7 26.9\n", "5 Russell 30.9 29.6 31.4 30.4 29.9" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "running_times_5k = pd.read_csv('./dataset/running_times_5k.csv')\n", "running_times_5k" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " name run1 run2 run3 run4 run5 mean\n", "0 Sue 20.1 18.5 19.6 20.3 18.3 19.36\n", "1 Mark 16.5 17.1 16.9 17.6 17.3 17.08\n", "2 Sean 23.5 25.1 25.2 24.6 23.9 24.46\n", "3 Erin 21.7 21.1 20.9 22.1 22.2 21.60\n", "4 Jenny 25.8 27.1 26.1 26.7 26.9 26.52\n", "5 Russell 30.9 29.6 31.4 30.4 29.9 30.44\n" ] } ], "source": [ "# Create a list of the columns to average\n", "run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']\n", "\n", "# Use apply to create a mean column\n", "running_times_5k['mean'] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)\n", "\n", "# Take a look at the results\n", "print(running_times_5k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Engineering numerical features - datetime\n", "There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
start_date_convertedstart_date_month
02011-07-307
12011-02-012
22011-01-291
32011-02-142
42011-02-052
\n", "
" ], "text/plain": [ " start_date_converted start_date_month\n", "0 2011-07-30 7\n", "1 2011-02-01 2\n", "2 2011-01-29 1\n", "3 2011-02-14 2\n", "4 2011-02-05 2" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First, convert string column to date column\n", "volunteer['start_date_converted'] = pd.to_datetime(volunteer['start_date_date'])\n", "\n", "# Extract just the month from the converted column\n", "volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row: row.month)\n", "\n", "# Take a look at the converted and new month columns\n", "volunteer[['start_date_converted', 'start_date_month']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text classification\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Engineering features from strings - extraction\n", "The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LengthLength_num
00.8 miles0.80
11.0 mile1.00
20.75 miles0.75
30.5 miles0.50
40.5 miles0.50
\n", "
" ], "text/plain": [ " Length Length_num\n", "0 0.8 miles 0.80\n", "1 1.0 mile 1.00\n", "2 0.75 miles 0.75\n", "3 0.5 miles 0.50\n", "4 0.5 miles 0.50" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "# Write a pattern to extract numbers and decimals\n", "def return_mileage(length):\n", " pattern = re.compile(r'\\d+\\.\\d+')\n", " \n", " if length == None:\n", " return\n", " \n", " # Search the text for matches\n", " mile = re.match(pattern, length)\n", " \n", " # If a value is returned, use group(0) to return the found value\n", " if mile is not None:\n", " return float(mile.group(0))\n", " \n", "# Apply the function to the Length column and take a look at both columns\n", "hiking['Length_num'] = hiking['Length'].apply(lambda row: return_mileage(row))\n", "hiking[['Length', 'Length_num']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Engineering features from strings - tf/idf\n", "Let's transform the `volunteer` dataset's `title` column into a text vector, to use in a prediction task in the next exercise.\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Need to drop NaN for train_test_split\n", "volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')\n", "volunteer = volunteer.dropna(subset=['category_desc'], axis=0)\n", "\n", "# Take the title text\n", "title_text = volunteer['title']\n", "\n", "# Create the vectorizer method\n", "tfidf_vec = TfidfVectorizer()\n", "\n", "# Transform the text into tf-idf vectors\n", "text_tfidf = tfidf_vec.fit_transform(title_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text classification using tf/idf vectors\n", "Now that we've encoded the `volunteer` dataset's `title` column into tf/idf vectors, let's use those vectors to try to predict the `category_desc` column.\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.5225806451612903\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "nb = GaussianNB()\n", "\n", "# Split the dataset according to the class distribution of category_desc\n", "y = volunteer['category_desc']\n", "X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)\n", "\n", "# Fit the model to the training data\n", "nb.fit(X_train, y_train)\n", "\n", "# Print out the model's accuracy\n", "print(nb.score(X_test, y_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }