{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating Features\n", "> In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns. This is the Summary of lecture \"Feature Engineering for Machine Learning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why generate features?\n", "- Different types of data\n", " - Continuous: either integers (or whole numbers) or floats (decimals)\n", " - Categorical: one of a limited set of values, e.g., gender, country of birth\n", " - Ordinal: ranked values often with no details of distance between them\n", " - Boolean: True/False values\n", " - Datetime: dates and times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting to know your data\n", "You will be working with a modified subset of the [Stackoverflow survey response data](https://insights.stackoverflow.com/survey/2018/#overview) in the first three chapters of this course. This data set records the details, and preferences of thousands of users of the StackOverflow website." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurveyDateFormalEducationConvertedSalaryHobbyCountryStackOverflowJobsRecommendVersionControlAgeYears ExperienceGenderRawSalary
02/28/18 20:20Bachelor's degree (BA. BS. B.Eng.. etc.)NaNYesSouth AfricaNaNGit2113MaleNaN
16/28/18 13:26Bachelor's degree (BA. BS. B.Eng.. etc.)70841.0YesSweeden7.0Git;Subversion389Male70,841.00
26/6/18 3:37Bachelor's degree (BA. BS. B.Eng.. etc.)NaNNoSweeden8.0Git4511NaNNaN
35/9/18 1:06Some college/university study without earning ...21426.0YesSweedenNaNZip file back-ups4612Male21,426.00
44/12/18 22:41Bachelor's degree (BA. BS. B.Eng.. etc.)41671.0YesUK8.0Git397Male£41,671.00
\n", "
" ], "text/plain": [ " SurveyDate FormalEducation \\\n", "0 2/28/18 20:20 Bachelor's degree (BA. BS. B.Eng.. etc.) \n", "1 6/28/18 13:26 Bachelor's degree (BA. BS. B.Eng.. etc.) \n", "2 6/6/18 3:37 Bachelor's degree (BA. BS. B.Eng.. etc.) \n", "3 5/9/18 1:06 Some college/university study without earning ... \n", "4 4/12/18 22:41 Bachelor's degree (BA. BS. B.Eng.. etc.) \n", "\n", " ConvertedSalary Hobby Country StackOverflowJobsRecommend \\\n", "0 NaN Yes South Africa NaN \n", "1 70841.0 Yes Sweeden 7.0 \n", "2 NaN No Sweeden 8.0 \n", "3 21426.0 Yes Sweeden NaN \n", "4 41671.0 Yes UK 8.0 \n", "\n", " VersionControl Age Years Experience Gender RawSalary \n", "0 Git 21 13 Male NaN \n", "1 Git;Subversion 38 9 Male 70,841.00 \n", "2 Git 45 11 NaN NaN \n", "3 Zip file back-ups 46 12 Male 21,426.00 \n", "4 Git 39 7 Male £41,671.00 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the data\n", "so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')\n", "\n", "# Print the first five rows of the DataFrame\n", "so_survey_df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SurveyDate object\n", "FormalEducation object\n", "ConvertedSalary float64\n", "Hobby object\n", "Country object\n", "StackOverflowJobsRecommend float64\n", "VersionControl object\n", "Age int64\n", "Years Experience int64\n", "Gender object\n", "RawSalary object\n", "dtype: object\n" ] } ], "source": [ "# Print the data type of each column\n", "print(so_survey_df.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting specific data types\n", "Often a data set will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['ConvertedSalary', 'StackOverflowJobsRecommend'], dtype='object')\n" ] } ], "source": [ "# Create subset of only the numberic columns\n", "so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])\n", "\n", "# Print the column names contained in so_numeric_df\n", "print(so_numeric_df.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with categorical features\n", "- Encoding categorical features\n", " - One-hot encoding\n", " - Dummy encoding\n", "- One-hot vs. dummies\n", " - One-hot encoding: Explainable features\n", " - Dummy encoding: Necessary information without duplication" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-hot encoding and dummy variables\n", "To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',\n", " 'StackOverflowJobsRecommend', 'VersionControl', 'Age',\n", " 'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',\n", " 'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',\n", " 'OH_UK', 'OH_USA', 'OH_Ukraine'],\n", " dtype='object')\n" ] } ], "source": [ "# Convert the Country column to a one hot encoded DataFrame\n", "one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')\n", "\n", "# Print the columns names\n", "print(one_hot_encoded.columns)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',\n", " 'StackOverflowJobsRecommend', 'VersionControl', 'Age',\n", " 'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',\n", " 'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',\n", " 'DM_USA', 'DM_Ukraine'],\n", " dtype='object')\n" ] } ], "source": [ "# Convert the Country column to a one hot encoded DataFrame\n", "dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')\n", "\n", "# Print the columns names\n", "print(dummy.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dealing with uncommon categories\n", "Some features can have many different categories but a very uneven distribution of their occurrences. Take for example Data Science's favorite languages to code in, some common choices are Python, R, and Julia, but there can be individuals with bespoke choices, like FORTRAN, C etc. In these cases, you may not want to create a feature for each value, but only the more common occurrences." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "South Africa 166\n", "USA 164\n", "Spain 134\n", "Sweeden 119\n", "France 115\n", "Russia 97\n", "India 95\n", "UK 95\n", "Ukraine 9\n", "Ireland 5\n", "Name: Country, dtype: int64\n" ] } ], "source": [ "# Create a series out of the Country columns\n", "countries = so_survey_df.Country\n", "\n", "# Get the counts of each category\n", "country_counts = countries.value_counts()\n", "\n", "# Print the count values for each category\n", "print(country_counts)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "Name: Country, dtype: bool\n" ] } ], "source": [ "# Create a mask for only categories that occur less than 10 times\n", "mask = countries.isin(country_counts[country_counts < 10].index)\n", "\n", "# Print the top 5 rows in the mask series\n", "print(mask.head())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "South Africa 166\n", "USA 164\n", "Spain 134\n", "Sweeden 119\n", "France 115\n", "Russia 97\n", "India 95\n", "UK 95\n", "Other 14\n", "Name: Country, dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \n" ] } ], "source": [ "# Label all other categories as Other\n", "countries[mask] = 'Other'\n", "\n", "# Print the updated category counts\n", "print(countries.value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numeric variables\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binarizing columns\n", "While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. In the `so_survey_df` data, you have a large number of survey respondents that are working voluntarily (without pay). You will create a new column titled `Paid_Job` indicating whether each person is paid (their salary is greater than zero)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Paid_JobConvertedSalary
00NaN
1170841.0
20NaN
3121426.0
4141671.0
\n", "
" ], "text/plain": [ " Paid_Job ConvertedSalary\n", "0 0 NaN\n", "1 1 70841.0\n", "2 0 NaN\n", "3 1 21426.0\n", "4 1 41671.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create the Paid_Job column filled with zeros\n", "so_survey_df['Paid_Job'] = 0\n", "\n", "# Replace all the Paid_Job values where ConvertedSalary is > 0\n", "so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1\n", "\n", "# Print the first five rows of the columns\n", "so_survey_df[['Paid_Job', 'ConvertedSalary']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binning values\n", "For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.\n", "\n", "Bins are created using `pd.cut(df['column_name'], bins)` where `bins` can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
equal_binnedConvertedSalary
0NaNNaN
1(-2000.0, 400000.0]70841.0
2NaNNaN
3(-2000.0, 400000.0]21426.0
4(-2000.0, 400000.0]41671.0
\n", "
" ], "text/plain": [ " equal_binned ConvertedSalary\n", "0 NaN NaN\n", "1 (-2000.0, 400000.0] 70841.0\n", "2 NaN NaN\n", "3 (-2000.0, 400000.0] 21426.0\n", "4 (-2000.0, 400000.0] 41671.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Bin the continouos variable ConvertedSalary into 5 bins\n", "so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)\n", "\n", "# Print the first 5 rows of the equal_binned column\n", "so_survey_df[['equal_binned', 'ConvertedSalary']].head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
boundary_binnedConvertedSalary
0NaNNaN
1Medium70841.0
2NaNNaN
3Low21426.0
4Low41671.0
\n", "
" ], "text/plain": [ " boundary_binned ConvertedSalary\n", "0 NaN NaN\n", "1 Medium 70841.0\n", "2 NaN NaN\n", "3 Low 21426.0\n", "4 Low 41671.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Specify the boundaries of the bins\n", "bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]\n", "\n", "# Bin labels\n", "labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']\n", "\n", "# Bin the continous variable ConvertedSalary using these boundaries\n", "so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)\n", "\n", "# Print the first 5 rows of the boundary_binned column\n", "so_survey_df[['boundary_binned', 'ConvertedSalary']].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }