{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Engineering\n", "> You will now get exposure to different types of features. You will modify existing features and create new ones. Also, you will treat the missing data accordingly. This is the Summary of lecture \"Winning a Kaggle Competition in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Kaggle, Machine_Learning]\n", "- image: images/feature_engineering.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = (10, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering\n", "- Solution workflow\n", "![solution](image/solution_workflow.png)\n", "- Modeling Stage\n", "![modeling](image/modeling_stage.png)\n", "- Feature Engineering\n", "![fe](image/feature_engineering.png)\n", "- Feature types\n", " - Numerical\n", " - Categorical\n", " - Datetime\n", " - Coordinates\n", " - Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arithmetical features\n", "To practice creating new features, you will be working with a subsample from the Kaggle competition called \"House Prices: Advanced Regression Techniques\". The goal of this competition is to predict the price of the house based on its properties. It's a regression problem with Root Mean Squared Error as an evaluation metric.\n", "\n", "Your goal is to create new features and determine whether they improve your validation score. To get the validation score from 5-fold cross-validation, you're given the `get_kfold_rmse()` function." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import KFold\n", "from sklearn.metrics import mean_squared_error\n", "\n", "kf = KFold(n_splits=5, shuffle=True, random_state=123)\n", "\n", "def get_kfold_rmse(train):\n", " mse_scores = []\n", "\n", " for train_index, test_index in kf.split(train):\n", " train = train.fillna(0)\n", " feats = [x for x in train.columns if x not in ['Id', 'SalePrice', 'RoofStyle', 'CentralAir']]\n", " \n", " fold_train, fold_test = train.loc[train_index], train.loc[test_index]\n", "\n", " # Fit the data and make predictions\n", " # Create a Random Forest object\n", " rf = RandomForestRegressor(n_estimators=10, min_samples_split=10, random_state=123)\n", "\n", " # Train a model\n", " rf.fit(X=fold_train[feats], y=fold_train['SalePrice'])\n", "\n", " # Get predictions for the test set\n", " pred = rf.predict(fold_test[feats])\n", " \n", " fold_score = mean_squared_error(fold_test['SalePrice'], pred)\n", " mse_scores.append(np.sqrt(fold_score))\n", " \n", " return round(np.mean(mse_scores) + np.std(mse_scores), 2)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('./dataset/house_prices_train.csv')\n", "test = pd.read_csv('./dataset/house_prices_test.csv')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE before feature engineering: 36029.39\n", "RMSE with total area: 35073.2\n", "RMSE with garden area: 34413.55\n", "RMSE with number of bathromms: 34506.78\n" ] } ], "source": [ "# Look at the initial RMSE\n", "print('RMSE before feature engineering:', get_kfold_rmse(train))\n", "\n", "# Find the total area of the house\n", "train['totalArea'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']\n", "\n", "# Look at the updated RMSE\n", "print('RMSE with total area:', get_kfold_rmse(train))\n", "\n", "# Find the area of the garden\n", "train['GardenArea'] = train['LotArea'] - train['1stFlrSF']\n", "print('RMSE with garden area:', get_kfold_rmse(train))\n", "\n", "# Find total number of bathrooms\n", "train['TotalBath'] = train['FullBath'] + train['HalfBath']\n", "print('RMSE with number of bathromms:', get_kfold_rmse(train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've created three new features. Here you see that house area improved the RMSE by almost `$1,000`. Adding garden area improved the RMSE by another `$600`. However, with the total number of bathrooms, the RMSE has increased. It means that you keep the new area features, but do not add \"TotalBath\" as a new feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Date features\n", "You've built some basic features using numerical variables. Now, it's time to create features based on date and time. You will practice on a subsample from the Taxi Fare Prediction Kaggle competition data. The data represents information about the taxi rides and the goal is to predict the price for each ride.\n", "\n", "Your objective is to generate date features from the pickup datetime. Recall that it's better to create new features for train and test data simultaneously. After the features are created, split the data back into the train and test DataFrames. Here it's done using pandas' `isin()` method." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('./dataset/taxi_train_chapter_4.csv')\n", "test = pd.read_csv('./dataset/taxi_test_chapter_4.csv')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Concatenate train and test together\n", "taxi = pd.concat([train, test])\n", "\n", "# Convert pickup date to datetime object\n", "taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])\n", "\n", "# Create a day of week feature\n", "taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek\n", "\n", "# Create an hour feature\n", "taxi['hour'] = taxi['pickup_datetime'].dt.hour\n", "\n", "# Split back into train and test\n", "new_train = taxi[taxi['id'].isin(train['id'])]\n", "new_test = taxi[taxi['id'].isin(test['id'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical features\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label encoding\n", "Let's work on categorical variables encoding. You will again work with a subsample from the House Prices Kaggle competition.\n", "\n", "Your objective is to encode categorical features \"RoofStyle\" and \"CentralAir\" using label encoding." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RoofStyleRoofStyle_encCentralAirCentralAir_enc
0Gable1Y1
1Gable1Y1
2Gable1Y1
3Gable1Y1
4Gable1Y1
\n", "
" ], "text/plain": [ " RoofStyle RoofStyle_enc CentralAir CentralAir_enc\n", "0 Gable 1 Y 1\n", "1 Gable 1 Y 1\n", "2 Gable 1 Y 1\n", "3 Gable 1 Y 1\n", "4 Gable 1 Y 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "train = pd.read_csv('./dataset/house_prices_train.csv')\n", "test = pd.read_csv('./dataset/house_prices_test.csv')\n", "\n", "# Concatenate train and test together\n", "houses = pd.concat([train, test])\n", "\n", "# Label encoder\n", "le = LabelEncoder()\n", "\n", "# Create new features\n", "houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])\n", "houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])\n", "\n", "# Look at new features\n", "houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-Hot encoding\n", "The problem with label encoding is that it implicitly assumes that there is a ranking dependency between the categories. So, let's change the encoding method for the features \"RoofStyle\" and \"CentralAir\" to one-hot encoding. \n", "\n", "Recall that if you're dealing with binary features (categorical features with only two categories) it is suggested to apply label encoder only.\n", "\n", "Your goal is to determine which of the mentioned features is not binary, and to apply one-hot encoding only to this one." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gable 2310\n", "Hip 551\n", "Gambrel 22\n", "Flat 20\n", "Mansard 11\n", "Shed 5\n", "Name: RoofStyle, dtype: int64 \n", "\n", "Y 2723\n", "N 196\n", "Name: CentralAir, dtype: int64\n" ] } ], "source": [ "# Concatenate train and test together\n", "houses = pd.concat([train, test])\n", "\n", "# Look at feature distributions\n", "print(houses['RoofStyle'].value_counts(), '\\n')\n", "print(houses['CentralAir'].value_counts())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RoofStyleRoofStyle_FlatRoofStyle_GableRoofStyle_GambrelRoofStyle_HipRoofStyle_MansardRoofStyle_Shed
0Gable010000
1Gable010000
2Gable010000
3Gable010000
4Gable010000
\n", "
" ], "text/plain": [ " RoofStyle RoofStyle_Flat RoofStyle_Gable RoofStyle_Gambrel \\\n", "0 Gable 0 1 0 \n", "1 Gable 0 1 0 \n", "2 Gable 0 1 0 \n", "3 Gable 0 1 0 \n", "4 Gable 0 1 0 \n", "\n", " RoofStyle_Hip RoofStyle_Mansard RoofStyle_Shed \n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Label encode binary 'CentralAir' feature\n", "le = LabelEncoder()\n", "houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])\n", "\n", "# Create One-Hot encoded features\n", "ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')\n", "\n", "# Concatenate OHE features to houses\n", "houses = pd.concat([houses, ohe], axis=1)\n", "\n", "# Look at OHE features\n", "houses[[col for col in houses.columns if 'RoofStyle' in col]].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target Encoding\n", "- Mean target encoding\n", " 1. Calculate mean on the train, apply to the test\n", " 2. Split train into K folds, Calculate mean on (K-1) folds, apply to the K-th fold\n", " 3. Add mean target encoded feature to the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean target encoding\n", "First of all, you will create a function that implements mean target encoding. Remember that you need to develop the two following steps:\n", "\n", "1. Calculate the mean on the train, apply to the test\n", "2. Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def test_mean_target_encoding(train, test, target, categorical, alpha=5):\n", " # Calculate global mean on the train data\n", " global_mean = train[target].mean()\n", " \n", " # Group by the categorical feature and calculate its properties\n", " train_groups = train.groupby(categorical)\n", " category_sum = train_groups[target].sum()\n", " category_size = train_groups.size()\n", " \n", " # Calculate smoothed mean target statistics\n", " train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)\n", " \n", " # Apply statistics to the test data and fill new categories\n", " test_feature = test[categorical].map(train_statistics).fillna(global_mean)\n", " return test_feature.values\n", "\n", "def train_mean_target_encoding(train, target, categorical, alpha=5):\n", " # Create 5-fold cross-validation\n", " kf = KFold(n_splits=5,random_state=123, shuffle=True)\n", " train_feature = pd.Series(index=train.index, dtype='float')\n", " \n", " # For each folds split\n", " for train_index, test_index in kf.split(train):\n", " cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]\n", " \n", " # Calculate out-of-fold statistics and apply to cv_test\n", " cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, \n", " categorical, alpha)\n", " \n", " # Save new feature for this particular fold\n", " train_feature.iloc[test_index] = cv_test_feature\n", " return train_feature.values\n", "\n", "def mean_target_encoding(train, test, target, categorical, alpha=5):\n", " # Get the train feature\n", " train_feature = train_mean_target_encoding(train, target, categorical, alpha)\n", " \n", " # Get the test feature\n", " test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)\n", " \n", " # Return new features to add to the model\n", " return train_feature, test_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-fold cross-validation\n", "You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.\n", "\n", "Train data is available in your workspace as `bryant_shots` DataFrame. It contains data on 10,000 shots with its properties and a target variable `\"shot\\_made\\_flag\"` -- whether shot was scored or not.\n", "\n", "One of the features in the data is `\"game_id\"` -- a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let's encode it using a target mean!\n", "\n", "Suppose you're using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " game_id shot_made_flag game_id_enc\n", "118 20000108 1.0 0.444531\n", " game_id shot_made_flag game_id_enc\n", "3249 20200471 0.0 0.562617\n", " game_id shot_made_flag game_id_enc\n", "6048 20400930 0.0 0.276686\n", " game_id shot_made_flag game_id_enc\n", "3199 20200425 0.0 0.485156\n", " game_id shot_made_flag game_id_enc\n", "7808 20500988 0.0 0.392894\n" ] } ], "source": [ "bryant_shots = pd.read_csv('./dataset/bryant_shots.csv')\n", "\n", "# Create 5-fold cross-validation\n", "kf = KFold(n_splits=5, random_state=123, shuffle=True)\n", "\n", "# For each folds split\n", "for train_index, test_index in kf.split(bryant_shots):\n", " cv_train, cv_test = bryant_shots.iloc[train_index].copy(), bryant_shots.iloc[test_index].copy()\n", " \n", " # Create mean target encoded feature\n", " cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,\n", " test=cv_test,\n", " target='shot_made_flag',\n", " categorical='game_id',\n", " alpha=5)\n", " \n", " # Look at the encoding\n", " print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could see different game encodings for each validation split in the output. The main conclusion you should make: while using local cross-validation, you need to repeat mean target encoding procedure inside each folds split separately." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beyond binary classification\n", "Of course, binary classification is just a single special case. Target encoding could be applied to any target variable type:\n", "\n", "- For **binary classification** usually mean target encoding is used\n", "- For **regression** mean could be changed to median, quartiles, etc.\n", "- For **multi-class classification** with N classes we create N features with target mean for each category in one vs. all fashion\n", "The `mean_target_encoding()` function you've created could be used for any target type specified above. Let's apply it for the regression problem on the example of House Prices Kaggle competition.\n", "\n", "Your goal is to encode a categorical feature `\"RoofStyle\"` using mean target encoding. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RoofStyleRoofStyle_enc
0Gable171565.947836
1Hip217594.645131
98Gambrel164152.950424
133Flat188703.563431
362Mansard180775.938759
1053Shed188267.663242
\n", "
" ], "text/plain": [ " RoofStyle RoofStyle_enc\n", "0 Gable 171565.947836\n", "1 Hip 217594.645131\n", "98 Gambrel 164152.950424\n", "133 Flat 188703.563431\n", "362 Mansard 180775.938759\n", "1053 Shed 188267.663242" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.read_csv('./dataset/house_prices_train.csv')\n", "test = pd.read_csv('./dataset/house_prices_test.csv')\n", "\n", "# Create mean target encoded feature\n", "train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,\n", " test=test, \n", " target='SalePrice',\n", " categorical='RoofStyle',\n", " alpha=10)\n", "# Look at the encoding\n", "test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You observe that houses with the `Hip` roof are the most pricy, while houses with the `Gambrel` roof are the cheapest. It's exactly the goal of target encoding: you've encoded categorical feature in such a manner that there is now a correlation between category values and target variable. We're done with categorical encoders." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data\n", "- Impute missing data\n", " - Numerical data\n", " - Mean/median imputation\n", " - Constant value imputation\n", " - Categorical data\n", " - Most frequent category imputation\n", " - New category imputation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find missing data\n", "Let's impute missing data on a real Kaggle dataset. For this purpose, you will be using a data subsample from the Kaggle \"Two sigma connect: rental listing inquiries\" competition.\n", "\n", "Before proceeding with any imputing you need to know the number of missing values for each of the features. Moreover, if the feature has missing values, you should explore the type of this feature." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id 0\n", "bathrooms 0\n", "bedrooms 0\n", "building_id 13\n", "latitude 0\n", "longitude 0\n", "manager_id 0\n", "price 32\n", "interest_level 0\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
building_idprice
053a5b119ba8f7b61d4e010512e0dfc853000.0
1c5c8a357cba207596b04d1afd1e4f1305465.0
2c3ba40552e2120b0acfc3cb5730bb2aa2850.0
328d9ad350afeaab8027513a3e52ac8d53275.0
4NaN3350.0
\n", "
" ], "text/plain": [ " building_id price\n", "0 53a5b119ba8f7b61d4e010512e0dfc85 3000.0\n", "1 c5c8a357cba207596b04d1afd1e4f130 5465.0\n", "2 c3ba40552e2120b0acfc3cb5730bb2aa 2850.0\n", "3 28d9ad350afeaab8027513a3e52ac8d5 3275.0\n", "4 NaN 3350.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read dataframe\n", "twosigma = pd.read_csv('./dataset/twosigma_rental_train_null.csv')\n", "\n", "# find the number of missing values in each column\n", "print(twosigma.isnull().sum())\n", "\n", "twosigma[['building_id', 'price']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Impute missing data\n", "You've found that `\"price\"` and `\"building_id\"` columns have missing values in the Rental Listing Inquiries dataset. So, before passing the data to the models you need to impute these values.\n", "\n", "Numerical feature `\"price\"` will be encoded with a mean value of non-missing prices.\n", "\n", "Imputing categorical feature `\"building_id\"` with the most frequent category is a bad idea, because it would mean that all the apartments with a missing `\"building_id\"` are located in the most popular building. The better idea is to impute it with a new category." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer\n", "\n", "# Create mean imputer\n", "mean_imputer = SimpleImputer(strategy='mean')\n", "\n", "# Price imputation\n", "twosigma[['price']] = mean_imputer.fit_transform(twosigma[['price']])\n", "\n", "# Create constant inputer\n", "constant_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')\n", "\n", "# building_id imputation\n", "twosigma[['building_id']] = constant_imputer.fit_transform(twosigma[['building_id']])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "bathrooms 0\n", "bedrooms 0\n", "building_id 0\n", "latitude 0\n", "longitude 0\n", "manager_id 0\n", "price 0\n", "interest_level 0\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "twosigma.isnull().sum()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }