{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dive into the Competition\n", "> Now that you know the basics of Kaggle competitions, you will learn how to study the specific problem at hand. You will practice EDA and get to establish correct local validation strategies. You will also learn about data leakage. This is the Summary of lecture \"Winning a Kaggle Competition in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Kaggle, Machine_Learning]\n", "- image: images/stratified_kfold.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize']=(10, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understand the problem\n", "- Solution workflow\n", "![sw](image/solution_workflow.png)\n", "- Custom Metric (Root Mean Squared Error in a Logarithmic scale)\n", "$$ RMSLE = \\sqrt{\\frac{1}{N}\\sum_{i=1}^N (\\log(y_i + 1) - \\log(\\hat{y_i} + 1))^2} $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define a competition metric\n", "Competition metric is used by Kaggle to evaluate your submissions. Moreover, you also need to measure the performance of different models on a local validation set.\n", "\n", "For now, your goal is to manually develop a couple of competition metrics in case if they are not available in `sklearn.metrics`.\n", "\n", "In particular, you will define:\n", "\n", "- Mean Squared Error (MSE) for the regression problem:\n", "\n", "$$ MSE = \\frac{1}{N} \\sum_{i=1}^{N}(y_i - \\hat{y_i})^2 $$\n", "\n", "- Logarithmic Loss (LogLoss) for the binary classification problem:\n", "\n", "$$ LogLoss = -\\frac{1}{N} \\sum_{i = 1}^N (y_i \\ln p_i + (1 - y_i) \\ln (1 - p_i)) $$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sample = pd.read_csv('./dataset/sample_reg_true_pred.csv')\n", "y_regression_true, y_regression_pred = sample['true'].to_numpy(), sample['pred'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sklearn MSE: 0.15418. \n", "Your MSE: 0.15418. \n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "# Define your own MSE function\n", "def own_mse(y_true, y_pred):\n", " # Raise differences to the power of 2\n", " squares = np.power(y_true - y_pred, 2)\n", " # Find mean over all observations\n", " err = np.mean(squares)\n", " return err\n", "\n", "print('Sklearn MSE: {:.5f}. '.format(mean_squared_error(y_regression_true, y_regression_pred)))\n", "print('Your MSE: {:.5f}. '.format(own_mse(y_regression_true, y_regression_pred)))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sample_class = pd.read_csv('./dataset/sample_class_true_pred.csv')\n", "y_classification_true, y_classification_pred = sample_class['true'].to_numpy(), sample_class['pred'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sklearn LogLoss: 1.10801\n", "Your LogLoss: 1.10801\n" ] } ], "source": [ "from sklearn.metrics import log_loss\n", "\n", "# Define your own LogLoss function\n", "def own_logloss(y_true, prob_pred):\n", " # Find loss for each observation\n", " terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)\n", " # Find mean over all observations\n", " err = np.mean(terms)\n", " return -err\n", "\n", "print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))\n", "print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial EDA\n", "- Goal of EDA\n", " - Size of the data\n", " - Properties of the target variable\n", " - Properties of the features\n", " - Generate ideas for feature engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA statistics\n", "As mentioned in the slides, you'll work with New York City taxi fare prediction data. You'll start with finding some basic statistics about the data. Then you'll move forward to plot some dependencies and generate hypotheses on them." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train shape: (20000, 8)\n", "Test shape: (9914, 7)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfare_amountpickup_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
004.52009-06-15 17:26:21 UTC-73.84431140.721319-73.84161040.7122781
1116.92010-01-05 16:52:16 UTC-74.01604840.711303-73.97926840.7820041
225.72011-08-18 00:35:00 UTC-73.98273840.761270-73.99124240.7505622
337.72012-04-21 04:30:42 UTC-73.98713040.733143-73.99156740.7580921
445.32010-03-09 07:51:00 UTC-73.96809540.768008-73.95665540.7837621
\n", "
" ], "text/plain": [ " id fare_amount pickup_datetime pickup_longitude \\\n", "0 0 4.5 2009-06-15 17:26:21 UTC -73.844311 \n", "1 1 16.9 2010-01-05 16:52:16 UTC -74.016048 \n", "2 2 5.7 2011-08-18 00:35:00 UTC -73.982738 \n", "3 3 7.7 2012-04-21 04:30:42 UTC -73.987130 \n", "4 4 5.3 2010-03-09 07:51:00 UTC -73.968095 \n", "\n", " pickup_latitude dropoff_longitude dropoff_latitude passenger_count \n", "0 40.721319 -73.841610 40.712278 1 \n", "1 40.711303 -73.979268 40.782004 1 \n", "2 40.761270 -73.991242 40.750562 2 \n", "3 40.733143 -73.991567 40.758092 1 \n", "4 40.768008 -73.956655 40.783762 1 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.read_csv('./dataset/taxi_train_chapter_4.csv')\n", "test = pd.read_csv('./dataset/taxi_test_chapter_4.csv')\n", "\n", "# Shapes of train and test data\n", "print('Train shape:', train.shape)\n", "print('Test shape:', test.shape)\n", "\n", "# train head()\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 20000.000000\n", "mean 11.303321\n", "std 9.541637\n", "min -3.000000\n", "25% 6.000000\n", "50% 8.500000\n", "75% 12.500000\n", "max 180.000000\n", "Name: fare_amount, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Describe the target variable\n", "train.fare_amount.describe()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 13999\n", "2 2912\n", "5 1327\n", "3 860\n", "4 420\n", "6 407\n", "0 75\n", "Name: passenger_count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Train distribution of passengers within rides\n", "train.passenger_count.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA plots I\n", "After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train DataFrame from the taxi competition is already available in your workspace.\n", "\n", "To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def haversine_distance(train):\n", " \n", " data = [train]\n", " lat1, long1, lat2, long2 = 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'\n", " \n", " for i in data:\n", " R = 6371 #radius of earth in kilometers\n", " #R = 3959 #radius of earth in miles\n", " phi1 = np.radians(i[lat1])\n", " phi2 = np.radians(i[lat2])\n", " \n", " delta_phi = np.radians(i[lat2]-i[lat1])\n", " delta_lambda = np.radians(i[long2]-i[long1])\n", " \n", " #a = sin²((φB - φA)/2) + cos φA . cos φB . sin²((λB - λA)/2)\n", " a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2\n", " \n", " #c = 2 * atan2( √a, √(1−a) )\n", " c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))\n", " \n", " #d = R*c\n", " d = (R * c) #in kilometers\n", " \n", " return d" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Calculate the ride distance\n", "train['distance_km'] = haversine_distance(train)\n", "\n", "# Draw a scatterplot\n", "plt.scatter(x=train['fare_amount'], y=train['distance_km'], alpha=0.5);\n", "plt.xlabel('Fare amount');\n", "plt.ylabel('Distance, km');\n", "plt.title('Fare amount based on the distance');\n", "\n", "# Limit on the distance\n", "plt.ylim(0, 50);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA plots II\n", "Another idea that comes to mind is that the price of a ride could change during the day.\n", "\n", "Your goal is to plot the median fare amount for each hour of the day as a simple line plot. The hour feature is calculated for you. Don't worry if you do not know how to work with the date features. We will explore them in the chapter on Feature Engineering." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create hour feature\n", "train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)\n", "train['hour'] = train.pickup_datetime.dt.hour\n", "\n", "# Find median fare_amount for each hour\n", "hour_price = train.groupby('hour', as_index=False)['fare_amount'].median()\n", "\n", "# Plot the line plot\n", "plt.plot(hour_price['hour'], hour_price['fare_amount'], marker='o');\n", "plt.xlabel('Hour of the day');\n", "plt.ylabel('Median fare amount');\n", "plt.title('Fare amount based on day time');\n", "plt.xticks(range(24));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that prices are a bit higher during the night. It is a good indicator that we should include the `\"hour\"` feature in the final model, or at least add a binary feature `\"is_night\"`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local Validation\n", "- Holdout set\n", "![holdout](image/holdout_set.png)\n", "- K-fold cross-validation\n", "![kfold_cv](image/kfold_cv.png)\n", "- Stratified K-fold\n", "![stratified](image/stratified_kfold.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-fold cross-validation\n", "You will start by getting hands-on experience in the most commonly used K-fold cross-validation.\n", "\n", "The data you'll be working with is from the \"Two sigma connect: rental listing inquiries\" Kaggle competition. The competition problem is a multi-class classification of the rental listings into 3 classes: low interest, medium interest and high interest. For faster performance, you will work with a subsample consisting of 1,000 observations.\n", "\n", "You need to implement a K-fold validation strategy and look at the sizes of each fold obtained." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('./dataset/twosigma_rental_train.csv')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold: 0\n", "CV train shape: (666, 9)\n", "Medium interest listings in CV train: 175\n", "\n", "Fold: 1\n", "CV train shape: (667, 9)\n", "Medium interest listings in CV train: 165\n", "\n", "Fold: 2\n", "CV train shape: (667, 9)\n", "Medium interest listings in CV train: 162\n", "\n" ] } ], "source": [ "from sklearn.model_selection import KFold\n", "\n", "# Create a KFold object\n", "kf = KFold(n_splits=3, shuffle=True, random_state=123)\n", "\n", "# Loop through each split\n", "fold = 0\n", "for train_index, test_index in kf.split(train):\n", " # Obtain training and test folds\n", " cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]\n", " print(\"Fold: {}\".format(fold))\n", " print(\"CV train shape: {}\".format(cv_train.shape))\n", " print(\"Medium interest listings in CV train: {}\\n\".format(\n", " sum(cv_train.interest_level == 'medium')\n", " ))\n", " fold += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stratified K-fold\n", "As you've just noticed, you have a pretty different target variable distribution among the folds due to the random splits. It's not crucial for this particular competition, but could be an issue for the classification competitions with the highly imbalanced target variable.\n", "\n", "To overcome this, let's implement the stratified K-fold strategy with the stratification on the target variable." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold: 0\n", "CV train shape: (666, 9)\n", "Medium interest listings in CV train: 167\n", "\n", "Fold: 1\n", "CV train shape: (667, 9)\n", "Medium interest listings in CV train: 167\n", "\n", "Fold: 2\n", "CV train shape: (667, 9)\n", "Medium interest listings in CV train: 168\n", "\n" ] } ], "source": [ "from sklearn.model_selection import StratifiedKFold\n", "\n", "# Create a StratifiedKFold object\n", "str_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123)\n", "\n", "# Loop through each split\n", "fold = 0\n", "for train_index, test_index in str_kf.split(train, train['interest_level']):\n", " # Obtain training and test folds\n", " cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]\n", " print('Fold: {}'.format(fold))\n", " print('CV train shape: {}'.format(cv_train.shape))\n", " print('Medium interest listings in CV train: {}\\n'.format(\n", " sum(cv_train.interest_level == 'medium')\n", " ))\n", " fold += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation usage\n", "- Data Leakage\n", " - Leak in **features** - using data that will not be available in the real setting\n", " - Leak in **validation strategy** - validation strategy differs from the real-world situation\n", "- Time K-fold cross-validation\n", " - Time-series data cannot use with KFold Cross validation\n", "![time kfold](image/time_kfold.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time K-fold\n", "Remember the \"Store Item Demand Forecasting Challenge\" where you are given store-item sales data, and have to predict future sales?\n", "\n", "It's a competition with time series data. So, time K-fold cross-validation should be applied. Your goal is to create this cross-validation strategy and make sure that it works as expected." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('./dataset/demand_forecasting_train_1_month.csv')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold : 0\n", "Train date range: from 2017-12-01 to 2017-12-08\n", "Test date range: from 2017-12-08 to 2017-12-16\n", "\n", "Fold : 1\n", "Train date range: from 2017-12-01 to 2017-12-16\n", "Test date range: from 2017-12-16 to 2017-12-24\n", "\n", "Fold : 2\n", "Train date range: from 2017-12-01 to 2017-12-24\n", "Test date range: from 2017-12-24 to 2017-12-31\n", "\n" ] } ], "source": [ "from sklearn.model_selection import TimeSeriesSplit\n", "\n", "# Create TimeSeriesSplit object\n", "time_kfold = TimeSeriesSplit(n_splits=3)\n", "\n", "# sort train data by date\n", "train = train.sort_values('date')\n", "\n", "# Iterate through each split\n", "fold = 0\n", "for train_index, test_index in time_kfold.split(train):\n", " cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]\n", " \n", " print('Fold :', fold)\n", " print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))\n", " print('Test date range: from {} to {}\\n'.format(cv_test.date.min(), cv_test.date.max()))\n", " fold += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overall validation score\n", "Now it's time to get the actual model performance using cross-validation! How does our store item demand prediction model perform?\n", "\n", "Your task is to take the Mean Squared Error (MSE) for each fold separately, and then combine these results into a single number.\n", "\n", "For simplicity, you're given `get_fold_mse()` function that for each cross-validation split fits a Random Forest model and returns a list of MSE scores by fold. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "def get_fold_mse(train, kf):\n", " mse_scores = []\n", " \n", " for train_index, test_index in kf.split(train):\n", " fold_train, fold_test = train.loc[train_index], train.loc[test_index]\n", "\n", " # Fit the data and make predictions\n", " # Create a Random Forest object\n", " rf = RandomForestRegressor(n_estimators=10, random_state=123)\n", "\n", " # Train a model\n", " rf.fit(X=fold_train[['store', 'item']], y=fold_train['sales'])\n", "\n", " # Get predictions for the test set\n", " pred = rf.predict(fold_test[['store', 'item']])\n", " \n", " fold_score = round(mean_squared_error(fold_test['sales'], pred), 5)\n", " mse_scores.append(fold_score)\n", " \n", " return mse_scores" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean validation MSE: 955.49186\n", "MSE by fold: [890.30336, 961.65797, 1014.51424]\n", "Overall Validation MSE: 1006.38784\n" ] } ], "source": [ "# Initialize 3-fold time cross-validation\n", "kf = TimeSeriesSplit(n_splits=3)\n", "\n", "# Get MSE scores for each cross-validation split\n", "mse_scores = get_fold_mse(train, kf)\n", "\n", "print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))\n", "print('MSE by fold: {}'.format(mse_scores))\n", "print('Overall Validation MSE: {:.5f}'.format(np.mean(mse_scores) + np.std(mse_scores)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }