{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dive into the Competition\n", "> Now that you know the basics of Kaggle competitions, you will learn how to study the specific problem at hand. You will practice EDA and get to establish correct local validation strategies. You will also learn about data leakage. This is the Summary of lecture \"Winning a Kaggle Competition in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Kaggle, Machine_Learning]\n", "- image: images/stratified_kfold.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize']=(10, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understand the problem\n", "- Solution workflow\n", "![sw](image/solution_workflow.png)\n", "- Custom Metric (Root Mean Squared Error in a Logarithmic scale)\n", "$$ RMSLE = \\sqrt{\\frac{1}{N}\\sum_{i=1}^N (\\log(y_i + 1) - \\log(\\hat{y_i} + 1))^2} $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define a competition metric\n", "Competition metric is used by Kaggle to evaluate your submissions. Moreover, you also need to measure the performance of different models on a local validation set.\n", "\n", "For now, your goal is to manually develop a couple of competition metrics in case if they are not available in `sklearn.metrics`.\n", "\n", "In particular, you will define:\n", "\n", "- Mean Squared Error (MSE) for the regression problem:\n", "\n", "$$ MSE = \\frac{1}{N} \\sum_{i=1}^{N}(y_i - \\hat{y_i})^2 $$\n", "\n", "- Logarithmic Loss (LogLoss) for the binary classification problem:\n", "\n", "$$ LogLoss = -\\frac{1}{N} \\sum_{i = 1}^N (y_i \\ln p_i + (1 - y_i) \\ln (1 - p_i)) $$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sample = pd.read_csv('./dataset/sample_reg_true_pred.csv')\n", "y_regression_true, y_regression_pred = sample['true'].to_numpy(), sample['pred'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sklearn MSE: 0.15418. \n", "Your MSE: 0.15418. \n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "# Define your own MSE function\n", "def own_mse(y_true, y_pred):\n", " # Raise differences to the power of 2\n", " squares = np.power(y_true - y_pred, 2)\n", " # Find mean over all observations\n", " err = np.mean(squares)\n", " return err\n", "\n", "print('Sklearn MSE: {:.5f}. '.format(mean_squared_error(y_regression_true, y_regression_pred)))\n", "print('Your MSE: {:.5f}. '.format(own_mse(y_regression_true, y_regression_pred)))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sample_class = pd.read_csv('./dataset/sample_class_true_pred.csv')\n", "y_classification_true, y_classification_pred = sample_class['true'].to_numpy(), sample_class['pred'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sklearn LogLoss: 1.10801\n", "Your LogLoss: 1.10801\n" ] } ], "source": [ "from sklearn.metrics import log_loss\n", "\n", "# Define your own LogLoss function\n", "def own_logloss(y_true, prob_pred):\n", " # Find loss for each observation\n", " terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)\n", " # Find mean over all observations\n", " err = np.mean(terms)\n", " return -err\n", "\n", "print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))\n", "print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial EDA\n", "- Goal of EDA\n", " - Size of the data\n", " - Properties of the target variable\n", " - Properties of the features\n", " - Generate ideas for feature engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA statistics\n", "As mentioned in the slides, you'll work with New York City taxi fare prediction data. You'll start with finding some basic statistics about the data. Then you'll move forward to plot some dependencies and generate hypotheses on them." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train shape: (20000, 8)\n", "Test shape: (9914, 7)\n" ] }, { "data": { "text/html": [ "
\n", " | id | \n", "fare_amount | \n", "pickup_datetime | \n", "pickup_longitude | \n", "pickup_latitude | \n", "dropoff_longitude | \n", "dropoff_latitude | \n", "passenger_count | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "4.5 | \n", "2009-06-15 17:26:21 UTC | \n", "-73.844311 | \n", "40.721319 | \n", "-73.841610 | \n", "40.712278 | \n", "1 | \n", "
1 | \n", "1 | \n", "16.9 | \n", "2010-01-05 16:52:16 UTC | \n", "-74.016048 | \n", "40.711303 | \n", "-73.979268 | \n", "40.782004 | \n", "1 | \n", "
2 | \n", "2 | \n", "5.7 | \n", "2011-08-18 00:35:00 UTC | \n", "-73.982738 | \n", "40.761270 | \n", "-73.991242 | \n", "40.750562 | \n", "2 | \n", "
3 | \n", "3 | \n", "7.7 | \n", "2012-04-21 04:30:42 UTC | \n", "-73.987130 | \n", "40.733143 | \n", "-73.991567 | \n", "40.758092 | \n", "1 | \n", "
4 | \n", "4 | \n", "5.3 | \n", "2010-03-09 07:51:00 UTC | \n", "-73.968095 | \n", "40.768008 | \n", "-73.956655 | \n", "40.783762 | \n", "1 | \n", "