{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Kaggle competitions process\n", "> In this first chapter, you will get exposure to the Kaggle competition process. You will train a model and prepare a csv file ready for submission. You will learn the difference between Public and Private test splits, and how to prevent overfitting. This is the Summary of lecture \"Winning a Kaggle Competition in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Kaggle, Machine_Learning]\n", "- image: images/Kaggle_logo.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Competitions overview\n", "- Kaggle benefits\n", " - Get practical experience on the real-world data\n", " - Develop portfolio projects\n", " - Meet a great Data Science community\n", " - Try new domain or model type\n", " - Keep up-to-date with the best performing methods\n", "- Process\n", "![process](image/competition_process.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore train data\n", "You will work with another Kaggle competition called \"Store Item Demand Forecasting Challenge\". In this competition, you are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items in 10 different stores.\n", "\n", "To begin, let's explore the train data for this competition. For the faster performance, you will work with a subset of the train data containing only a single month history.\n", "\n", "Your initial goal is to read the input data and take the first look at it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train shape: (15500, 5)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestoreitemsales
01000002017-12-011119
11000012017-12-021116
21000022017-12-031131
31000032017-12-04117
41000042017-12-051120
\n", "
" ], "text/plain": [ " id date store item sales\n", "0 100000 2017-12-01 1 1 19\n", "1 100001 2017-12-02 1 1 16\n", "2 100002 2017-12-03 1 1 31\n", "3 100003 2017-12-04 1 1 7\n", "4 100004 2017-12-05 1 1 20" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read train data\n", "train = pd.read_csv('./dataset/demand_forecasting_train_1_month.csv')\n", "\n", "# Look at the shape of the data\n", "print('Train shape:', train.shape)\n", "\n", "# Look at the head() of the data\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore test data\n", "Having looked at the train data, let's explore the test data in the \"Store Item Demand Forecasting Challenge\". Remember, that the test dataset generally contains one column less than the train one.\n", "\n", "This column, together with the output format, is presented in the sample submission file. Before making any progress in the competition, you should get familiar with the expected output.\n", "\n", "That is why, let's look at the columns of the test dataset and compare it to the train columns. Additionally, let's explore the format of the sample submission. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train columns: ['id', 'date', 'store', 'item', 'sales']\n", "Test columns: ['id', 'date', 'store', 'item']\n" ] } ], "source": [ "# Read the test data\n", "test = pd.read_csv('./dataset/demand_forecasting_test.csv')\n", "\n", "# Print train and test columns\n", "print('Train columns:', train.columns.tolist())\n", "print('Test columns:', test.columns.tolist())" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsales
0052
1152
2252
3352
4452
\n", "
" ], "text/plain": [ " id sales\n", "0 0 52\n", "1 1 52\n", "2 2 52\n", "3 3 52\n", "4 4 52" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the sample submission file\n", "sample_submission = pd.read_csv('./dataset/sample_submission.csv')\n", "\n", "# Look at the head() of the sample submission\n", "sample_submission.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample submission file consists of two columns: `id` of the observation and `sales` column for your predictions. Kaggle will evaluate your predictions on the true `sales` data for the corresponding `id`. So, it’s important to keep track of the predictions by `id` before submitting them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare your first submission\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Determine a problem type\n", "You will keep working on the Store Item Demand Forecasting Challenge. Recall that you are given a history of store-item sales data, and asked to predict 3 months of the future sales.\n", "\n", "Before building a model, you should determine the problem type you are addressing. The goal of this exercise is to look at the distribution of the target variable, and select the correct problem type you will be building a model for." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots()\n", "train.sales.hist(ax=ax);\n", "ax.set_title('histogram of sales');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train a simple model\n", "As you determined, you are dealing with a regression problem. So, now you're ready to build a model for a subsequent submission. But now, instead of building the simplest Linear Regression model as in the slides, let's build an out-of-box Random Forest model.\n", "\n", "You will use the `RandomForestRegressor` class from the scikit-learn library.\n", "\n", "Your objective is to train a Random Forest model with default parameters on the \"store\" and \"item\" features.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor()" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "# Create a Random Forest object\n", "rf = RandomForestRegressor()\n", "\n", "# Train a model\n", "rf.fit(X=train[['store', 'item']], y=train['sales'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare a submission\n", "You've already built a model on the training data from the Kaggle Store Item Demand Forecasting Challenge. Now, it's time to make predictions on the test data and create a submission file in the specified format.\n", "\n", "Your goal is to read the test data, make predictions, and save these in the format specified in the \"sample_submission.csv\" file. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsales
0052
1152
2252
3352
4452
\n", "
" ], "text/plain": [ " id sales\n", "0 0 52\n", "1 1 52\n", "2 2 52\n", "3 3 52\n", "4 4 52" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the head() of the sample_submission\n", "sample_submission.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Get predictions for the test set\n", "test['sales'] = rf.predict(test[['store', 'item']])\n", "\n", "# Write test predictions using the sample_submission format\n", "test[['id', 'sales']].to_csv('kaggle_submission.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id,sales\r\n", "0,17.143270686120914\r\n", "1,17.143270686120914\r\n", "2,17.143270686120914\r\n", "3,17.143270686120914\r\n", "4,17.143270686120914\r\n", "5,17.143270686120914\r\n", "6,17.143270686120914\r\n", "7,17.143270686120914\r\n", "8,17.143270686120914\r\n" ] } ], "source": [ "!head kaggle_submission.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Public vs Private leaderboard\n", "- Competition metric\n", "\n", "| Evaluation metric | Type of problem |\n", "| ----------------- | --------------- | \n", "| Area Under the ROC (AUC) | Classification |\n", "| F1 score (F1) | Classification |\n", "| Mean Log Loss (LogLoss) | Classification |\n", "| Mean Absolute Error (MAE) | Regression |\n", "| Mean Squared Erro (MSE) | Regression |\n", "| Mean Average Precision at K (MAPK, MAP@K) | Ranking |\n", "\n", "- Overfitting in kaggle\n", "![of](image/overfitting_kaggle.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train XGBoost models\n", "Every Machine Learning method could potentially overfit. You will see it on this example with XGBoost. Again, you are working with the Store Item Demand Forecasting Challenge. \n", "\n", "Firstly, let's train multiple XGBoost models with different sets of hyperparameters using XGBoost's learning API. The single hyperparameter you will change is:\n", "\n", "- `max_depth` - maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import xgboost as xgb\n", "\n", "# Create DMatrix on train data\n", "dtrain = xgb.DMatrix(data=train[['store', 'item']],\n", " label=train['sales'])\n", "\n", "# Define xgboost parameters\n", "params = {'objective': 'reg:squarederror',\n", " 'max_depth': 2,\n", " 'verbosity': 1}\n", "\n", "# Train xgboost model\n", "xg_depth_2 = xgb.train(params=params, dtrain=dtrain)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Define xgboost parameters\n", "params = {'objective': 'reg:squarederror',\n", " 'max_depth': 8,\n", " 'verbosity': 1}\n", "\n", "# Train xgboost model\n", "xg_depth_8 = xgb.train(params=params, dtrain=dtrain)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Define xgboost parameters\n", "params = {'objective': 'reg:squarederror',\n", " 'max_depth': 15,\n", " 'verbosity': 1}\n", "\n", "# Train xgboost model\n", "xg_depth_15 = xgb.train(params=params, dtrain=dtrain)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore overfitting XGBoost\n", "Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.\n", "\n", "The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in `sklearn.metrics` as `mean_squared_error()` function that takes two arguments: true values and predicted values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE Train: 331.064. MSE Test: 249.849\n", "MSE Train: 112.057. MSE Test: 30.751\n", "MSE Train: 84.952. MSE Test: 3.536\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "dtrain = xgb.DMatrix(data=train[['store', 'item']])\n", "dtest = xgb.DMatrix(data=test[['store', 'item']])\n", "\n", "# For each of 3 trained models\n", "for model in [xg_depth_2, xg_depth_8, xg_depth_15]:\n", " # Make predictions\n", " train_pred = model.predict(dtrain)\n", " test_pred = model.predict(dtest)\n", " \n", " # Calculate metrics\n", " mse_train = mean_squared_error(train['sales'], train_pred)\n", " mse_test = mean_squared_error(test['sales'], test_pred)\n", " print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }