{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**[Intermediate Machine Learning Home Page](https://www.kaggle.com/learn/intermediate-machine-learning)**\n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you will use your new knowledge to train a model with **gradient boosting**.\n", "\n", "# Setup\n", "\n", "The questions below will give you feedback on your work. Run the following cell to set up the feedback system." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setup Complete\n" ] } ], "source": [ "# Set up code checking\n", "from learntools.core import binder\n", "binder.bind(globals())\n", "from learntools.ml_intermediate.ex6 import *\n", "print(\"Setup Complete\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will work with the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course) dataset from the previous exercise. \n", "\n", "![Ames Housing dataset image](https://i.imgur.com/lTJVG4e.png)\n", "\n", "Run the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`. The test set is loaded in `X_test`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Read the data\n", "X = pd.read_csv('../input/train.csv', index_col='Id')\n", "X_test_full = pd.read_csv('../input/test.csv', index_col='Id')\n", "\n", "# Remove rows with missing target, separate target from predictors\n", "X.dropna(axis=0, subset=['SalePrice'], inplace=True)\n", "y = X.SalePrice \n", "X.drop(['SalePrice'], axis=1, inplace=True)\n", "\n", "# Break off validation set from training data\n", "X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,\n", " random_state=0)\n", "\n", "# \"Cardinality\" means the number of unique values in a column\n", "# Select categorical columns with relatively low cardinality (convenient but arbitrary)\n", "low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and \n", " X_train_full[cname].dtype == \"object\"]\n", "\n", "# Select numeric columns\n", "numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]\n", "\n", "# Keep selected columns only\n", "my_cols = low_cardinality_cols + numeric_cols\n", "X_train = X_train_full[my_cols].copy()\n", "X_valid = X_valid_full[my_cols].copy()\n", "X_test = X_test_full[my_cols].copy()\n", "\n", "# One-hot encode the data (to shorten the code, we use pandas)\n", "X_train = pd.get_dummies(X_train)\n", "X_valid = pd.get_dummies(X_valid)\n", "X_test = pd.get_dummies(X_test)\n", "X_train, X_valid = X_train.align(X_valid, join='left', axis=1)\n", "X_train, X_test = X_train.align(X_test, join='left', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1: Build model\n", "\n", "In this step, you'll build and train your first model with gradient boosting.\n", "\n", "- Begin by setting `my_model_1` to an XGBoost model. Use the [XGBRegressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor) class, and set the random seed to 0 (`random_state=0`). **Leave all other parameters as default.**\n", "- Then, fit the model to the training data in `X_train` and `y_train`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"1.1_Model1A\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from xgboost import XGBRegressor\n", "\n", "# Define the model\n", "my_model_1 = XGBRegressor(random_state=0) # Your code here\n", "\n", "# Fit the model\n", "my_model_1.fit(X_train, y_train) # Your code here\n", "\n", "# Check your answer\n", "step_1.a.check()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Lines below will give you a hint or solution code\n", "#step_1.a.hint()\n", "#step_1.a.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set `predictions_1` to the model's predictions for the validation data. Recall that the validation features are stored in `X_valid`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"1.2_Model1B\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import mean_absolute_error\n", "\n", "# Get predictions\n", "predictions_1 = my_model_1.predict(X_valid) # Your code here\n", "\n", "# Check your answer\n", "step_1.b.check()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Lines below will give you a hint or solution code\n", "#step_1.b.hint()\n", "#step_1.b.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, use the `mean_absolute_error()` function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in `y_valid`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error: 17662.736729452055\n" ] }, { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"1.3_Model1C\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Calculate MAE\n", "mae_1 = mean_absolute_error(y_valid, predictions_1) # Your code here\n", "\n", "# Uncomment to print MAE\n", "print(\"Mean Absolute Error:\" , mae_1)\n", "\n", "# Check your answer\n", "step_1.c.check()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Lines below will give you a hint or solution code\n", "#step_1.c.hint()\n", "#step_1.c.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2: Improve the model\n", "\n", "Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance!\n", "- Begin by setting `my_model_2` to an XGBoost model, using the [XGBRegressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor) class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like `n_estimators` and `learning_rate`) to get better results.\n", "- Then, fit the model to the training data in `X_train` and `y_train`.\n", "- Set `predictions_2` to the model's predictions for the validation data. Recall that the validation features are stored in `X_valid`.\n", "- Finally, use the `mean_absolute_error()` function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in `y_valid`.\n", "\n", "In order for this step to be marked correct, your model in `my_model_2` must attain lower MAE than the model in `my_model_1`. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error: 16688.691513270547\n" ] }, { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"2_Model2\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Define the model\n", "my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05) # Your code here\n", "\n", "# Fit the model\n", "my_model_2.fit(X_train, y_train) # Your code here\n", "\n", "# Get predictions\n", "predictions_2 = my_model_2.predict(X_valid) # Your code here\n", "\n", "# Calculate MAE\n", "mae_2 = mean_absolute_error(y_valid, predictions_2) # Your code here\n", "\n", "# Uncomment to print MAE\n", "print(\"Mean Absolute Error:\" , mae_2)\n", "\n", "# Check your answer\n", "step_2.check()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Lines below will give you a hint or solution code\n", "#step_2.hint()\n", "#step_2.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3: Break the model\n", "\n", "In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!\n", "- Begin by setting `my_model_3` to an XGBoost model, using the [XGBRegressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor) class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like `n_estimators` and `learning_rate`) to design a model to get high MAE.\n", "- Then, fit the model to the training data in `X_train` and `y_train`.\n", "- Set `predictions_3` to the model's predictions for the validation data. Recall that the validation features are stored in `X_valid`.\n", "- Finally, use the `mean_absolute_error()` function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in `y_valid`.\n", "\n", "In order for this step to be marked correct, your model in `my_model_3` must attain higher MAE than the model in `my_model_1`. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error: 20930.964656464042\n" ] }, { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"questionId\": \"3_Model3\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Define the model\n", "my_model_3 = XGBRegressor(n_estimators=100, learning_rate=0.5)\n", "\n", "# Fit the model\n", "my_model_3.fit(X_train, y_train) # Your code here\n", "\n", "# Get predictions\n", "predictions_3 = my_model_3.predict(X_valid)\n", "\n", "# Calculate MAE\n", "mae_3 = mean_absolute_error(y_valid, predictions_3)\n", "\n", "# Uncomment to print MAE\n", "print(\"Mean Absolute Error:\" , mae_3)\n", "\n", "# Check your answer\n", "step_3.check()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Lines below will give you a hint or solution code\n", "#step_3.hint()\n", "#step_3.solution()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error: 16719.397795376713\n" ] } ], "source": [ "# Define the model\n", "my_model_4 = XGBRegressor(n_estimators=1024, learning_rate=0.05)\n", "\n", "# Fit the model\n", "my_model_4.fit(X_train, y_train, \n", " early_stopping_rounds=32, \n", " eval_set=[(X_valid, y_valid)], \n", " verbose=False)\n", "\n", "# Get predictions\n", "predictions_4 = my_model_4.predict(X_valid)\n", "\n", "# Calculate MAE\n", "mae_4 = mean_absolute_error(y_valid, predictions_4)\n", "\n", "print(\"Mean Absolute Error:\" , mae_4)\n", "\n", "# Get test predictions \n", "preds_test = my_model_4.predict(X_test)\n", "\n", "# Save test predictions to file\n", "output = pd.DataFrame({'Id': X_test.index,\n", " 'SalePrice': preds_test})\n", "output.to_csv('submission.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Keep going\n", "\n", "Continue to learn about **[data leakage](https://www.kaggle.com/alexisbcook/data-leakage)**. This is an important issue for a data scientist to understand, and it has the potential to ruin your models in subtle and dangerous ways!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "**[Intermediate Machine Learning Home Page](https://www.kaggle.com/learn/intermediate-machine-learning)**\n", "\n", "\n", "\n", "\n", "\n", "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 4 }