{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Open Machine Learning Course\n", "
Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group
All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Assignment #10 (demo)\n", "##
Gradient boosting\n", "\n", "Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief description of how the second benchmark was achieved using Xgboost. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will perform well. Most likely it will be Xgboost, however, we’ve got plenty of categorical features here.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from xgboost import XGBClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv(\"../../data/flight_delays_train.csv\")\n", "test = pd.read_csv(\"../../data/flight_delays_test.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take Xgboost classifier and two features that are easiest to take: DepTime and Distance. Such model results in 0.68202 on the LB." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train = train[[\"Distance\", \"DepTime\"]].values\n", "y_train = train[\"dep_delayed_15min\"].map({\"Y\": 1, \"N\": 0}).values\n", "X_test = test[[\"Distance\", \"DepTime\"]].values\n", "\n", "X_train_part, X_valid, y_train_part, y_valid = train_test_split(\n", " X_train, y_train, test_size=0.3, random_state=17\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll train Xgboost with default parameters on part of data and estimate holdout ROC AUC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb_model = XGBClassifier(seed=17)\n", "\n", "xgb_model.fit(X_train_part, y_train_part)\n", "xgb_valid_pred = xgb_model.predict_proba(X_valid)[:, 1]\n", "\n", "roc_auc_score(y_valid, xgb_valid_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we do the same with the whole training set, make predictions to test set and form a submission file. This is how you beat the first benchmark. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb_model.fit(X_train, y_train)\n", "xgb_test_pred = xgb_model.predict_proba(X_test)[:, 1]\n", "\n", "pd.Series(xgb_test_pred, name=\"dep_delayed_15min\").to_csv(\n", " \"xgb_2feat.csv\", index_label=\"id\", header=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second benchmark in the leaderboard was achieved as follows:\n", "\n", "- Features `Distance` and `DepTime` were taken unchanged\n", "- A feature `Flight` was created from features `Origin` and `Dest`\n", "- Features `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` and `Flight` were transformed with OHE (`LabelBinarizer`)\n", "- Logistic regression and gradient boosting (xgboost) were trained. Xgboost hyperparameters were tuned via cross-validation. First, the hyperparameters responsible for model complexity were optimized, then the number of trees was fixed at 500 and learning step was tuned.\n", "- Predicted probabilities were made via cross-validation using `cross_val_predict`. A linear mixture of logistic regression and gradient boosting predictions was set in the form $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, where $p_{logit}$ is a probability of class 1, predicted by logistic regression, and $p_{xgb}$ – the same for xgboost. $w_1$ weight was selected manually.\n", "- A similar combination of predictions was made for test set. \n", "\n", "Following the same steps is not mandatory. That’s just a description of how the result was achieved by the author of this assignment. Perhaps you might not want to follow the same steps, and instead, let’s say, add a couple of good features and train a random forest of a thousand trees.\n", "\n", "Good luck!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }