{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Open Machine Learning Course\n", "
Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group
Assignment #10 (demo)\n", "##
Gradient boosting\n", "\n", "Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief description of how the second benchmark was achieved using Xgboost. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will perform well. Most likely it will be Xgboost, however, we’ve got plenty of categorical features here.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import roc_auc_score" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('../../data/flight_delays_train.csv')\n", "test = pd.read_csv('../../data/flight_delays_test.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthDayofMonthDayOfWeekDepTimeUniqueCarrierOriginDestDistancedep_delayed_15min
0c-8c-21c-71934AAATLDFW732N
1c-4c-20c-31548USPITMCO834N
2c-9c-2c-51422XERDUCLE416N
3c-11c-25c-61015OODENMEM872N
4c-10c-7c-61828WNMDWOMA423Y
\n", "
" ], "text/plain": [ " Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance \\\n", "0 c-8 c-21 c-7 1934 AA ATL DFW 732 \n", "1 c-4 c-20 c-3 1548 US PIT MCO 834 \n", "2 c-9 c-2 c-5 1422 XE RDU CLE 416 \n", "3 c-11 c-25 c-6 1015 OO DEN MEM 872 \n", "4 c-10 c-7 c-6 1828 WN MDW OMA 423 \n", "\n", " dep_delayed_15min \n", "0 N \n", "1 N \n", "2 N \n", "3 N \n", "4 Y " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthDayofMonthDayOfWeekDepTimeUniqueCarrierOriginDestDistance
0c-7c-25c-3615YVMRYPHX598
1c-4c-17c-2739WNLASHOU1235
2c-12c-2c-7651MQGSPORD577
3c-3c-25c-71614WNBWIMHT377
4c-6c-6c-31505UAORDSTL258
\n", "