{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Irina Knyazeva, ODS Slack nickname: iknyazeva\n", " \n", "##
Solar flares forecasting (ML approach)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Research plan**\n", " - Dataset and features description\n", " - Exploratory data analysis\n", " - Visual analysis of the features\n", " - Patterns, insights, pecularities of data\n", " - Data preprocessing\n", " - Feature engineering and description\n", " - Cross-validation, hyperparameter tuning\n", " - Validation and learning curves\n", " - Prediction for hold-out and test samples\n", " - Model evaluation with metrics description\n", " - Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1. What is the solar flares and why we need forecast them?\n", "\n", "\n", "\n", "The sun produces solar flares, which have the power to affect the Earth and near-Earth environment with their great bursts of electromagnetic energy and particles. These flares have the power to blow out transformers on power grids and disrupt satellite systems. There is a long lasting task of predictions such events for minimizing its negative impact. Doing so is a difficult task because of the rarity of these events. The success in this task not changes significantly over the last 60 years. Actually, this was a topic of my Ph.D. research, and I did it without any machine learning. But either in the era of big data, there is no big success in this task. The most common approach described in the paper [Bobra et al., 2014](http://link.springer.com/article/10.1007%2Fs11207-014-0529-3). The main drawback of the approach is ignoring time dependence on features. Here I tried to use knowledge about working with time series in features. All data for this project could be downloaded from the [link] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Picture of the sun\n", "Below the picture of our Sun in one of spectral lines $ H_\\alpha $, the most beautifull one) \n", "You can see bright regions on the Sun surface, these regions called Sun active regions, and in most cases solar flares erased from such region. There is a great site (https://solarmonitor.org/) where is information about the Sun aggregated. Let's look at the Sun and there active regions " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.image as mpimg\n", "import wget\n", "import os\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "#посмотрим на размеченную картинку с solarmonitor\n", "file_url = 'https://solarmonitor.org/data/2014/05/14/pngs/bbso/bbso_halph_fd_20140514_053834.png'\n", "DOWNLOAD = True\n", "IMG_PATH = '../../img/'\n", "file_name = file_url.split(sep='/')[-1]\n", "\n", "if DOWNLOAD:\n", " file_name = wget.download(file_url, out = os.path.join(IMG_PATH, file_name))\n", " img=mpimg.imread(file_name)\n", "else: \n", " img=mpimg.imread(os.path.join(IMG_PATH, file_name))\n", "plt.figure(figsize = (12,12))\n", "imgplot = plt.imshow(img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Active regions\n", "Active regions called active because the strength of magnetic field in this regions. Strength of magnetics fields could be taken from so-called solar magnetograms. Magnetogramm of full solar disk at the same time at the next pictures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#посмотрим на размеченную картинку с solarmonitor\n", "file_url = 'https://solarmonitor.org/data/2014/05/14/pngs/shmi/shmi_maglc_fd_20140514_224622.png'\n", "DOWNLOAD = True\n", "IMG_PATH = '../../img/'\n", "file_name = file_url.split(sep='/')[-1]\n", "\n", "if DOWNLOAD:\n", " file_name = wget.download(file_url, out = os.path.join(IMG_PATH, file_name))\n", " img=mpimg.imread(file_name)\n", "else: \n", " img=mpimg.imread(os.path.join(IMG_PATH, file_name))\n", "plt.figure(figsize = (12,12))\n", "imgplot = plt.imshow(img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A solar flare occurs when magnetic energy that has built up in the solar atmosphere is suddenly released. Solar flares are an often occurrence when the Sun is active in the years around solar maximum. Many solar flares can occur on just one day during this period! Around solar minimum, solar flares might occur less than once per week.\n", "#### The classification of solar flares\n", "Solar flares are classified as A (smallest), B, C, M or X (strongest) according to the peak flux (in watts per square metre, W/m2) of 1 to 8 Ångströms X-rays near Earth, as measured by XRS instrument on-board the GOES-15 satellite which is in a geostationary orbit over the Pacific Ocean. Some (mostly stronger) solar flares can launch huge clouds of solar plasma into space which we call a coronal mass ejection. When a coronal mass ejection arrives at Earth, it can cause a geomagnetic storm and intense auroral displays.\n", "\n", "#### Solar flares prediction \n", "Most of techniques are based on the complexity of the\n", "photospheric magnetic field of the Sun's active regions. There are a large number of dif-\n", "ferent characteristics that can be used for magnetic field complexity description. Due to\n", "many empirical assumptions during their calculation, they are hardly reproducible. For\n", "HMI/SDO vector magnetograms an automated active region tracking system exist called\n", "Spaceweather HMI Active Region Patch \n", "[SHARP](http://jsoc.stanford.edu/doc/data/hmi/sharp/sharp.htm). For each active region, key features\n", "called SHARP parameters were calculated and are available online. Computation of these\n", "features is based on [SDO vector magnetograms](https://sdo.gsfc.nasa.gov/data/aiahmi/) . \n", "\n", "One example of Active region at the picture below:\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data description\n", "\n", "To define a solar flare, we only consider flares with a Geostationary Operational Environmental Satellite (GOES) X-ray flux peak magnitude above the M1.0 level. This allows us to focus only on major flares. For the purposes of this study, we defined a positive event to be an active region that flares with a peak magnitude above the M1.0 level, as defined by the GOES database. A negative event would be an active region that does not have such an event within a 24-hour time span. For collection active region for the negative class, we will gather also information about all regions where X-ray flux peak magnitude above the C1.0 level. So in our training set the same active region could be positive in one time and negative in the other. In each time moment, our target value will be 1 if in the next 24 hours will be the event with flux above the M1.0 level and 0 if not.\n", "\n", "For doing that we need to describe the complexity of each active region with the features. I (and most other researchers) did it with so-called SHARP features.\n", "\n", "The Solar Dynamic's Observatory's Helioseismic and Magnetic Imager is the first instrument to continuously map the vector magnetic field of the sun. The SDO takes the most data of any NASA satellite in history, approximately 2 terabytes per day, making it an ideal dataset for such a problem. Using this data, we can characterize active regions on the sun. From the time frame of 2010 May to 2018 December, we focused on 18 parameters calculated using the SHARP vector magnetic field data. They characterize various physical and geometrical qualities of the active region.\n", "\n", "\n", "1. **USFLUX** is the total unsigned flux. \n", "2. **MEANGAM** is the mean angle of field from radial. \n", "3. **MEANGBT** is the mean gradient of total field.\n", "4. **MEANGBZ** is the mean gradient of vertical field.\n", "5. **MEANGBH** is the mean gradient of the horizontal field. \n", "6. **MEANJZD** is the mean vertical current density. \n", "7. **TOTUSJZ** is the total unsigned vertical current. \n", "8. **MEANALP** is the mean characteristic twist parameter.\n", "9. **MEANJZH** is the mean current helicity.\n", "10. **TOTUSJH** is the total unsigned vertical current.\n", "11. **ABSNJZH** is the absolute value of the net current helicity.\n", "12. **SAVNCPP** is the sum of the modulus of the net current per polarity. \n", "13. **MEANPOT** is the mean photospheric magnetic free energy.\n", "14. **TOTPOT** is the total photospheric magnetic free energy density.\n", "15. **MEANSHR** is the mean shear angle. \n", "16. **SHRGT45** is the fraction of area with shear greater than 45 degrees. \n", "17. **R_VALUE** is the sum of flux near polarity inversion line. \n", "18. **AREA_ACR** is the area of strong field pixels in the active region. \n", "\n", "\n", "The following section of code initializes the start and end dates of the data set used in this study and also fetches the set of possible positive events and the mapper from NOAA active region numbers to the HARPNUMs used in our database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get all info about solar flares from goes\n", "This part contain code for gathering solar data. Here data from two instruments collected: goes and SDO. There is special package sunpy for handling solar data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#pip install sunpy\n", "#pip install suds-py3\n", "#pip install drms\n", "from datetime import timedelta\n", "import datetime\n", "import sunpy\n", "from sunpy.time import TimeRange\n", "from sunpy.instr import goes\n", "import numpy as np \n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DOWNLOAD = False\n", "DATA_PATH = '../../data/solar_flares'\n", "if DOWNLOAD:\n", " time_range = TimeRange('2010/06/01 00:10', '2018/12/01 00:20')\n", " #time_range = TimeRange(t_start,t_end)\n", " goes_events = goes.get_goes_event_list(time_range,'C1')\n", " goes_events = pd.DataFrame(goes_events)\n", "else:\n", " goes_events = pd.read_csv(os.path.join(DATA_PATH,'goes_events_2010_2018.csv'), index_col=[0])\n", "goes_events['noaa_active_region'] = goes_events['noaa_active_region'].replace(0,np.nan)\n", "goes_events.dropna(inplace=True) \n", "goes_events.drop(['goes_location','event_date','end_time','peak_time'], axis=1, inplace=True)\n", "goes_events.start_time = goes_events.start_time.astype('datetime64[ns]')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "goes_events.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Active regions detections\n", "There are different approaches to active regions detections. One of them with manual correction and done each day in NOAA. Active regions in this catalog have NOAA numbers. The team of SDO has own fully automated system of AR detections, and their regions called HARPs. Also, they compute plenty of parameters of magnetic field complexity. So I used harp regions with features, but information about goes flux there is only for NOAA regions. HARP and NOAA regions are not coinciding, but there is the mapping between this two catalogs. Below the code for mapping between the HARP and NOAA regions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#download mapper NOAA\n", "if os.path.isfile(os.path.join(DATA_PATH,'all_harps_with_noaa_ars.txt')):\n", " num_mapper = pd.read_csv(os.path.join(DATA_PATH,'all_harps_with_noaa_ars.txt'), sep=' ',index_col=[0])\n", "else:\n", " num_mapper = pd.read_csv('http://jsoc.stanford.edu/doc/data/hmi/harpnum_to_noaa/all_harps_with_noaa_ars.txt',sep=' ')\n", " num_mapper.to_csv(os.path.join(DATA_PATH,'all_harps_with_noaa_ars.txt'), sep=' ')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_mapper.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def convert_noaa_to_harpnum(noaa_ar):\n", " \"\"\"\n", " Converts from a NOAA Active Region to a HARPNUM\n", " Returns harpnum if present, else None if there are no matching harpnums\n", " \n", " Args:\n", " \"\"\"\n", " idx = num_mapper[num_mapper['NOAA_ARS'].str.contains(str(int(noaa_ar)))]\n", " return None if idx.empty else int(idx.HARPNUM.values[0])\n", "goes_events['harp_number'] = goes_events['noaa_active_region'].apply(convert_noaa_to_harpnum)\n", "goes_events.dropna(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Events class could be mapped to flux, which is continuous. It could be done with method flareclass_to_flux from goes " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Goes class flares better convert to flux value. It could be done with method flareclass_to_flux from goes\n", "goes_events['flux'] = goes_events['goes_class'].apply(lambda x: 1e06*goes.flareclass_to_flux(x).value)\n", "goes_events.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In one region could be many flares of differents classes. We have more then 1300 events and only Let's see to the countplot for the harp_number" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "sns.countplot(x='harp_number', data = goes_events)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading data\n", "\n", "Data with the main features of Active regions could be taken from SDO database. There is a special package for acesssing data drms. We will download meta information with the all keywords with drms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#here list of keywords we want to download. Keywords computed for harp regions. \n", "#Here we walk through the all harp regions, download features and save them to disk (it is very time consuming)\n", "import drms\n", "c = drms.Client()\n", "list_keywords = ['T_REC,CRVAL1,CRLN_OBS,USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,MEANJZD,TOTUSJZ,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH']\n", "harp_list = pd.unique(goes_events.harp_number)\n", "for harp in harp_list:\n", " str_query = f'hmi.sharp_cea_720s[{str(int(harp))}]'\n", "\n", " if os.path.isfile(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv')):\n", " print(f'Harp number {harp} already exist\\n')\n", " else:\n", " print(f'load region with Harp number {harp}')\n", " keys = c.query(str_query, key=list_keywords)\n", " keys.to_csv(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv'))\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2. Exploratory data analysis\n", "Now there are many CSV files for each harp region and we can analyze the evolution of different parameters with the time. It is believed that before the flares complexity of magnetic field changes and there are special patterns in features evolutions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_harp_features_flares(harp, goes_events = goes_events, DATA_PATH=DATA_PATH, feature_key = 'R_VALUE'):\n", " str_query = f'hmi.sharp_cea_720s[{str(int(harp))}]'\n", " df = pd.read_csv(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv'), index_col=[0])\n", " df['T_REC'] = drms.to_datetime(df.T_REC)\n", " df.set_index('T_REC', inplace=True)\n", " first_date = df.index.get_values()[0]\n", "\n", " is_visible = abs(df['CRVAL1']-df['CRLN_OBS'])<60\n", " df = df[is_visible]\n", "\n", " flux = goes_events[goes_events['harp_number']==harp][['start_time','flux']].set_index('start_time')\n", " #plt.figure(figsize = (10,14))\n", " fig, ax1 = plt.subplots(figsize=(15,5))\n", " #ax1.figure(figsize = (10,14))\n", " first_date = flux.index.get_values()[0]\n", " first_date = df.index.get_values()[0]\n", " #t2 = flux.index.get_values()[0]\n", " #first_data = min(t1,t2)\n", " dates_to_show = pd.date_range(pd.Timestamp(first_date).strftime('%m/%d/%Y'), periods=14, freq='d')\n", " labels = dates_to_show.strftime('%b %d')\n", " color = 'tab:green'\n", " ax1.plot(df.index, df[feature_key], color=color)\n", " ax2 = ax1.twinx()\n", " ax2.bar(flux.index, flux.flux, width=0.05, facecolor='indianred')\n", " plt.setp(ax1, xticks=dates_to_show, xticklabels=labels);\n", " #ax2.set_ylim(0,10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "harp = 6327\n", "plot_harp_features_flares(harp, feature_key = 'R_VALUE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Features from csv\n", "\n", "Now we need to generate features from time series evolutions. It is believed that active region \"prepared for giving a flare\". So I decided to take complexity features at the considered moment add aggregates from the past\n", " 1. Features at the time moment, not all moments each 2 hours\n", " 2. Mean and range (max-min) for features at the last 2 hours\n", " 3. Mean and range (max-min) at the last 6 hours\n", " 4. Mean and range (max-min) at the last 12 hours\n", " 5. Mean and range (max-min) at the last 24 hours\n", " 6. Mean and range (max-min) at the last 48 hours\n", " \n", "Also there are instrumental distortion if angle between observer and the point at the sun big, usually records where latituted more than 70 is droped from estimation. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def extract_features_from_csv(harp, time_stamp = 2, delay_hours = [2,6,12,24,48], long=70):\n", " \"\"\"\n", " process downloaded csv files with time series features and create features from them\n", " harp: harp number \n", " time_stamp: frequence \n", " delay_hour:\n", " return: data_frame with features\n", " \"\"\"\n", " \n", " str_query = f'hmi.sharp_cea_720s[{str(int(harp))}]'\n", " df = pd.read_csv(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv'), index_col=[0])\n", " df['T_REC'] = drms.to_datetime(df.T_REC)\n", " df.set_index('T_REC', inplace=True)\n", " is_visible = abs(df['CRVAL1']-df['CRLN_OBS'])level)]\n", " target = pd.Series(full_df.index.map(lambda x: np.sum((x>big_events.start_time - np.timedelta64(horizont,'h'))\n", " & (x0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = logit_pipe.predict_proba(test_part[key_cols])\n", "class_names = ['No Flare', 'Flare']\n", "pd.DataFrame(confusion_matrix(test_part['bin_target'],y_pred[:,1]>0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we missed 11 flares at the test set, and 432 in train set.\n", "Let's try Random Forest from the box" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "rf = RandomForestClassifier(n_estimators = 100, max_depth=3, class_weight='balanced')\n", "score = cross_val_score(rf, train_part[key_cols], train_part['bin_target'], cv=tcv, scoring = 'roc_auc')\n", "print('Validation score:', score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "rf = RandomForestClassifier(n_estimators = 500, max_depth=3, class_weight='balanced')\n", "rf.fit(train_part[key_cols], train_part['bin_target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_pred = rf.predict_proba(train_part[key_cols])\n", "class_names = ['No Flare', 'Flare']\n", "pd.DataFrame(confusion_matrix(train_part['bin_target'],rf_pred[:,1]>0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_pred = rf.predict_proba(test_part[key_cols])\n", "class_names = ['No Flare', 'Flare']\n", "pd.DataFrame(confusion_matrix(test_part['bin_target'],rf_pred[:,1]>0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lightgbm as lgb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "lg = lgb.LGBMClassifier(n_estimators=300, max_depth=5, num_leaves=10)\n", "tcv = TimeSeriesSplit(n_splits=10)\n", "score = cross_val_score(lg, train_part[key_cols], train_part['bin_target'], cv=tcv, scoring = 'roc_auc')\n", "print('Validation score:', score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg = lgb.LGBMClassifier(n_estimators=500, max_depth=5, num_leaves=10)\n", "lg.fit(train_part[key_cols], train_part['bin_target'])\n", "lg_pred = lg.predict_proba(train_part[key_cols])\n", "class_names = ['No Flare', 'Flare']\n", "pd.DataFrame(confusion_matrix(train_part['bin_target'],lg_pred[:,1]>0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg_pred = lg.predict_proba(test_part[key_cols])\n", "class_names = ['No Flare', 'Flare']\n", "pd.DataFrame(confusion_matrix(test_part['bin_target'],lg_pred[:,1]>0.5), index=class_names, columns=class_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lightgbm with this parameters give us less false positive alarms, but very low recall" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 6. Feature engineering and description \n", "\n", "Actually all features enineering was done before, it is historical features, previous flux, here we try to use them and add Year as a feature, because there is a 11 year cycle in solar activity. Number of flares changes significantly from year to year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#number of flares by years\n", "train_df.groupby('Year')['bin_target'].count()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.compose import ColumnTransformer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')\n", "cat_cols = ['Year']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "idx_split = train_part.shape[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tcv = TimeSeriesSplit(n_splits=10)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=1, class_weight='balanced', random_state=17))])\n", "score = cross_val_score(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=tcv, scoring = 'roc_auc')\n", "print('Validation score:', score)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets try to add delay features for 24 hours, range for example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tcv = TimeSeriesSplit(n_splits=10)\n", "delay=24\n", "col_range=[col+'_range_'+str(delay) for col in key_cols]\n", "num_cols = key_cols+['prev_flux']+col_range\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=1, class_weight='balanced', random_state=17))])\n", "score = cross_val_score(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=tcv, scoring = 'roc_auc')\n", "print('Validation score:', score)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't see any improvement, so I try will tune logistic with base features and lightgbm on the whole dataset. Random Forest will be later" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 7. Cross-validation, hyperparameter tuning" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tcv = TimeSeriesSplit(n_splits=5)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=1, class_weight='balanced', random_state=17))])\n", "\n", "param_grid = {\n", " 'logit__C': [.5, 1.0, 2.0, 10.0]\n", " }\n", "gs = GridSearchCV(logit_pipe, param_grid, cv=tcv, scoring = 'roc_auc', n_jobs = -1)\n", "gs.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(gs.cv_results_)[['params','mean_test_score', 'std_test_score', 'mean_train_score']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tcv = TimeSeriesSplit(n_splits=5)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=1, class_weight='balanced', random_state=17))])\n", "\n", "param_grid = {\n", " 'logit__C': [0.01, 0.025,0.05,.1,0.15,0.2,0.3,0.4]\n", " }\n", "gs = GridSearchCV(logit_pipe, param_grid, cv=tcv, scoring = 'roc_auc', n_jobs = -1)\n", "gs.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(gs.cv_results_)[['params','mean_test_score', 'std_test_score', 'mean_train_score']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tcv = TimeSeriesSplit(n_splits=10)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=0.01, class_weight='balanced', random_state=17))])\n", "score = cross_val_score(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=tcv, scoring = 'roc_auc', n_jobs=-1)\n", "print('Validation score:', score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning lightgbm (ohe features)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lightgbm as lgb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tcv = TimeSeriesSplit(n_splits=10)\n", "delay=24\n", "col_range=[col+'_range_'+str(delay) for col in key_cols]\n", "col_mean = [col+'_mean_'+str(delay) for col in key_cols]\n", "num_cols = key_cols+['prev_flux']+col_range+col_mean\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "#score = cross_val_score(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=tcv, scoring = 'roc_auc')\n", "#print('Validation score:', score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "lg = lgb.LGBMClassifier(silent=False)\n", "param_dist = {\"max_depth\": [5,10,15,20],\n", " \"num_leaves\": [5,10,20,30],\n", " \"n_estimators\": [300],\n", " #\"learning_rate\" : [0.1,0.05]\n", " }\n", "grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = tcv, scoring=\"roc_auc\", verbose=5)\n", "grid_search.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target']) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score', 'std_test_score', 'mean_train_score']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score', 'std_test_score', 'mean_train_score']].iloc[1,0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg = lgb.LGBMClassifier(silent=False)\n", "param_dist = {\"max_depth\": [5],\n", " \"num_leaves\": [10],\n", " \"n_estimators\": [500],\n", " \"learning_rate\" : [0.025,0.015, 0.01]\n", " }\n", "grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, scoring=\"roc_auc\", verbose=5)\n", "grid_search.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target']) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score', 'std_test_score', 'mean_train_score']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So for lightgbm final set of parameters param_dist = {\"max_depth\": [5],\n", " \"num_leaves\": [10],\n", " \"n_estimators\": [500],\n", " \"learning_rate\" : [0.01]\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 8. Validation and learning curves\n", "\n", "Let's plot learning curve for both models with best parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tcv = TimeSeriesSplit(n_splits=5)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=0.01, class_weight='balanced', random_state=17))])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_sizes, train_scores, test_scores = learning_curve(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], \n", " train_sizes=np.arange(0.1,1, 0.2), \n", " cv=tcv, scoring='roc_auc', n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.grid(True)\n", "plt.plot(train_sizes, train_scores.mean(axis = 1), 'g-', marker='o', label='train')\n", "plt.plot(train_sizes, test_scores.mean(axis = 1), 'r-', marker='o', label='test')\n", "plt.ylim((0.0, 1.05))\n", "plt.legend(loc='lower right')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delay=24\n", "col_range=[col+'_range_'+str(delay) for col in key_cols]\n", "col_mean = [col+'_mean_'+str(delay) for col in key_cols]\n", "num_cols = key_cols+['prev_flux']+col_range+col_mean\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "lg = lgb.LGBMClassifier(silent=False,max_depth=5, num_leaves=10,n_estimators=500,learning_rate=0.01)\n", "train_sizes, train_scores, test_scores = learning_curve(lg, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], \n", " train_sizes=np.arange(0.1,1, 0.2), \n", " cv=tcv, scoring='roc_auc', n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.grid(True)\n", "plt.plot(train_sizes, train_scores.mean(axis = 1), 'g-', marker='o', label='train')\n", "plt.plot(train_sizes, test_scores.mean(axis = 1), 'r-', marker='o', label='test')\n", "plt.ylim((0.0, 1.05))\n", "plt.legend(loc='lower right')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning curves looks OK!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 9. Prediction for hold-out and test samples and model evaluation\n", "\n", "Now time to fit model on the whole dataset and test it on the hold out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=0.01, class_weight='balanced', random_state=17))])\n", "logit_pipe.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_pred = logit_pipe.predict_proba(train_ohe.loc[idx_split:,:])\n", "class_names = ['No Flare', 'Flare']\n", "print(roc_auc_score(train_df.loc[idx_split:,'bin_target'],logit_pred[:,1]))\n", "pd.DataFrame(confusion_matrix(train_df.loc[idx_split:,'bin_target'],logit_pred[:,1]>0.5), index=class_names, columns=class_names)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delay=24\n", "col_range=[col+'_range_'+str(delay) for col in key_cols]\n", "col_mean = [col+'_mean_'+str(delay) for col in key_cols]\n", "num_cols = key_cols+['prev_flux']+col_range+col_mean\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "lg = lgb.LGBMClassifier(silent=False,max_depth=5, num_leaves=10,n_estimators=800,learning_rate=0.01)\n", "lg.fit(train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#lg_pred = lg.predict_proba(train_ohe.loc[idx_split:,:])\n", "class_names = ['No Flare', 'Flare']\n", "lg_pred = lg.predict_proba(train_ohe.loc[idx_split:,:])\n", "pd.DataFrame(confusion_matrix(train_df.loc[idx_split:,'bin_target'],lg_pred[:,1]>0.5), index=class_names, columns=class_names)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see no false alarm, let's mix them" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_predict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tcv = TimeSeriesSplit(n_splits=5)\n", "num_cols = key_cols+['prev_flux']\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "logit_pipe = Pipeline([('transform', ct), ('logit', LogisticRegression(C=0.01, class_weight='balanced', random_state=17))])\n", "y_pred_logit = cross_val_predict(logit_pipe, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=3, method='predict_proba')[:,1]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delay=24\n", "col_range=[col+'_range_'+str(delay) for col in key_cols]\n", "col_mean = [col+'_mean_'+str(delay) for col in key_cols]\n", "num_cols = key_cols+['prev_flux']+col_range+col_mean\n", "transformers = [('num', StandardScaler(), num_cols)]\n", "ct = ColumnTransformer(transformers=transformers, remainder = 'passthrough' )\n", "train_ohe = pd.concat([pd.get_dummies(train_df['Year'], prefix = 'Year'), train_df[num_cols]], axis=1)\n", "lg = lgb.LGBMClassifier(silent=False,max_depth=5, num_leaves=10,n_estimators=800,learning_rate=0.01)\n", "y_pred_lg = cross_val_predict(lg, train_ohe.loc[:idx_split,:], train_df.loc[:idx_split,'bin_target'], cv=3, method='predict_proba')[:,1]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "answ = pd.DataFrame(data = np.vstack([y_pred_logit, y_pred_lg,train_df.loc[:idx_split,'bin_target'].values]).T, columns=['logit','lgb','real'])\n", "answ.loc[answ['real']==1].sample(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_2lev = LogisticRegression()\n", "cross_val_score(logit_2lev, answ.drop('real',axis=1), answ['real'], cv=5, scoring = 'roc_auc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_2lev.fit(answ.drop('real',axis=1), answ['real'])\n", "logit_2lev.coef_/np.sum(logit_2lev.coef_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_mixed = 0.73*logit_pred[:,1]+0.27*lg_pred[:,1]\n", "pd.DataFrame(confusion_matrix(train_df.loc[idx_split:,'bin_target'],y_mixed>0.5), index=class_names, columns=class_names)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 11. Conclusions\n", "\n", "In this task there are plenty of room for improvement, but final result good. Almost all flares was revealed and 10% of false alarm not a big problem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "PyEnv 3.6", "language": "python", "name": "pyenv3.6" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }