{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "04018cedf8962b1b2da83298c3a0173026f15ae0" }, "source": [ "# Introduction\n", "\n", "This notebook is based on my [previous attempt with the GTD](GTD_1st_round.ipynb). At that time I explored the dataset and tried to start running ML models on it. In this round, on the other hand, first, [I tried to reconstruct two recent papers 'by hand'](gtd_pred_papers.ipynb) written on the same problem, that is, the prediction of terrorist groups based on the characteristics of previous incidents. While I managed to build out their preprocessing and modeling environment I could not do it perfectly because\n", "1. The papers do not describe every single step in their workflow\n", "2. Both groups of authors used WEKA with its own in-built algorithms (and did not publish their code). For instance, one of them used the J48 model which is an Java implementation of the C4.5 (while sklearn uses CART).\n", "\n", "After going through the papers, I started to [build out a pipeline](Y_trans_ppl.ipynb) which I hoped will allow me to turn on/off various features on the whole data processing/modeling workflow. One of my main reason for that was to examine the relationship of the way the initial problem is framed to modelling.\n", "\n", "## Problem discussion\n", "To give an example, the problem of predicting perpetrator groups is meant in a number of different ways mirroring the specific issue the modelers want to solve and the means they have for that. Both papers (and most of the earlier ones) describe cases where the problem was narrowed down to pedict Indian cases and usually with a general idea of helping to find out the groups behind an incident and eventually to arrest the perpetrators. This can imply many things (perhaps most importantly that there is an agent who can relatively quickly collect the same information a model is based on). Accordingly, the particular model setting needs to mirror this situation (i.e. what features do we allow into the model, how do we clean the data, where do we split, etc.). Finally, this is also related to the issue of those types of data leaking problems where models take into account 'information from the future'.\n", "\n", "To examine this, however, I wanted to create a working flow which allows me to look over the whole process and to turn particular features on/off in it. Unfortunately, scikit-learn's current Pipeline (as far as I understand) does not allow\n", "1. tranformations based on label values,\n", "2. changing sample size witin the pipeline\n", "3. building the test-train split into the pipeline (or maybe I just did not find out how to do it..) \n", "\n", "To tackle this issue I created a CustomTransformation wrapper around the normal scikit-learn entities and hoped that I can pass label and other information between the steps so they also allow me to include the above mentioned features. This, however, proved to be harder than I expected (e.g. I have not used classes before) so after a couple of days I had to give up this project to rather focus on the actual prediction problem.\n", "\n", "## This notebook\n", "In this notebook then I connected the aims and some of the functions of the previous ones together. Here, I continued to recreate the papers, but now I also tried to build out if not a pipeline but a workflow to see what affects what. Even in this way, it was not entirely easy to understand which step should come before/after another, so at one some I summarized it both in text and graph which you can find at the end of this notebook.\n", "\n", "Eventually, I ran a number of modelling situations which gave me an insight about the the issues around the paper although I did not go into either hyperparameter selection, metric interpretation (or even to use multiple metrics) or multilabel classification (which I hoped to do at this round).\n", "\n", "## The papers\n", "### CHRIST\n", "> D. Talreja, J. Nagaraj, N. J. Varsha and K. Mahesh, \"Terrorism analytics: Learning to predict the perpetrator,\" 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, 2017, pp. 1723-1726. doi: 10.1109/ICACCI.2017.8126092\n", "\n", "### KAnOE\n", "> Varun Teja Gundabathula and V. Vaidhehi, An Efficient Modelling of Terrorist Groups in India using Machine Learning Algorithms, Indian Journal of Science and Technology, Vol 11(15), DOI: 10.17485/ijst/2018/v11i15/121766, April 2018" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "83836c143f355975927e87794ec4623269ecaca7" }, "source": [ "# Libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Versions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* pandas: 0.23.3\n", "* scikit-learn: 0.19.1\n", "* numpy: 1.15.0\n", "* scipy: 1.1.0\n", "* imblearn 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Warning control" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(action='ignore', category=DeprecationWarning)\n", "warnings.filterwarnings(action='ignore', category=FutureWarning)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "036573d721c9a4b43850b3944165a1a1ebc03a96" }, "source": [ "## Data handling" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_uuid": "c7f324db07bccfdbfd5c2f0022d31644e0ca7f4c", "scrolled": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "aa7b376e9a37065ae0b520e8955d5fec867d20d0" }, "source": [ "## Preprocessing" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "_uuid": "e3db1d2dd0e97a9f9b86a3815af976a4fcaec092" }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder, FunctionTransformer, Imputer, StandardScaler, Normalizer\n", "from sklearn.feature_selection import SelectKBest, chi2, RFE\n", "from imblearn.over_sampling import SMOTE, RandomOverSampler\n", "from scipy.sparse import csr_matrix" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "3b0c71acff1f801439272bd69d640a99601033c4" }, "source": [ "## Pipeline" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "_uuid": "180fa7c8896feba5c6381cb2fd378cff5b96b023", "scrolled": true }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_validate, \\\n", " StratifiedKFold" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9a73192d0be85a86a0ebd3ee0caf51dc854eb98d" }, "source": [ "## Models" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "_uuid": "7f9e91269e4cc16d9a946d9770601127d8c2c138", "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.linear_model import LogisticRegression,\n", "from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier\n", "from sklearn.naive_bayes import GaussianNB" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "95d3b74d5b7be112f3383dd7f32a252c85c59028" }, "source": [ "## Metrics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "_uuid": "3bbbc2e9a6fa8c5422a1acad20c2e1faaa21c843" }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score, precision_score, classification_report, roc_auc_score" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "3ba62c1eefd97e81969091a3b7a978285f09b7f9" }, "source": [ "## Utilities" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import time" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "_uuid": "d85a1e3f02039ff73e726b555f34e42beac11d1c", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: JOBLIB_TEMP_FOLDER=/tmp\n" ] } ], "source": [ "# This sets the temp folder and is required for \"jobs=-1\" to work on Kaggle at some cases \n", "# (see https://www.kaggle.com/getting-started/45288#292143)\n", "%env JOBLIB_TEMP_FOLDER=/tmp" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "86765176fbbaec74c331fac5e78d309086470267" }, "source": [ "# Loading the data" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6715920cb093a258c58e2bed3586b984edbf13a4" }, "source": [ "Previously we loaded the data and created a sample out of it. From now on we are going to use it for our analysis and modeling." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "_uuid": "501699cf01c63eca9d29d0f84e878d94cee9d5cc" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (4,6,31,33,53,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "# Instead of the excel from their homepage, I use the csv version they uploaded to Kaggle\n", "# gtd = pd.read_excel(\"globalterrorismdb_0617dist.xlsx\")\n", "gtd = pd.read_csv(\"../input/globalterrorismdb_0617dist.csv\", encoding='ISO-8859-1')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# In case we want to use a sample\n", "# gtd_ori = gtd\n", "gtd = gtd.sample(frac=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recreating the CHRIST paper" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a84ffd108fd274d6808b339ff33b321368526ce2" }, "source": [ "The authors of the paper trained and tested the model on a subsample of the whole dataset, namely:\n", "* Between 1970-2015\n", "* Incidents related to India\n", "* Cases to which the dataset attributes a perpetrator group (i.e. no 'Unknown' labels)\n", "\n", "They also preselected a number of features for the model which they used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling missing values\n", "There are a number of features where missing values are noted with a special code." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "miscodes = {\"0\": ['imonth', 'iday'], \n", " \"-9\": ['claim2', 'claimed', 'compclaim', 'doubtterr', 'INT_ANY', 'INT_IDEO', 'INT_LOG', 'INT_MISC', \n", " 'ishostkid', 'ndays', 'nhostkid', 'nhostkidus', 'nhours', 'nperpcap', 'nperps', 'nreleased', \n", " 'property', 'ransom', 'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'vicinity'], \n", " \"-99\": ['ransompaid', 'nperpcap', 'compclaim', 'nreleased', 'nperps', 'nhostkidus', 'ransomamtus', \n", " 'ransomamt', 'nhours', 'ndays', 'propvalue', 'ransompaidus', 'nhostkid']}" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def mistonan(data, nancode):\n", " \"\"\"Replaces columns' missing value code with numpy.NaN.\n", "\n", " Parameters:\n", " `data`: dataframe\n", "\n", " `nanvalue` : the code of the missing value in the columns\n", " \"\"\"\n", " data=data.copy()\n", " \n", " for code in nancode.keys():\n", " for col in nancode[code]:\n", " if col in data.columns:\n", " data.loc[:, col].where(data.loc[:, col] != float(code), np.NaN, inplace=True)\n", " \n", " return data" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "tdat = gtd.copy()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "29645" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat[tdat == -9].count().sum() #The number of '-9' values in the test data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5674261052951983" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.isna().mean().mean() # The mean ratio of missing values over the columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function replaces the codes with NaNs" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "tdat = mistonan(tdat, miscodes)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat[tdat == -9].count().sum()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5851838806813858" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.isna().mean().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Imputing missing values" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def nanimputer(dat):\n", " \"\"\"Imputes missing values with the column's mean\n", " Takes:\n", " 'dat', dataframe\n", " \n", " Returns:\n", " 'tdat': the dataframe with the imputed values\n", " \"\"\"\n", " tdat = dat.copy()\n", " numcols = tdat.select_dtypes(exclude=object).columns # numeric columns\n", " hasna = tdat[numcols].isna().any() # query about missing values\n", " hasnacols = hasna[hasna].index # columns w/ missing values\n", "\n", " # A simple conditional imputer converted to df\n", " tdat[hasnacols] = pd.DataFrame(np.where(tdat[hasnacols].isna(), \n", " tdat[hasnacols].mean(), \n", " tdat[hasnacols]), \n", " columns=tdat[hasnacols].columns, \n", " index=tdat[hasnacols].index)\n", " \n", " return tdat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The missing value ratio in columns before and after the transformation" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "weapsubtype4 0.999765\n", "weaptype4 0.999706\n", "claimmode3 0.999061\n", "guncertain3 0.998004\n", "claim3 0.997945\n", "ransompaid 0.997887\n", "attacktype3 0.997652\n", "ransompaidus 0.997593\n", "ransomamtus 0.997417\n", "claimmode2 0.996947\n", "ransomamt 0.994071\n", "targsubtype3 0.993425\n", "natlty3 0.993073\n", "targtype3 0.993014\n", "compclaim 0.992427\n", "weapsubtype3 0.991136\n", "weaptype3 0.990431\n", "claim2 0.989727\n", "guncertain2 0.989257\n", "nhours 0.988612\n", "ndays 0.981626\n", "nreleased 0.966481\n", "attacktype2 0.966246\n", "hostkidoutcome 0.944232\n", "targsubtype2 0.940475\n", "propvalue 0.940006\n", "natlty2 0.940006\n", "targtype2 0.937951\n", "weapsubtype2 0.936190\n", "nhostkid 0.935721\n", " ... \n", "weapsubtype1 0.109657\n", "nwound 0.090167\n", "doubtterr 0.082301\n", "nkill 0.056355\n", "targsubtype1 0.053009\n", "latitude 0.025125\n", "longitude 0.025125\n", "natlty1 0.007455\n", "iday 0.004755\n", "ishostkid 0.003053\n", "INT_MISC 0.002818\n", "guncertain1 0.002172\n", "vicinity 0.000176\n", "imonth 0.000117\n", "extended 0.000000\n", "country 0.000000\n", "region 0.000000\n", "iyear 0.000000\n", "specificity 0.000000\n", "suicide 0.000000\n", "crit1 0.000000\n", "crit2 0.000000\n", "crit3 0.000000\n", "multiple 0.000000\n", "success 0.000000\n", "attacktype1 0.000000\n", "targtype1 0.000000\n", "weaptype1 0.000000\n", "individual 0.000000\n", "eventid 0.000000\n", "Length: 77, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.select_dtypes(exclude=object).isna().mean().sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "INT_ANY 0.0\n", "natlty2 0.0\n", "attacktype2 0.0\n", "attacktype3 0.0\n", "targtype1 0.0\n", "targsubtype1 0.0\n", "natlty1 0.0\n", "targtype2 0.0\n", "targsubtype2 0.0\n", "targtype3 0.0\n", "claimed 0.0\n", "targsubtype3 0.0\n", "natlty3 0.0\n", "guncertain1 0.0\n", "guncertain2 0.0\n", "guncertain3 0.0\n", "individual 0.0\n", "nperps 0.0\n", "attacktype1 0.0\n", "suicide 0.0\n", "success 0.0\n", "multiple 0.0\n", "iyear 0.0\n", "imonth 0.0\n", "iday 0.0\n", "extended 0.0\n", "country 0.0\n", "region 0.0\n", "latitude 0.0\n", "longitude 0.0\n", " ... \n", "ndays 0.0\n", "ransomamt 0.0\n", "claim2 0.0\n", "ransomamtus 0.0\n", "ransompaid 0.0\n", "ransompaidus 0.0\n", "hostkidoutcome 0.0\n", "nreleased 0.0\n", "INT_LOG 0.0\n", "INT_IDEO 0.0\n", "property 0.0\n", "nwoundte 0.0\n", "nwoundus 0.0\n", "nwound 0.0\n", "claimmode2 0.0\n", "claim3 0.0\n", "claimmode3 0.0\n", "compclaim 0.0\n", "weaptype1 0.0\n", "weapsubtype1 0.0\n", "weaptype2 0.0\n", "weapsubtype2 0.0\n", "weaptype3 0.0\n", "weapsubtype3 0.0\n", "weaptype4 0.0\n", "weapsubtype4 0.0\n", "nkill 0.0\n", "nkillus 0.0\n", "nkillter 0.0\n", "eventid 0.0\n", "Length: 77, dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat = nanimputer(tdat)\n", "tdat.select_dtypes(exclude=object).isna().mean().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean as a new value in the column. I just realized that this interferes with the datatype coverter function (see below)." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.000000 9915\n", "0.100563 6902\n", "1.000000 90\n", "2.000000 33\n", "3.000000 21\n", "4.000000 13\n", "9.000000 8\n", "10.000000 7\n", "6.000000 7\n", "7.000000 6\n", "8.000000 6\n", "5.000000 5\n", "13.000000 3\n", "15.000000 3\n", "50.000000 2\n", "14.000000 2\n", "24.000000 2\n", "20.000000 2\n", "11.000000 2\n", "18.000000 1\n", "12.000000 1\n", "17.000000 1\n", "19.000000 1\n", "38.000000 1\n", "23.000000 1\n", "Name: nwoundte, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.nwoundte.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I tried to use this built-in sklearn version but it provided me a numpy array without indexes which I found hard to transfer back." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "#def imputenans(data, misvals=None, strategy='median'):\n", "# if len(misvals) == 0:\n", "# return data\n", "# \n", "# misdat = data.copy()\n", "# impute = Imputer(missing_values=misvals[0],\n", "# strategy=strategy,\n", "# verbose=0)\n", "# \n", "# numcols = misdat.select_dtypes(exclude=[object]).columns\n", "# #numeric = misdat.copy().loc[:, numcols]\n", "# \n", "# misdat.loc[:, numcols] = pd.DataFrame(impute.fit_transform(misdat.loc[:, numcols]))\n", "# \n", "# return imputenans(misdat, misvals=misvals[1:], strategy=strategy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling based on attribute values" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def attribute_sampler(X, startdate, enddate, country=None):\n", " \"\"\"Filters data samples based on the values of the attributes\n", " Here I tried to separate those transformations which are not related\n", " to the label.\n", " \n", " Takes\n", " - X: dataframe\n", " - startdate and enddate: first and last year to include\n", " - country: the countries to include\n", " \"\"\"\n", " if 'country_txt' in X.columns:\n", " if country == None:\n", " country = X.country_txt\n", " else:\n", " X = X[X.country_txt == country]\n", " \n", " filcrit = (X.iyear >= max(startdate, X.iyear.min())) \\\n", " & (X.iyear <= min(enddate, X.iyear.max()))\n", " \n", " return X[filcrit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially (within this notebook) I also converted the functions with FunctionTransfromer into possible pipeline steps but finally I did not use them." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "attsamp = FunctionTransformer(attribute_sampler,\n", " validate=False,\n", " kw_args={\n", " 'startdate': 1970,\n", " 'enddate': 2015,\n", " 'country': 'India'\n", " })" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "tX0 = attsamp.fit_transform(tdat).copy()\n", "ty0 = tdat.gname[tX0.index]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1979 2015\n", "India\n" ] } ], "source": [ "# Checks\n", "print(tX0.iyear.min(), tX0.iyear.max())\n", "print(tdat.reindex(tX0.index).country_txt.unique()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting columns" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def col_sel(X, columns=None):\n", " \"\"\"Selects columns from a DataFrame based on a list of column names.\n", " Takes\n", " - X: DataFrame\n", " - columns: list of column names\n", " \n", " Returns\n", " - X: The DataFrame with the filtered columns\n", " \"\"\"\n", " if columns != None:\n", " X = X.loc[:, columns]\n", " \n", " \n", " return X" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# The two set of colums the two papers used in their model\n", "kanoe_cols = ['iyear', 'attacktype1', 'targtype1', 'targsubtype1', 'weaptype1', \n", " 'latitude', 'longitude', 'natlty1', 'property', 'INT_ANY', 'multiple', 'crit3']\n", "\n", "christ_cols = ['iyear', 'imonth', 'iday', 'extended', 'provstate', 'city', 'attacktype1_txt',\n", " 'targtype1_txt', 'nperps', 'weaptype1_txt', 'nkill', 'nwound', 'nhostkid']" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "colsamp = FunctionTransformer(\n", " col_sel,\n", " validate=False,\n", " kw_args={'columns': christ_cols}) # the another option: kanoe_cols" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "tX0 = colsamp.fit_transform(tX0)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['iyear' 'imonth' 'iday' 'extended' 'provstate' 'city' 'attacktype1_txt'\n", " 'targtype1_txt' 'nperps' 'weaptype1_txt' 'nkill' 'nwound' 'nhostkid']\n" ] } ], "source": [ "# Check\n", "print(tX0.columns.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling based on target values" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def target_sampler(X, y, mininc, onlyknown=True, ukcode='Unknown'):\n", " \"\"\"Samples both X and y based on information from y \n", " (as opposed to for instance the attribute sampler\n", " which relies only on information from X)\n", " \n", " Takes:\n", " - X, y: The attributes and their labels\n", " - mininc: The minimum occurance of a label to include \n", " (i.e. for how many incidents a group is responsible)\n", " - onlyknown: whether to drop samples with unknown target values\n", " - ukcode: the code for unknown targets (this can be different if we\n", " encode the labels)\n", " \n", " Returns\n", " - X\n", " \"\"\"\n", " \n", " if onlyknown: # Check for known labels and filter the data\n", " nukcrit = y != ukcode\n", " X = X[nukcrit]\n", " y = y[nukcrit]\n", "\n", " # Maps the labels with the minimum required occurences\n", " mininc = y.isin(y.value_counts() \\\n", " [y.value_counts() >= mininc] \\\n", " .index.values)\n", "\n", " return X[mininc] # y is missing from here!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the transformation the data has no 'Unknown' values and the minimum frequency of included labels is 2." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "tX1 = target_sampler(tX0, ty0, mininc=2, onlyknown=True, ukcode='Unknown')\n", "ty1 = ty0[tX1.index]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n", "2\n" ] } ], "source": [ "# Checks\n", "print(tX1.shape[0] == ty1.shape[0])\n", "print(('Communist Party of India - Maoist (CPI-Maoist)' == ty1).any())\n", "print(('Unknown' != ty1).all()) \n", "print(ty1.value_counts().min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New feature creation" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def naff(X, crits, zeroval, newcol):\n", " \"\"\" A function which creates a new feature based on the number \n", " of killed, wounded and hostages taken in an incident. \n", " It also fills zero values at some of the places.\n", " It basically bins the columns into categories based on effect size\n", " and provides a column of their combinations sets.\n", " \n", " Takes\n", " -X: The Dataframe to transform\n", " -crits, dict: the criteria based on which the new feature is created.\n", " \n", " It works with the following structure:\n", " {'nkill': (62, 124, 'abc')\n", " - the key is the column name\n", " - the first two values are numeric thresholds based on which\n", " it bins the feature\n", " - the three-letter string are the three codes each bin receives as a code\n", " \n", " -zeroval: the code zero values receive\n", " -newcol: the name of the new column\n", " \n", " \"\"\"\n", " # Technical column\n", " n = pd.Series('_')\n", "\n", " # The loops go over the columns and bin them\n", " for key, i in zip(crits.keys(), range(len(crits))):\n", " \n", " # I start with creating a series of 'n'-s with the \n", " # indexes of the dataframe\n", " i = pd.Series('n', name=key).repeat(X.shape[0])\n", " i.index = X.index \n", " \n", " # Binning the column based on the criteria from the dict\n", " i[(X.loc[:,key] > 0) \n", " & (X.loc[:,key] < crits[key][0])] = crits[key][2][2]\n", " i[(X.loc[:,key] <= crits[key][1]) \n", " & (X.loc[:,key] >= crits[key][0])] = crits[key][2][1]\n", " i[X.loc[:,key] > crits[key][1]] = crits[key][2][0]\n", " \n", " # Concatenating the new column to the technical/other columns\n", " n = pd.concat((n, i), axis=1)\n", " \n", " # Cleaning missing values\n", " n.nhostkid[n.nhostkid == -99] = zeroval\n", " n = n.replace(np.NaN, zeroval)\n", " \n", " # Dropping the technical column\n", " n = n.reindex(X.index).drop(columns=0)\n", " \n", " # Merging the values of the three columns together into strings\n", " n = n.iloc[:,0] + n.iloc[:,1] + n.iloc[:,2]\n", " \n", " # Replacing the old columns with the new one\n", " X = X.drop(columns=list(crits.keys()))\n", " X[newcol] = n\n", " \n", " # Further data cleaning\n", " X.nperps = X.nperps.where(X.nperps != -99, 0)\n", " X.nperps = X.nperps.fillna(0)\n", " \n", " return X" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "80db91a3bb00d1d32a5fa408dfe1e379e0f4f8da" }, "source": [ "The feature creation criteria\n", "\n", "```python\n", "'crits' : {'nkill': (62, 124, 'abc'),\n", " 'nwound': (272, 544, 'def'),\n", " 'nhostkid': (400, 800, 'ghi')},\n", "'zeroval': 'n',\n", "'newcol': 'naffect'\n", "```" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "naff_col = FunctionTransformer(naff, \n", " validate=False,\n", " kw_args={'crits' : {'nkill': (62, 124, 'abc'),\n", " 'nwound': (272, 544, 'def'),\n", " 'nhostkid': (400, 800, 'ghi')},\n", " 'zeroval': 'n',\n", " 'newcol': 'naffect'\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new column after the tranformation" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "tX2 = naff_col.fit_transform(tX1).copy()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nkillnwoundnhostkidnaffect
1236311.00.09.881279cni
348267.060.09.881279cfi
1008203.04.09.881279cfi
751640.00.09.881279nni
591110.00.09.881279nni
\n", "
" ], "text/plain": [ " nkill nwound nhostkid naffect\n", "123631 1.0 0.0 9.881279 cni\n", "34826 7.0 60.0 9.881279 cfi\n", "100820 3.0 4.0 9.881279 cfi\n", "75164 0.0 0.0 9.881279 nni\n", "59111 0.0 0.0 9.881279 nni" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((tX1.loc[:, ['nkill', 'nwound', 'nhostkid']], tX2.naffect), axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checks (for an earlier version) does not work at the moment" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# crits : {'nkill': (62, 124, 'abc'),\n", "# 'nwound': (272, 544, 'def'),\n", "# 'nhostkid': (400, 800, 'ghi')}\n", "# \n", "# print((((X.loc[:, crits.keys()] == 0)\n", "# | (X.loc[:, crits.keys()] == -99)\n", "# ) == (naffect == 'n')).all().all())\n", "# \n", "# print(((X.loc[:, crits.keys()] > crits[key][1]) == (nc.isin(('a', 'd', 'g')))).all().all())\n", "# \n", "# print((((X.loc[:, crits.keys()] <= crits[key][1])\n", "# & (X.loc[:, crits.keys()] >= crits[key][0])\n", "# ) == (nc.isin(('b', 'e', 'h')))).all().all())\n", "# \n", "# print((((X.loc[:, crits.keys()] < crits[key][0])\n", "# & (X.loc[:, crits.keys()] > 0)\n", "# ) == (nc.isin(('c', 'f', 'i')))).all().all())\n", "# \n", "# print((tX2 != -99).all().all())\n", "# print(~tX2.isna().any().any())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting the colum datatypes" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "df = gtd.copy()" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "def dtypeconv(df, col_dt_dict=None, auto=True):\n", " \"\"\"Converts a DataFrame's datatype based on either a dictionary \n", " or an automatic logic. \n", " ** The automatic mode is unfinished and probably interfers with the data imputer **\n", " \n", " Takes\n", " - df: Dataframe to transform\n", " - col_dt_dict; the dict of {'column': 'intended datatype'} pairs\n", " - auto, Boolean: if True, the function ignores the dict and follows the \n", " automatic logic.\n", " \n", " Returns the transformed dataframe\n", " \n", " The automatic coverter logic\n", " This is currently limited and probably has some dangers I am not unaware of but, \n", " I think, it help with the speed of models. It can do the following things:\n", " - Converts objects to categoricals for those columns where the ratio \n", " of unique labels are less than 0.5.\n", " - Downcasts unsigned integer values into the lowest possible category\n", " (e.g. uint8 for columns with only binary values) \n", " \"\"\"\n", " \n", " df = df.copy()\n", " \n", " if auto == True:\n", " objcidx = df.select_dtypes(object).columns # Indexes of object type columns\n", " numcidx = df.select_dtypes(exclude=object).columns # Indexes of numerical columns\n", " nnegcidx = numcidx[(df[numcidx] >= 0).all()] # Indexes of non-negative columns\n", "\n", " # Converting objects to categoricals\n", " if len(objcidx) != 0:\n", " unqrat = (df.nunique() / df.count()) # Unique value ratio\n", " lehalfun = unqrat[unqrat < 0.5].index # Columns with less than 0.5 unique ratio\n", " df[objcidx & lehalfun] = df[objcidx & lehalfun].astype('category')\n", " \n", " \n", " # Downcasting integers\n", " # Checks for actual whole values\n", " # (During writing up this I realized that it interferes with the \n", " # imputer function)\n", " \n", " # Indexes of columns with zero fractional modulus 1\n", " tointidx = nnegcidx[(df[nnegcidx] % 1).max() == 0] \n", " icolmaxes = df[tointidx].max() # Their maximum values\n", "\n", " # Distributing the column names\n", " ui8idx = tointidx[(icolmaxes <= 255)]\n", " ui16idx = tointidx[(icolmaxes > 255) & (icolmaxes <= 65535)]\n", " ui32idx = tointidx[(icolmaxes > 65535) & (icolmaxes <= 4294967295)]\n", " ui64idx = tointidx[(icolmaxes > 4294967295)]\n", " \n", " df[ui8idx] = df[ui8idx].astype('uint8')\n", " df[ui16idx] = df[ui16idx].astype('uint16')\n", " df[ui32idx] = df[ui32idx].astype('uint32')\n", " df[ui64idx] = df[ui64idx].astype('uint64')\n", " \n", " return df\n", " \n", " else:\n", " # Dictionary-based dtype conversion\n", " dks = col_dt_dict.keys() # Keys\n", " \n", " if len(tuple(dks)) == 0:\n", " return df \n", " \n", " dtype = tuple(dks)[0] # Type stored as key\n", " cols = col_dt_dict[dtype] # Columns\n", "\n", " # We can also provide a datatype instead of a column name\n", " # so it will transform all the columns of that type\n", " if type(cols) == type:\n", " cols = df.select_dtypes(cols).columns.values\n", "\n", " df.loc[:, cols]= df.loc[:, cols].astype(dtype)\n", " \n", " # Tempdic allowing us recursion\n", " tempdict = col_dt_dict.copy()\n", " del tempdict[dtype]\n", " \n", " return dtypeconv(df, tempdict) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Used dtype transformations\n", "\n", "```python\n", "'col_dt_dict' : {'int8': ('imonth', 'iday', 'extended'),\n", " 'int16': ('iyear', 'nperps'),\n", " 'category': object}\n", "```" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "dtconv = FunctionTransformer(dtypeconv, \n", " validate=False,\n", " kw_args={'auto' : False,\n", " 'col_dt_dict' : {'int8': ('imonth', 'iday', 'extended'),\n", " 'int16': ('iyear'),\n", " 'category': object}\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Old and new datatypes" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "iyear int64\n", "imonth float64\n", "iday float64\n", "extended int64\n", "provstate object\n", "city object\n", "attacktype1_txt object\n", "targtype1_txt object\n", "nperps float64\n", "weaptype1_txt object\n", "naffect object\n", "dtype: object" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tX2.dtypes" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iyear uint16\n", "imonth uint8\n", "iday uint8\n", "extended uint8\n", "provstate category\n", "city object\n", "attacktype1_txt category\n", "targtype1_txt category\n", "nperps float64\n", "weaptype1_txt category\n", "naffect category\n", "dtype: object" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tX3 = dtconv.fit_transform(tX2)\n", "ty3 = ty0[tX3.index]\n", "tX3.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Automatic conversion" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "iyear uint16\n", "imonth uint8\n", "iday float64\n", "extended uint8\n", "provstate category\n", "city object\n", "attacktype1_txt category\n", "targtype1_txt category\n", "nperps float64\n", "weaptype1_txt category\n", "naffect category\n", "dtype: object" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtypeconv(tX2, auto=True).dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it turns out, the two does not provide the same results, because there might be fractional values where they should not be (in this case, the calendar day of the incident perhaps due to the imputation above)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iyear True\n", "imonth True\n", "iday False\n", "extended True\n", "provstate True\n", "city True\n", "attacktype1_txt True\n", "targtype1_txt True\n", "nperps True\n", "weaptype1_txt True\n", "naffect True\n", "dtype: bool" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tX3.dtypes == dtypeconv(tX2, auto=True).dtypes" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "103627 0.634246\n", "102826 0.634246\n", "Name: iday, dtype: float64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(tX2.iday % 1)[(tX2.iday % 1) > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding categoricals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "X values" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "def catenc(X, sparse=False):\n", " \"\"\"Uses pandas.get_dummies to create binary columns \n", " from the values of the object or category type attributes.\n", " \n", " Takes\n", " X: the dataframe to transform\n", " sparse: if True, transforms the dataframe to sparse at the end\n", " \n", " Returns the dataframe\n", " \"\"\"\n", " \n", " X = pd.get_dummies(X)\n", " #cols = X.columns\n", " \n", " if sparse:\n", " X = csr_matrix(X)\n", "\n", " return X #, cols" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "codeX = FunctionTransformer(catenc, validate=False, kw_args={'sparse': False})\n", "tX4 = codeX.fit_transform(tX3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "y values" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def lenc(labels, labelencoder=None, inv=False):\n", " \"\"\"\n", " Encoding label values with scikit-learn's LabelEncoder.\n", " \n", " Takes\n", " - labels, array or Series of labels to encode\n", " - labelencoder, the encoder function (needed when\n", " doing recoding)\n", " - inv, boolean: If True, does inverse encoding (recoding). It requires the \n", " encoder function as imput to work.\n", " \n", " Returns\n", " - l: the encoded labels as pandas.Series\n", " - lec: the encoding function (for later recoding)\n", " - miscode: the code for the 'Unknown' value\n", " \"\"\"\n", " l = labels.copy()\n", " \n", " # Inverse encoding\n", " if inv == True:\n", " l = labelencoder.inverse_transform(l)\n", " l = pd.Series(l) # Transform to series\n", " l.index = labels.index\n", " return l\n", " \n", " else:\n", " lec = LabelEncoder()\n", " l = lec.fit_transform(l)\n", " \n", " # I transfer back the labels to series so I sample it with \n", " # X transformations based on its indexes\n", " l = pd.Series(l)\n", " l.index = labels.index\n", " \n", " # Returning the code of the \"Unknown\" label value\n", " if 'Unknown' in labels.values:\n", " miscode = np.unique(l[labels == 'Unknown'])[0]\n", " return l, lec, miscode\n", " else:\n", " return l, lec, None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. The new codes\n", "2. The code of the 'Unknown' label\n", "3. The result of recoding" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123631 38\n", "34826 49\n", "100820 18\n", "75164 49\n", "59111 27\n", "133710 24\n", "111255 33\n", "94076 4\n", "93803 4\n", "96486 4\n", "dtype: int64\n", "None\n", "123631 People's Liberation Front of India\n", "34826 United Liberation Front of Assam (ULFA)\n", "100820 Karbi People's Liberation Tigers (KPLT)\n", "75164 United Liberation Front of Assam (ULFA)\n", "59111 Muslim Separatists\n", "133710 Maoists\n", "111255 National Socialist Council of Nagaland-Isak-Mu...\n", "94076 Communist Party of India - Maoist (CPI-Maoist)\n", "93803 Communist Party of India - Maoist (CPI-Maoist)\n", "96486 Communist Party of India - Maoist (CPI-Maoist)\n", "dtype: object\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n", " if diff:\n" ] } ], "source": [ "l, lec, unkonwn_code = lenc(ty3)\n", "print(l[:10])\n", "print(unkonwn_code)\n", "print(lenc(labels=l, labelencoder=lec, inv=True)[:10])" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "ency = FunctionTransformer(lenc, validate=False)\n", "ty4, tlec, tukcode = ency.fit_transform(ty3)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "# Checks\n", "print(np.unique(ty4).size == ty3.nunique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transforming X into sparse form" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a common step in the workflow, so I put it here" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "tX4 = csr_matrix(tX4)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(tX4.toarray()[:,0] == tX3.iloc[:,0]).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rebalancing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The authors of the CHRIST paper used SMOTE overbalancing and mixed balancing. Because at first it was slow I focused only on overbalancing (because that created the best results for them).\n", "\n", "EDIT: actually 'all' does mixed sampling, which I used because 'minorty' only oversamples the less frequent labels which in a multi class case are not all infrequent ones." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "smote = SMOTE(ratio='all', k_neighbors=1, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.8 s, sys: 2.42 s, total: 13.2 s\n", "Wall time: 20 s\n" ] } ], "source": [ "%%time\n", "tX5, ty5 = smote.fit_sample(tX4, ty4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new nuber of samples" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7072, 445)\n", "(7072,)\n", "True\n" ] } ], "source": [ "# Checks\n", "print(tX5.shape)\n", "print(ty5.shape)\n", "print((pd.Series(ty5).value_counts() == pd.Series(ty4).value_counts().max()).all())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting the dataset" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "_uuid": "cd062f51c417ffa8e0238426622ebdde19fca480" }, "outputs": [], "source": [ "validation_size = 0.2\n", "seed = 17" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "# The authors focused only on incidents from India.\n", "atts = gtd[gtd.country_txt == 'India'].copy()" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [], "source": [ "labels, LEC, ukcode = lenc(labels=atts.gname)" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "_uuid": "c503023a4cbc311c811fad15181a0ac3312fbff9", "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(821, 134)\n", "(206, 134)\n", "(821,)\n", "(206,)\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(atts.drop(columns='gname'),\n", " labels,\n", " test_size=validation_size, \n", " random_state=seed)\n", "\n", "# Storing the original values for technical checks\n", "X_train0, X_test0, y_train0, y_test0 = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy()\n", "\n", "print(X_train.shape)\n", "print(X_test.shape)\n", "print(y_train.shape)\n", "print(y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function gathers together the previous steps. " ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "def preproc(X_train, y_train, \n", " startdate=1970, enddate=2016, \n", " country=None, \n", " ukcode='Unknown', onlyknown=False, mininc=1,\n", " misctona=True, maxna=0, impute=False, nadrop=False,\n", " dropspec=True, \n", " onlyprim=True):\n", " \"\"\"\n", " A wrapper function collecting many of the data processing steps.\n", " \n", " Takes\n", " - X_train, y_train\n", " - startdate, enddate\n", " \n", " - country\n", " \n", " - ukcode, string of numeric: the code of the unknown target label\n", " \n", " - onlyknown: if True, samples with 'unknown' (or what is defined by `ukcode`) labels are excluded\n", " \n", " - mininc, int: labels with minimum frequency to include\n", " \n", " - misctona, bool: if True, transforms the missing codes into nans\n", " \n", " - maxna, float: the minimum ratio (between 0 and 1) of allowed missing values in a column,\n", " e.g. 0.05 means that columns with more than 95% of missing values are excluded\n", " \n", " - impute, bool: if True, imputes mean values into the missing ones\n", " \n", " - nadrop, bool: if True, drops samples with missing values \n", " (after dropping the columns and imputation)\n", " \n", " - dropspec, bool: if True drops special columns\n", " \n", " - onlyprim, bool: if True, drops secondary and tertiary groups and all subnames.\n", " \n", " Returns\n", " X_train, y_train\n", " \"\"\"\n", " \n", " # drops special columns\n", " if dropspec == True:\n", " X_train = X_train.drop([\n", " 'eventid', \n", " 'addnotes', \n", " 'scite1', \n", " 'scite2', \n", " 'scite3', \n", " 'dbsource'],\n", " axis=1,\n", " errors='ignore')\n", "\n", " # leaving only the primary label values\n", " if onlyprim == True:\n", " X_train = X_train.drop(columns=['gsubname',\n", " 'gname2',\n", " 'gsubname2',\n", " 'gname3',\n", " 'gsubname3'], \n", " axis=1,\n", " errors='ignore')\n", "\n", " # Sampling based on attribute values only\n", " X_train = attribute_sampler(X_train, \n", " startdate=startdate, \n", " enddate=enddate, \n", " country=country)\n", "\n", " # Original missing codes to replace with np.NaN\n", " if misctona == True:\n", " miscodes = {\"0\": ['imonth', 'iday'], \n", " \"-9\": ['claim2', 'claimed', 'compclaim', 'doubtterr', 'INT_ANY', 'INT_IDEO', 'INT_LOG', 'INT_MISC', \n", " 'ishostkid', 'ndays', 'nhostkid', 'nhostkidus', 'nhours', 'nperpcap', 'nperps', 'nreleased', \n", " 'property', 'ransom', 'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'vicinity'], \n", " \"-99\": ['ransompaid', 'nperpcap', 'compclaim', 'nreleased', 'nperps', 'nhostkidus', 'ransomamtus', \n", " 'ransomamt', 'nhours', 'ndays', 'propvalue', 'ransompaidus', 'nhostkid']}\n", "\n", " X_train = mistonan(X_train, miscodes)\n", "\n", " # Dropping columns beyond a threshold of missing value ratios\n", " if maxna != 0:\n", " X_train = X_train.dropna(axis=1, thresh=X_train.shape[0] * (1 - maxna))\n", " \n", " # Imputing missing values\n", " if impute == True:\n", " X_train = nanimputer(X_train)\n", " \n", " # Dropping samples with missing values\n", " if nadrop == True:\n", " X_train = X_train.dropna()\n", " \n", " \n", " # We need to sample `y_train` because the next transformation\n", " # requires both x and y to be of the same sample size\n", " # -- possibly this was required for an earlier version\n", " y_train = y_train[X_train.index]\n", " X_train = target_sampler(X_train, y_train,\n", " mininc=mininc,\n", " onlyknown=onlyknown,\n", " ukcode=ukcode)\n", " \n", " # We sample `y_train` again to mirror the previous X transformations\n", " y_train = y_train[X_train.index]\n", " \n", " return X_train, y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The set parameters (based on the paper):\n", "- Only samples with known labels\n", "- Only labels with minimum 2 occurences" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = preproc(\n", " X_train, y_train, \n", " onlyknown=True,\n", " ukcode=ukcode,\n", " mininc=2,\n", " maxna=0.05, \n", " nadrop=True # They do not mention handling missing values, so I simply drop the samples\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting the proposed columns" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['iyear', 'imonth', 'iday', 'extended', 'provstate', 'city',\n", " 'attacktype1_txt', 'targtype1_txt', 'nperps', 'weaptype1_txt', 'nkill',\n", " 'nwound', 'nhostkid'],\n", " dtype='object')\n" ] } ], "source": [ "X_train = colsamp.fit_transform(X_train)\n", "print(X_train.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I connected some of the steps into a pipeline:\n", "- selecting columns\n", "- feature extraction\n", "- datatype conversion\n", "- encoding the categorical attributes" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "scrolled": true }, "outputs": [], "source": [ "preppl = Pipeline([\n", " ('selcol', colsamp),\n", " ('nafcol', naff_col),\n", " ('dtypeconv', dtconv),\n", " ('attencode', codeX),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running the pipeline" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "objcidx: Index(['provstate', 'city', 'attacktype1_txt', 'targtype1_txt',\n", " 'weaptype1_txt', 'naffect'],\n", " dtype='object')\n", "unqrat: iyear 0.076923\n", "imonth 0.028846\n", "iday 0.074519\n", "extended 0.004808\n", "provstate 0.064904\n", "city 0.737981\n", "attacktype1_txt 0.019231\n", "targtype1_txt 0.033654\n", "nperps 0.002404\n", "weaptype1_txt 0.014423\n", "naffect 0.009615\n", "dtype: float64\n", "lehalfun: Index(['iyear', 'imonth', 'iday', 'extended', 'provstate', 'attacktype1_txt',\n", " 'targtype1_txt', 'nperps', 'weaptype1_txt', 'naffect'],\n", " dtype='object')\n" ] } ], "source": [ "X_train = preppl.fit_transform(X_train) #, y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have almost the same number of features as samples" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(416, 371) \n", "(416,) \n" ] } ], "source": [ "print(X_train.shape, type(X_train))\n", "print(y_train.shape, type(y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to have a copy of the columns names so we can select them also for the X_test" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "Xtrcols = X_train.columns\n", "X_train = csr_matrix(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We save a version of the labels before the rebalancing" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "y_train_ = y_train.copy()" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.09 s, sys: 1.7 s, total: 8.79 s\n", "Wall time: 13.8 s\n" ] } ], "source": [ "%%time\n", "X_train, y_train = smote.fit_sample(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4305, 371) \n", "(4305,) \n" ] } ], "source": [ "print(X_train.shape, type(X_train))\n", "print(y_train.shape, type(y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Trial cross validation" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "_uuid": "e37ca379fad0cb8cf46233624145356175cc310b" }, "outputs": [], "source": [ "def eval_models(models, X, y, kfold):\n", " \"\"\"Evaluates selected model's prediction power on the cross-validated training datasets.\n", " \n", " (This is a function I created earlier so it is not fully adapted for this particular project.)\n", " \n", " Takes\n", " models: Dictionary of \"model_name\": model() pairs.\n", " X: predictor attributes\n", " y: target attribute\n", " \n", " Does not return anything but prints the metrics I set in the function, currently:\n", " - accuracy\n", " - micro precision\n", " - micro recall\n", " - f1 micro\n", " \"\"\"\n", " \n", " results = []\n", " \n", " # Iterates over the dictionary of models and runs each of them\n", " for model in models:\n", " print(\"Running {}...\".format(model))\n", " #start = time.time()\n", " if model == \"Naive Bayes\":\n", " X = X.toarray()\n", "\n", " \n", " # Metrics it calcualtes\n", " model_score = cross_validate(\n", " models[model], X, y,\n", " scoring=['accuracy',\n", " 'precision_micro',\n", " 'recall_micro',\n", " 'f1_micro'],\n", " cv=kfold, n_jobs=-1, verbose=0,\n", " return_train_score=False\n", " )\n", "\n", " acc_mean = model_score['test_accuracy'].mean()\n", " acc_std = model_score['test_accuracy'].std()\n", " #auc_mean = model_score['test_roc_auc'].mean()\n", " #auc_std = model_score['test_roc_auc'].std()\n", "\n", " print(\"\\n{}:\\n\\tAccuracy: {} ({})\".format(model, acc_mean, acc_std)) #auc_std\n", " #print(\"\\tROC AUC: {} ({})\".format(auc_mean, auc_std))\n", "\n", " precision_micro_mean = model_score['test_precision_micro'].mean()\n", " precision_micro_std = model_score['test_precision_micro'].std()\n", " recall_micro_mean = model_score['test_recall_micro'].mean()\n", " recall_micro_std = model_score['test_recall_micro'].std()\n", "\n", " f1_micro_mean = model_score['test_f1_micro'].mean()\n", " f1_micro_std = model_score['test_f1_micro'].std()\n", " print(\"\\tF1 micro: {} ({})\".format(f1_micro_mean, f1_micro_std))\n", "\n", " dur = model_score['fit_time'].sum() + model_score['score_time'].sum()\n", "\n", " print(\"\\tduration: {}\\n\".format(dur))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I used the models \n", "1. this papers described it used (Decision Tree, KNN, Naive Bayes)\n", "2. the another paper's (Linear SVC, Random Forest)\n", "3. and others which I thought will be relatively fast to run and can provide good results without much parametrization (ExtraTree and Logistic Regression)." ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "_uuid": "4f15cac0b3a6ddd2094b6e689bdb84deaa2c5e95" }, "outputs": [], "source": [ "n_jobs=-1\n", "kfold = StratifiedKFold(n_splits=3, random_state=seed)\n", "\n", "models = {\n", " \"Naive Bayes\": GaussianNB(),\n", " \"Decisiong Tree Classifier\": DecisionTreeClassifier(),\n", " \"Random Forest\": RandomForestClassifier(),\n", " \"K-Neighbors Classifier\": KNeighborsClassifier(),\n", " \"Logistic Regression\": LogisticRegression(),\n", " \"Extra Tree\": ExtraTreesClassifier(),\n", " \"Linear SVC\": LinearSVC(),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The authors reported around above 98% accuracy and ROC AUC for the J48 decision tree, around 88/98% for Naive Bayesa and 95/98% for IBK a KNN-base algorithm for a similarly overbalanced case.\n", "\n", "Here I recreated this result, and even got similar values for Random Forest, and the Extra Tree classifier. I also wanted to create ROC AUC values but scikit-learn does not allow it by default for multi-class problems and for now general accuracy did the job.\n", "\n", "Nonetheless, the cross validation happened only on the training data and we still need to test it on the test samples." ] }, { "cell_type": "code", "execution_count": 162, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Naive Bayes...\n", "\n", "Naive Bayes:\n", "\tAccuracy: 0.8977932636469221 (0.0617589197970132)\n", "\tF1 micro: 0.8977932636469221 (0.061758919797013256)\n", "\tduration: 8.364598989486694\n", "\n", "Running Decisiong Tree Classifier...\n", "\n", "Decisiong Tree Classifier:\n", "\tAccuracy: 0.9463414634146341 (0.025145112965391927)\n", "\tF1 micro: 0.9463414634146341 (0.025145112965391927)\n", "\tduration: 0.6489729881286621\n", "\n", "Running Random Forest...\n", "\n", "Random Forest:\n", "\tAccuracy: 0.9572590011614402 (0.030900421181539423)\n", "\tF1 micro: 0.9572590011614402 (0.030900421181539375)\n", "\tduration: 0.6815705299377441\n", "\n", "Running K-Neighbors Classifier...\n", "\n", "K-Neighbors Classifier:\n", "\tAccuracy: 0.8292682926829268 (0.02645025436250077)\n", "\tF1 micro: 0.8292682926829268 (0.02645025436250077)\n", "\tduration: 18.56847333908081\n", "\n", "Running Logistic Regression...\n", "\n", "Logistic Regression:\n", "\tAccuracy: 0.856678281068525 (0.0656590807745004)\n", "\tF1 micro: 0.856678281068525 (0.0656590807745004)\n", "\tduration: 17.121886253356934\n", "\n", "Running Extra Tree...\n", "\n", "Extra Tree:\n", "\tAccuracy: 0.9384436701509872 (0.03236898583618218)\n", "\tF1 micro: 0.9384436701509872 (0.03236898583618218)\n", "\tduration: 1.4884719848632812\n", "\n", "Running Linear SVC...\n", "\n", "Linear SVC:\n", "\tAccuracy: 0.6957026713124274 (0.06748277323213307)\n", "\tF1 micro: 0.6957026713124274 (0.06748277323213307)\n", "\tduration: 12.885974407196045\n", "\n", "CPU times: user 1.2 s, sys: 642 ms, total: 1.84 s\n", "Wall time: 38.8 s\n" ] } ], "source": [ "%%time\n", "eval_models(models, X_train, y_train, kfold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transforming the test sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will apply the following transformations on `X_test`:\n", "- Sampling and transforming columns:\n", " - The model will use the same features in the prediction and the test dataset probably would need the same features as the training set\n", " - The feature selection criteria was independent of the training itself\n", "- Data type conversion: does not effect values, but might be good for performance\n", "- Encoding: This we also require because of having the same features" ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "scrolled": true }, "outputs": [], "source": [ "X_test = colsamp.fit_transform(X_test)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "X_test = preppl.fit_transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we did not filtered X_test so strictly as we did for the training test, it has more values and, therefore, after the category encoding, more columns." ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum([col in Xtrcols.values for col in X_test.columns]) / len(X_test.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Furthermore, even from the training set, less the 30% of the columns are in the test set." ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2587601078167116" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum([col in X_test.columns.values for col in Xtrcols.values]) / len(Xtrcols.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We check whether our method really provides the intended result" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "275\n", "275\n" ] } ], "source": [ "# Original number of columns with np.NaN values\n", "print(X_test.isna().all().sum())\n", "\n", "# The new number of colums with np.NaN values.\n", "print(X_test.reindex(columns=Xtrcols).isna().all().sum())\n", "\n", "# This is the same as the number of new columns after the addition\n", "print(sum([col not in X_test.columns.values for col in Xtrcols.values]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reindex the test sample based on the training sample's column indexes." ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(206, 371) \n" ] } ], "source": [ "X_test = X_test.reindex(columns=Xtrcols)\n", "print(X_test.shape, type(X_test))" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "# X_train does not contain np.NaN values\n", "print(np.isnan(X_train.toarray()).any())" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [], "source": [ "# Because we did not have any nan values in the original form\n", "# we can assume that the columns stand for attribute values\n", "# the samples do not have and, therefore, we can turn them to 0s.\n", "X_test[np.isnan(X_test)] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I summarized the X_test transformation steps into a function" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "def X_test_trans(\n", " X_test,\n", " colidx,\n", " prep=True,\n", " traincols=True):\n", " \"\"\"Transforms the test samples.\n", " \n", " Takes\n", " - X_test: the sample to transform\n", " - colidx: the training set's column index\n", " - prep, boolean: if true the test sample goes through the\n", " preprocessing pipeline (just like the training sample)\n", " - triancols, boolean: if True, it reindexes the columns \n", " based on the training sample\n", " \n", " Returns the transformed test sample.\n", " \n", " \"\"\"\n", " \n", " if prep:\n", " X_test = preppl.fit_transform(X_test)\n", " \n", " \n", " if traincols:\n", " X_test = X_test.reindex(columns=colidx)\n", " X_test[np.isnan(X_test)] = 0\n", " \n", " \n", " return X_test" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "feat_trans = FunctionTransformer(X_test_trans,\n", " validate=False,\n", " kw_args={'prep': True,\n", " 'traincols': True})" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9aaddc8e0e8501c39452e21bc78834990f0cac05" }, "source": [ "## Modeling on the held out test sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can check whether we received similarly good results as in the cross-validation case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We make sure the indexes of the test datasets are still the same." ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(X_test.index == y_test.index).all()" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4305, 371)" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(206,)" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.shape" ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(206, 371)" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I created another function to run the models also on the test sample. While neither the papers mentioned normalization or standardization, I built them into this function for a number of models." ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [], "source": [ "def mtest(models, X_train, y_train, X_test, y_test):\n", " \"\"\"\n", " Runs the models on the test sample and provides their accuracy.\n", " \n", " Takes\n", " - models: a dict of {'model name': model function} pairs\n", " - X_train, y_train\n", " - X_test, y_test\n", " \n", " Does not returns anything but prints the accuracy of the predictions.\n", " \n", " \"\"\"\n", " # There will be lots of silly back-and-forth transformations\n", " xtr = csr_matrix(X_train).copy()\n", " xte = csr_matrix(X_test).copy()\n", " \n", " for model in models:\n", " X_train = xtr.copy()\n", " X_test = xte.copy()\n", " \n", " if model in [\"Linear SVC\",\n", " \"K-Neighbors Classifier\",\n", " \"Logistic Regression\"]:\n", " X_train = X_train.toarray()\n", " X_test = X_test.toarray()\n", " scaler = StandardScaler().fit(X_train)\n", " X_train = scaler.transform(X_train)\n", " X_test = scaler.transform(X_test)\n", " \n", "\n", " if model in [\"Naive Bayes\"]:\n", " X_train = X_train.toarray()\n", " X_test = X_test.toarray()\n", " \n", " norm = Normalizer().fit(X_train)\n", " X_train = norm.transform(X_train)\n", " X_test = norm.transform(X_test)\n", " \n", " models[model].fit(X_train, y_train)\n", "\n", " preds = models[model].predict(X_test)\n", " print(\"\\n{}: {}\".format(model, (preds == y_test).mean()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared to our previous case, in the test case only a fraction of the predictions were correct, around 36% the best. While probably my recreation of the model is not entirely exact, I was able to arrive to very similar results in the cross-validation case so this suggests that the original authors did not tested their model on a held-out sample." ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Naive Bayes: 0.18932038834951456\n", "\n", "Decisiong Tree Classifier: 0.30097087378640774\n", "\n", "Random Forest: 0.2766990291262136\n", "\n", "K-Neighbors Classifier: 0.12135922330097088\n", "\n", "Logistic Regression: 0.04854368932038835\n", "\n", "Extra Tree: 0.27184466019417475\n", "\n", "Linear SVC: 0.043689320388349516\n", "CPU times: user 2min 20s, sys: 93.1 ms, total: 2min 20s\n", "Wall time: 2min 20s\n" ] } ], "source": [ "%%time\n", "mtest(models, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps the most important reason for this low performance is that the included labels are very different in the training and the test sets because the testing dataset \n", "- also includes 'Unknown' labels, which the training dataset did not and\n", "- it also includes labels with single occurrences\n", "\n", "(By the way, in my previous attempt on this problem, I also filtered my data in a very similar way)." ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Communist Party of India - Maoist (CPI-Maoist) 105\n", "Maoists 70\n", "Sikh Extremists 56\n", "United Liberation Front of Assam (ULFA) 27\n", "Lashkar-e-Taiba (LeT) 13\n", "dtype: int64\n", "\n", "\n", "Unknown 82\n", "Communist Party of India - Maoist (CPI-Maoist) 28\n", "Maoists 21\n", "National Democratic Front of Bodoland (NDFB) 10\n", "Sikh Extremists 10\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n", " if diff:\n", "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n", " if diff:\n" ] } ], "source": [ "y_tr = lenc(y_train_, LEC, inv=True)\n", "y_te = lenc(y_test, LEC, inv=True)\n", "\n", "print(y_tr.value_counts().head()) # Training labels\n", "print(\"\\n\")\n", "print(y_te.value_counts().head()) # Testing labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the point where a concrete knowledge of the actual issue a modeling project tries to address is crucial because this is what can tell us whether we should include this or that type of information into the model.\n", "\n", "In my understanding, one-off incidents and incidents the perpetrator of which remains unknown will remain a relevant issue in the future and, therefore, data should be tested upon them.\n", "\n", "Rethinking the problem might also involve not focusing on accuracy, but rather on specificity, precision or another metric.\n", "\n", "Now, however, I move to the another paper." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recreating the KAnOE paper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this paper the authors used Factor Analysis to select attributes for the model, imputed missing values, and used the data between 1990 and 2014 to predict perpetrators of the 2015 incidents. It reported a 73.2% accuracy with SVM. Here I mostly follow them, although I will not use Factor Analysis just simply train the model on the features they found to be useful.\n", "\n", "They did not mention any rebalancing method, but I had to increase the minimum number of samples so I used random oversampling to increase the number of the less frequent labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point I was relatively confused of how the different parts belong together, and also as I continually introduced new elemenets into the workflow, I needed to have a relatively easy access to the 'whole picture'. For this reason (and in quite the opposite way than I approached the problem before), I did not try to create a 'mega-function' (besides `preproc`) but put most of the elements into a single cell and edited its elements 'on plain site'. The cell splitting and merging functions were particularly useful in this respect.\n", "\n", "I think that at least in this kind of exploratory stage this way of writing can work relatively well." ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "raw_mimetype": "text/x-python", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1077, 12) \n", "(1077,) \n", "(94, 12) \n", "(94,) \n" ] } ], "source": [ "# I turned back to simple functions because at the building stage it \n", "# allowed me a better overview of what I was doing\n", "\n", "\n", "# Data transformations before the test-train split\n", "data, label = preproc(\n", " gtd.drop(columns='gname').copy(),\n", " gtd.gname.copy(), \n", "\n", " startdate=1970, \n", " enddate=2015, # They used data only up to 2015 \n", " country='India', \n", ")\n", "\n", "\n", "labels, LEC, ukcode = lenc(label)\n", "\n", "##############################\n", "\n", "# They did not use random split, but trained the data up to 2014 and tested on 2015 values\n", "X_train = data[data.iyear <= 2014]\n", "y_train = labels.reindex(X_train.index)\n", "X_test = data[data.iyear == 2015]\n", "y_test = labels.reindex(X_test.index)\n", "\n", "##############################\n", "\n", "# Feature selection\n", "# They used a slightly different column list\n", "X_train = col_sel(X_train, kanoe_cols) # Column sampling\n", "X_test = col_sel(X_test, kanoe_cols)\n", "\n", "##############################\n", "\n", "# Transformation involving both the samples and labels of the training set.\n", "# Here I dropped some columns with many missing values and imputed the rest with the mean.\n", "X_train, y_train = preproc(\n", " X_train,\n", " y_train,\n", " impute=True,\n", " # The do not mention minimum label frequency but I set it to 3\n", " # because neither we do reabalancing but the cross-validation \n", " # needs 3 values at a minimum.\n", " mininc=3 \n", ")\n", " \n", "##############################\n", "\n", "# X_train transformation\n", "Xtrcols = X_train.columns # Saving X_train columns\n", "X_train = dtypeconv(X_train, auto=True) # Datatype conversion\n", "\n", "X_train = csr_matrix(X_train) # X_train into sparse\n", "\n", "##############################\n", "\n", "# X_test tranformations\n", "# Very similar to X_train transorfmation or gets information from it.\n", "# Probably I could have wrapped it together.\n", "\n", "X_test = nanimputer(X_test) # Imputing values\n", "X_test = X_test.reindex(columns=Xtrcols) # Selecting the same columns as X_train\n", "\n", "X_test = dtypeconv(X_test, auto=True) # Datatype conversion\n", "X_test = csr_matrix(X_test) # Transformation to sparse\n", "\n", "##############################\n", "\n", "# Rebalancing\n", "# While they do not mention rebalancing, we need to have more than one sample\n", "# from each label for the models to run so we do a 'minority' rebalancing\n", "y_train_ = y_train.copy() # Saving label freqs before rebalaning\n", "\n", "ros = RandomOverSampler(ratio='minority') # Applying Random Oversampler\n", "X_train, y_train = ros.fit_sample(X_train, y_train)\n", "\n", "print(X_train.shape, type(X_train))\n", "print(y_train.shape, type(y_train))\n", "print(X_test.shape, type(X_test))\n", "print(y_test.shape, type(y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case our cross-validate results are not that high than in the previous case, but still achieves relatively good results. Because there are multiple differences compared to the previous setting, it is not straightforward to tell what causes the difference, but the introduction of 'Unknown' labels is possibly a very important factor." ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Naive Bayes...\n", "\n", "Naive Bayes:\n", "\tAccuracy: 0.4608877056964489 (0.03236125888437632)\n", "\tF1 micro: 0.4608877056964489 (0.03236125888437632)\n", "\tduration: 0.10384821891784668\n", "\n", "Running Decisiong Tree Classifier...\n", "\n", "Decisiong Tree Classifier:\n", "\tAccuracy: 0.8293019164603864 (0.008553084909194698)\n", "\tF1 micro: 0.8293019164603864 (0.008553084909194698)\n", "\tduration: 0.027477741241455078\n", "\n", "Running Random Forest...\n", "\n", "Random Forest:\n", "\tAccuracy: 0.8496933040648887 (0.00654365178762912)\n", "\tF1 micro: 0.8496933040648887 (0.006543651787629167)\n", "\tduration: 0.13256359100341797\n", "\n", "Running K-Neighbors Classifier...\n", "\n", "K-Neighbors Classifier:\n", "\tAccuracy: 0.6612753569037721 (0.0150192672781113)\n", "\tF1 micro: 0.6612753569037721 (0.0150192672781113)\n", "\tduration: 0.12248086929321289\n", "\n", "Running Logistic Regression...\n", "\n", "Logistic Regression:\n", "\tAccuracy: 0.5935161832702817 (0.016656479603549018)\n", "\tF1 micro: 0.5935161832702817 (0.016656479603549018)\n", "\tduration: 1.51766037940979\n", "\n", "Running Extra Tree...\n", "\n", "Extra Tree:\n", "\tAccuracy: 0.832999003627419 (0.007615728717504401)\n", "\tF1 micro: 0.832999003627419 (0.007615728717504439)\n", "\tduration: 0.19862985610961914\n", "\n", "Running Linear SVC...\n", "\n", "Linear SVC:\n", "\tAccuracy: 0.40767323649837306 (0.1150378406716749)\n", "\tF1 micro: 0.40767323649837306 (0.1150378406716749)\n", "\tduration: 6.948620557785034\n", "\n", "CPU times: user 624 ms, sys: 391 ms, total: 1.01 s\n", "Wall time: 7.39 s\n" ] } ], "source": [ "%%time\n", "eval_models(models, X_train, y_train, kfold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally I also ran the models on the test data. At this time the received accuracies are more near than the cross-validated results, and, in the case of the Linear SVC, it even surpassed it. Interestingly, the paper reports lower accuracy for its tree-based models and found SVC to be the most effective predictor which is the opposite of what I got here.\n", "Despite this, this gives a more consistent impression of the performance of this model/preprocessing combination (compared to the previous one). But, again, these findings were based on accuracy metrics and therefore should be complemented with other ones and we should examine closer 'to where the predictions go'." ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Naive Bayes: 0.35106382978723405\n", "\n", "Decisiong Tree Classifier: 0.6914893617021277\n", "\n", "Random Forest: 0.7659574468085106\n", "\n", "K-Neighbors Classifier: 0.5531914893617021\n", "\n", "Logistic Regression: 0.6276595744680851\n", "\n", "Extra Tree: 0.6702127659574468\n", "\n", "Linear SVC: 0.6276595744680851\n", "CPU times: user 1.98 s, sys: 19.4 ms, total: 2 s\n", "Wall time: 1.86 s\n" ] } ], "source": [ "%%time\n", "mtest(models, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic feature selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last round I wanted to quickly try out whether using automatic feature selection and also letting all the unknown and one-off terrorist incidents would improve the results." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "raw_mimetype": "text/x-python", "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(30906, 1429) \n", "(30906,) \n", "(216, 1429) \n", "(216,) \n" ] } ], "source": [ "# I turned back to simple functions because at the building stage it \n", "# allowed me a better overview of what I was doing\n", "\n", "\n", "# Data transformations before the test-train split\n", "# I do nothing, just filter for Indian cases.\n", "\n", "data, label = preproc(\n", " gtd.drop(columns='gname').copy(),\n", " gtd.gname.copy(), \n", " country='India',\n", ")\n", "\n", "labels, LEC, ukcode = lenc(label)\n", "\n", "##############################\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data,\n", " labels,\n", " test_size=validation_size, \n", " random_state=seed)\n", "\n", "##############################\n", "\n", "# Transformation involving both the training samples and labels\n", "X_train, y_train = preproc(\n", " X_train, y_train,\n", " maxna=0.1, # I drop columns with 90% or more missing values.\n", " impute=True # Impute the rest with the mean.\n", ")\n", "\n", "##############################\n", "\n", "# X_train transformation\n", "X_train = catenc(X_train, sparse=False) # Encoding categoricals\n", "X_train = dtypeconv(X_train, auto=True) # Automatic datatype conversion\n", "\n", "Xtrcols = X_train.columns # Saving X_train columns\n", "X_train = csr_matrix(X_train) # X_train into sparse\n", "\n", "##############################\n", "\n", "# X_test tranformations\n", "\n", "X_test = nanimputer(X_test) # Imputing nan values\n", "X_test = catenc(X_test, sparse=False) # Categorical encoding\n", "\n", "X_test = X_test.reindex(columns=Xtrcols) # Selecting the same columns as X_train\n", "X_test[X_test.isna()] = 0 # Dropping the zero values which appear because of the reindexing\n", "\n", "X_test = dtypeconv(X_test, auto=True) # Automatic datatype conversion\n", "\n", "Xtecols = X_test.columns # Saving X_test columns\n", "X_test = csr_matrix(X_test) # Transforing into sparse form\n", "\n", "##############################\n", "\n", "# Rebalancing\n", "# I will use random oversampling for every class ('all')\n", "y_train_ = y_train.copy() # Saving label freqs before rebalancing\n", "\n", "ros = RandomOverSampler(ratio='all') # Applying Random Oversampler\n", "X_train, y_train = ros.fit_sample(X_train, y_train)\n", "\n", "print(X_train.shape, type(X_train))\n", "print(y_train.shape, type(y_train))\n", "print(X_test.shape, type(X_test))\n", "print(y_test.shape, type(y_test))" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "def fsel(X, y, test_data, model, method='univar', k=5):\n", " \"\"\"Selects features from from the dataset based on various methods.\n", " Parameters\n", " X: DataFrame\n", " The predictor attributes\n", " \n", " y: DataFrame or Series\n", " The label attribute to predict\n", " \n", " X_test : DataFrame\n", " The test data\n", " \n", " method='univar'\n", " The method to identify the selected features:\n", " 'univar': Univariate feature selection based on chi-squared test.\n", " 'rfe': Recursive Feature Elimination\n", " 'pca': Principal Component Analysis \n", " 'fimp': Feature importance\n", " \n", " model=RandomForestClassifier\n", " Predictor model (applicable for 'rfe' and 'fimp')\n", " \n", " k=5\n", " Depending on the chosen method:\n", " 'univar', 'rfe', 'fimp': The number of best features selected.\n", " 'pca': The number of components to create from the attributes.\n", " \n", " Returns\n", " The transformed training (X) and test datasets (with the best attributes or with the new components).\n", " \"\"\"\n", " start = time.time()\n", " \n", " # Univariate chi2\n", " if method == 'univar':\n", " fsel_mod = SelectKBest(score_func=chi2, k=k)\n", " fsel_test = fsel_mod.fit(X, y)\n", "\n", " #print(\"Feature selection test scores:{}\".format(fsel_test.scores_))\n", " features = fsel_test.transform(X)\n", " X_train = features\n", " test = fsel_test.transform(test_data)\n", " \n", " fnames = pd.DataFrame(data={'attribute': X.columns,\n", " 'chi2': fsel_test.scores_}) \\\n", " .sort_values(by='chi2', ascending=False) \\\n", " .head(k).attribute.values \n", "\n", " # Recursive feature elimination\n", " elif method == 'rfe':\n", " fsel_mod = RFE(model(verbose=1, n_jobs=-1), k)\n", " fsel_test = fsel_mod.fit(X, y)\n", " fnames = X.columns[fsel_test.support_] \n", " X_train = X.loc[:,fnames].as_matrix()\n", " test = fsel_test.transform(test_data)\n", "\n", " # Principal component analysis\n", " elif method == 'pca':\n", " fsel_mod = PCA(n_components=k)\n", " fsel_test = fsel_mod.fit(X)\n", " print(\"Explained Variance:{}\".format(fsel_test.explained_variance_ratio_))\n", " \n", " X_train = fsel_test.transform(X)\n", " test = fsel_test.transform(test_data)\n", "\n", " # Feature imortance\n", " elif method == 'fimp':\n", " fsel_mod = model()\n", " fsel_test = fsel_mod.fit(X, y)\n", " fnames = pd.DataFrame(data={'attribute': X.columns,\n", " 'fimp': fsel_test.feature_importances_}) \\\n", " .sort_values(by='fimp', ascending=False) \\\n", " .head(k).attribute.values\n", " \n", " X_train = X.loc[:,fnames].as_matrix()\n", " test = test_data.loc[:,fnames].as_matrix()\n", "\n", " else:\n", " print(\"The {} method does not exist!\".format(method))\n", " \n", " end = time.time()\n", " #print(\"{} seconds\".format(end - start))\n", " \n", " return X_train, test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I loop over the models to ry out the feature extracion methods on each." ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "**********\n", "\n", "Model: Naive Bayes\n", "\n", "Method: univar\n", "Before: 0.4212962962962963\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.41203703703703703\n", "\n", "Method: rfe\n", "Before: 0.4212962962962963\n", "After: 0.41203703703703703\n", "\n", "Method: pca\n", "Before: 0.4212962962962963\n", "After: 0.41203703703703703\n", "\n", "Method: fimp\n", "Before: 0.4212962962962963\n", "After: 0.41203703703703703\n", "**********\n", "\n", "Model: Decisiong Tree Classifier\n", "\n", "Method: univar\n", "Before: 0.4861111111111111\n", "After: 0.4722222222222222\n", "\n", "Method: rfe\n", "Before: 0.4861111111111111\n", "After: 0.49074074074074076\n", "\n", "Method: pca\n", "Before: 0.5092592592592593\n", "After: 0.5\n", "\n", "Method: fimp\n", "Before: 0.49537037037037035\n", "After: 0.48148148148148145\n", "**********\n", "\n", "Model: Random Forest\n", "\n", "Method: univar\n", "Before: 0.5462962962962963\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5370370370370371\n", "\n", "Method: rfe\n", "Before: 0.5277777777777778\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5648148148148148\n", "\n", "Method: pca\n", "Before: 0.5416666666666666\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5370370370370371\n", "\n", "Method: fimp\n", "Before: 0.5370370370370371\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5462962962962963\n", "**********\n", "\n", "Model: K-Neighbors Classifier\n", "\n", "Method: univar\n", "Before: 0.3333333333333333\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.3472222222222222\n", "\n", "Method: rfe\n", "Before: 0.3333333333333333\n", "After: 0.3472222222222222\n", "\n", "Method: pca\n", "Before: 0.3333333333333333\n", "After: 0.3472222222222222\n", "\n", "Method: fimp\n", "Before: 0.3333333333333333\n", "After: 0.3472222222222222\n", "**********\n", "\n", "Model: Logistic Regression\n", "\n", "Method: univar\n", "Before: 0.42592592592592593\n", "After: 0.44907407407407407\n", "\n", "Method: rfe\n", "Before: 0.42592592592592593\n", "After: 0.44907407407407407\n", "\n", "Method: pca\n", "Before: 0.42592592592592593\n", "After: 0.44907407407407407\n", "\n", "Method: fimp\n", "Before: 0.42592592592592593\n", "After: 0.44907407407407407\n", "**********\n", "\n", "Model: Extra Tree\n", "\n", "Method: univar\n", "Before: 0.5324074074074074\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5601851851851852\n", "\n", "Method: rfe\n", "Before: 0.5509259259259259\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5462962962962963\n", "\n", "Method: pca\n", "Before: 0.5185185185185185\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5462962962962963\n", "\n", "Method: fimp\n", "Before: 0.5601851851851852\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.5740740740740741\n", "**********\n", "\n", "Model: Linear SVC\n", "\n", "Method: univar\n", "Before: 0.06944444444444445\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/feature_selection/univariate_selection.py:166: RuntimeWarning: divide by zero encountered in true_divide\n", " chisq /= f_exp\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "After: 0.05092592592592592\n", "\n", "Method: rfe\n", "Before: 0.046296296296296294\n", "After: 0.046296296296296294\n", "\n", "Method: pca\n", "Before: 0.06944444444444445\n", "After: 0.023148148148148147\n", "\n", "Method: fimp\n", "Before: 0.018518518518518517\n", "After: 0.05555555555555555\n", "CPU times: user 1h 18min 12s, sys: 26.7 s, total: 1h 18min 39s\n", "Wall time: 1h 19min 9s\n" ] } ], "source": [ "%%time\n", "\n", "# In a not too sophisticated way this 'k' parameter intends to control the behavior of\n", "# all the feature extractors\n", "k = round(X_train.shape[1] * 0.5)\n", "fsemethods = ('univar', 'rfe', 'pca', 'fimp')\n", "\n", "\n", "for model in models: # Looping over models\n", " print(\"*\" * 10)\n", " print(\"\\nModel: {}\".format(model))\n", "\n", " # For each model we try out all the feature extaction methods\n", " for method in fsemethods: \n", " time1 = time.time()\n", " print(\"\\nMethod: {}\".format(method))\n", " \n", " # Silly transformations all the way down..\n", " fsXtr = pd.DataFrame(X_train.toarray(), columns=Xtrcols)\n", " fsXte = pd.DataFrame(X_test.toarray(), columns=Xtecols)\n", " \n", " if model != \"Naive Bayes\":\n", " fsXtr = csr_matrix(fsXtr)\n", " fsXte = csr_matrix(fsXte)\n", " \n", " # Fitting the model on training set\n", " models[model].fit(fsXtr, y_train)\n", " preds = models[model].predict(fsXte)\n", " print(\"Before: {}\".format((preds == y_test).mean()))\n", " \n", " fsXtr = pd.DataFrame(X_train.toarray(), columns=Xtrcols)\n", " fsXte = pd.DataFrame(X_test.toarray(), columns=Xtecols)\n", " \n", " # Feature extraction\n", " fsXtr, fsXte = fsel(fsXtr, y_train, fsXte, model, method='univar', k=k)\n", " \n", " if model != \"Naive Bayes\":\n", " fsXtr = csr_matrix(fsXtr)\n", " \n", " models[model].fit(fsXtr, y_train)\n", " preds = models[model].predict(fsXte)\n", " print(\"After: {}\".format((preds == y_test).mean()))\n", " \n", " #print(\"Duration: {} seconds\".format(time.time()-time1))\n", " #eval_models(models, fsXtr, y_train, kfold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While I did not awaited for a miracle so far the selectors worked relatively badly, even with very different parameters. Also it produces several error messages, so it still needs more work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I tried to recreate two papers predicting terrorist groups in India with the data from the GTD.\n", "I partially managed to recreate their results but not entirely. For the CHRIST paper I found that perhaps they did not validated their model on test data, for the KAnOE paper I could not repeat their result with the SVC model but got better results with Desision Trees and Random Forest. I also tried to introduce automatic feature selection methods, but they were very ineffective so far." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Things I achieved\n", "- Almost recreated two papers and pointed out their possible improvements\n", "- Thought through the workflow process more thoroughly and outlined it in text and graph\n", "- Started to implement a Custom wrapper around scikit-learn pipeline steps\n", "- Fun fact: I read a book on multi-label classification for this challenge with the hope I will put it into practice\n", "\n", "\n", "## Things I learned\n", "(For this project, since my last submission)\n", "- Classes\n", "- sklearn CustomTransformation, Pipelines, FunctionalTransformer,\n", "- Rebalancing\n", "- Standardization and normalization\n", "- Data imputation\n", "- Datatypes and their conversion\n", "- LabelEncoder\n", "\n", "\n", "## Things I found hard or simply would need to learn more\n", "- Array mapping\n", "- Classes\n", "- Pipelines\n", "- Model interpretation and evaluation\n", "- Logical operations (e.g. in conditionals)\n", "\n", "\n", "## Possible improvements of this notebook\n", "- Running the models with further metrics\n", "- Examining the places where the models do not fit\n", "- Hyper-parameter tuning\n", "- Actually building the pipeline and run grid selections\n", "\n", "\n", "## Found possible bugs during writing up\n", "- Imputer - dtypetrans interfernece (mean produces fractions\n", "- Target sampler: missing y from return\n", "- dtypetrans: the actual datatypes are not like they normally should be (e.g. fractional calendar day)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix 1 - Data transformation steps [<](#ref1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Describing the steps of the workflow\n", "### 1. Filtering the whole dataset\n", "That in what degree we like to do this really depends on the actual problem we want to solve and also leads to the question of who, in what context, and with what kind of available data will try to predict the group responsible for the incident.\n", "\n", "### 2. Encoding the target label\n", "We need to do this early on because the codes need to be consistent between the training and test labels and neither we can transport the codes from one to the other because they do not share all their values with each other.\n", "\n", "- We need to store the encoder model\n", "- Sometimes 'Unknown' value codes need to be stored before the transformation\n", "\n", "### 3. Train-test split\n", "### 4. Training set\n", "#### 4.1. Sampling based on attribute values\n", "This simply means that we train our model on only a subset of examples based on their values:\n", "- If a subset of samples mirror 'better' the future instances, relying only on it might produce a better model. We can also give more weight to the important attributes. For the GTD, using only data from the last round of data collection (e.g. from the year 2015) might create better predictions because they represent better data collection methods. This is the same logic behind feature selection and extraction. However, when we do this based on actual values (e.g. year period, location, etc.) we change the number of samples (as opposed to column-wise transformations).\n", "\n", "Because this tranformation changes sample size, we need to apply it also on the training labels.\n", "\n", "#### 4.2. Selecting columns\n", "This is more straightforward than the previous case because the lack of columns does not effect the y labels although we might want to reindex the test samples for performance purposes.\n", "\n", "#### 4.3. Selecting samples based on label information\n", "This is more tricky than the previous sample selection.\n", "\n", "- Focusing only on groups responsible for more than one incident might be a viable option to emphasize their weight. On the other hand, we should not take them out from the test data because 'in real life' we cannot know whether the next one will be a one-off incident or not.\n", "- Similarly, taking out the unknown groups might help the model to recognise groups in cross-validation but we definitely should not leave it out from the test set.\n", "\n", "These transformations require\n", "- information about the labels (frequency, the 'unknown' label), and\n", "- applying the transformation also on the training labels (or doing there first, and applying on train_X).\n", "\n", "#### 4.4. Feature extraction\n", "In this context this is basically the same category as simple feature selection. I does not influence the labels, but we need to apply this also on the test attributes.\n", "\n", "#### 4.5. Datatype conversion\n", "This affects neither the labels nor the training data (if done well).\n", "\n", "#### 4.6. Encoding categorical attributes\n", "This does not affect the labels but we need\n", "1. to apply this also on the test data, and\n", "2. optionally (e.g. for performance reasons), to select from the test data only those columns (now also representing actual values) which are in the training set.\n", "\n", "#### 4.7. Transforming attributes into sparse form\n", "This can help us to use different algorithms. In our case, without it we could not be able to run SMOTE because, by default, it eats up all the memory on a standard Kaglle notebook.\n", "\n", "#### 4.8. Rebalancing the data\n", "This both requires and transforms both the training attributes and labels, but should not affect test data.\n", "\n", "### 5. Training labels\n", "A summary of the already mentioned train_y transformations\n", "1. Sampling based on X_train attribute values\n", "2. Selecting samples based on label information\n", "3. Rebalancing together with X\n", "\n", "### 6. Test set\n", "While we have mentiond the necessary steps, I also summarize them here:\n", "#### 6.1. Test attributes\n", "1. Selecting columns (requires information form X_train)\n", "2. Feature extracion\n", "3. Datatype conversion\n", "4. Encoding categoricals\n", "5. Filtering on the training attributes (requires information from X_train) and transforming NaN values into zeros\n", "\n", "#### 6.2 Test labels\n", "We do not need to do anything with the test_y, only after the test we might inverse recode it with the coder we used on the whole dataset at the start." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The image below shows an outline of the tranformations as I understood them at a particular time. I marked `train_X` transformations with bold because of their relatively central role in the workflow.\n", "\n", "![graph](img/graph.jpg)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }