{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "04018cedf8962b1b2da83298c3a0173026f15ae0" }, "source": [ "# Introduction\n", "\n", "This notebook is based on my [previous attempt with the GTD](GTD_1st_round.ipynb). At that time I explored the dataset and tried to start running ML models on it. In this round, on the other hand, first, [I tried to reconstruct two recent papers 'by hand'](gtd_pred_papers.ipynb) written on the same problem, that is, the prediction of terrorist groups based on the characteristics of previous incidents. While I managed to build out their preprocessing and modeling environment I could not do it perfectly because\n", "1. The papers do not describe every single step in their workflow\n", "2. Both groups of authors used WEKA with its own in-built algorithms (and did not publish their code). For instance, one of them used the J48 model which is an Java implementation of the C4.5 (while sklearn uses CART).\n", "\n", "After going through the papers, I started to [build out a pipeline](Y_trans_ppl.ipynb) which I hoped will allow me to turn on/off various features on the whole data processing/modeling workflow. One of my main reason for that was to examine the relationship of the way the initial problem is framed to modelling.\n", "\n", "## Problem discussion\n", "To give an example, the problem of predicting perpetrator groups is meant in a number of different ways mirroring the specific issue the modelers want to solve and the means they have for that. Both papers (and most of the earlier ones) describe cases where the problem was narrowed down to pedict Indian cases and usually with a general idea of helping to find out the groups behind an incident and eventually to arrest the perpetrators. This can imply many things (perhaps most importantly that there is an agent who can relatively quickly collect the same information a model is based on). Accordingly, the particular model setting needs to mirror this situation (i.e. what features do we allow into the model, how do we clean the data, where do we split, etc.). Finally, this is also related to the issue of those types of data leaking problems where models take into account 'information from the future'.\n", "\n", "To examine this, however, I wanted to create a working flow which allows me to look over the whole process and to turn particular features on/off in it. Unfortunately, scikit-learn's current Pipeline (as far as I understand) does not allow\n", "1. tranformations based on label values,\n", "2. changing sample size witin the pipeline\n", "3. building the test-train split into the pipeline (or maybe I just did not find out how to do it..) \n", "\n", "To tackle this issue I created a CustomTransformation wrapper around the normal scikit-learn entities and hoped that I can pass label and other information between the steps so they also allow me to include the above mentioned features. This, however, proved to be harder than I expected (e.g. I have not used classes before) so after a couple of days I had to give up this project to rather focus on the actual prediction problem.\n", "\n", "## This notebook\n", "In this notebook then I connected the aims and some of the functions of the previous ones together. Here, I continued to recreate the papers, but now I also tried to build out if not a pipeline but a workflow to see what affects what. Even in this way, it was not entirely easy to understand which step should come before/after another, so at one some I summarized it both in text and graph which you can find at the end of this notebook.\n", "\n", "Eventually, I ran a number of modelling situations which gave me an insight about the the issues around the paper although I did not go into either hyperparameter selection, metric interpretation (or even to use multiple metrics) or multilabel classification (which I hoped to do at this round).\n", "\n", "## The papers\n", "### CHRIST\n", "> D. Talreja, J. Nagaraj, N. J. Varsha and K. Mahesh, \"Terrorism analytics: Learning to predict the perpetrator,\" 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, 2017, pp. 1723-1726. doi: 10.1109/ICACCI.2017.8126092\n", "\n", "### KAnOE\n", "> Varun Teja Gundabathula and V. Vaidhehi, An Efficient Modelling of Terrorist Groups in India using Machine Learning Algorithms, Indian Journal of Science and Technology, Vol 11(15), DOI: 10.17485/ijst/2018/v11i15/121766, April 2018" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "83836c143f355975927e87794ec4623269ecaca7" }, "source": [ "# Libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Versions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* pandas: 0.23.3\n", "* scikit-learn: 0.19.1\n", "* numpy: 1.15.0\n", "* scipy: 1.1.0\n", "* imblearn 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Warning control" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(action='ignore', category=DeprecationWarning)\n", "warnings.filterwarnings(action='ignore', category=FutureWarning)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "036573d721c9a4b43850b3944165a1a1ebc03a96" }, "source": [ "## Data handling" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_uuid": "c7f324db07bccfdbfd5c2f0022d31644e0ca7f4c", "scrolled": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "aa7b376e9a37065ae0b520e8955d5fec867d20d0" }, "source": [ "## Preprocessing" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "_uuid": "e3db1d2dd0e97a9f9b86a3815af976a4fcaec092" }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder, FunctionTransformer, Imputer, StandardScaler, Normalizer\n", "from sklearn.feature_selection import SelectKBest, chi2, RFE\n", "from imblearn.over_sampling import SMOTE, RandomOverSampler\n", "from scipy.sparse import csr_matrix" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "3b0c71acff1f801439272bd69d640a99601033c4" }, "source": [ "## Pipeline" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "_uuid": "180fa7c8896feba5c6381cb2fd378cff5b96b023", "scrolled": true }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_validate, \\\n", " StratifiedKFold" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9a73192d0be85a86a0ebd3ee0caf51dc854eb98d" }, "source": [ "## Models" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "_uuid": "7f9e91269e4cc16d9a946d9770601127d8c2c138", "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.linear_model import LogisticRegression,\n", "from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier\n", "from sklearn.naive_bayes import GaussianNB" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "95d3b74d5b7be112f3383dd7f32a252c85c59028" }, "source": [ "## Metrics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "_uuid": "3bbbc2e9a6fa8c5422a1acad20c2e1faaa21c843" }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score, precision_score, classification_report, roc_auc_score" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "3ba62c1eefd97e81969091a3b7a978285f09b7f9" }, "source": [ "## Utilities" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import time" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "_uuid": "d85a1e3f02039ff73e726b555f34e42beac11d1c", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: JOBLIB_TEMP_FOLDER=/tmp\n" ] } ], "source": [ "# This sets the temp folder and is required for \"jobs=-1\" to work on Kaggle at some cases \n", "# (see https://www.kaggle.com/getting-started/45288#292143)\n", "%env JOBLIB_TEMP_FOLDER=/tmp" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "86765176fbbaec74c331fac5e78d309086470267" }, "source": [ "# Loading the data" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6715920cb093a258c58e2bed3586b984edbf13a4" }, "source": [ "Previously we loaded the data and created a sample out of it. From now on we are going to use it for our analysis and modeling." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "_uuid": "501699cf01c63eca9d29d0f84e878d94cee9d5cc" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (4,6,31,33,53,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "# Instead of the excel from their homepage, I use the csv version they uploaded to Kaggle\n", "# gtd = pd.read_excel(\"globalterrorismdb_0617dist.xlsx\")\n", "gtd = pd.read_csv(\"../input/globalterrorismdb_0617dist.csv\", encoding='ISO-8859-1')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# In case we want to use a sample\n", "# gtd_ori = gtd\n", "gtd = gtd.sample(frac=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recreating the CHRIST paper" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a84ffd108fd274d6808b339ff33b321368526ce2" }, "source": [ "The authors of the paper trained and tested the model on a subsample of the whole dataset, namely:\n", "* Between 1970-2015\n", "* Incidents related to India\n", "* Cases to which the dataset attributes a perpetrator group (i.e. no 'Unknown' labels)\n", "\n", "They also preselected a number of features for the model which they used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling missing values\n", "There are a number of features where missing values are noted with a special code." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "miscodes = {\"0\": ['imonth', 'iday'], \n", " \"-9\": ['claim2', 'claimed', 'compclaim', 'doubtterr', 'INT_ANY', 'INT_IDEO', 'INT_LOG', 'INT_MISC', \n", " 'ishostkid', 'ndays', 'nhostkid', 'nhostkidus', 'nhours', 'nperpcap', 'nperps', 'nreleased', \n", " 'property', 'ransom', 'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'vicinity'], \n", " \"-99\": ['ransompaid', 'nperpcap', 'compclaim', 'nreleased', 'nperps', 'nhostkidus', 'ransomamtus', \n", " 'ransomamt', 'nhours', 'ndays', 'propvalue', 'ransompaidus', 'nhostkid']}" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def mistonan(data, nancode):\n", " \"\"\"Replaces columns' missing value code with numpy.NaN.\n", "\n", " Parameters:\n", " `data`: dataframe\n", "\n", " `nanvalue` : the code of the missing value in the columns\n", " \"\"\"\n", " data=data.copy()\n", " \n", " for code in nancode.keys():\n", " for col in nancode[code]:\n", " if col in data.columns:\n", " data.loc[:, col].where(data.loc[:, col] != float(code), np.NaN, inplace=True)\n", " \n", " return data" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "tdat = gtd.copy()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "29645" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat[tdat == -9].count().sum() #The number of '-9' values in the test data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5674261052951983" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.isna().mean().mean() # The mean ratio of missing values over the columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function replaces the codes with NaNs" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "tdat = mistonan(tdat, miscodes)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat[tdat == -9].count().sum()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5851838806813858" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.isna().mean().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Imputing missing values" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def nanimputer(dat):\n", " \"\"\"Imputes missing values with the column's mean\n", " Takes:\n", " 'dat', dataframe\n", " \n", " Returns:\n", " 'tdat': the dataframe with the imputed values\n", " \"\"\"\n", " tdat = dat.copy()\n", " numcols = tdat.select_dtypes(exclude=object).columns # numeric columns\n", " hasna = tdat[numcols].isna().any() # query about missing values\n", " hasnacols = hasna[hasna].index # columns w/ missing values\n", "\n", " # A simple conditional imputer converted to df\n", " tdat[hasnacols] = pd.DataFrame(np.where(tdat[hasnacols].isna(), \n", " tdat[hasnacols].mean(), \n", " tdat[hasnacols]), \n", " columns=tdat[hasnacols].columns, \n", " index=tdat[hasnacols].index)\n", " \n", " return tdat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The missing value ratio in columns before and after the transformation" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "weapsubtype4 0.999765\n", "weaptype4 0.999706\n", "claimmode3 0.999061\n", "guncertain3 0.998004\n", "claim3 0.997945\n", "ransompaid 0.997887\n", "attacktype3 0.997652\n", "ransompaidus 0.997593\n", "ransomamtus 0.997417\n", "claimmode2 0.996947\n", "ransomamt 0.994071\n", "targsubtype3 0.993425\n", "natlty3 0.993073\n", "targtype3 0.993014\n", "compclaim 0.992427\n", "weapsubtype3 0.991136\n", "weaptype3 0.990431\n", "claim2 0.989727\n", "guncertain2 0.989257\n", "nhours 0.988612\n", "ndays 0.981626\n", "nreleased 0.966481\n", "attacktype2 0.966246\n", "hostkidoutcome 0.944232\n", "targsubtype2 0.940475\n", "propvalue 0.940006\n", "natlty2 0.940006\n", "targtype2 0.937951\n", "weapsubtype2 0.936190\n", "nhostkid 0.935721\n", " ... \n", "weapsubtype1 0.109657\n", "nwound 0.090167\n", "doubtterr 0.082301\n", "nkill 0.056355\n", "targsubtype1 0.053009\n", "latitude 0.025125\n", "longitude 0.025125\n", "natlty1 0.007455\n", "iday 0.004755\n", "ishostkid 0.003053\n", "INT_MISC 0.002818\n", "guncertain1 0.002172\n", "vicinity 0.000176\n", "imonth 0.000117\n", "extended 0.000000\n", "country 0.000000\n", "region 0.000000\n", "iyear 0.000000\n", "specificity 0.000000\n", "suicide 0.000000\n", "crit1 0.000000\n", "crit2 0.000000\n", "crit3 0.000000\n", "multiple 0.000000\n", "success 0.000000\n", "attacktype1 0.000000\n", "targtype1 0.000000\n", "weaptype1 0.000000\n", "individual 0.000000\n", "eventid 0.000000\n", "Length: 77, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.select_dtypes(exclude=object).isna().mean().sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "INT_ANY 0.0\n", "natlty2 0.0\n", "attacktype2 0.0\n", "attacktype3 0.0\n", "targtype1 0.0\n", "targsubtype1 0.0\n", "natlty1 0.0\n", "targtype2 0.0\n", "targsubtype2 0.0\n", "targtype3 0.0\n", "claimed 0.0\n", "targsubtype3 0.0\n", "natlty3 0.0\n", "guncertain1 0.0\n", "guncertain2 0.0\n", "guncertain3 0.0\n", "individual 0.0\n", "nperps 0.0\n", "attacktype1 0.0\n", "suicide 0.0\n", "success 0.0\n", "multiple 0.0\n", "iyear 0.0\n", "imonth 0.0\n", "iday 0.0\n", "extended 0.0\n", "country 0.0\n", "region 0.0\n", "latitude 0.0\n", "longitude 0.0\n", " ... \n", "ndays 0.0\n", "ransomamt 0.0\n", "claim2 0.0\n", "ransomamtus 0.0\n", "ransompaid 0.0\n", "ransompaidus 0.0\n", "hostkidoutcome 0.0\n", "nreleased 0.0\n", "INT_LOG 0.0\n", "INT_IDEO 0.0\n", "property 0.0\n", "nwoundte 0.0\n", "nwoundus 0.0\n", "nwound 0.0\n", "claimmode2 0.0\n", "claim3 0.0\n", "claimmode3 0.0\n", "compclaim 0.0\n", "weaptype1 0.0\n", "weapsubtype1 0.0\n", "weaptype2 0.0\n", "weapsubtype2 0.0\n", "weaptype3 0.0\n", "weapsubtype3 0.0\n", "weaptype4 0.0\n", "weapsubtype4 0.0\n", "nkill 0.0\n", "nkillus 0.0\n", "nkillter 0.0\n", "eventid 0.0\n", "Length: 77, dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat = nanimputer(tdat)\n", "tdat.select_dtypes(exclude=object).isna().mean().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean as a new value in the column. I just realized that this interferes with the datatype coverter function (see below)." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.000000 9915\n", "0.100563 6902\n", "1.000000 90\n", "2.000000 33\n", "3.000000 21\n", "4.000000 13\n", "9.000000 8\n", "10.000000 7\n", "6.000000 7\n", "7.000000 6\n", "8.000000 6\n", "5.000000 5\n", "13.000000 3\n", "15.000000 3\n", "50.000000 2\n", "14.000000 2\n", "24.000000 2\n", "20.000000 2\n", "11.000000 2\n", "18.000000 1\n", "12.000000 1\n", "17.000000 1\n", "19.000000 1\n", "38.000000 1\n", "23.000000 1\n", "Name: nwoundte, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdat.nwoundte.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I tried to use this built-in sklearn version but it provided me a numpy array without indexes which I found hard to transfer back." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "#def imputenans(data, misvals=None, strategy='median'):\n", "# if len(misvals) == 0:\n", "# return data\n", "# \n", "# misdat = data.copy()\n", "# impute = Imputer(missing_values=misvals[0],\n", "# strategy=strategy,\n", "# verbose=0)\n", "# \n", "# numcols = misdat.select_dtypes(exclude=[object]).columns\n", "# #numeric = misdat.copy().loc[:, numcols]\n", "# \n", "# misdat.loc[:, numcols] = pd.DataFrame(impute.fit_transform(misdat.loc[:, numcols]))\n", "# \n", "# return imputenans(misdat, misvals=misvals[1:], strategy=strategy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling based on attribute values" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def attribute_sampler(X, startdate, enddate, country=None):\n", " \"\"\"Filters data samples based on the values of the attributes\n", " Here I tried to separate those transformations which are not related\n", " to the label.\n", " \n", " Takes\n", " - X: dataframe\n", " - startdate and enddate: first and last year to include\n", " - country: the countries to include\n", " \"\"\"\n", " if 'country_txt' in X.columns:\n", " if country == None:\n", " country = X.country_txt\n", " else:\n", " X = X[X.country_txt == country]\n", " \n", " filcrit = (X.iyear >= max(startdate, X.iyear.min())) \\\n", " & (X.iyear <= min(enddate, X.iyear.max()))\n", " \n", " return X[filcrit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially (within this notebook) I also converted the functions with FunctionTransfromer into possible pipeline steps but finally I did not use them." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "attsamp = FunctionTransformer(attribute_sampler,\n", " validate=False,\n", " kw_args={\n", " 'startdate': 1970,\n", " 'enddate': 2015,\n", " 'country': 'India'\n", " })" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "tX0 = attsamp.fit_transform(tdat).copy()\n", "ty0 = tdat.gname[tX0.index]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1979 2015\n", "India\n" ] } ], "source": [ "# Checks\n", "print(tX0.iyear.min(), tX0.iyear.max())\n", "print(tdat.reindex(tX0.index).country_txt.unique()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting columns" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def col_sel(X, columns=None):\n", " \"\"\"Selects columns from a DataFrame based on a list of column names.\n", " Takes\n", " - X: DataFrame\n", " - columns: list of column names\n", " \n", " Returns\n", " - X: The DataFrame with the filtered columns\n", " \"\"\"\n", " if columns != None:\n", " X = X.loc[:, columns]\n", " \n", " \n", " return X" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# The two set of colums the two papers used in their model\n", "kanoe_cols = ['iyear', 'attacktype1', 'targtype1', 'targsubtype1', 'weaptype1', \n", " 'latitude', 'longitude', 'natlty1', 'property', 'INT_ANY', 'multiple', 'crit3']\n", "\n", "christ_cols = ['iyear', 'imonth', 'iday', 'extended', 'provstate', 'city', 'attacktype1_txt',\n", " 'targtype1_txt', 'nperps', 'weaptype1_txt', 'nkill', 'nwound', 'nhostkid']" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "colsamp = FunctionTransformer(\n", " col_sel,\n", " validate=False,\n", " kw_args={'columns': christ_cols}) # the another option: kanoe_cols" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "tX0 = colsamp.fit_transform(tX0)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['iyear' 'imonth' 'iday' 'extended' 'provstate' 'city' 'attacktype1_txt'\n", " 'targtype1_txt' 'nperps' 'weaptype1_txt' 'nkill' 'nwound' 'nhostkid']\n" ] } ], "source": [ "# Check\n", "print(tX0.columns.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling based on target values" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def target_sampler(X, y, mininc, onlyknown=True, ukcode='Unknown'):\n", " \"\"\"Samples both X and y based on information from y \n", " (as opposed to for instance the attribute sampler\n", " which relies only on information from X)\n", " \n", " Takes:\n", " - X, y: The attributes and their labels\n", " - mininc: The minimum occurance of a label to include \n", " (i.e. for how many incidents a group is responsible)\n", " - onlyknown: whether to drop samples with unknown target values\n", " - ukcode: the code for unknown targets (this can be different if we\n", " encode the labels)\n", " \n", " Returns\n", " - X\n", " \"\"\"\n", " \n", " if onlyknown: # Check for known labels and filter the data\n", " nukcrit = y != ukcode\n", " X = X[nukcrit]\n", " y = y[nukcrit]\n", "\n", " # Maps the labels with the minimum required occurences\n", " mininc = y.isin(y.value_counts() \\\n", " [y.value_counts() >= mininc] \\\n", " .index.values)\n", "\n", " return X[mininc] # y is missing from here!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the transformation the data has no 'Unknown' values and the minimum frequency of included labels is 2." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "tX1 = target_sampler(tX0, ty0, mininc=2, onlyknown=True, ukcode='Unknown')\n", "ty1 = ty0[tX1.index]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n", "2\n" ] } ], "source": [ "# Checks\n", "print(tX1.shape[0] == ty1.shape[0])\n", "print(('Communist Party of India - Maoist (CPI-Maoist)' == ty1).any())\n", "print(('Unknown' != ty1).all()) \n", "print(ty1.value_counts().min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New feature creation" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def naff(X, crits, zeroval, newcol):\n", " \"\"\" A function which creates a new feature based on the number \n", " of killed, wounded and hostages taken in an incident. \n", " It also fills zero values at some of the places.\n", " It basically bins the columns into categories based on effect size\n", " and provides a column of their combinations sets.\n", " \n", " Takes\n", " -X: The Dataframe to transform\n", " -crits, dict: the criteria based on which the new feature is created.\n", " \n", " It works with the following structure:\n", " {'nkill': (62, 124, 'abc')\n", " - the key is the column name\n", " - the first two values are numeric thresholds based on which\n", " it bins the feature\n", " - the three-letter string are the three codes each bin receives as a code\n", " \n", " -zeroval: the code zero values receive\n", " -newcol: the name of the new column\n", " \n", " \"\"\"\n", " # Technical column\n", " n = pd.Series('_')\n", "\n", " # The loops go over the columns and bin them\n", " for key, i in zip(crits.keys(), range(len(crits))):\n", " \n", " # I start with creating a series of 'n'-s with the \n", " # indexes of the dataframe\n", " i = pd.Series('n', name=key).repeat(X.shape[0])\n", " i.index = X.index \n", " \n", " # Binning the column based on the criteria from the dict\n", " i[(X.loc[:,key] > 0) \n", " & (X.loc[:,key] < crits[key][0])] = crits[key][2][2]\n", " i[(X.loc[:,key] <= crits[key][1]) \n", " & (X.loc[:,key] >= crits[key][0])] = crits[key][2][1]\n", " i[X.loc[:,key] > crits[key][1]] = crits[key][2][0]\n", " \n", " # Concatenating the new column to the technical/other columns\n", " n = pd.concat((n, i), axis=1)\n", " \n", " # Cleaning missing values\n", " n.nhostkid[n.nhostkid == -99] = zeroval\n", " n = n.replace(np.NaN, zeroval)\n", " \n", " # Dropping the technical column\n", " n = n.reindex(X.index).drop(columns=0)\n", " \n", " # Merging the values of the three columns together into strings\n", " n = n.iloc[:,0] + n.iloc[:,1] + n.iloc[:,2]\n", " \n", " # Replacing the old columns with the new one\n", " X = X.drop(columns=list(crits.keys()))\n", " X[newcol] = n\n", " \n", " # Further data cleaning\n", " X.nperps = X.nperps.where(X.nperps != -99, 0)\n", " X.nperps = X.nperps.fillna(0)\n", " \n", " return X" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "80db91a3bb00d1d32a5fa408dfe1e379e0f4f8da" }, "source": [ "The feature creation criteria\n", "\n", "```python\n", "'crits' : {'nkill': (62, 124, 'abc'),\n", " 'nwound': (272, 544, 'def'),\n", " 'nhostkid': (400, 800, 'ghi')},\n", "'zeroval': 'n',\n", "'newcol': 'naffect'\n", "```" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "naff_col = FunctionTransformer(naff, \n", " validate=False,\n", " kw_args={'crits' : {'nkill': (62, 124, 'abc'),\n", " 'nwound': (272, 544, 'def'),\n", " 'nhostkid': (400, 800, 'ghi')},\n", " 'zeroval': 'n',\n", " 'newcol': 'naffect'\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new column after the tranformation" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "tX2 = naff_col.fit_transform(tX1).copy()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | nkill | \n", "nwound | \n", "nhostkid | \n", "naffect | \n", "
---|---|---|---|---|
123631 | \n", "1.0 | \n", "0.0 | \n", "9.881279 | \n", "cni | \n", "
34826 | \n", "7.0 | \n", "60.0 | \n", "9.881279 | \n", "cfi | \n", "
100820 | \n", "3.0 | \n", "4.0 | \n", "9.881279 | \n", "cfi | \n", "
75164 | \n", "0.0 | \n", "0.0 | \n", "9.881279 | \n", "nni | \n", "
59111 | \n", "0.0 | \n", "0.0 | \n", "9.881279 | \n", "nni | \n", "