{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<h1 style=\"text-align: center\">Extracting Features From the History of Crimes Recently Committed and Within the Area of Influence: A Feature Engineering Strategy</h1>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This document presents a feature engineering strategy that supports the predictive power of an ensemble-based approach to crime classification. First, to each crime in the dataset, several features are extracted from the history of recent incidents that are within its area of influence. Next, a methodology to develop a stacked generalization ensemble is proposed. In particular, several base models are trained on stratified subsets of the whole dataset that are different from each other. Then, the first-level predictions, i.e., the outputs from the base models, are combined according to the stacking technique the meta-model implements to make the second-level predictions. Experimental results show that training a classifier to combine the predictions from the base models outperforms the straightforward stacking technique of soft voting. Nevertheless, the lowest multi-class logarithmic loss obtained, i.e., 2.56044, is the result of combining the second-level predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<span style=\"font-size: 128.6%;\"><b>Contents</b></span>\n", "\n", "<ul style=\"list-style-type: none; padding-left: 0 !important;\">\n", " <li><a href=\"#Project-Description\">Project Description</a></li>\n", " <li><a href=\"#0.-Requirements\">0. Requirements</a></li>\n", " <li><a href=\"#1.-Data-Preprocessing\">1. Data Preprocessing</a></li>\n", " <li><a href=\"#2.-Feature-Engineering\">2. Feature Engineering</a></li>\n", " <li>\n", " <a href=\"#3.-Crime-Classification\">3. Crime Classification</a>\n", " <ul style=\"list-style-type: none; padding-left: 1em !important;\">\n", " <li><a href=\"#3.1-Train-Test-Split\">3.1 Train-Test Split</a></li>\n", " <li><a href=\"#3.2-Baseline-Model\">3.2 Baseline Model</a></li>\n", " <li><a href=\"#3.3-Discriminative-Features\">3.3 Discriminative Features</a></li>\n", " <li>\n", " <a href=\"#3.4-Stacking-Ensemble\">3.4 Stacking Ensemble</a>\n", " <ul style=\"list-style-type: none; padding-left: 1em !important;\">\n", " <li><a href=\"#3.4.1-Results\">3.4.1 Results</a></li>\n", " <li><a href=\"#3.4.2-Late-Submission\">3.4.2 Late Submission</a></li>\n", " </ul>\n", " </li>\n", " </ul>\n", " </li>\n", " <li><a href=\"#4.-Conclusion\">4. Conclusion</a></li>\n", "</ul>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Project Description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime/) is a Kaggle challenge related to multi-class classification. In particular, each record in the dataset must be classified into one out of 39 categories of crime. There are 12 years of criminal records on which predictive models must be trained and evaluated. To get a sense of the dataset, an overview of its attributes is presented below:\n", "\n", "* **Dates**: the DateTime the crime was committed at\n", "* **Category**: the category of the committed crime. (This is the target variable.)\n", "* **Descript**: a further description of the crime\n", "* **DayOfWeek**: day of the week on which the crime was committed\n", "* **PdDistrict**: the Police Department District that attended the crime incident\n", "* **Resolution**: an (optional) explanation on how the crime was resolved\n", "* **Address**: \"the approximate street address of the crime incident\"\n", "* **X**: the longitude of the location the crime was committed\n", "* **Y**: the latitude of the location the crime was committed\n", "\n", "In total, the dataset comprises 1,762,311 crime records, 878,049 of which have a category (i.e., the training set). These data range from January 2003 to May 2015.\n", "\n", "Lastly, the performance of predictive models is evaluated on the test set using the multi-class logarithmic loss metric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Requirements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to encourage reproducibility, the following is a list of technologies used, as well as their respective version:\n", "\n", "1. Python **3.7.2**.\n", "2. NumPy **1.17.2**.\n", "3. SciPy **1.3.1**.\n", "4. Scikit-learn **0.21.3**.\n", "5. pandas **0.25.1**.\n", "6. Matplotlib **3.1.1**.\n", "7. Numba **0.48.0**.\n", "8. tabulate **0.8.6**." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import copy\n", "import os\n", "import re\n", "\n", "import joblib\n", "import matplotlib.pyplot as plt\n", "import numba\n", "import numpy as np\n", "import pandas as pd\n", "import tabulate\n", "from IPython.display import display\n", "from IPython.display import HTML\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import log_loss\n", "from sklearn.model_selection import cross_val_predict\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.model_selection import ParameterGrid\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "CURRENT_PATH = os.path.abspath(os.getcwd())\n", "DATA_PATH = os.path.join(CURRENT_PATH, 'data')\n", "\n", "DATE_FORMAT = '%Y-%m-%d %H:%M:%S'\n", "RANDOM_STATE = 91\n", "\n", "ALGORITHMS = {\n", " 'logit': {\n", " 'estimator': LogisticRegression(\n", " penalty='l2',\n", " solver='liblinear',\n", " multi_class='ovr'\n", " ),\n", " 'predict_proba': True,\n", " 'param_grid': None\n", " },\n", " 'gb': {\n", " 'estimator': GradientBoostingClassifier(),\n", " 'predict_proba': True,\n", " 'param_grid': None\n", " },\n", " 'rf': {\n", " 'estimator': RandomForestClassifier(),\n", " 'predict_proba': True,\n", " 'param_grid': None\n", " }\n", " }" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "FNAMES = {\n", " 'original-train': 'train.csv',\n", " 'train': 'projected-train.csv',\n", " 'original-test': 'test.csv',\n", " 'test': 'projected-test.csv',\n", " 'sample-submission': 'sampleSubmission.csv',\n", " 'crime-dataset': 'crime-dataset.csv',\n", " 'baseline': 'baseline-models.csv'\n", " }\n", "\n", "FNAMES = {key: os.path.join(DATA_PATH, fname) for key, fname in FNAMES.items()}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Data Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's map each crime category to its numerical representation." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "with open(FNAMES['sample-submission']) as f:\n", " for i, row in enumerate(f):\n", " row = row.rstrip('\\n')\n", " CRIME_CATEGORY = {category: j-1 for j, category in enumerate(row.split(',')) if j > 0}\n", " \n", " break" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "PROJECTION = ['Id', 'Dates', 'Category', 'DayOfWeek', 'PdDistrict', 'X', 'Y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As specified by the variable `PROJECTION`, let's project (or filter) the datasets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def dataset_projection(in_fname, out_fname, projection=PROJECTION):\n", " valid_columns = []\n", " \n", " insert_id = False\n", " header, data = [], []\n", " with open(in_fname) as f:\n", " for i, row in enumerate(f):\n", " row = row.rstrip('\\n')\n", " \n", " if i == 0:\n", " for j, col in enumerate(row.split(',')):\n", " if col in projection:\n", " header.append(col)\n", " valid_columns.append(j)\n", " \n", " insert_id = True if 'Id' not in header else False\n", "\n", " continue\n", " \n", " for old in re.findall(r'\"[^\"]+\"', row):\n", " new = re.sub(r',', '|', old) \n", " row = row.replace(old, new, 1)\n", " \n", " record = [\n", " re.sub(r'\\|', ',', col).strip()\n", " for j, col in enumerate(row.split(',')) if j in valid_columns\n", " ]\n", " \n", " if len(record) != len(valid_columns):\n", " print('({}), Malformed columns at line {}'.format(in_fname, i+1))\n", " continue\n", " elif insert_id:\n", " record.insert(0, i-1)\n", " \n", " data.append(record)\n", " \n", " if insert_id:\n", " header.insert(0, 'Id')\n", " \n", " return header, data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "datasets = [\n", " [FNAMES['original-train'], FNAMES['train']],\n", " [FNAMES['original-test'], FNAMES['test']]\n", " ]\n", "for in_fname, out_fname in datasets:\n", " if os.path.isfile(out_fname):\n", " continue\n", " \n", " header, data = dataset_projection(in_fname, out_fname)\n", " df = pd.DataFrame(data, columns=header)\n", " \n", " df['Dates'] = pd.to_datetime(df['Dates'], format=DATE_FORMAT)\n", " df = df.sort_values(by=['Dates'])\n", " df['Dates'] = df['Dates'].dt.strftime(DATE_FORMAT)\n", " \n", " df.to_csv(out_fname, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's append the test set to the training one." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "if not os.path.isfile(FNAMES['crime-dataset']):\n", " columns = copy.deepcopy(PROJECTION)\n", " columns.remove('Category')\n", " columns.insert(0, 'Dataset')\n", " \n", " df = None\n", " for in_fname, dataset_type in [[FNAMES['train'], 'train'], [FNAMES['test'], 'test']]:\n", " dataset = pd.read_csv(in_fname)\n", " dataset['Dataset'] = dataset_type\n", " dataset = dataset[columns]\n", " \n", " df = dataset.copy(deep=True) if df is None else df.append(dataset)\n", " \n", " df['Dates'] = pd.to_datetime(df['Dates'], format=DATE_FORMAT)\n", " df = df.sort_values(by=['Dates', 'Dataset', 'Id'])\n", " df['Dates'] = df['Dates'].dt.strftime(DATE_FORMAT)\n", " \n", " df = df[columns]\n", " \n", " df.to_csv(FNAMES['crime-dataset'], index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This process explores the ability to extract valuable information from the history of crimes recently committed and relatively close to each other. In other words, to each crime $c$, whose position in the map is depicted by the white filled circle and area of influence corresponds to the red filled circle, the history of crimes comprises all those from the previous hours before $c$ and within the area of influence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<div align=\"center\" style=\"margin-top: 10px;\"><b>Figure 1</b>: A crime instance in San Francisco, which is is depicted by the white filled circle and whose area of influence corresponds to the red filled circle. Credits: <a href=\"https://www.google.com/maps\">Google Maps</a> and <a href=\"https://www.mapdevelopers.com/draw-circle-tool.php\">Map Developers</a></div>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Formally, let $x_{i}$ be each crime in the dataset and $S_{i}$ be the set comprising all crimes from the previous $t_{w}$ hours before $x_{i}$, then:\n", "\n", "\\begin{equation*}\n", "S_{i} = \\{ x_{j} | (x_{j}^{Dates} < x_{i}^{Dates}) \\wedge (hours(x_{j}^{Dates}, x_{i}^{Dates}) < t_{w}) \\}_{_{j \\neq i}^{j = 1}}^{N}\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "S_{i}^{PdDistrict} = \\{ x_{j} | (x_{j}^{Dates} < x_{i}^{Dates}) \\wedge (hours(x_{j}^{Dates}, x_{i}^{Dates}) < t_{w}) \\wedge (x_{j}^{PdDistrict} = x_{i}^{PdDistrict}) \\}_{_{j \\neq i}^{j = 1}}^{N}\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "S_{i}^{AreaOfInfluence} = \\{ x_{j} | (x_{j}^{Dates} < x_{i}^{Dates}) \\wedge (hours(x_{j}^{Dates}, x_{i}^{Dates}) < t_{w}) \\wedge (distance(x_{i}^{Y}, x_{i}^{X}, x_{j}^{Y}, x_{j}^{X}) \\leq r) \\}_{_{j \\neq i}^{j = 1}}^{N}\n", "\\end{equation*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Afterward, several features are created by applying the aggregation function of count to each set of crimes as follows:\n", "\n", "\\begin{equation*}\n", "x_{i}^{Agg_{1}} = |S_{i}|\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "x_{i}^{Agg_{2}} = |S_{i}^{PdDistrict}|\n", "\\end{equation*}\n", "\n", "\\begin{equation*}\n", "x_{i}^{Agg_{3}} = |S_{i}^{AreaOfInfluence}|\n", "\\end{equation*}\n", "\n", "where,\n", "\n", "* $N$ is the number of total crimes in the dataset,\n", "* $| \\cdot |$ is the cardinality of a set,\n", "* $x_{i}^{Dates}$ is the DateTime when crime $x_{i}$ was committed,\n", "* $hours(x_{j}^{Dates}, x_{i}^{Dates})$ is a function calculating the number of hours between the DateTimes $x_{j}^{Dates}$ and $x_{i}^{Dates}$,\n", "* $t_{w}$ is the maximum number of hours between previous crime $x_{j}$ and crime $x_{i}$ in order to include the former in the set $S_{i}$,\n", "* $x_{i}^{PdDistrict}$ is the Police Department District that attended the crime incident,\n", "* $S_{i}^{PdDistrict}$ is the set of crimes attended by the same Police Department District,\n", "* $x_{i}^{Y}$ and $x_{i}^{X}$ correspond to latitude and longitude, respectively,\n", "* $distance(x_{i}^{Y}, x_{i}^{X}, x_{j}^{Y}, x_{j}^{X})$ is a function calculating the distance, in kilometers, between previous crime $x_{j}$ and crime $x_{i}$,\n", "* $r$ is the maximum distance in kilometers between previous crime $x_{j}$ and crime $x_{i}$ in order to include the former in the set $S_{i}$,\n", "* Note that $r$ can also be seen as the radius of the circle representing the area of influence of $x_{i}$,\n", "* $S_{i}^{AreaOfInfluence}$ is the set of crimes within the area of influence of $x_{i}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before continuing, some clarifications must be provided, namely:\n", "\n", "* $t_{w}$ takes a value from the set $\\{ 12, 24, 72, 168, 336 \\}$.\n", "* $r$ takes a value from the set $\\{ 1, 2, 4, 8, 16 \\}$.\n", "* The three aggregated features described above result from assigning one value to $t_{w}$ and one to $r$ from their corresponding sets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, the following list comprises all attributes derived from the DateTime $x_{i}^{Dates}$:\n", "\n", "1. Year,\n", "2. Month,\n", "3. Quarter,\n", "4. Triannual,\n", "5. Semester,\n", "6. Day,\n", "7. Day of week,\n", "8. Whether or not the day of the week is a working day,\n", "9. Fortnight,\n", "10. Hour,\n", "11. Four-hour period the hour of the crime belongs to,\n", "12. Six-hour period the hour of the crime belongs to,\n", "13. Twelve-hour period the hour of the crime belongs to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, the feature engineering process creates 13 date-derived features plus three aggregated features per each combination of values $t_{w}$ and $r$ take. To create these features, four raw features are used, namely: $Dates$, $PdDistrict$, $X$, and $Y$. Lastly, bear in mind that terms *feature* and *attribute* are used interchangeably throughout this document." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def compute_distance(\n", " lat_1, lon_1,\n", " lat_2, lon_2):\n", " \"\"\"Compute distance between two locations.\n", " \n", " Returns\n", " -------\n", " float\n", " Distance in KM.\n", " \n", " Source: <https://stackoverflow.com/questions/19412462/>\n", " \"\"\"\n", " # Approximate radius of earth in KM\n", " earth_radius = 6373.0\n", " \n", " lat_1 = np.radians(lat_1)\n", " lon_1 = np.radians(lon_1)\n", " \n", " lat_2 = np.radians(lat_2)\n", " lon_2 = np.radians(lon_2)\n", " \n", " lon_dist = lon_2 - lon_1\n", " lat_dist = lat_2 - lat_1\n", " \n", " a = (np.square(np.sin(lat_dist/2))\n", " + np.cos(lat_1)\n", " * np.cos(lat_2)\n", " * np.square(np.sin(lon_dist/2)))\n", " \n", " c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))\n", " \n", " return earth_radius * c" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def compute_aggregated_features(data, time_window, crime_radius):\n", " \"\"\"Compute aggregated features.\n", " \n", " Parameters\n", " ----------\n", " data : np.ndarray, dtype('int64')\n", " A Numpy-like array of shape \"(n, m)\", where \"n\" is the number\n", " of records and \"m\" is the number of columns (or attributes).\n", " The strict order of the columns is presented below:\n", " Dataset,\n", " Id,\n", " Dates,\n", " PdDistrict,\n", " X - Longitude,\n", " Y - Latitude\n", " time_window : int\n", " Time window (in hours).\n", " crime_radius : list\n", " List of integers, each of which representing a radius in kilometers.\n", " \"\"\"\n", " n = len(data)\n", " \n", " # Let's transform the time window into seconds\n", " time_window = time_window * 60 * 60\n", " \n", " aggregated_features = []\n", " for i in range(n):\n", " ts = data[i,2] \n", " \n", " lower_ts = ts - time_window\n", " \n", " mask = ((lower_ts < data[:,2])\n", " & (data[:,2] < ts))\n", " \n", " historical_data = data[mask]\n", " m = len(historical_data)\n", " \n", " police_district = data[i,3]\n", " \n", " feature_vector = [\n", " int(data[i,0]),\n", " int(data[i,1]),\n", " m, # number of crimes within the time window\n", " 0 # number of crimes attended by the same police department district\n", " ]\n", " feature_vector = feature_vector + [0 for j in crime_radius]\n", " \n", " lat_1 = data[i,5]\n", " lon_1 = data[i,4]\n", " \n", " for j in range(m):\n", " feature_vector[3] += 1 if police_district == historical_data[j,3] else 0\n", " \n", " lat_2 = historical_data[j,5]\n", " lon_2 = historical_data[j,4]\n", " \n", " # Let's compute the number of crimes within each given radius\n", " distance = compute_distance(lat_1, lon_1, lat_2, lon_2)\n", " \n", " for k, rad in enumerate(crime_radius):\n", " feature_vector[4+k] += 1 if distance <= rad else 0\n", " \n", " aggregated_features.append(feature_vector)\n", " \n", " return aggregated_features" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def derive_date_attributes(df, column, drop_column=True):\n", " df['Year'] = df[column].dt.year\n", " \n", " df['Month'] = df[column].dt.month\n", " df['Quarter'] = df[column].dt.quarter\n", " \n", " df['Triannual'] = 0\n", " for i, (min_m, max_m) in enumerate([[1, 4], [5, 8], [9, 12]]):\n", " df.loc[((min_m <= df['Month']) & (df['Month'] <= max_m)), 'Triannual'] = i + 1\n", " \n", " df['Semester'] = 1\n", " df.loc[(df['Quarter'] > 2), 'Semester'] = 2\n", " \n", " df['Day'] = df[column].dt.day\n", " df['DayOfWeek'] = df[column].dt.dayofweek\n", " \n", " df['WorkingDay'] = 1\n", " df.loc[(df['DayOfWeek'] > 4), 'WorkingDay'] = 0\n", " \n", " df['Fortnight'] = 1\n", " df.loc[(df['Day'] > 15), 'Fortnight'] = 2\n", " \n", " df['Hour'] = df[column].dt.hour\n", " \n", " hourly_periods = {\n", " 'four': [[i, i+3] for i in range(0, 24, 4)],\n", " 'six': [[i, i+5] for i in range(0, 24, 6)],\n", " 'twelve': [[i, i+11] for i in range(0, 24, 12)],\n", " }\n", " \n", " for str_period, period in hourly_periods.items():\n", " period_column = '{}HourPeriod'.format(str_period.title())\n", " df[period_column] = 0\n", " \n", " for i, (min_hr, max_hr) in enumerate(period):\n", " df.loc[((min_hr <= df['Hour']) & (df['Hour'] <= max_hr)), period_column] = i + 1\n", " \n", " if drop_column:\n", " df = df.drop(columns=[column])\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def feature_engineering(\n", " df, data, time_windows,\n", " idx_to_dataset, idx_to_district,\n", " crime_radius=[1, 2, 4, 8, 16]):\n", " \"\"\"Compute the process of feature engineering.\"\"\"\n", " df = df[['Dataset', 'Id', 'PdDistrict', 'Dates']]\n", " df = df.replace({'Dataset': idx_to_dataset, 'PdDistrict': idx_to_district})\n", " \n", " df = derive_date_attributes(df, 'Dates')\n", " \n", " crime_radius = np.array(crime_radius, dtype=int)\n", " \n", " agg_ds_fname = os.path.join(DATA_PATH, 'agg-dataset-{}H.csv')\n", " cum_ds_fname = os.path.join(DATA_PATH, 'agg-dataset-{}H-cumulative.csv') \n", " \n", " for i, time_window in enumerate(time_windows):\n", " agg_ds_fname_1 = agg_ds_fname.format(time_window)\n", " cum_ds_fname_1 = cum_ds_fname.format(time_window)\n", " \n", " if (os.path.isfile(cum_ds_fname_1)\n", " or (i == 0 and os.path.isfile(agg_ds_fname_1))):\n", " continue\n", " elif os.path.isfile(agg_ds_fname_1):\n", " agg_ds = pd.read_csv(agg_ds_fname_1)\n", " \n", " for col in agg_ds.columns:\n", " if col in ['Dataset']:\n", " continue\n", " \n", " agg_ds[col] = pd.to_numeric(agg_ds[col])\n", " else:\n", " agg_ds = compute_aggregated_features(data, time_window, crime_radius)\n", " \n", " prefix = '{}H_'.format(time_window)\n", " agg_ds_columns = (['Dataset', 'Id']\n", " + [(prefix + col) for col in ['Crimes', 'CrimesAttendedByPdDistrict']]\n", " + [(prefix + 'CrimesWithin{}KMRad'.format(rad)) for rad in crime_radius])\n", " \n", " agg_ds = pd.DataFrame(agg_ds, columns=agg_ds_columns)\n", " agg_ds = agg_ds.astype({col: 'int32' for col in agg_ds_columns})\n", " \n", " agg_ds['Dataset'] = agg_ds['Dataset'].map(idx_to_dataset)\n", " \n", " agg_ds.to_csv(agg_ds_fname_1, index=False)\n", " \n", " if i == 0:\n", " continue\n", " \n", " cum_ds = agg_ds.copy(deep=True)\n", " agg_ds = None\n", " \n", " for j in range(i):\n", " agg_ds_fname_2 = agg_ds_fname.format(time_windows[j])\n", " \n", " agg_ds = pd.read_csv(agg_ds_fname_2)\n", " \n", " for col in agg_ds.columns:\n", " if col in ['Dataset']:\n", " continue\n", " \n", " agg_ds[col] = pd.to_numeric(agg_ds[col])\n", " \n", " cum_ds = pd.merge(agg_ds, cum_ds, on=['Dataset', 'Id'], how='inner')\n", " \n", " cum_ds = pd.merge(df, cum_ds, on=['Dataset', 'Id'], how='inner')\n", " \n", " cum_ds.to_csv(cum_ds_fname_1, index=False)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(FNAMES['crime-dataset'])\n", "\n", "df['Dates'] = pd.to_datetime(df['Dates'], format=DATE_FORMAT)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "DATASET_TO_IDX = {dataset: i for i, dataset in enumerate(df['Dataset'].unique())}\n", "IDX_TO_DATASET = {i: dataset for dataset, i in DATASET_TO_IDX.items()}\n", "\n", "df['Dataset'] = df['Dataset'].map(DATASET_TO_IDX)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "DISTRICT_TO_IDX = {district: i for i, district in enumerate(df['PdDistrict'].unique())}\n", "IDX_TO_DISTRICT = {i: district for district, i in DISTRICT_TO_IDX.items()}\n", "\n", "df['PdDistrict'] = df['PdDistrict'].map(DISTRICT_TO_IDX)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "PROJECTION = (['Dataset']\n", " + [col for col in PROJECTION if col not in ['Category', 'DayOfWeek']])\n", "\n", "df = df[PROJECTION]\n", "df = df.sort_values(by=['Dates', 'Dataset', 'Id'])\n", "\n", "crimes = df.copy(deep=True)\n", "crimes['ts'] = crimes['Dates'].values.astype(np.int64) // 10 ** 9\n", "crimes = crimes[[('ts' if col == 'Dates' else col) for col in PROJECTION]].to_numpy().astype(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bear in mind that running the feature engineering process takes a long time; approximately 7.3 hours." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "TIME_WINDOWS = [12, 24, 72, 168, 336]\n", "\n", "feature_engineering(df, crimes, TIME_WINDOWS, IDX_TO_DATASET, IDX_TO_DISTRICT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Crime Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first subsection presents how the whole dataset is split into training, validation, and test sets. Then, the baseline model and the discriminative power of each set of features are discussed in subsections 3.2 and 3.3, respectively. Finally, the methodology to build stacked generalization ensembles is developed in subsection 3.4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Train-Test Split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split the whole dataset of crimes into training, validation and test datasets. Please recall that the original training and test datasets, as downloaded from Kaggle, were merged to run the feature engineering process. Lastly, the training dataset will be split into two folds: the first one, approximately 80% of the data, is intended to train prediction models; the second one is aimed at validating each prediction model on an independent dataset." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "VALIDATION_SIZE = 0.2\n", "\n", "for time_window in TIME_WINDOWS:\n", " in_fname = os.path.join(DATA_PATH, 'agg-dataset-{}H-cumulative.csv'.format(time_window))\n", " if not os.path.isfile(in_fname):\n", " continue\n", " \n", " window_path = {\n", " '/': os.path.join(DATA_PATH, '{}H'.format(time_window))\n", " }\n", " window_path['data'] = os.path.join(window_path['/'], 'data')\n", " \n", " for path in window_path.values():\n", " if not os.path.isdir(path):\n", " os.makedirs(path)\n", " \n", " window_fnames = {\n", " ds: os.path.join(window_path['data'], '{}.csv'.format(ds))\n", " for ds in ['train', 'validation', 'test']\n", " }\n", " \n", " split_data = False\n", " for fname in window_fnames.values():\n", " if not os.path.isfile(fname):\n", " split_data = True\n", " break\n", " \n", " if not split_data:\n", " continue\n", " \n", " df = pd.read_csv(in_fname) \n", " \n", " df['Id'] = pd.to_numeric(df['Id'])\n", " \n", " columns = df.columns \n", " \n", " numerical_columns = [col for col in columns if re.match(r'[0-9]{2,3}H_', col)] \n", " \n", " categorical_columns = ['PdDistrict']\n", " for col in columns:\n", " if (col not in numerical_columns\n", " and col not in ['Dataset', 'Id', 'PdDistrict']):\n", " categorical_columns.append(col)\n", " \n", " if not os.path.isfile(window_fnames['test']):\n", " test = df.loc[df['Dataset']=='test'].drop(columns=['Dataset'])\n", " \n", " test = test.sort_values(by=['Id'])\n", " \n", " columns = ['Id'] + categorical_columns + numerical_columns\n", " test = test[columns]\n", " \n", " test.to_csv(window_fnames['test'], index=False)\n", " test = None\n", " \n", " df = df.loc[df['Dataset']=='train'].drop(columns=['Dataset'])\n", " \n", " train = pd.read_csv(FNAMES['train']) \n", " \n", " train['Id'] = pd.to_numeric(train['Id'])\n", " train = train[['Id', 'Category']]\n", " \n", " assert len(df) == len(train)\n", " \n", " df = pd.merge(df, train, on=['Id'], how='inner')\n", " train = None\n", " \n", " columns = ['Id'] + categorical_columns + numerical_columns + ['Category']\n", " df = df[columns]\n", " \n", " # The validation dataset is a stratified random sample\n", " \n", " validation = None\n", " for category in CRIME_CATEGORY.keys():\n", " category_samples = df.loc[df['Category']==category]\n", " \n", " sample_size = len(category_samples) * VALIDATION_SIZE\n", " sample_size = int(np.round(sample_size, 0))\n", " \n", " sample = category_samples.sample(n=sample_size, replace=False, random_state=RANDOM_STATE)\n", " \n", " validation = (\n", " validation.append(sample).reset_index(drop=True)\n", " if validation is not None\n", " else sample.copy(deep=True)\n", " )\n", " \n", " validation.to_csv(window_fnames['validation'], index=False)\n", " \n", " df = df.loc[~df['Id'].isin(validation['Id'].values)]\n", " df.to_csv(window_fnames['train'], index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Baseline Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's build a strong baseline model according to the following criteria:\n", "\n", "1. The training dataset will be used to learn prediction models.\n", "2. The validation dataset will be used to rank prediction models, as they haven't seen these data during their training process.\n", "3. The entire set of features will be used.\n", "4. A set of several machine learning algorithms will be used to learn prediction models. Thus, a prediction model will be learned per each combination of algorithm and time window.\n", "5. There will not be hyperparameter optimization. Machine learning algorithms will be used with their hyperparameters default values." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def identify_attributes(\n", " df, attributes, time_window, window_type,\n", " attributes_to_exclude=['Id', 'PdDistrict', 'Category']\n", " ):\n", " \"\"\"Identify the attribute names the sets of numerical and categorical attributes consist of.\n", " \n", " Returns\n", " -------\n", " list\n", " List of attribute names the set of numerical features consists of.\n", " list\n", " List of attribute names the set of categorical features consists of.\n", " \"\"\"\n", " columns = df.columns\n", " \n", " num_attr_re = re.compile(\n", " r'{}H_'.format('[0-9]{2,3}' if window_type == 'cumulative' else time_window)\n", " )\n", " num_attr = (\n", " [col for col in columns if num_attr_re.match(col)]\n", " if attributes in ['num', 'all']\n", " else []\n", " )\n", " \n", " cat_attr = []\n", " for col in columns:\n", " if attributes == 'num':\n", " break\n", " \n", " if (col not in attributes_to_exclude\n", " and not re.match(r'[0-9]{2,3}H_', col)):\n", " cat_attr.append(col)\n", " \n", " return num_attr, cat_attr" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def build_model(\n", " train, validation,\n", " attributes, y_column,\n", " time_window, window_type,\n", " estimator, param_grid,\n", " predict_proba,\n", " return_pred=False,\n", " return_clf=False\n", " ):\n", " \"\"\"Build a prediction model.\n", " \n", " Parameters\n", " ----------\n", " train : pd.DataFrame\n", " \n", " validation : pd.DataFrame\n", " \n", " attributes : str\n", " The set of attributes to be used as input by the estimator.\n", " \n", " - If \"cat\", then the date-derived attributes are used.\n", " - If \"num\", then the aggregated attributes computed using the time window are used.\n", " - If \"all\", then the union of the attributes \"cat\" and \"raw\" is used.\n", " \n", " Bear in mind that, whatever the set of attributes,\n", " the attribute \"PdDistrict\" is always used.\n", " \n", " y_column : str\n", " Which is the target variable? This variable must\n", " be in both training and validation datasets.\n", " \n", " time_window : int\n", " Time window (in hours).\n", " \n", " window_type : str\n", " Whether to use other windows whose time in hours is less than the specified time window value.\n", " \n", " - If \"exact\", then aggregated attributes that were computed using only the specified time\n", " window, are used.\n", " - If \"cumulative\", then aggregated attributes that were computed using windows whose time\n", " in hours is less than the specified time window, are also used.\n", " \n", " estimator\n", " A scikit-learn classification estimator.\n", " \n", " param_grid : dict\n", " Dictionary of parameters, as well as their corresponding values, for the estimator.\n", " This enables an exhaustive search over the specified parameter values through cross-validation.\n", " \n", " predict_proba : bool\n", " Whether or not the estimator supports the \"predict_proba()\" method.\n", " \n", " return_pred : bool\n", " \n", " return_clf : bool\n", " \"\"\"\n", " num_attr, cat_attr = identify_attributes(train, attributes, time_window, window_type)\n", " \n", " if ('PdDistrict' in train.columns\n", " and 'PdDistrict' not in cat_attr):\n", " cat_attr.insert(0, 'PdDistrict')\n", " \n", " X_train_cat = train[cat_attr].to_numpy()\n", " X_valid_cat = validation[cat_attr].to_numpy()\n", " \n", " X_train_num = None\n", " X_valid_num = None\n", " \n", " scaler = StandardScaler()\n", " if attributes in ['num', 'all']:\n", " X_train_num = scaler.fit_transform(train[num_attr].to_numpy().astype(float))\n", " X_valid_num = scaler.transform(validation[num_attr].to_numpy().astype(float))\n", " \n", " X_train = (\n", " np.hstack([X_train_cat, X_train_num])\n", " if X_train_num is not None\n", " else X_train_cat\n", " )\n", " \n", " X_valid = (\n", " np.hstack([X_valid_cat, X_valid_num])\n", " if X_valid_num is not None\n", " else X_valid_cat\n", " )\n", " \n", " y_train = train[y_column].values.astype(int)\n", " y_valid = validation[y_column].values.astype(int)\n", " \n", " clf = estimator\n", " \n", " if isinstance(param_grid, dict):\n", " cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n", " \n", " scoring = 'neg_log_loss' if predict_proba else 'f1_macro'\n", " clf = GridSearchCV(\n", " estimator=estimator, param_grid=param_grid, cv=cv,\n", " n_jobs=5, scoring=scoring, iid=False, refit=True\n", " )\n", " \n", " clf.fit(X_train, y_train)\n", " \n", " y_pred = clf.predict_proba(X_valid) if predict_proba else clf.predict(X_valid) \n", " \n", " score = log_loss(y_valid, y_pred)\n", " \n", " if not return_pred and not return_clf:\n", " return score\n", " elif return_pred and not return_clf:\n", " return score, y_pred\n", " elif not return_pred and return_clf:\n", " return score, clf, scaler\n", " else:\n", " return score, y_pred, clf, scaler" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def find_baseline_model(\n", " time_windows, out_fname,\n", " district_to_idx=DISTRICT_TO_IDX,\n", " crime_category=CRIME_CATEGORY,\n", " algorithms=ALGORITHMS):\n", " \"\"\"Find the best baseline model, as specified above.\"\"\"\n", " results = []\n", " results_header = ['TimeWindow', 'Algorithm', 'LogLoss']\n", " \n", " for time_window in time_windows:\n", " window_path = {\n", " '/': os.path.join(DATA_PATH, '{}H'.format(time_window))\n", " }\n", " \n", " window_path['data'] = os.path.join(window_path['/'], 'data')\n", " \n", " train = pd.read_csv(os.path.join(window_path['data'], 'train.csv'))\n", " train = train.replace({'PdDistrict': district_to_idx, 'Category': crime_category})\n", " \n", " validation = pd.read_csv(os.path.join(window_path['data'], 'validation.csv'))\n", " validation = validation.replace({'PdDistrict': district_to_idx, 'Category': crime_category})\n", " \n", " for algo, settings in algorithms.items():\n", " score = build_model(\n", " train, validation,\n", " 'all', 'Category',\n", " time_window, 'cumulative',\n", " settings['estimator'], None,\n", " settings['predict_proba']\n", " )\n", " results.append([str(time_window), algo, '{:.5f}'.format(score)])\n", " \n", " pd.DataFrame(results, columns=results_header).to_csv(out_fname, index=False, mode='w')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finding a strong baseline model takes a long time. Therefore, the set of time windows will be reduced, namely: 24, 72, and 168 hours." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "TIME_WINDOWS.pop(0)\n", "\n", "if not os.path.isfile(FNAMES['baseline']):\n", " find_baseline_model(TIME_WINDOWS[:-1], FNAMES['baseline'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's analyze these results." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "baseline_results = pd.read_csv(FNAMES['baseline'])\n", "\n", "for col in baseline_results.columns:\n", " if col in ['Algorithm']:\n", " continue\n", " \n", " baseline_results[col] = pd.to_numeric(baseline_results[col])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Algorithm gb logit rf\n", "TimeWindow \n", "24 8.13907 2.60024 14.50282\n", "72 2.58249 2.59720 14.36502\n", "168 10.42041 2.59499 14.36123\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "chart_data = baseline_results.pivot(index='TimeWindow', columns='Algorithm', values='LogLoss')\n", "\n", "print(chart_data)\n", "\n", "chart_data.plot(kind='bar', grid=False, legend=True, x=None)\n", "\n", "plt.xlabel('Time Window')\n", "plt.ylabel('Log Loss')\n", "\n", "plt.yscale(\"log\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notably, the Logistic Regression algorithm outperforms all of the machine learning algorithms, even though the best result, i.e., **2.58249**, was achieved by the Gradient Boosting algorithm on data aggregated using a time window of 72 hours. This conclusion is drawn as Logistic Regression is the most computationally efficient algorithm, and the difference between its best result and the overall best result is negligible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, let's analyze how having a larger time window contributes to crime classification. The Logistic Regression algorithm will be used to conduct this analysis." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "FNAMES['logit-baseline'] = os.path.join(DATA_PATH, 'baseline-logit-models.csv')\n", "\n", "if not os.path.isfile(FNAMES['logit-baseline']):\n", " logit_baseline = baseline_results.loc[baseline_results['Algorithm']=='logit']\n", " logit_baseline = logit_baseline.drop(columns=['Algorithm'])\n", " \n", " time_window = 336\n", " \n", " window_data_path = os.path.join(DATA_PATH, '{}H'.format(time_window), 'data')\n", " \n", " train = pd.read_csv(os.path.join(window_data_path, 'train.csv'))\n", " train = train.replace({'PdDistrict': DISTRICT_TO_IDX, 'Category': CRIME_CATEGORY})\n", " \n", " validation = pd.read_csv(os.path.join(window_path['data'], 'validation.csv'))\n", " validation = validation.replace({'PdDistrict': DISTRICT_TO_IDX, 'Category': CRIME_CATEGORY})\n", " \n", " algo = ALGORITHMS['logit']\n", " \n", " score = build_model(\n", " train, validation,\n", " 'all', 'Category',\n", " time_window, 'cumulative',\n", " algo['estimator'], None,\n", " algo['predict_proba']\n", " )\n", " \n", " logit_baseline = logit_baseline.append({'TimeWindow': time_window, 'LogLoss': score}, ignore_index=True)\n", " \n", " logit_baseline.to_csv(FNAMES['logit-baseline'], index=False, mode='w')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " TimeWindow LogLoss\n", "0 24 2.600240\n", "1 72 2.597200\n", "2 168 2.594990\n", "3 336 2.594697\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "logit_baseline = pd.read_csv(FNAMES['logit-baseline'])\n", "\n", "logit_baseline = logit_baseline.astype({'TimeWindow': 'int32', 'LogLoss': 'float64'})\n", "\n", "print(logit_baseline)\n", "\n", "logit_baseline.plot(x='TimeWindow', y='LogLoss', kind='bar', grid=False, legend=False)\n", "\n", "plt.xlabel('Time Window')\n", "plt.ylabel('Log Loss')\n", "\n", "plt.yscale(\"log\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To sum up, the **baseline result** is **2.5947**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Discriminative Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to quantify how discriminative each set of features is, as well as each window type." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def discriminative_feature(\n", " time_window, algorithm, out_fname,\n", " district_to_idx=DISTRICT_TO_IDX,\n", " crime_category=CRIME_CATEGORY):\n", " \"\"\"Quantify how discriminative each set of features is, as well as each window type.\"\"\"\n", " window_data_path = os.path.join(DATA_PATH, '{}H'.format(time_window), 'data')\n", " \n", " train = pd.read_csv(os.path.join(window_data_path, 'train.csv'))\n", " train = train.replace({'PdDistrict': district_to_idx, 'Category': crime_category})\n", " \n", " validation = pd.read_csv(os.path.join(window_data_path, 'validation.csv'))\n", " validation = validation.replace({'PdDistrict': district_to_idx, 'Category': crime_category})\n", " \n", " grid = {\n", " 'attributes': ['all', 'num', 'cat'],\n", " 'window_type': ['exact', 'cumulative']\n", " }\n", " \n", " results = []\n", " results_header = ['Attributes', 'WindowType', 'LogLoss']\n", " \n", " for params in ParameterGrid(grid):\n", " score = build_model(\n", " train, validation,\n", " params['attributes'], 'Category',\n", " time_window, params['window_type'],\n", " algorithm['estimator'], None,\n", " algorithm['predict_proba']\n", " )\n", " \n", " results.append([params['attributes'], params['window_type'], '{:.5f}'.format(score)])\n", " \n", " pd.DataFrame(results, columns=results_header).to_csv(out_fname, index=False, mode='w')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "FNAMES['discriminative-features'] = os.path.join(DATA_PATH, 'discriminative-features.csv')\n", "\n", "if not os.path.isfile(FNAMES['discriminative-features']):\n", " discriminative_feature(\n", " 336, ALGORITHMS['logit'], FNAMES['discriminative-features']\n", " )" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "discriminative_features = pd.read_csv(FNAMES['discriminative-features'])\n", "\n", "discriminative_features['LogLoss'] = pd.to_numeric(discriminative_features['LogLoss'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WindowType cumulative exact\n", "Attributes \n", "all 2.59470 2.60675\n", "cat 2.65614 2.65614\n", "num 2.60562 2.61889\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "chart_data = discriminative_features.pivot(index='Attributes', columns='WindowType', values='LogLoss')\n", "\n", "print(chart_data)\n", "\n", "chart_data.plot(kind='bar', grid=False, legend=True)\n", "\n", "plt.xlabel('Set of Attributes')\n", "plt.ylabel('Log Loss')\n", "\n", "plt.yscale(\"log\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results show that the union of the numerical and categorical attributes contribute the most to discriminative power. In a like manner, aggregated attributes computed using cumulative time windows rather than an exact time window are more discriminative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Stacking Ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The methodology to develop a stacked generalization ensemble goes from splitting the initial dataset to producing the second-level predictions, as depicted in Figure 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<div align=\"center\" style=\"margin-top: 10px;\"><b>Figure 2</b>: Methodology to develop a stacked generalization ensemble</div>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, the initial dataset is split into $k$ folds through a stratified sampling method. In particular, the resulting folds are different from each other, i.e., a crime instance belongs to only one fold. Secondly, $k$ base models are trained, one per each fold. Thirdly, the first-level predictions, i.e., the outputs from the base models, are combined depending on the stacking technique the meta-model implements. Finally, the meta-model outputs the second-level predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Furthermore, there are other considerations regarding the methodology described above, namely:\n", "\n", "1. The way the first-level predictions are stacked depends on the stacking technique. If the stacking technique is model averaging, then such predictions are summed element-wise. Otherwise, matrices representing first-level predictions are stacked horizontally.\n", "2. Accordingly, there are two stacking techniques a meta-model might implement, namely: training a classifier or model averaging.\n", "3. Potentially, combining the predictions from several meta-models might be used to produce third-level predictions." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def write_in_file(fname, content, mode='w', insert_new_line=True):\n", " with open(fname, mode) as f:\n", " content = (content\n", " + ('\\n' if insert_new_line else ''))\n", " \n", " f.write(content)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def is_leap_year(year):\n", " return (True\n", " if (year % 4 == 0 and (year % 100 != 0 or year % 400 == 0))\n", " else False)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def get_number_of_days_in_month(data):\n", " n = len(data)\n", " \n", " days_in_month = [31, None, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]\n", " \n", " result = []\n", " for i in range(n):\n", " year = data[i,0]\n", " month = data[i,1]\n", " \n", " result.append(\n", " (29 if is_leap_year(year) else 28)\n", " if month == 2\n", " else days_in_month[month-1]\n", " )\n", " \n", " return result" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def encode_cyclical_attributes(\n", " df,\n", " attributes=['Month', 'Day', 'DayOfWeek', 'Hour']):\n", " \"\"\"Encoding of cyclical attributes.\n", " \n", " Source: <http://blog.davidkaleko.com/feature-engineering-cyclical-features.html>\n", " \"\"\"\n", " df['days_month'] = get_number_of_days_in_month(df[['Year', 'Month']].to_numpy().astype(int))\n", " \n", " for attr in attributes:\n", " max_val = df[attr].max() if attr != 'Day' else df['days_month']\n", " min_val = df[attr].min()\n", " \n", " if min_val == 1:\n", " df[attr] -= 1\n", " elif min_val == 0:\n", " max_val += 1\n", " \n", " df['{}_Sin'.format(attr)] = np.sin(df[attr] * 2 * np.pi / max_val)\n", " df['{}_Cos'.format(attr)] = np.cos(df[attr] * 2 * np.pi / max_val)\n", " \n", " attributes.append('days_month')\n", " df = df.drop(columns=attributes)\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def encode_categorical_attributes(df, attributes):\n", " \"\"\"One-Hot encoding of categorical attributes.\"\"\"\n", " return pd.get_dummies(df, prefix=attributes, columns=attributes)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def split_training_data(df, n_splits, split_criterion='Category'):\n", " \"\"\"Split training data in a stratified way without replacement.\n", " \n", " Please note that if the number of class instances is less than the\n", " number of splits, a stratified random sample with replacement will\n", " be performed.\n", " \"\"\"\n", " criterion_values = {\n", " value: df.loc[df[split_criterion]==value].shape[0]\n", " for value in df[split_criterion].unique()\n", " }\n", " \n", " sampled_instances = []\n", " \n", " for i in range(n_splits):\n", " last_split = True if i == (n_splits-1) else False\n", " \n", " sample = None\n", " for value, size in criterion_values.items():\n", " mask = (df[split_criterion] == value) & (~df['Id'].isin(sampled_instances))\n", " criterion_sample = df.loc[mask]\n", " \n", " sample_size = size * (1/n_splits)\n", " sample_size = int(np.round(sample_size, 0))\n", " \n", " if sample_size < n_splits:\n", " criterion_sample = df.loc[(df[split_criterion] == value)]\n", " sample_size = np.min([n_splits, criterion_sample.shape[0]])\n", " \n", " if not last_split:\n", " criterion_sample = criterion_sample.sample(\n", " n=sample_size, replace=False, random_state=RANDOM_STATE\n", " )\n", " \n", " sample = (\n", " sample.append(criterion_sample).reset_index(drop=True)\n", " if sample is not None\n", " else criterion_sample.copy(deep=True)\n", " )\n", " \n", " sampled_instances += criterion_sample['Id'].values.tolist()\n", " \n", " yield sample" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def build_meta_classifier(\n", " prediction_fnames,\n", " y_true,\n", " stack_method,\n", " estimator):\n", " \"\"\"Build a meta classifier on the basis of several learners and a stacking method.\"\"\"\n", " X = None\n", " \n", " for pred_fname in prediction_fnames:\n", " clf_pred = np.loadtxt(pred_fname, dtype=float, delimiter=',', skiprows=1)\n", " clf_pred = clf_pred[:,1:]\n", " \n", " if X is None:\n", " X = copy.deepcopy(clf_pred)\n", " continue\n", " \n", " X = np.add(X, clf_pred) if stack_method == 'soft_voting' else np.hstack([X, clf_pred])\n", " \n", " if stack_method == 'soft_voting':\n", " y_pred = X / len(prediction_fnames)\n", " else:\n", " cv = StratifiedKFold(n_splits=5, random_state=RANDOM_STATE)\n", " \n", " y_pred = cross_val_predict(\n", " estimator=estimator, X=X, y=y_true, cv=cv, n_jobs=5, method='predict_proba')\n", " \n", " return log_loss(y_true, y_pred)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def build_classifier_ensembles(\n", " out_fname,\n", " base_algorithm,\n", " meta_algorithm,\n", " crime_category=CRIME_CATEGORY,\n", " time_windows=TIME_WINDOWS):\n", " \"\"\"Build classifier ensembles.\"\"\"\n", " if not os.path.isfile(out_fname):\n", " content = [\n", " 'Id',\n", " 'TimeWindow',\n", " 'NumberOfEstimators',\n", " 'StackingMethod',\n", " 'EncodingOfCyclicalAttributes',\n", " 'LogLoss'\n", " ] \n", " write_in_file(out_fname, ','.join(content))\n", " \n", " prediction_header = ['' for j in range(len(crime_category))]\n", " for crime, idx in crime_category.items():\n", " prediction_header[idx] = crime\n", " \n", " ensemble_grid = {\n", " 'encode_cyclical_attr': ['cyclical', 'onehot'],\n", " 'n_estimators': [3, 5, 7],\n", " 'stack_method': ['soft_voting', 'clf'],\n", " 'time_window': time_windows\n", " }\n", " ensemble_grid = ParameterGrid(ensemble_grid)\n", " \n", " for setting_id, settings in enumerate(ensemble_grid):\n", " time_window = settings['time_window']\n", " \n", " window_path = {\n", " '/': os.path.join(DATA_PATH, '{}H'.format(time_window))\n", " } \n", " window_path['/data'] = os.path.join(window_path['/'], 'data')\n", " \n", " train = pd.read_csv(os.path.join(window_path['/data'], 'train.csv'))\n", " validation = pd.read_csv(os.path.join(window_path['/data'], 'validation.csv'))\n", " \n", " if settings['encode_cyclical_attr'] == 'cyclical':\n", " train = encode_cyclical_attributes(train)\n", " validation = encode_cyclical_attributes(validation)\n", " else:\n", " cyclical_attr = ['Month', 'Day', 'DayOfWeek', 'Hour']\n", " train = encode_categorical_attributes(train, cyclical_attr)\n", " validation = encode_categorical_attributes(validation, cyclical_attr)\n", " \n", " date_attr = [\n", " 'Quarter', 'Triannual', 'Semester', 'Fortnight',\n", " 'FourHourPeriod', 'SixHourPeriod', 'TwelveHourPeriod'\n", " ] \n", " train = encode_categorical_attributes(train, date_attr)\n", " validation = encode_categorical_attributes(validation, date_attr)\n", " \n", " train = encode_categorical_attributes(train, ['PdDistrict'])\n", " validation = encode_categorical_attributes(validation, ['PdDistrict'])\n", " \n", " train = train.replace({'Category': crime_category})\n", " validation = validation.replace({'Category': crime_category})\n", " \n", " window_path['/prediction'] = os.path.join(\n", " window_path['/'],\n", " 'prediction',\n", " 'validation',\n", " 'n_estimators={}'.format(settings['n_estimators']),\n", " 'cyclical_attr={}'.format(settings['encode_cyclical_attr'])\n", " )\n", " \n", " if not os.path.isdir(window_path['/prediction']):\n", " os.makedirs(window_path['/prediction'])\n", " \n", " learners_result_fname = os.path.join(window_path['/prediction'], 'learners-result.csv')\n", " if not os.path.isfile(learners_result_fname):\n", " write_in_file(learners_result_fname, ','.join(['Clf', 'LogLoss']))\n", " \n", " prediction_fnames = []\n", " for i, train_sample in enumerate(split_training_data(train, settings['n_estimators'])):\n", " pred_fname = os.path.join(window_path['/prediction'], 'clf-{}-pred.csv'.format(i))\n", " prediction_fnames.append(pred_fname)\n", " \n", " if os.path.isfile(pred_fname):\n", " continue\n", " \n", " score, predictions = build_model(\n", " train_sample, validation,\n", " 'all', 'Category',\n", " time_window, 'cumulative',\n", " base_algorithm['estimator'], None,\n", " base_algorithm['predict_proba'],\n", " return_pred=True\n", " )\n", " \n", " write_in_file(\n", " learners_result_fname, ','.join([str(i), '{:.5f}'.format(score)]), mode='a'\n", " )\n", " \n", " predictions = pd.DataFrame(predictions, columns=prediction_header)\n", " \n", " predictions['Id'] = validation['Id'].values\n", " predictions = predictions[['Id']+prediction_header]\n", " \n", " predictions.to_csv(pred_fname, float_format='%.5f', index=False)\n", " \n", " score = build_meta_classifier(\n", " prediction_fnames,\n", " validation['Category'].values.astype(int),\n", " settings['stack_method'],\n", " meta_algorithm['estimator']\n", " )\n", " \n", " content = [\n", " str(setting_id),\n", " str(time_window),\n", " str(settings['n_estimators']),\n", " settings['stack_method'],\n", " settings['encode_cyclical_attr'],\n", " '{:.5f}'.format(score)\n", " ]\n", " write_in_file(out_fname, ','.join(content), mode='a')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "FNAMES['classifier-ensembles'] = os.path.join(DATA_PATH, 'classifier-ensembles.csv')\n", "\n", "if not os.path.isfile(FNAMES['classifier-ensembles']):\n", " build_classifier_ensembles(\n", " FNAMES['classifier-ensembles'], ALGORITHMS['logit'], ALGORITHMS['logit']\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.4.1 Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before starting the analysis of results, some clarifications must be provided, namely:\n", "\n", "1. The Logistic Regression algorithm was used to train base (or first-level) models and the meta-model (or stacking model).\n", "2. There was no hyperparameter optimization. Hence, hyperparameters default values were used.\n", "3. The meta-model learns how to best combine the predictions from the base models.\n", "4. However, training a meta-model is not the only technique to combine predictions. Model averaging, or soft voting, is also other technique used." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "stacking_results = pd.read_csv(FNAMES['classifier-ensembles']).drop(columns='Id')\n", "\n", "stacking_results = stacking_results.astype(\n", " {'TimeWindow': 'int32', 'NumberOfEstimators': 'int32', 'LogLoss': 'float64'}\n", " )\n", "\n", "stacking_results = stacking_results.sort_values(by='LogLoss')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, let's catch a glimpse of the top ten meta-models, as shown by the following table:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<center><table>\n", "<thead>\n", "<tr><th style=\"text-align: right;\"> Rank</th><th style=\"text-align: right;\"> Time Window</th><th style=\"text-align: right;\"> Number of Estimators</th><th>Stacking Technique </th><th>Cyclical Attributes Encoding Technique </th><th style=\"text-align: right;\"> Log Loss</th></tr>\n", "</thead>\n", "<tbody>\n", "<tr><td style=\"text-align: right;\"> 1</td><td style=\"text-align: right;\"> 336</td><td style=\"text-align: right;\"> 7</td><td>Classifier </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56077</td></tr>\n", "<tr><td style=\"text-align: right;\"> 2</td><td style=\"text-align: right;\"> 336</td><td style=\"text-align: right;\"> 7</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56126</td></tr>\n", "<tr><td style=\"text-align: right;\"> 3</td><td style=\"text-align: right;\"> 168</td><td style=\"text-align: right;\"> 7</td><td>Classifier </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56178</td></tr>\n", "<tr><td style=\"text-align: right;\"> 4</td><td style=\"text-align: right;\"> 168</td><td style=\"text-align: right;\"> 7</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56212</td></tr>\n", "<tr><td style=\"text-align: right;\"> 5</td><td style=\"text-align: right;\"> 168</td><td style=\"text-align: right;\"> 5</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56214</td></tr>\n", "<tr><td style=\"text-align: right;\"> 6</td><td style=\"text-align: right;\"> 72</td><td style=\"text-align: right;\"> 5</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56239</td></tr>\n", "<tr><td style=\"text-align: right;\"> 7</td><td style=\"text-align: right;\"> 336</td><td style=\"text-align: right;\"> 5</td><td>Classifier </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56241</td></tr>\n", "<tr><td style=\"text-align: right;\"> 8</td><td style=\"text-align: right;\"> 336</td><td style=\"text-align: right;\"> 5</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56258</td></tr>\n", "<tr><td style=\"text-align: right;\"> 9</td><td style=\"text-align: right;\"> 72</td><td style=\"text-align: right;\"> 7</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56261</td></tr>\n", "<tr><td style=\"text-align: right;\"> 10</td><td style=\"text-align: right;\"> 168</td><td style=\"text-align: right;\"> 3</td><td>Soft voting </td><td>One-hot </td><td style=\"text-align: right;\"> 2.56267</td></tr>\n", "</tbody>\n", "</table></center>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "header = [\n", " 'Rank',\n", " 'Time Window',\n", " 'Number of Estimators',\n", " 'Stacking Technique',\n", " 'Cyclical Attributes Encoding Technique',\n", " 'Log Loss'\n", " ]\n", "\n", "table = []\n", "\n", "for i, (idx, row) in enumerate(stacking_results.iloc[:10].iterrows()):\n", " stack_technique = row['StackingMethod']\n", " stack_technique = (\n", " 'Classifier' if stack_technique == 'clf' else stack_technique.replace('_', ' ')\n", " )\n", " stack_technique = stack_technique[0].upper() + stack_technique[1:]\n", " \n", " encoding_technique = row['EncodingOfCyclicalAttributes']\n", " encoding_technique = (\n", " 'one-hot' if encoding_technique == 'onehot' else encoding_technique\n", " )\n", " encoding_technique = encoding_technique[0].upper() + encoding_technique[1:]\n", " \n", " table.append([\n", " str(i+1),\n", " str(row['TimeWindow']),\n", " str(row['NumberOfEstimators']),\n", " stack_technique,\n", " encoding_technique,\n", " '{:.5f}'.format(row['LogLoss'])\n", " ])\n", "\n", "table = ('<center>'\n", " + tabulate.tabulate(table, header, tablefmt='html')\n", " + '</center>')\n", "\n", "display(HTML(table))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above results let us draw the following conclusions:\n", "\n", "1. Surprisingly, or at least for me, the technique that best handles cyclical attributes is one-hot encoding. Mapping each cyclical attribute onto a circle, through the trigonometric functions sine and cosine, doesn't outperform the basic one-hot encoding.\n", "2. Model averaging, or soft voting, is a simple but powerful technique to combine predictions, as seven out of the top ten meta-models are built on it.\n", "3. The larger the number of estimators, the better the discriminative power of the meta-model.\n", "4. In a like manner, the larger the time window, the better the discriminative power of the meta-model, as eight out of the top ten meta-models use a time window of at least 168 hours to aggregate features.\n", "5. Finally, it is worth mentioning that the best two meta-models combine the predictions from the same set of seven base models, but differ from each other by the stacking technique." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "In total, <b>120</b> different base models were trained." ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "total_estimators = np.sum(stacking_results['NumberOfEstimators'].unique()\n", " * stacking_results['TimeWindow'].unique().shape[0]\n", " * stacking_results['EncodingOfCyclicalAttributes'].unique().shape[0])\n", "\n", "display(HTML('In total, <b>{}</b> different base models were trained.'.format(total_estimators)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second of all, let's plot the performance of the <b>48</b> meta-models having as splitting criterion the cyclical attributes encoding technique." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 1296x648 with 2 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "stacking_results = stacking_results.replace(\n", " {'StackingMethod': {'clf': 'Classifier', 'soft_voting': 'Soft voting'}}\n", " )\n", "\n", "fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18,9))\n", "\n", "alpha = 0.001\n", "y_range = [\n", " stacking_results['LogLoss'].min() - alpha,\n", " stacking_results['LogLoss'].max() + alpha\n", " ]\n", "\n", "for i, technique in enumerate(stacking_results['EncodingOfCyclicalAttributes'].unique()):\n", " mask = stacking_results['EncodingOfCyclicalAttributes'] == technique\n", " technique_results = stacking_results.loc[mask].drop(columns='EncodingOfCyclicalAttributes')\n", " \n", " technique_results = technique_results.pivot_table(\n", " index='TimeWindow', columns=['NumberOfEstimators', 'StackingMethod'], values='LogLoss'\n", " )\n", " \n", " technique_results.plot(kind='bar', ax=ax[i])\n", " \n", " technique = 'one-hot' if technique == 'onehot' else technique\n", " ax[i].set_title('{} Encoding Technique'.format(technique.title()))\n", " ax[i].set_xlabel('Time Window')\n", "\n", " ax[i].set_ylim(*y_range)\n", " ax[i].set_yscale(\"log\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, the above figures support the finding that the technique that best handles cyclical attributes is one-hot encoding. Likewise, findings with respect to the time window and the stacking technique might also be generalized." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Third of all, let's get insights into how much the meta-model strengthens the discriminative power of the base models. To accomplish this, a comparison between the two best meta-models and the set of seven base models they are built on is conducted." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<center><table>\n", "<thead>\n", "<tr><th style=\"text-align: right;\"> Base Model</th><th style=\"text-align: right;\"> Log Loss</th></tr>\n", "</thead>\n", "<tbody>\n", "<tr><td style=\"text-align: right;\"> 1</td><td style=\"text-align: right;\"> 2.5726 </td></tr>\n", "<tr><td style=\"text-align: right;\"> 2</td><td style=\"text-align: right;\"> 2.56885</td></tr>\n", "<tr><td style=\"text-align: right;\"> 3</td><td style=\"text-align: right;\"> 2.57124</td></tr>\n", "<tr><td style=\"text-align: right;\"> 4</td><td style=\"text-align: right;\"> 2.57141</td></tr>\n", "<tr><td style=\"text-align: right;\"> 5</td><td style=\"text-align: right;\"> 2.5707 </td></tr>\n", "<tr><td style=\"text-align: right;\"> 6</td><td style=\"text-align: right;\"> 2.56979</td></tr>\n", "<tr><td style=\"text-align: right;\"> 7</td><td style=\"text-align: right;\"> 2.57093</td></tr>\n", "</tbody>\n", "</table></center>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "best_setting = stacking_results.iloc[0]\n", "\n", "predictions_path = os.path.join(\n", " DATA_PATH,\n", " '{}H'.format(best_setting['TimeWindow']),\n", " 'prediction',\n", " 'validation',\n", " 'n_estimators={}'.format(best_setting['NumberOfEstimators']),\n", " 'cyclical_attr={}'.format(best_setting['EncodingOfCyclicalAttributes'])\n", " )\n", "\n", "base_models_performance = pd.read_csv(os.path.join(predictions_path, 'learners-result.csv'))\n", "\n", "header = ['Base Model', 'Log Loss']\n", "\n", "table = [\n", " [str(i+1), '{:.5f}'.format(row['LogLoss'])]\n", " for i, (idx, row) in enumerate(base_models_performance.iterrows())\n", " ]\n", "\n", "table = ('<center>'\n", " + tabulate.tabulate(table, header, tablefmt='html')\n", " + '</center>')\n", "\n", "display(HTML(table))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this in mind, let's quantify how much the two best meta-models strengthen the discriminative power w.r.t. the best, the worst, and the average base model." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<center><table>\n", "<thead>\n", "<tr><th>Stacking Technique </th><th>Best Base Model </th><th>Worst Base Model </th><th>Average Base Model </th></tr>\n", "</thead>\n", "<tbody>\n", "<tr><td>Classifier </td><td>0.315 % </td><td>0.460 % </td><td>0.390 % </td></tr>\n", "<tr><td>Soft voting </td><td>0.295 % </td><td>0.441 % </td><td>0.371 % </td></tr>\n", "</tbody>\n", "</table></center>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "header = [\n", " 'Stacking Technique',\n", " 'Best Base Model',\n", " 'Worst Base Model',\n", " 'Average Base Model'\n", " ]\n", "\n", "table = []\n", "\n", "comparison_values = [\n", " base_models_performance['LogLoss'].min(),\n", " base_models_performance['LogLoss'].max(),\n", " base_models_performance['LogLoss'].mean()\n", " ]\n", "\n", "for i in range(2):\n", " stack_technique = stacking_results.iloc[i]['StackingMethod']\n", " \n", " score = stacking_results.iloc[i]['LogLoss']\n", " \n", " comparison = ['{:.3f} %'.format((np.abs(score-val)/val)*100) for val in comparison_values]\n", " comparison.insert(0, stack_technique)\n", " \n", " table.append(comparison)\n", "\n", "table = ('<center>'\n", " + tabulate.tabulate(table, header, tablefmt='html')\n", " + '</center>')\n", "\n", "display(HTML(table))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To conclude, let's make the following final remarks:\n", "\n", "1. The contributions of the two best meta-models to the discriminative power is almost negligible; all of them are below 0.5%. Even more, deploying any of these meta-models and its respective base models to a production environment might be unfeasible.\n", "2. However, any gain, no matter how small, is worth in the context of a Kaggle competition.\n", "3. The prediction from the two best meta-models on the test set will be used as late submissions to participate in this Kaggle competition.\n", "4. Regarding the baseline, i.e., 2.5947 multi-class logarithmic loss, the best meta-model has outperformed it. The decrease (or gain) is around <b>1.31%</b>." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.4.2 Late Submission" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As pointed out above, the final predictions to submit are those from the two best meta-models. These meta-models are built on the predictions from the same set of seven base models, but differ from each other by the stacking technique. Therefore, let's recall how the two best meta-models are built, as shown by the following table:\n", "\n", "| Rank | Time Window | Number of Estimators | Stacking Technique | Cyclical Attributes Encoding Technique |\n", "|------|-------------|----------------------|--------------------|----------------------------------------|\n", "| 1 | 336 | 7 | Classifier | One-hot |\n", "| 2 | 336 | 7 | Soft voting | One-hot |" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def train_and_persist_base_models(\n", " train_fname,\n", " validation_fname,\n", " time_window,\n", " n_estimators,\n", " base_algorithm,\n", " meta_algorithm,\n", " models_path\n", " ):\n", " \"\"\"Fit the base models on the training dataset and persist them.\"\"\"\n", " if not os.path.isdir(models_path):\n", " os.makedirs(models_path)\n", " \n", " results_fname = os.path.join(models_path, 'base-learners-result.csv')\n", " if not os.path.isfile(results_fname):\n", " write_in_file(results_fname, ','.join(['Clf', 'LogLoss']))\n", " \n", " train = pd.read_csv(train_fname)\n", " validation = pd.read_csv(validation_fname)\n", " \n", " cyclical_attr = [\n", " 'Month', 'Day', 'DayOfWeek', 'Hour'\n", " ]\n", " train = encode_categorical_attributes(train, cyclical_attr)\n", " validation = encode_categorical_attributes(validation, cyclical_attr)\n", " \n", " date_attr = [\n", " 'Quarter', 'Triannual', 'Semester', 'Fortnight',\n", " 'FourHourPeriod', 'SixHourPeriod', 'TwelveHourPeriod'\n", " ]\n", " train = encode_categorical_attributes(train, date_attr)\n", " validation = encode_categorical_attributes(validation, date_attr)\n", " \n", " train = encode_categorical_attributes(train, ['PdDistrict'])\n", " validation = encode_categorical_attributes(validation, ['PdDistrict'])\n", " \n", " train = train.replace({'Category': CRIME_CATEGORY})\n", " validation = validation.replace({'Category': CRIME_CATEGORY})\n", " \n", " X = None\n", " \n", " for i, train_sample in enumerate(split_training_data(train, n_estimators)):\n", " clf_fname = os.path.join(models_path, 'clf-{}.joblib'.format(i))\n", " scaler_fname = os.path.join(models_path, 'clf-{}-scaler.joblib'.format(i))\n", " \n", " if os.path.isfile(clf_fname):\n", " continue\n", " \n", " score, predictions, clf, scaler = build_model(\n", " train_sample, validation,\n", " 'all', 'Category',\n", " time_window, 'cumulative',\n", " base_algorithm['estimator'], None,\n", " base_algorithm['predict_proba'],\n", " return_pred=True,\n", " return_clf=True\n", " )\n", " \n", " X = copy.deepcopy(predictions) if X is None else np.hstack([X, predictions])\n", " \n", " write_in_file(results_fname, ','.join([str(i), '{:.5f}'.format(score)]), mode='a')\n", " \n", " joblib.dump(clf, clf_fname)\n", " joblib.dump(scaler, scaler_fname)\n", " \n", " clf_ensemble_fname = os.path.join(models_path, 'clf-ensemble.joblib')\n", " if os.path.isfile(clf_ensemble_fname):\n", " return\n", " \n", " n_cols = len(CRIME_CATEGORY) * n_estimators\n", " \n", " assert n_cols == X.shape[1]\n", " \n", " clf_ensemble = meta_algorithm['estimator']\n", " clf_ensemble.fit(X, validation['Category'].values.astype(int))\n", " \n", " joblib.dump(clf_ensemble, clf_ensemble_fname)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "def first_level_predictions(\n", " test_fname,\n", " time_window,\n", " n_estimators,\n", " models_path,\n", " predictions_path):\n", " \"\"\"Make predictions from the base models on the test set.\"\"\"\n", " prediction_header = ['' for j in range(len(CRIME_CATEGORY))]\n", " for crime, idx in CRIME_CATEGORY.items():\n", " prediction_header[idx] = crime\n", " \n", " if not os.path.isdir(predictions_path):\n", " os.makedirs(predictions_path)\n", " \n", " test = pd.read_csv(test_fname)\n", " \n", " cyclical_attr = [\n", " 'Month', 'Day', 'DayOfWeek', 'Hour'\n", " ]\n", " test = encode_categorical_attributes(test, cyclical_attr)\n", " \n", " date_attr = [\n", " 'Quarter', 'Triannual', 'Semester', 'Fortnight',\n", " 'FourHourPeriod', 'SixHourPeriod', 'TwelveHourPeriod'\n", " ]\n", " test = encode_categorical_attributes(test, date_attr)\n", " \n", " test = encode_categorical_attributes(test, ['PdDistrict'])\n", " \n", " num_attr, cat_attr = identify_attributes(test, 'all', time_window, 'cumulative')\n", " \n", " X_cat = test[cat_attr].to_numpy()\n", " \n", " for i in range(n_estimators):\n", " clf_pred_fname = os.path.join(predictions_path, 'clf-{}-pred.csv'.format(i))\n", " if os.path.isfile(clf_pred_fname):\n", " continue\n", " \n", " scaler_fname = os.path.join(models_path, 'clf-{}-scaler.joblib'.format(i))\n", " scaler = joblib.load(scaler_fname)\n", " \n", " X_num = scaler.transform(test[num_attr].to_numpy().astype(float))\n", " \n", " X = np.hstack([X_cat, X_num])\n", " \n", " clf_fname = os.path.join(models_path, 'clf-{}.joblib'.format(i))\n", " clf = joblib.load(clf_fname)\n", " \n", " predictions = pd.DataFrame(clf.predict_proba(X), columns=prediction_header)\n", " \n", " predictions['Id'] = test['Id'].values\n", " predictions = predictions[['Id']+prediction_header]\n", " \n", " predictions.to_csv(clf_pred_fname, float_format='%.5f', index=False)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "def second_level_predictions(\n", " n_estimators, stack_method,\n", " models_path, predictions_path,\n", " out_fname\n", " ):\n", " \"\"\"Combine the predictions from the base models.\"\"\"\n", " prediction_header = ['' for j in range(len(CRIME_CATEGORY))]\n", " for crime, idx in CRIME_CATEGORY.items():\n", " prediction_header[idx] = crime\n", " \n", " X = None\n", " \n", " identifiers = None\n", " \n", " for i in range(n_estimators):\n", " pred_fname = os.path.join(predictions_path, 'clf-{}-pred.csv'.format(i))\n", " clf_pred = np.loadtxt(pred_fname, dtype=float, delimiter=',', skiprows=1)\n", " \n", " if identifiers is None:\n", " identifiers = clf_pred[:,0].astype(int).tolist()\n", " \n", " clf_pred = clf_pred[:,1:]\n", " \n", " if X is None:\n", " X = copy.deepcopy(clf_pred)\n", " continue\n", " \n", " X = np.add(X, clf_pred) if stack_method == 'soft_voting' else np.hstack([X, clf_pred])\n", " \n", " predictions = None\n", " \n", " if stack_method == 'soft_voting':\n", " predictions = X / n_estimators\n", " else:\n", " clf_ensemble_fname = os.path.join(models_path, 'clf-ensemble.joblib')\n", " clf_ensemble = joblib.load(clf_ensemble_fname)\n", " \n", " predictions = clf_ensemble.predict_proba(X)\n", " \n", " predictions = pd.DataFrame(predictions, columns=prediction_header)\n", " \n", " predictions['Id'] = identifiers\n", " predictions = predictions[['Id']+prediction_header]\n", " \n", " predictions.to_csv(out_fname, float_format='%.5f', index=False)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def late_submission(\n", " time_window,\n", " n_estimators,\n", " stack_method,\n", " base_algorithm=ALGORITHMS['logit'],\n", " meta_algorithm=ALGORITHMS['logit']\n", " ):\n", " \"\"\"Make final predictions.\n", " \n", " These predictions are made on the test set.\n", " \"\"\"\n", " window_path = {'/': os.path.join(DATA_PATH, '{}H'.format(time_window))}\n", " \n", " window_path['/data'] = os.path.join(window_path['/'], 'data')\n", " \n", " window_path['/model'] = os.path.join(\n", " window_path['/'],\n", " 'model',\n", " 'n_estimators={}'.format(n_estimators),\n", " 'cyclical_attr=onehot'\n", " )\n", " \n", " window_path['/prediction'] = os.path.join(\n", " window_path['/'],\n", " 'prediction',\n", " 'test',\n", " 'n_estimators={}'.format(n_estimators),\n", " 'cyclical_attr=onehot'\n", " )\n", " \n", " train_base_models = False\n", " make_first_level_pred = False\n", " \n", " for i in range(n_estimators):\n", " clf_fname = os.path.join(window_path['/model'], 'clf-{}.joblib'.format(i))\n", " clf_pred_fname = os.path.join(window_path['/prediction'], 'clf-{}-pred.csv'.format(i))\n", " \n", " if not os.path.isfile(clf_fname):\n", " train_base_models = True\n", " make_first_level_pred = True\n", " elif not os.path.isfile(clf_pred_fname):\n", " make_first_level_pred = True\n", " \n", " if train_base_models:\n", " break\n", " \n", " if train_base_models:\n", " train_and_persist_base_models(\n", " os.path.join(window_path['/data'], 'train.csv'),\n", " os.path.join(window_path['/data'], 'validation.csv'),\n", " time_window,\n", " n_estimators,\n", " base_algorithm,\n", " meta_algorithm,\n", " window_path['/model']\n", " )\n", " \n", " if make_first_level_pred:\n", " first_level_predictions(\n", " os.path.join(window_path['/data'], 'test.csv'),\n", " time_window,\n", " n_estimators,\n", " window_path['/model'],\n", " window_path['/prediction']\n", " )\n", " \n", " out_fname = os.path.join(window_path['/prediction'], '{}-ensemble-pred.csv'.format(stack_method))\n", " if not os.path.isfile(out_fname):\n", " second_level_predictions(\n", " n_estimators, stack_method,\n", " window_path['/model'], window_path['/prediction'],\n", " out_fname\n", " )" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "time_window = 336\n", "n_estimators = 7\n", "\n", "late_submission(time_window, n_estimators, 'clf')\n", "late_submission(time_window, n_estimators, 'soft_voting')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's combine the predictions from the two best meta-models and produce third-level ones. To accomplish this, second-level predictions are combined through soft voting." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def third_level_predictions(\n", " time_window,\n", " n_estimators,\n", " stack_methods=['clf', 'soft_voting']\n", " ):\n", " \"\"\"Produce third-level predictions by combining those from the two best meta-models.\"\"\"\n", " window_path = {'/': os.path.join(DATA_PATH, '{}H'.format(time_window))}\n", " \n", " window_path['/prediction'] = os.path.join(\n", " window_path['/'],\n", " 'prediction',\n", " 'test',\n", " 'n_estimators={}'.format(n_estimators),\n", " 'cyclical_attr=onehot'\n", " )\n", " \n", " X = None\n", " \n", " identifiers = None\n", "\n", " for stack_method in ['clf', 'soft_voting']:\n", " pred_fname = os.path.join(window_path['/prediction'], '{}-ensemble-pred.csv'.format(stack_method))\n", " clf_pred = np.loadtxt(pred_fname, dtype=float, delimiter=',', skiprows=1)\n", " \n", " if identifiers is None:\n", " identifiers = clf_pred[:,0].astype(int).tolist()\n", " \n", " clf_pred = clf_pred[:,1:]\n", " \n", " X = copy.deepcopy(clf_pred) if X is None else np.add(X, clf_pred)\n", " \n", " predictions = X / 2\n", " \n", " prediction_header = ['' for j in range(len(CRIME_CATEGORY))]\n", " for crime, idx in CRIME_CATEGORY.items():\n", " prediction_header[idx] = crime\n", "\n", " predictions = pd.DataFrame(predictions, columns=prediction_header)\n", " \n", " predictions['Id'] = identifiers\n", " predictions = predictions[['Id']+prediction_header]\n", "\n", " out_fname = os.path.join(window_path['/prediction'], 'third-level-pred.csv')\n", " predictions.to_csv(out_fname, float_format='%.5f', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Meta-model Description | Log Loss |\n", "|------------------------------|----------|\n", "| The first-ranked meta-model | 2.56294 |\n", "| The second-ranked meta-model | 2.56539 |\n", "| Third-level predictions | 2.56044 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above table shows the multi-class logarithmic loss of the predictions from the three meta-models on the test set. In the first place, results show that the two best meta-models learned to accurately generalize on unseen data (i.e., the test set), as the overall difference between validation scores and test scores is really small.\n", "\n", "In the second place, training a classifier to combine predictions from base models outperforms the straightforward technique of soft voting. The final decrease in *log loss* is around 0.1% when a classifier is used to produce second-level predictions.\n", "\n", "More importantly, producing third-level predictions remarkably outperforms those from the two best meta-models. In particular, the decreases in *log loss* are 0.19% and 0.098% when compared to the performance of the second- and first-ranked meta-models, respectively.\n", "\n", "Overall, these results are satisfactory, although the best result achieved is relatively larger than the best-reported result for the competition, i.e., 1.95936." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Throughout this document, a feature engineering strategy has been developed to support the predictive power of an ensemble-based approach to crime classification.\n", "2. To each record in the dataset, such a strategy created several features from the history of crimes recently committed and within the area of influence.\n", "3. Then, several base models were trained to make first-level predictions and these latter combined to produce second-level predictions.\n", "4. Model averaging, or soft voting, is a simple but powerful stacking technique, as most of the top meta-models were built on it.\n", "5. However, training a classifier to combine the predictions from the base models outperformed the straightforward stacking technique of soft voting.\n", "6. On the other hand, the larger the number of base models, the better the predictive power of the meta-model.\n", "7. In a like manner, the larger the time window, the better the predictive power of the meta-model, as most of the top meta-models used a time window of at least 168 hours to set the recency, i.e., the history of recently committed crimes.\n", "8. The best setting to build meta-models was to train seven base models, use a time window of 336 hours, and one-hot encode cyclical attributes.\n", "9. Finally, the best score obtained, i.e., 2.56044 of multi-class logarithmic loss, was the result of combining the second-level predictions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }