{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "import seaborn as sns\n", "\n", "from mpl_toolkits.basemap import Basemap\n", "from catboost import CatBoostRegressor, Pool\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from sklearn.metrics import SCORERS\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import Ridge, RidgeCV, LassoCV\n", "from scipy import sparse\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error\n", "import random\n", "PATH = './data/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from jupyterthemes import jtplot\n", "jtplot.style(theme='onedork')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_name_cr = 'crime.csv'\n", "df_crime = pd.read_csv(PATH+file_name_cr, encoding = \"ISO-8859-1\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1. Dataset and features description" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.head()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\"crime.csv\" contains an array of data on crimes in the city of Boston for the years 2015-2018. The data array contains the following data:\n", "'INCIDENT_NUMBER' - incident number \n", "'OFFENSE_CODE' - offense code\n", "'OFFENSE_CODE_GROUP' - offense group code\n", "'OFFENSE_DESCRIPTION' - brief description of the incident\n", "'DISTRICT' - district\n", "'REPORTING_AREA' - reporting site\n", "'SHOOTING' - whether weapons were used\n", "'OCCURRED_ON_DATE' - date of offense\n", "'YEAR' - year\n", "'MONTH' - month\n", "'DAY_OF_WEEK' \n", "'HOUR' \n", "'UCR_PART' - Uniform Crime Reports\n", "'STREET' \n", "'Lat' - latitude\n", "'Long' - longitude\n", "'Location' - coordinate\n", "\n", "The target variables that we will predict are the crime scene, i.e. coordinates - longitude and latitude. Also try to predict the number of crimes per day.\n", "The task is relevant due to the fact that it will help prevent crimes and reduce the number of victims." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_name_weather = 'Boston weather_clean.csv'\n", "df_weather = pd.read_csv(PATH+file_name_weather, encoding = \"ISO-8859-1\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_weather.tail()\n" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\"Boston weather_clean.csv\" contains an array of weather conditions that were in the city of Boston and coincide in time with the main file \"crime.csv\".\n", "'Year' - year\n", "'Month' - month\n", "'Day' - day\n", "'High Temp (F)' - Maximum temperature \n", "'Avg Temp (F)' - average \n", "'Low Temp (F)' - minimal\n", "'High Dew Point (F)' - maximum dew point\n", "'Avg Dew Point (F)' - average\n", "'Low Dew Point (F)' - min\n", "'High Humidity (%)' - Humidity\n", "'Avg Humidity (%)' \n", "'Low Humidity (%)'\n", "'High Sea Level Press (in)'\n", "'Avg Sea Level Press (in)' \n", "'Low Sea Level Press (in)' \n", "'High Visibility (mi)' - visibility\n", "'Avg Visibility (mi)' \n", "'Low Visibility (mi)' \n", "'High Wind (mph)'\n", "'Avg Wind (mph)' \n", "'High Wind Gust (mph)' \n", "'Snowfall (in)' snow cover\n", "'Precip (in)' - precipitation\n", "'Events' - Events\n", "\n", "The weather information, I hope, will help build a more accurate model for the target variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2. Exploratory data analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what areas are in the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.DISTRICT.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.DISTRICT = df_crime.DISTRICT.fillna('NON')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.SHOOTING = df_crime.SHOOTING.fillna('NON')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.UCR_PART.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.UCR_PART = df_crime.UCR_PART.fillna('Other')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(df_crime.STREET.unique())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.STREET = df_crime.STREET.fillna('Other')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime[np.isnan(df_crime.Lat)]['Location'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime = df_crime[~np.isnan(df_crime.Lat)]" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Check out the passes now. They are not left." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.isnull().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_weather.isnull().sum()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Let's look at the data using table analysis.\n", "On the most frequent crimes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.OFFENSE_CODE_GROUP.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.OFFENSE_CODE.value_counts()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "On the coordinates. We see that there are problems in filling, because there is a number -1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min(df_crime.Lat), max(df_crime.Lat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min(df_crime.Long), max(df_crime.Long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are spikes in the coordinates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3. Visual analysis of the features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.OCCURRED_ON_DATE = df_crime.OCCURRED_ON_DATE.map(pd.to_datetime)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime['test_one'] = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime.groupby(('YEAR','MONTH'))['test_one'].sum()\n", "fig = plt.figure(1, (12, 8))\n", "ax1 = fig.add_subplot(211)\n", "g.unstack(level=0).plot(kind='bar', grid = True, ax = ax1);\n", "ax2 = fig.add_subplot(212)\n", "\n", "g.plot(grid = True, ax = ax2);\n", "ax2.set_xticks(range(len(g)));\n", "ax2.set_xticklabels([\"%s-%02d\" % item for item in g.index.tolist()], rotation=90);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "most crimes are committed in the summer, least in the winter. average offense remains constant" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime.groupby(('HOUR'))['test_one'].sum().plot(kind='bar',figsize=(15,6), grid = True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "most crimes are committed after dinner, least of all late at night" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "order = ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday','Sunday']\n", "df_crime.groupby(('DAY_OF_WEEK'))['test_one'].sum().loc[order].plot(kind = 'pie', figsize=(7, 7), autopct='%.2f');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "by the weekend the number of crimes decreases" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime[df_crime.SHOOTING != 'NON'].groupby(('YEAR','MONTH'))['test_one'].sum()\n", "fig = plt.figure(1, (12, 12))\n", "ax1 = fig.add_subplot(311)\n", "g.unstack(level=0).plot(kind='bar', grid = True, ax = ax1, stacked=True);\n", "ax2 = fig.add_subplot(312)\n", "g.plot(grid = True, ax = ax2);\n", "ax2.set_xticks(range(len(g)));\n", "ax2.set_xticklabels([\"%s-%02d\" % item for item in g.index.tolist()], rotation=90);\n", "\n", "g = df_crime[df_crime.SHOOTING != 'NON'].groupby(('DAY_OF_WEEK'))['test_one'].sum().loc[order]\n", "ax3 = fig.add_subplot(313)\n", "g.plot(kind='bar', grid = True, ax = ax3, stacked=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime.groupby(('STREET'))['test_one'].sum().sort_values(ascending = False).head(30)\n", "fig = plt.figure(1, (12, 8))\n", "#ax1 = fig.add_subplot(211)\n", "g.plot(kind='bar', grid = True);" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Street connection and number of crimes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime.groupby(('OFFENSE_CODE_GROUP'))['test_one'].sum().sort_values(ascending = False).head(20)\n", "fig = plt.figure(1, (12, 8))\n", "#ax1 = fig.add_subplot(211)\n", "g.plot(kind='barh', grid = True);" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Distribution by district" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime.groupby(('DISTRICT'))['test_one'].sum().sort_values(ascending = False).head(20)\n", "fig = plt.figure(1, (12, 8))\n", "#ax1 = fig.add_subplot(211)\n", "g.plot(kind='pie', grid = True);" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Remove emissions in the coordinates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime_pre = df_crime[(df_crime.Lat > 20) & (df_crime['Long']<-20)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = df_crime_pre.groupby(('OFFENSE_CODE_GROUP'))['Lat', 'Long', 'test_one'].aggregate((np.sum, np.median)).sort_values(by = ('test_one','sum'), ascending = False)\n", "\n", "fig = plt.figure(1, (8, 8))\n", "#ax1 = fig.add_subplot(211)\n", "plt.scatter(x = g[('Lat','median')].values, y = g[('Long','median')].values, marker = '*');" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "The graph shows the centers of gravity of the crimes. It can be seen that various crimes are most often committed in different parts of the city.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(1, (8, 8))\n", "sns.scatterplot(x='Long', y='Lat', hue='DISTRICT', data=df_crime_pre)\n", "plt.legend(bbox_to_anchor=(1.05, 1), loc=2);" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Distribution of crimes by coordinates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime_pre['Lat'].plot(kind='kde');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime_pre['Long'].plot(kind='kde');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_crime_pre['DAY'] = df_crime_pre['OCCURRED_ON_DATE'].dt.day" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Now add an array with the weather." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_total = df_crime_pre.merge(df_weather, left_on=['YEAR', 'MONTH', 'DAY'], right_on=['Year', 'Month', 'Day'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "crime rate and temperature" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "plt.figure();\n", "df_total['Low Temp (F)'].hist(alpha = 0.5, bins = 30);\n", "df_total['High Temp (F)'].hist(alpha = 0.5, bins = 30);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this is a real temperature distribution\n", "plt.figure();\n", "df_weather['Low Temp (F)'].hist(alpha = 0.5, bins = 30);\n", "df_weather['High Temp (F)'].hist(alpha = 0.5, bins = 30);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_total.groupby('Events')['test_one'].aggregate(np.sum).plot.pie();\n", "# what precipitations were at the time of the crime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "calculate the correlation between variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_corr(df,size=10):\n", " corr = df.corr()\n", " fig, ax = plt.subplots(figsize=(size, size))\n", " cs = ax.matshow(corr, interpolation='none', cmap='plasma')\n", " plt.xticks(range(len(corr.columns)), corr.columns, rotation=90);\n", " plt.yticks(range(len(corr.columns)), corr.columns);\n", " cbar = fig.colorbar(cs, ax=ax, shrink=0.7) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dummies_OFFENSE_CODE_GROUP = pd.get_dummies(df_total['OFFENSE_CODE_GROUP'])\n", "df_dummies_Events = pd.get_dummies(df_total['Events'])\n", "df_dummies_DISTRICT = pd.get_dummies(df_total['DISTRICT'])\n", "#df_dummies_SHOOTING = pd.get_dummies(df_total['SHOOTING'])\n", "df_dummies_UCR_PART = pd.get_dummies(df_total['UCR_PART'])\n", "df_dummies_STREET = pd.get_dummies(df_total['STREET'])\n", "\n", "df_new = pd.concat([df_dummies_OFFENSE_CODE_GROUP, df_dummies_Events, df_dummies_DISTRICT, df_dummies_UCR_PART, df_total[['Lat', 'Long']], df_total.iloc[:, 23:42]], axis=1)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_corr(df_new, 25)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "It can be seen that there is no correlation between the weather and the type of crime. There is a connection between weather clones with each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)\n", " #http://server.arcgisonline.com/arcgis/rest/services\n", " #EPSG Number of America is 4269\n", "#World_Street_Map\n", "#World_Topo_Map\n", "map.arcgisimage(service='World_Street_Map', xpixels = 1000, verbose= True)\n", "for i in range(100):\n", " xpt,ypt = map(df_crime.Long[i], df_crime.Lat[i])\n", " map.scatter(xpt,ypt,c = 'b')\n", "\n", "# airport coorinates\n", "coo_aero = (-71.01, 42.36)\n", "xpt,ypt = map(coo_aero[0],coo_aero[1])\n", "map.scatter(xpt,ypt,c = 'r')\n", "# south station\n", "coo_v1 = (-71.06, 42.34)\n", "xpt,ypt = map(coo_v1[0],coo_v1[1])\n", "map.scatter(xpt,ypt,c = 'r')\n", "plt.show()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "And now let's look at the map a few points with crimes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4. Patterns, insights, pecularities of data" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "based on the lack of correlation and almost uniform distribution of the number of crimes, the probability of predicting the crime scene is small." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "As a metric we will use MAE. MAE because it is less sensitive to outliers, and our prediction area is small, the median is better.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5. Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6. Feature engineering and description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 7. Cross-validation, hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 8. Validation and learning curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 9. Prediction for hold-out and test samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 10. Model evaluation with metrics description" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "Let's try several regression methods, starting with simple, linear ones and ending with a library special for analyzing time rads." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = df_total.copy()\n", "X = X[X.OFFENSE_CODE == 3006]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.drop(['INCIDENT_NUMBER', 'Location', 'test_one', 'Day', 'Month', 'Year'], axis = 1, inplace=True)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "Since we predict coordinates, we have to remove from the data everything that indirectly contains spatial features - street, district.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "X['OFFENSE_CODE'] = X.OFFENSE_CODE.astype('category')\n", "X['YEAR'] = X.YEAR.astype('category')\n", "X['MONTH'] = X.MONTH.astype('category')\n", "X['HOUR'] = X.HOUR.astype('category')\n", "X['DAY'] = X.DAY.astype('category')\n", "#X['quarter'] = X.quarter.astype('category')\n", "X.drop(['STREET',], axis = 1, inplace=True) \n", "X.drop(['DISTRICT',], axis = 1, inplace=True) \n", "\n", "X.drop(['REPORTING_AREA',], axis = 1, inplace=True) \n", "X.drop(['OFFENSE_DESCRIPTION',], axis = 1, inplace=True) \n", "\n", "\n", "X_hot = pd.get_dummies(X)\n", "X_hot.drop(['OCCURRED_ON_DATE',], axis = 1, inplace=True)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "For linear models, we use the normalization of numerical data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss = StandardScaler()\n", "\n", "col = [x for x in X_hot.columns if ((X_hot[x].dtype == np.float64 or X_hot[x].dtype == np.int64) and (x !='Lat') and (x != 'Long'))]\n", "col" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_ss = pd.DataFrame(ss.fit_transform(X_hot[col].values), columns=col)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_hot.drop(col, axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_hot_ss = X_hot.reset_index(drop=True).join(X_ss.reset_index(drop=True))\n", "X_hot_ss.shape\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y1 = X_hot_ss['Lat']\n", "y2 = X_hot_ss['Long']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_hot_ss.drop(['Lat', 'Long'], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_val, y1_train, y1_val, y2_train, y2_val = train_test_split(X_hot_ss, y1, y2, random_state = 42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RidgeCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "model1_cv = RidgeCV(alphas= (0.01, 0.05,0.1,0.7,1,2), cv=7)\n", "model2_cv = RidgeCV(alphas= (0.01, 0.05,0.1,0.7,1,2), cv=7)\n", "#model = LinearRegression()\n", "\n", "rid1_cv = model1_cv.fit(X_train, y1_train)\n", "rid2_cv = model2_cv.fit(X_train, y2_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_Lat_cv = rid1_cv.predict(X_val)\n", "mse_Lat_cv = mean_absolute_error(y1_val, pred_Lat_cv)\n", "pred_Long_cv = rid2_cv.predict(X_val)\n", "mse_Long_cv = mean_absolute_error(y2_val, pred_Long_cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_cv, mse_Long_cv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dd_Lat_cv = pd.DataFrame({'tr':y1_val, 'pred': pred_Lat_cv, 'del': (y1_val - pred_Lat_cv)}).sort_values(by=['del'])\n", "dd_Long_cv = pd.DataFrame({'tr':y2_val, 'pred': pred_Long_cv, 'del': (y2_val - pred_Long_cv)}).sort_values(by=['del'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(1, (12, 12))\n", "ax1 = fig.add_subplot(211)\n", "\n", "dd_Lat_cv['del'].hist(bins = 100, ax = ax1, range = (-0.2,0.2), log = True)\n", "ax2 = fig.add_subplot(212)\n", "\n", "dd_Long_cv['del'].hist(bins = 100, ax = ax2, range = (-0.2,0.2), log= True) " ] }, { "cell_type": "raw", "metadata": {}, "source": [ "A very big mistake in determining the coordinates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)\n", " #http://server.arcgisonline.com/arcgis/rest/services\n", " #EPSG Number of America is 4269\n", "#World_Street_Map\n", "#World_Topo_Map\n", "map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)\n", "\n", "random.seed(3)\n", "for rr in range(100):\n", " rnd_int = random.randint(0,len(dd_Lat_cv))\n", " coordinat_true_Lat = y1[rnd_int] \n", " coordinat_true_Long = y2[rnd_int]\n", " coordinat_pred_Lat = pred_Lat_cv[rnd_int] \n", " coordinat_pred_Long = pred_Long_cv[rnd_int] \n", " \n", " xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)\n", " map.scatter(xpt1,ypt1,c = 'r')\n", "\n", " xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)\n", " map.scatter(xpt2,ypt2,c = 'b')\n", " \n" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "All predictions are grouped in median.\n", "Let's look at the important signs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model1_cv.coef_)}).sort_values(['coef'], ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model2_cv.coef_)}).sort_values(['coef'], ascending=False).head(10)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "The signs that were most important for the linear model were time and average temperature. But the exact predictions are very low.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RandomForestRegressor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import GridSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "#params = {'n_estimators':[160],'max_depth':[20]}\n", "rfc1=RandomForestRegressor(random_state=42, n_jobs= 3, n_estimators = 200)\n", "#rfc1_grid = GridSearchCV(rfc1,param_grid=params,cv=3,n_jobs= 3)\n", "rfc1.fit(X_train, y1_train)\n", "#rfc1=rfc1_grid.best_estimator_ # assign best model to rfc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rfc1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(pd.Series(rfc1.feature_importances_, index=X_train.columns)\n", " .nlargest(20)\n", " .plot(kind='barh'))\n", "plt.title('Feature Importance') # top 10 features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "rfc2=RandomForestRegressor(random_state=42, n_jobs= 3, n_estimators = 200)\n", "#rfc2_grid = GridSearchCV(rfc2,param_grid=params,cv=5, n_jobs= 3)\n", "rfc2.fit(X_train, y2_train)\n", "#rfc2=rfc2_grid.best_estimator_ # assign best model to rfc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rfc2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(pd.Series(rfc2.feature_importances_, index=X_train.columns)\n", " .nlargest(20)\n", " .plot(kind='barh'))\n", "plt.title('Feature Importance') # top 10 features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_Lat_rfr = rfc1.predict(X_val)\n", "mse_Lat_rfr = mean_absolute_error(y1_val, pred_Lat_rfr)\n", "pred_Long_rfr = rfc2.predict(X_val)\n", "mse_Long_rfr = mean_absolute_error(y2_val, pred_Long_rfr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_rfr, mse_Long_rfr" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_cv, mse_Long_cv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)\n", " #http://server.arcgisonline.com/arcgis/rest/services\n", " #EPSG Number of America is 4269\n", "#World_Street_Map\n", "#World_Topo_Map\n", "map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)\n", "\n", "random.seed(3)\n", "for rr in range(100):\n", " rnd_int = random.randint(0,len(dd_Lat_cv))\n", " coordinat_true_Lat = y1[rnd_int] \n", " coordinat_true_Long = y2[rnd_int]\n", " coordinat_pred_Lat = pred_Lat_rfr[rnd_int] \n", " coordinat_pred_Long = pred_Long_rfr[rnd_int] \n", " \n", " xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)\n", " map.scatter(xpt1,ypt1,c = 'r')\n", "\n", " xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)\n", " map.scatter(xpt2,ypt2,c = 'b')\n", " \n", " map.plot((xpt1,xpt2),(ypt1,ypt2),c = 'y')" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Similar result. The trees showed that the most important signs are the time and height of the sea level. The prediction accuracy is small. Worse than a linear model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LassoCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model3 = LassoCV(random_state = 42)\n", "model4 = LassoCV(random_state = 42)\n", "\n", "model3.fit(X_train, y1_train)\n", "model4.fit(X_train, y2_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_Lat_3 = model3.predict(X_val)\n", "mse_Lat_3 = mean_absolute_error(y1_val, pred_Lat_3)\n", "\n", "pred_Long_3 = model4.predict(X_val)\n", "mse_Long_3 = mean_absolute_error(y2_val, pred_Long_3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_3, mse_Long_3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dd_Lat_lasso = pd.DataFrame({'tr':y1_val, 'pred': pred_Lat_3, 'del': (y1_val - pred_Lat_3)}).sort_values(by=['del'])\n", "dd_Long_lasso = pd.DataFrame({'tr':y2_val, 'pred': pred_Long_3, 'del': (y2_val - pred_Long_3)}).sort_values(by=['del'])\n", "\n", "fig = plt.figure(1, (12, 12))\n", "ax1 = fig.add_subplot(211)\n", "\n", "dd_Lat_lasso['del'].hist(bins = 100, ax = ax1, range = (-0.2,0.2), log = True)\n", "ax2 = fig.add_subplot(212)\n", "\n", "dd_Long_lasso['del'].hist(bins = 100, ax = ax2, range = (-0.2,0.2), log= True) \n", "\n", "plt.figure(figsize=(12,12))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)\n", "map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)\n", "\n", "random.seed(3)\n", "\n", "for rr in range(100):\n", "\n", " rnd_int = random.randint(0,len(dd_Lat_cv))\n", " coordinat_true_Lat = y1[rnd_int] \n", " coordinat_true_Long = y2[rnd_int]\n", " coordinat_pred_Lat = pred_Lat_3[rnd_int] \n", " coordinat_pred_Long = pred_Long_3[rnd_int] \n", "\n", " \n", " xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)\n", " map.scatter(xpt1,ypt1,c = 'r')\n", "\n", " xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)\n", " map.scatter(xpt2,ypt2,c = 'b')\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model3.intercept_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model3.coef_)}).sort_values(['coef'], ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model4.coef_)}).sort_values(['coef'], ascending=False).head(10)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Another linear model with regularizations showed the worst result.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## catboost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_total.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for c in df_total.columns:\n", " col_type = df_total[c].dtype\n", " if col_type == 'object' or col_type.name == 'category':\n", " df_total[c] = df_total[c].astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_total = df_total.copy()\n", "X_total.drop(['STREET','DISTRICT','REPORTING_AREA','OFFENSE_DESCRIPTION','Lat', 'Long','Location','INCIDENT_NUMBER','OCCURRED_ON_DATE','test_one','Year', 'Month', 'Day'], axis = 1, inplace=True) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_total = X_total[X_total['OFFENSE_CODE']==3006]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_val, y1_train, y1_val, y2_train, y2_val = train_test_split(X_total, y1, y2, random_state = 42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, c in enumerate(X_total.columns):\n", " col_type = df_total[c].dtype\n", " if col_type == 'object' or col_type.name == 'category':\n", " print(i)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SCORERS.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = {'iterations':3000, 'depth':10, 'learning_rate':0.03, 'loss_function':'RMSE', 'random_seed' : 42, \n", " 'task_type' : \"GPU\", 'cat_features' : [1,2,5,7,29],'verbose':0}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_y1 = CatBoostRegressor(**param)\n", "#catboost_pool = Pool(X_train, y1_train, cat_features = [1,2,5,7,29])\n", "model_y1.fit(X_train, y1_train)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(1, (12, 12))\n", "plt.plot(np.log(model_y1.evals_result_['learn']['RMSE']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "#rsCV_y1 = RandomizedSearchCV(model_y1, param_distributions={'iterations':[100,500,1000,2000], 'depth':[i for i in range(2,15)]}, n_iter=20, n_jobs=1, random_state=42, scoring='neg_median_absolute_error')\n", "#rsCV_y1.fit(X_train, y1_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_y2 = CatBoostRegressor(**param )\n", "model_y2.fit(X_train, y2_train)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "#rsCV_y2 = RandomizedSearchCV(model_y2, param_distributions={'iterations':[100,500,1000,2000], 'depth':[i for i in range(2,15)]}, n_iter=20, n_jobs=1, random_state=42, scoring='neg_median_absolute_error')\n", "#rsCV_y2.fit(X_train, y2_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#rsCV_y1.best_params_, rsCV_y1.best_params_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#y1_pred = rsCV_y1.best_estimator_.predict(X_val)\n", "#y2_pred = rsCV_y2.best_estimator_.predict(X_val)\n", "y1_pred = model_y1.predict(X_val)\n", "y2_pred = model_y2.predict(X_val)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y1_pred, y2_pred" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_boost = mean_absolute_error(y1_val, y1_pred)\n", "mse_Long_boost = mean_absolute_error(y2_val, y2_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse_Lat_boost, mse_Long_boost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)\n", " #http://server.arcgisonline.com/arcgis/rest/services\n", " #EPSG Number of America is 4269\n", "#World_Street_Map\n", "#World_Topo_Map\n", "map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)\n", "\n", "random.seed(3)\n", "for rr in range(100):\n", " rnd_int = random.randint(0,len(y1_pred))\n", " coordinat_true_Lat = y1[rnd_int] \n", " coordinat_true_Long = y2[rnd_int]\n", " coordinat_pred_Lat = y1_pred[rnd_int] \n", " coordinat_pred_Long = y2_pred[rnd_int] \n", " \n", " xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)\n", " map.scatter(xpt1,ypt1,c = 'r')\n", "\n", " xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)\n", " map.scatter(xpt2,ypt2,c = 'b')\n", "\n", " # map.plot((xpt1,xpt2),(ypt1,ypt2),c = 'y')\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feat_imp = pd.Series(model_y2.feature_importances_, index=X_train.columns)\n", "feat_imp.nlargest(20).plot(kind='barh', figsize=(8,10))" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "It seems that the ensemble of trees gives a better result, but all the same, the accuracy is very bad. Predicting where a crime will happen is impossible.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "An analysis of the predictions and size of the loss function in a deferred sample tells us that it is not possible to build a general model for perdogazaniya all types of crimes. Now we will try to predict only one type of crime using the theory of time series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.columns\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = X\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aggr_df = df.groupby('OCCURRED_ON_DATE')[['Lat']].count()\n", "aggr_df.columns = ['Lat_count']\n", "aggr_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "daily_df = aggr_df.resample('D').apply(sum)\n", "daily_df.head(n=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(figsize=(30,8))\n", "plt.plot(daily_df.index, daily_df.Lat_count)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weekly_df = daily_df.resample('W').apply(sum) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(figsize=(30,8))\n", "plt.plot(weekly_df.index, weekly_df.Lat_count)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fbprophet import Prophet\n", "\n", "import logging\n", "logging.getLogger().setLevel(logging.ERROR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = daily_df\n", "df.columns = ['y']\n", "df.index.name = 'ds'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prediction_size = 90\n", "train_df = df[:-prediction_size]\n", "train_df.tail(n=3)\n", "train_df['ds'] = train_df.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = Prophet()\n", "m.fit(train_df);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "future = m.make_future_dataframe(periods=prediction_size)\n", "future.tail(n=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast = m.predict(future)\n", "forecast.tail(n=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(figsize=(30,8))\n", "plt.plot(df.y.tail(200));\n", "plt.plot(forecast.ds.tail(prediction_size), forecast.yhat.tail(prediction_size));\n", "plt.plot(forecast.ds.tail(prediction_size), forecast.yhat_lower.tail(prediction_size));\n", "plt.plot(forecast.ds.tail(prediction_size), forecast.yhat_upper.tail(prediction_size));" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "The prediction using the time series also did not lead to a good result. The average predicted value is very different from the real." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_comparison_dataframe(historical, forecast):\n", " \"\"\"Join the history with the forecast.\n", " \n", " The resulting dataset will contain columns 'yhat', 'yhat_lower', 'yhat_upper' and 'y'.\n", " \"\"\"\n", " return forecast.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']].join(historical)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmp_df = make_comparison_dataframe(df, forecast)\n", "cmp_df.tail(n=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_forecast_errors(df, prediction_size):\n", " df = df.copy()\n", " df['e'] = df['y'] - df['yhat']\n", " predicted_part = df[-prediction_size:]\n", " error_mean = lambda error_name: np.mean(np.abs(predicted_part[error_name]))\n", " return {'MAE': error_mean('e')}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for err_name, err_value in calculate_forecast_errors(cmp_df, prediction_size).items():\n", " print(err_name, err_value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "And in the end we will try to do the analysis as it is done in neural recurrent networks. We will use the past values of the series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame(df.y.copy())\n", "data.columns = [\"y\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adding the lag of the target variable from 6 steps back up to 24\n", "for i in range(6, 45):\n", " data[\"lag_{}\".format(i)] = data.y.shift(i)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.tail(7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.model_selection import TimeSeriesSplit\n", "\n", "# for time-series cross-validation set 5 folds \n", "tscv = TimeSeriesSplit(n_splits=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error\n", "from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error\n", "\n", "def mean_absolute_percentage_error(y_true, y_pred): \n", " return np.mean(np.abs((y_true - y_pred) / y_true)) * 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def timeseries_train_test_split(X, y, test_size):\n", " \"\"\"\n", " Perform train-test split with respect to time series structure\n", " \"\"\"\n", " \n", " # get the index after which test set starts\n", " test_index = int(len(X)*(1-test_size))\n", " \n", " X_train = X.iloc[:test_index]\n", " y_train = y.iloc[:test_index]\n", " X_test = X.iloc[test_index:]\n", " y_test = y.iloc[test_index:]\n", " \n", " return X_train, X_test, y_train, y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_ = data.dropna().y\n", "X_ = data.dropna().drop(['y'], axis=1)\n", "\n", "# reserve 30% of data for testing\n", "X_train, X_test, y_train, y_test = timeseries_train_test_split(X_, y_, test_size=0.3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr = LinearRegression()\n", "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plotModelResults(model, X_train=X_train, X_test=X_test, plot_intervals=False, plot_anomalies=False):\n", " \n", " prediction = model.predict(X_test)\n", " \n", " plt.figure(figsize=(15, 7))\n", " plt.plot(prediction, \"g\", label=\"prediction\", linewidth=2.0)\n", " plt.plot(y_test.values, label=\"actual\", linewidth=2.0)\n", " \n", " if plot_intervals:\n", " cv = cross_val_score(model, X_train, y_train, \n", " cv=tscv, \n", " scoring=\"neg_mean_absolute_error\")\n", " mae = cv.mean() * (-1)\n", " deviation = cv.std()\n", " \n", " scale = 1.96\n", " lower = prediction - (mae + scale * deviation)\n", " upper = prediction + (mae + scale * deviation)\n", " \n", " plt.plot(lower, \"r--\", label=\"upper bond / lower bond\", alpha=0.5)\n", " plt.plot(upper, \"r--\", alpha=0.5)\n", " \n", " if plot_anomalies:\n", " anomalies = np.array([np.NaN]*len(y_test))\n", " anomalies[y_testupper] = y_test[y_test>upper]\n", " plt.plot(anomalies, \"o\", markersize=10, label = \"Anomalies\")\n", " \n", " error = mean_absolute_percentage_error(prediction, y_test)\n", " plt.title(\"Mean absolute percentage error {0:.2f}%\".format(error))\n", " plt.legend(loc=\"best\")\n", " plt.tight_layout()\n", " plt.grid(True);\n", " \n", "def plotCoefficients(model):\n", " \"\"\"\n", " Plots sorted coefficient values of the model\n", " \"\"\"\n", " \n", " coefs = pd.DataFrame(model.coef_, X_train.columns)\n", " coefs.columns = [\"coef\"]\n", " coefs[\"abs\"] = coefs.coef.apply(np.abs)\n", " coefs = coefs.sort_values(by=\"abs\", ascending=False).drop([\"abs\"], axis=1)\n", " \n", " plt.figure(figsize=(15, 7))\n", " coefs.coef.plot(kind='bar')\n", " plt.grid(True, axis='y')\n", " plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(lr, plot_intervals=True)\n", "plotCoefficients(lr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.index = pd.to_datetime(data.index)\n", "data[\"day\"] = data.index.day\n", "data[\"weekday\"] = data.index.weekday\n", "data['is_weekend'] = data.weekday.isin([5,6])*1\n", "data.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_ = data.dropna().y\n", "X_ = data.dropna().drop(['y'], axis=1)\n", "\n", "X_train, X_test, y_train, y_test = timeseries_train_test_split(X_, y_, test_size=0.3)\n", "\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "lr = LinearRegression()\n", "lr.fit(X_train_scaled, y_train)\n", "\n", "plotModelResults(lr, X_train=X_train_scaled, X_test=X_test_scaled, plot_intervals=True)\n", "plotCoefficients(lr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def code_mean(data, cat_feature, real_feature):\n", " return dict(data.groupby(cat_feature)[real_feature].mean())\n", "\n", "def prepareData(series, lag_start, lag_end, test_size, target_encoding=False):\n", "\n", " # copy of the initial dataset\n", " data = pd.DataFrame(series.copy())\n", "# data.columns = [\"y\"]\n", " \n", " # lags of series\n", " for i in range(lag_start, lag_end):\n", " data[\"lag_{}\".format(i)] = data.y.shift(i)\n", " \n", " # datetime features\n", "# data.index = pd.to_datetime(data.index)\n", " data[\"day\"] = data.index.day\n", " data[\"weekday\"] = data.index.weekday\n", " data['is_weekend'] = data.weekday.isin([5,6])*1\n", " \n", " if target_encoding:\n", " # calculate averages on train set only\n", " test_index = int(len(data.dropna())*(1-test_size))\n", " data['weekday_average'] = list(map(code_mean(data[:test_index], 'weekday', \"y\").get, data.weekday))\n", " data[\"day_average\"] = list(map(code_mean(data[:test_index], 'day', \"y\").get, data.day))\n", "\n", " # drop encoded variables \n", " data.drop([\"day\", \"weekday\"], axis=1, inplace=True)\n", " \n", " # train-test split\n", " y = data.dropna().y\n", " X = data.dropna().drop(['y'], axis=1)\n", " X_train, X_test, y_train, y_test = timeseries_train_test_split(X, y, test_size=test_size)\n", "\n", " return X_train, X_test, y_train, y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "average_hour = code_mean(data, 'weekday', \"y\")\n", "plt.figure(figsize=(7, 5))\n", "plt.title(\"Hour averages\")\n", "pd.DataFrame.from_dict(average_hour, orient='index')[0].plot()\n", "plt.grid(True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = prepareData(daily_df, lag_start=6, lag_end=25, test_size=0.3, target_encoding=False)\n", "\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "lr = LinearRegression()\n", "lr.fit(X_train_scaled, y_train)\n", "\n", "plotModelResults(lr, X_train=X_train_scaled, X_test=X_test_scaled, plot_intervals=True, plot_anomalies=True)\n", "plotCoefficients(lr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 11. Conclusions" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "As a result, we can say that the prediction of the crime scene according to the weather for the day and the description of the crime is impossible. Coordinates tend to the average value, and the number of crimes is determined with a big mistake. Probably more data is needed from other areas such as demography, economics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }