--- title: "Homework #5: Probability and Classification" author: "**Your Name Here**" format: ds6030hw-html --- ```{r config, include=FALSE} # Set global configurations and settings here knitr::opts_chunk$set() # set global chunk options ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme ``` # Required R packages and Directories {.unnumbered .unlisted} ```{r packages, message=FALSE, warning=FALSE} dir_data= 'https://mdporter.github.io/teaching/data/' # data directory library(glmnet) library(tidyverse) # functions for data manipulation ``` # Crime Linkage Crime linkage attempts to determine if a set of unsolved crimes share a common offender. *Pairwise* crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes: - `spatial` is the spatial distance between the crimes - `temporal` is the fractional time (in days) between the crimes - `tod` and `dow` are the differences in time of day and day of week between the crimes - `LOC`, `POA,` and `MOA` are binary with a 1 corresponding to a match (type of property, point of entry, method of entry) - `TIMERANGE` is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime). - The response variable indicates if the crimes are linked ($y=1$) or unlinked ($y=0$). These problems use the [linkage-train](`r file.path(dir_data, "linkage_train.csv") `) and [linkage-test](`r file.path(dir_data, "linkage_test.csv") `) datasets (click on links for data). ## Load Crime Linkage Data ::: {.callout-note title="Solution"} Add solution here ::: # Problem 1: Penalized Regression for Crime Linkage ## a. Fit a penalized *linear regression* model to predict linkage. Use an elastic net penalty (including lasso and ridge) (your choice). - Report the value of $\alpha \in [0, 1]$ used. - Report the value of $\lambda$ used. - Report the estimated coefficients. ::: {.callout-note title="Solution"} Add solution here ::: ## b. Fit a penalized *logistic regression* model to predict linkage. Use an elastic net penalty (including lasso and ridge) (your choice). - Report the value of $\alpha \in [0, 1]$ used. - Report the value of $\lambda$ used. - Report the estimated coefficients. ::: {.callout-note title="Solution"} Add solution here ::: # Problem 2: Random Forest for Crime Linkage Fit a random forest model to predict crime linkage. - Report the loss function (or splitting rule) used. - Report any non-default tuning parameters. - Report the variable importance (indicate which importance method was used). ::: {.callout-note title="Solution"} Add solution here ::: # Problem 3: ROC Curves ## a. ROC curve: training data Produce one plot that has the ROC curves, using the *training data*, for all three models (linear, logistic, and random forest). Use color and/or linetype to distinguish between models and include a legend. Also report the AUC (area under the ROC curve) for each model. Again, use the *training data*. - Note: you should be weary of being asked to evaluation predictive performance from the same data used to estimate the tuning and model parameters. The next problem will walk you through a more proper way of evaluating predictive performance with resampling. ::: {.callout-note title="Solution"} Add solution here ::: ## b. ROC curve: resampling estimate Recreate the ROC curve from the penalized logistic regression (logreg) and random forest (rf) models using repeated hold-out data. The following steps will guide you: - For logreg, use $\alpha=.75$. For rf use *mtry = 2*, *num.trees = 1000*, and fix any other tuning parameters at your choice. - Run the following steps 25 times: i. Hold out 500 observations. ii. Use the remaining observations to estimate $\lambda$ using 10-fold CV for the logreg model. Don't tune any rf parameters. iii. Predict the probability of linkage for the 500 hold-out observations. iv. Store the predictions and hold-out labels. v. Calculate the AUC. - Report the mean AUC and standard error for both models. Compare to the results from part a. - Produce two plots showing the 25 ROC curves for each model. - Note: by estimating $\lambda$ each iteration, we are incorporating the uncertainty present in estimating that tuning parameter. ::: {.callout-note title="Solution"} Add solution here ::: # Problem 4: Contest ## a. Contest Part 1: Predict the estimated *probability* of linkage. Predict the estimated *probability* of linkage for the test data (using any model). - Submit a .csv file (ensure comma separated format) named `lastname_firstname_1.csv` that includes the column named **p** that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven't yet covered in the course). - You are free to use any data transformation or feature engineering. - You will receive credit for a proper submission; the top five scores will receive 2 bonus points. - Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average *log-loss* metric): $$ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] $$ where $M$ is the number of test observations, $\hat{p}_i$ is the prediction for the $i$th test observation, and $y_i \in \{0,1\}$ are the true test set labels. ::: {.callout-note title="Solution"} Add solution here ::: ## b. Contest Part 2: Predict the *linkage label*. Predict the linkages for the test data (using any model). - Submit a .csv file (ensure comma separated format) named `lastname_firstname_2.csv` that includes the column named **linkage** that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven't yet covered in the course). - You are free to use any data transformation or feature engineering. - Your labels will be evaluated based on total cost, where cost is equal to `1*FP + 8*FN`. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP). - You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests. ::: {.callout-note title="Solution"} Add solution here :::