--- title: "Homework #4: Probability and Classification" author: "**Your Name Here**" format: sys6018hw-html --- ::: {style="background-color:yellow; color:red; display: block; border-color: black; padding:1em"} This is an **independent assignment**. Do not discuss or work with classmates. ::: ```{r config, include=FALSE} # Set global configurations and settings here knitr::opts_chunk$set() # set global chunk options ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme ``` # Required R packages and Directories {.unnumbered .unlisted} ```{r packages, message=FALSE, warning=FALSE} dir_data= 'https://mdporter.github.io/SYS6018/data/' # data directory library(glmnet) # for glmnet() functions library(yardstick) # for evaluation metrics library(tidyverse) # functions for data manipulation ``` # Crime Linkage Crime linkage attempts to determine if a set of unsolved crimes share a common offender. *Pairwise* crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes: - `spatial` is the spatial distance between the crimes - `temporal` is the fractional time (in days) between the crimes - `tod` and `dow` are the differences in time of day and day of week between the crimes - `LOC`, `POA,` and `MOA` are binary with a 1 corresponding to a match (type of property, point of entry, method of entry) - `TIMERANGE` is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime). - The response variable indicates if the crimes are linked ($y=1$) or unlinked ($y=0$). These problems use the [linkage-train](https://mdporter.github.io/DS6030/data/linkage_train.csv) and [linkage-test](https://mdporter.github.io/DS6030/data/linkage_test.csv) datasets (click on links for data). ## Load Crime Linkage Data ::: {.callout-note title="Solution"} Add solution here. ::: # Problem 1: Penalized Regression for Crime Linkage ## a. Fit a penalized *linear regression* model to predict linkage. Use an elastic net penalty (including lasso and ridge) (your choice). - Report the value of $\alpha \in [0, 1]$ used - Report the value of $\lambda$ used - Report the estimated coefficients ::: {.callout-note title="Solution"} Add solution here. ::: ## b. Fit a penalized *logistic regression* model to predict linkage. Use an elastic net penalty (including lasso and ridge) (your choice). - Report the value of $\alpha \in [0, 1]$ used - Report the value of $\lambda$ used - Report the estimated coefficients ::: {.callout-note title="Solution"} Add solution here. ::: ## c. ROC curve: training data Produce one plot that has the ROC curves, using the *training data*, for both models (from part a and b). Use color and/or linetype to distinguish between models and include a legend. ::: {.callout-note title="Solution"} Add solution here. ::: ## d. ROC curve: resampling estimate Recreate the ROC curve from the penalized logistic regression model using repeated hold-out data. The following steps will guide you: - Fix $\alpha=.75$ - Run the following steps 25 times: i. Hold out 500 observations ii. Use the remaining observations to estimate $\lambda$ using 10-fold CV iii. Predict the probability of linkage for the 500 hold-out observations iv. Store the predictions and hold-out labels - Combine the results and produce the hold-out based ROC curve from all of the hold-out data. I'm looking for a single ROC curve using the predictions for all 12,500 (25 x 500) observations rather than 25 different curves. - Note: by estimating $\lambda$ each iteration, we are incorporating the uncertainty present in estimating that tuning parameter. ::: {.callout-note title="Solution"} Add solution here. ::: ## e. Contest Part 1: Predict the estimated *probability* of linkage. Predict the estimated *probability* of linkage for the test data (using any model). - Submit a .csv file (ensure comma separated format) named `lastname_firstname_1.csv` that includes the column named **p** that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven't yet covered in the course). - You are free to use any data transformation or feature engineering. - You will receive credit for a proper submission; the top five scores will receive 2 bonus points. - Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average *log-loss* metric): $$ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] $$ where $M$ is the number of test observations, $\hat{p}_i$ is the prediction for the $i$th test observation, and $y_i \in \{0,1\}$ are the true test set labels. ::: {.callout-note title="Solution"} Add solution here. ::: ## f. Contest Part 2: Predict the *linkage label*. Predict the linkages for the test data (using any model). - Submit a .csv file (ensure comma separated format) named `lastname_firstname_2.csv` that includes the column named **linkage** that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven't yet covered in the course). - You are free to use any data transformation or feature engineering. - Your labels will be evaluated based on total cost, where cost is equal to `1*FP + 8*FN`. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP). - You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests. ::: {.callout-note title="Solution"} Add solution here. :::