---
title: "Homework #4: Probability and Classification" 
author: "**Your Name Here**"
format: sys6018hw-html
---


::: {style="background-color:yellow; color:red; display: block; border-color: black; padding:1em"}
This is an **independent assignment**. Do not discuss or work with classmates.
:::

```{r config, include=FALSE}
# Set global configurations and settings here
knitr::opts_chunk$set()                 # set global chunk options
ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme
```


# Required R packages and Directories {.unnumbered .unlisted}

```{r packages, message=FALSE, warning=FALSE}
dir_data= 'https://mdporter.github.io/SYS6018/data/' # data directory
library(glmnet)    # for glmnet() functions
library(yardstick) # for evaluation metrics
library(tidyverse) # functions for data manipulation  
```


# Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. *Pairwise* crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

- `spatial` is the spatial distance between the crimes
- `temporal` is the fractional time (in days) between the crimes
- `tod` and `dow` are the differences in time of day and day of week between the crimes
- `LOC`, `POA,` and `MOA` are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
- `TIMERANGE` is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
- The response variable indicates if the crimes are linked ($y=1$) or unlinked ($y=0$).


These problems use the [linkage-train](https://mdporter.github.io/DS6030/data/linkage_train.csv) and [linkage-test](https://mdporter.github.io/DS6030/data/linkage_test.csv) datasets (click on links for data). 


## Load Crime Linkage Data

::: {.callout-note title="Solution"}
Add solution here.
:::

# Problem 1: Penalized Regression for Crime Linkage

## a. Fit a penalized *linear regression* model to predict linkage. 

Use an elastic net penalty (including lasso and ridge) (your choice). 

- Report the value of $\alpha \in [0, 1]$ used 
- Report the value of $\lambda$ used
- Report the estimated coefficients


::: {.callout-note title="Solution"}
Add solution here.
:::


## b. Fit a penalized *logistic regression* model to predict linkage. 

Use an elastic net penalty (including lasso and ridge) (your choice). 

- Report the value of $\alpha \in [0, 1]$ used 
- Report the value of $\lambda$ used
- Report the estimated coefficients

::: {.callout-note title="Solution"}
Add solution here.
:::

## c. ROC curve: training data

Produce one plot that has the ROC curves, using the *training data*, for both models (from part a and b). Use color and/or linetype to distinguish between models and include a legend.    

::: {.callout-note title="Solution"}
Add solution here.
:::


## d. ROC curve: resampling estimate

Recreate the ROC curve from the penalized logistic regression model using repeated hold-out data. The following steps will guide you:

- Fix $\alpha=.75$ 
- Run the following steps 25 times:
    i. Hold out 500 observations
    ii. Use the remaining observations to estimate $\lambda$ using 10-fold CV
    iii. Predict the probability of linkage for the 500 hold-out observations
    iv. Store the predictions and hold-out labels
- Combine the results and produce the hold-out based ROC curve from all of the hold-out data. I'm looking for a single ROC curve using the predictions for all 12,500 (25 x 500) observations rather than 25 different curves. 
- Note: by estimating $\lambda$ each iteration, we are incorporating the uncertainty present in estimating that tuning parameter. 
    
::: {.callout-note title="Solution"} 
Add solution here.
:::

## e. Contest Part 1: Predict the estimated *probability* of linkage. 

Predict the estimated *probability* of linkage for the test data (using any model). 

- Submit a .csv file (ensure comma separated format) named `lastname_firstname_1.csv` that includes the column named **p** that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. 
- You are free to any model (even ones we haven't yet covered in the course).
- You are free to use any data transformation or feature engineering.
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.     
- Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average *log-loss* metric):
$$ 
L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)]
$$
where $M$ is the number of test observations, $\hat{p}_i$ is the prediction for the $i$th test observation, and $y_i \in \{0,1\}$ are the true test set labels. 

::: {.callout-note title="Solution"}
Add solution here.
:::

## f. Contest Part 2: Predict the *linkage label*. 

Predict the linkages for the test data (using any model). 

- Submit a .csv file (ensure comma separated format) named `lastname_firstname_2.csv` that includes the column named **linkage** that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. 
- You are free to any model (even ones we haven't yet covered in the course).
- You are free to use any data transformation or feature engineering.
- Your labels will be evaluated based on total cost, where cost is equal to `1*FP + 8*FN`. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).    
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests. 

::: {.callout-note title="Solution"}
Add solution here.
:::