---
title: "Homework #5: Probability and Classification" 
author: "**Your Name Here**"
format: ds6030hw-html
---

```{r config, include=FALSE}
# Set global configurations and settings here
knitr::opts_chunk$set()                 # set global chunk options
ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme
```


# Required R packages and Directories {.unnumbered .unlisted}

```{r packages, message=FALSE, warning=FALSE}
dir_data= 'https://mdporter.github.io/teaching/data/' # data directory
library(glmnet)
library(tidyverse) # functions for data manipulation  
```


# Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. *Pairwise* crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

- `spatial` is the spatial distance between the crimes
- `temporal` is the fractional time (in days) between the crimes
- `tod` and `dow` are the differences in time of day and day of week between the crimes
- `LOC`, `POA,` and `MOA` are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
- `TIMERANGE` is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
- The response variable indicates if the crimes are linked ($y=1$) or unlinked ($y=0$).


These problems use the [linkage-train](`r file.path(dir_data, "linkage_train.csv") `) and [linkage-test](`r file.path(dir_data, "linkage_test.csv") `) datasets (click on links for data). 


## Load Crime Linkage Data

::: {.callout-note title="Solution"}
Add solution here
:::

# Problem 1: Penalized Regression for Crime Linkage

## a. Fit a penalized *linear regression* model to predict linkage. 

Use an elastic net penalty (including lasso and ridge) (your choice). 

- Report the value of $\alpha \in [0, 1]$ used. 
- Report the value of $\lambda$ used.
- Report the estimated coefficients.

::: {.callout-note title="Solution"}
Add solution here
:::


## b. Fit a penalized *logistic regression* model to predict linkage. 

Use an elastic net penalty (including lasso and ridge) (your choice). 

- Report the value of $\alpha \in [0, 1]$ used. 
- Report the value of $\lambda$ used.
- Report the estimated coefficients.

::: {.callout-note title="Solution"}
Add solution here
:::

# Problem 2: Random Forest for Crime Linkage

Fit a random forest model to predict crime linkage. 

- Report the loss function (or splitting rule) used. 
- Report any non-default tuning parameters.
- Report the variable importance (indicate which importance method was used). 

::: {.callout-note title="Solution"}
Add solution here
:::

# Problem 3: ROC Curves

## a. ROC curve: training data

Produce one plot that has the ROC curves, using the *training data*, for all three models (linear, logistic, and random forest). Use color and/or linetype to distinguish between models and include a legend.    
Also report the AUC (area under the ROC curve) for each model. Again, use the *training data*. 

- Note: you should be weary of being asked to evaluation predictive performance from the same data used to estimate the tuning and model parameters. The next problem will walk you through a more proper way of evaluating predictive performance with resampling. 

::: {.callout-note title="Solution"}
Add solution here
:::


## b. ROC curve: resampling estimate

Recreate the ROC curve from the penalized logistic regression (logreg) and random forest (rf) models using repeated hold-out data. The following steps will guide you:

- For logreg, use $\alpha=.75$. For rf use *mtry = 2*,  *num.trees = 1000*, and fix any other tuning parameters at your choice. 
- Run the following steps 25 times:
    i. Hold out 500 observations.
    ii. Use the remaining observations to estimate $\lambda$ using 10-fold CV for the logreg model. Don't tune any rf parameters.
    iii. Predict the probability of linkage for the 500 hold-out observations.
    iv. Store the predictions and hold-out labels.
    v. Calculate the AUC. 
- Report the mean AUC and standard error for both models. Compare to the results from part a. 
- Produce two plots showing the 25 ROC curves for each model. 
- Note: by estimating $\lambda$ each iteration, we are incorporating the uncertainty present in estimating that tuning parameter. 
    
::: {.callout-note title="Solution"} 
Add solution here
:::

# Problem 4: Contest

## a. Contest Part 1: Predict the estimated *probability* of linkage. 

Predict the estimated *probability* of linkage for the test data (using any model). 

- Submit a .csv file (ensure comma separated format) named `lastname_firstname_1.csv` that includes the column named **p** that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. 
- You are free to any model (even ones we haven't yet covered in the course).
- You are free to use any data transformation or feature engineering.
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.     
- Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average *log-loss* metric):
$$ 
L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)]
$$
where $M$ is the number of test observations, $\hat{p}_i$ is the prediction for the $i$th test observation, and $y_i \in \{0,1\}$ are the true test set labels. 

::: {.callout-note title="Solution"}
Add solution here
:::


## b. Contest Part 2: Predict the *linkage label*. 

Predict the linkages for the test data (using any model). 

- Submit a .csv file (ensure comma separated format) named `lastname_firstname_2.csv` that includes the column named **linkage** that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. 
- You are free to any model (even ones we haven't yet covered in the course).
- You are free to use any data transformation or feature engineering.
- Your labels will be evaluated based on total cost, where cost is equal to `1*FP + 8*FN`. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).    
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests. 

::: {.callout-note title="Solution"}
Add solution here
:::