---
title: "Homework #10: Clustering" 
author: "**Your Name Here**"
format: sys6018hw-html
---

```{r config, include=FALSE}
# Set global configurations and settings here
knitr::opts_chunk$set()                 # set global chunk options
ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme
```

# Required R packages and Directories {.unnumbered .unlisted}

```{r packages, message=FALSE, warning=FALSE}
data_dir = 'https://mdporter.github.io/SYS6018/data/' # data directory
library(mclust)    # for model-based clustering
library(mixtools)  # for poisson mixture mode
library(tidyverse) # functions for data manipulation   
```

# Problem 1: Customer Segmentation with RFM (Recency, Frequency, and Monetary Value)

RFM analysis is an approach that some businesses use to understand their customers' activities. At any point in time, a company can measure how recently a customer purchased a product (Recency), how many times they purchased a product (Frequency), and how much they have spent (Monetary Value). There are many ad-hoc attempts to segment/cluster customers based on the RFM scores (e.g., here is one based on using the customers' rank of each dimension independently: <https://joaocorreia.io/blog/rfm-analysis-increase-sales-by-segmenting-your-customers.html>). In this problem you will use the clustering methods we covered in class to segment the customers. 


The data for this problem can be found here: <`r file.path(data_dir, "RFM.csv")`>. Cluster based on the Recency, Frequency, and Monetary value columns.


::: {.callout-note title="Solution"}
Load Data Here.
:::


## a. Implement hierarchical clustering. 

- Describe any pre-processing steps you took (e.g., scaling, distance metric)
- State the linkage method you used with justification. 
- Show the resulting dendrogram
- State the number of segments/clusters you used with justification. 
- Using your segmentation, are customers 1 and 100 in the same cluster?     
    
::: {.callout-note title="Solution"}
Add solution here
:::


## b. Implement k-means.  

- Describe any pre-processing steps you took (e.g., scaling)
- State the number of segments/clusters you used with justification. 
- Using your segmentation, are customers 1 and 100 in the same cluster?     
    
::: {.callout-note title="Solution"}
Add solution here
:::

## c. Implement model-based clustering

- Describe any pre-processing steps you took (e.g., scaling)
- State the number of segments/clusters you used with justification. 
- Describe the best model. What restrictions are on the shape of the components?
- Using your segmentation, are customers 1 and 100 in the same cluster?     

::: {.callout-note title="Solution"}
Add solution here
:::

## d. Discussion of results

Discuss how you would cluster the customers if you had to do this for your job. Do you think one model would do better than the others? 

::: {.callout-note title="Solution"}
Add solution here
:::


# Problem 2: Poisson Mixture Model

The pmf of a Poisson random variable is:
\begin{align*}
f_k(x; \lambda_k) = \frac{\lambda_k^x e^{-\lambda_k}}{x!}
\end{align*}

A two-component Poisson mixture model can be written:
\begin{align*}
f(x; \theta) = \pi \frac{\lambda_1^x e^{-\lambda_1}}{x!} + (1-\pi) \frac{\lambda_2^x e^{-\lambda_2}}{x!}
\end{align*}


## a. Model parameters

What are the parameters of the model? 

::: {.callout-note title="Solution"}
Add solution here
:::

## b. Log-likelihood

Write down the log-likelihood for $n$ independent observations ($x_1, x_2, \ldots, x_n$). 

::: {.callout-note title="Solution"}
$$
\log L(\theta) = \sum_{i=1}^n  \log \left(\pi \frac{\lambda_1^{x_i} e^{-\lambda_1}}{x_i!} + (1-\pi) \frac{\lambda_2^{x_i} e^{-\lambda_2}}{x_i!}
\right)
$$
:::

## c. Updating the responsibilities

Suppose we have initial values of the parameters. Write down the equation for updating the *responsibilities*. 

::: {.callout-note title="Solution"}
Add solution here
:::

## d. Updating the model parameters

Suppose we have responsibilities, $r_{ik}$ for all $i=1, 2, \ldots, n$ and $k=1,2$. Write down the equations for updating the parameters. 

::: {.callout-note title="Solution"}
Add solution here
:::

## e. Fit a two-component Poisson mixture model 

Fit a two-component Poisson mixture model. Report the estimated parameter values and show a plot of the estimated mixture pmf for the following data:

```{r, echo=TRUE}
#-- Run this code to generate the data
set.seed(123)             # set seed for reproducibility
n = 200                   # sample size
z = sample(1:2, size=n, replace=TRUE, prob=c(.25, .75)) # sample the latent class
theta = c(8, 16)          # true parameters
y = ifelse(z==1, rpois(n, lambda=theta[1]), rpois(n, lambda=theta[2]))
```

- Note: The function `poisregmixEM()` in the R package `mixtools` is designed to estimate a mixture of *Poisson regression* models. We can still use this function for our problem of pmf estimation if it is recast as an intercept-only regression. To do so, set the $x$ argument (predictors) to `x = rep(1, length(y))` and `addintercept = FALSE`. 
    - Look carefully at the output from this model. The outputs use different names/symbols than what we used in the course notes. The `beta` values (regression coefficients) are on the log scale.

::: {.callout-note title="Solution"}
Add solution here
:::

## f. **2 pts Extra Credit** EM from scratch

Write a function that estimates this two-component Poisson mixture model using the EM approach. Show that it gives the same result as part *e*. 

- Note: you are not permitted to copy code.  Write everything from scratch and use comments to indicate how the code works (e.g., the E-step, M-step, initialization strategy, and convergence should be clear). 
- Cite any resources you consulted to help with the coding. 


::: {.callout-note title="Solution"}
Add solution here
:::