Notice, in the above problem, how we created the random_data
variable using both the dataframe the runif
functions. This latter function (runif
) draws random values from a uniform\((0,1)\) distribution. As mentioned in our first few Computing Assignments, R
can simulate a number of distributions, including the normal distributions
\[ \mathbb{W} \sim \mathcal{N}(-2, 1) ~~~~~\&~~~~~ \mathbb{V} \sim \mathcal{N}(2, 1). \]
For this next exercise, we are going to simulate 300 observations from \(\mathbb{W}\) and 200 observations from \(\mathbb{V}\), and create another variable \(\mathbb{Y}\) that classifies from which normal distribution each observation came. Indeed,
W_obs = rnorm(300, mean = -2)
V_obs = rnorm(200, mean = 2)
Y_class = c(rep(0, times = length(W_obs)), rep(1, times = length(V_obs)))
train_data = data.frame(X = c(W_obs, V_obs), Y = Y_class)
Question: Why did we not specify the value for standard deviation in the rnorm
function?
YOUR ANSWER HERE
Since we have specified ourselves the model for our data (here: two different normals), we can assess the performance of any classification technique we use on this data. We will illustrate this by calculating the Bayes rule for our train_data
. To do so, fill in the following quantities
pi_0 = YOUR CODE HERE (Probability of W_obs)
pi_1 = YOUR CODE HERE (Probability of V_obs)
To start, let’s just calculate the Bayes’ rule for the first observation of train_data
. To find the conditional probability, you may use the density
function in R
to estimate the PDF and get the appropriate probability mass for \(x\). That is, let \(P(x | Y = 0)\) be the probability mass of \(x\) in the sub-population for which \(Y=0\). The following is an example of how to get a probability mass from an estimated density function using the full training_data
. For your purposes, you will need to use a subset of the training_data
instead of the full.
# Example of PDF estimation
x_obs = train_data[1,1]
full_density = density(train_data$X)
index_of_density = sum(full_density$x <= x_obs)
pdf_value_of_x_obs = full_density$y[index_of_density]
pdf_value_of_x_obs
x_obs = train_data[1,1]
prob_x_given_0 = YOUR CODE HERE
prob_x_given_1 = YOUR CODE HERE
Bayes_rule_for_x = prob_x_given_1 * pi_1 / (prob_x_given_1 * pi_1 + prob_x_given_0 * pi_0)
What hypothesis does this Bayes Rule test? Based on our calculation, in which distribution should we classify x_obs
?
YOUR ANSWER HERE
Now, calculate the Bayes Rule for every value, and use them to compute a classifier for every observation in train_data
. (DO NOT do this exhaustively. You should be using built-in features in R
and/or a for loop.) Compare these classifiers with the true classifiers. Calculate, and report, the Bayes’ Risk.
YOUR CODE, AND ANALYSIS, HERE
Using the same train_data
from the last exercise, fit a \(k\)-nearest neighbors model for \(k \in \{1,3,11\}\). The code for \(k = 1\) is provided.
YOUR CODE, AND ANALYSIS, HERE.
# Take special care to seperate the classifier from the rest of the data when fitting a
# knn model. Consider the following code to give you an idea as to how one does this.
# Note, however, that this code is based off of my naming practices, and may require editing
# depending on your previous code.
train_data_classifiers = as.factor(train_data$Y)
train_data_observations = data.frame(train_data$X)
knn.1 <- knn(train_data_observations, train_data_observations, cl = train_data_classifiers, k=1)
R_knn_1 = 100 * sum(train_data_classifiers == knn.1)/length(knn.1)
R_knn_1
Comment on the performance for the different values of \(k\). Why does k = 1
do so well? What is it doing that gives it such great performance?
YOUR ANSWER HERE
Now let’s do the same thing with Fisher’s Linear Discriminate Analysis (LDA). We have provided the follow code as an example. Assess the risk using the derived predictions (you will need to grab the class
attribute from this variable).
library(MASS)
library(dplyr)
# Fit the model
model <- lda(Y~X, data = train_data)
# Make predictions
predictions <- model %>% predict(train_data)
YOUR CODE, AND ANALYSIS, HERE
Now that we have explored Bayes’ Rule, k-nearest neighbors, and LDA, we will see how each method performs on data that was NOT used to originally set them up. Consider the following new data set drawn from the same random variables \(\mathbb{W}~ \&~ \mathbb{V}\).
W_obs = rnorm(150, mean = -2)
V_obs = rnorm(50, mean = 2)
Y_class_test = c(rep(0, times = length(W_obs)), rep(1, times = length(V_obs)))
test_data = data.frame(X = c(W_obs, V_obs), Y = Y_class_test)
Use the information from train_data
to classify values in the test_data
, then compare these calculated classes with the true classes found in Y_class_test
. Report the Bayes’ Risk, and compare it to the same metric calculated from k-nearest neighbors and LDA.
Hints:
In the case of Bayes rule, you will use the same \(\pi_0 ~\&~ \pi_1\) and calculate the conditional probabilities in the same way, except this time your x_obs
will be from test_data
. DO NOT use the class labels from test_data
ANYWHERE in this calculation (until you evaluate at the end).
Use the manual page ?knn
to see how one inputs a different dataset for the “test” parameter. You may have to separate the observations \(X\) from the class labels \(Y\).
In a similar fashion, you are expected to read the manual pages for the functions used in the LDA process.
YOUR CODE, AND ANALYSIS, HERE
Now, let’s do this 1000 more times! During each iteration of the following for loop, use the models you have created to calculate classifiers for the test_data
and calculate the risk for each.
set.seed(13)
all_bayes_risks = c()
all_knn_risks = c()
all_lda_risks = c()
for(iteration in 1:1000){
W_obs = rnorm(150, mean = -2)
V_obs = rnorm(50, mean = 2)
Y_class_test = c(rep(0, times = length(W_obs)), rep(1, times = length(V_obs)))
test_data = data.frame(X = c(W_obs, V_obs), Y = Y_class_test)
YOUR MODEL CODE HERE
bayes_risk = YOUR CODE HERE
knn_risk = YOUR CODE HERE
lda_risk = YOUR CODE HERE
all_bayes_risks = c(all_bayes_risks, bayes_risk)
all_knn_risks = c(all_knn_risks, knn_risk)
all_lda_risks = c(all_lda_risks, lda_risk)
}
If done correctly, you should have three vectors of risk values, each from a different classification method. Create an intuitive plot that compares these three values. Make sure this plot compares the values AT THE SAME INTERATION. Our suggestion would be a line plot with the \(x\)-axis as the iteration number and the \(y\)-axis as the risk value, colored by classification method.
YOUR CODE HERE
Comment on your model
YOUR ANSWER HERE