Greetings everyone! With the final project looming overhead, I have attempted to make this a lighter computer assignment. Since the amount that I need to cover does not really align with my intent here, I am going to provide some examples in this assignment and then simply require you to use the built in
Rfunctions on some data. Please don’t be intimidated by the number of questions, this assignment is mostly about reading examples and using pre-written functions fromR. -Murph
Note: You’ll need to install the e1071 and stringr packages before this document will compile.
Recall that both LASSO and Ridge have a penalty parameter \(\lambda\) that controls the amount of regularization in our procedure. As you might imagine, a good choice of \(\lambda\) is crucial to finding a good procedure fit. With our goal being to maximize our prediction accuracy, we can actually use what we have learned about cross-validation to choose a good \(\lambda\). Recall that we use cross-validation to create an estimate for the actual test error of a procedure (the test error we would get for a general procedure, not of a particular classification rule we have created on the training data).
Each distinct value of the penalty parameter \(\lambda\) determines a distinct model we could use to fit our data and our aim is to pick the best procedure. The following algorithm uses cross-validation to choose \(\lambda\) by calculating estimates of prediction accuracy for a set of candidate \(\lambda\)s. The \(\lambda\) chosen is the one that has the best estimated accuracy.
\[CV(\lambda) = \frac{1}{K} \sum_{k = 1}^K e_k(\lambda).\]
Further discussion of this process can be found here. Luckily for us, R has built-in functionality to perform this entire process for us! We will practice using these built-in functions on the following data.
We will attempt to identify trees based on image data of their leaves. This is a tough problem, though apps such as iNaturalist now do a pretty good job identifying plants from images taken on your phone.
The data set is from here.
Images have been pre-processed, so the dataset inlcudes vectors for margin, shape and texture attributes for each of almost 1000 images.
We will start by loading the leaves dataset and dropping the id and species variables.
leaf = read.csv("leaves.csv")
leaf$id = NULL
leaf$species = NULL
Lasso function from the last homework. Make your response the margin1 variable and the rest of the variables your predictors. Set the parameter fix.lambda to FALSE.library(MASS)
YOUR CODE HERE
fix.lambda is set to FALSE, the Lasso function tunes a \(\lambda\) using cross-validation. According to the manual page, what is the default number of folds the Lasso function uses?YOUR ANSWER HERE
YOUR CODE HERE
R is done with the ridgereg.cv function in the MXM package. ridgereg.cv cross validates on every value of \(\lambda\) you provide and plots the Mean Square Prediction Error (MSPE) for each \(\lambda\). Use ridgereg.cv for \(\lambda \in \{0.5, 1.0, 1.5, \dots, 3.5, 4.0\}\).library(MXM)
YOUR CODE HERE
ridgereg.cv, which value of \(\lambda\) should we choose?YOUR ANSWER HERE
YOUR CODE HERE
For this section we will use the svm function in package e1071. Let us walk through a simple example (originally found here) to see how the svm function works:
set.seed(13)
x=matrix(rnorm(200*2) , ncol =2)
x[1:100 ,] = x[1:100 ,] + 2
x[101:150 ,]= x[101:150,] - 2
y=c(rep(1 ,150) ,rep(2 ,50) )
dat=data.frame(x=x,y=as.factor(y))
plot(x, col =y)
Note here that our data are NOT linearly seperable! In fact, it does not appear that a linear seperation rule will work here. Let us verify this using a linear kernel with our SVM:
library(e1071)
train=sample(200 ,100)
svmfit = svm(y~., data=dat[train ,], kernel ="linear", gamma = 1, cost = 1)
plot(svmfit , dat)
It would appear that all observations are classified as 1s, confirming our suspicions. Luckily, SVMs are not limited to linear kernels:
library(e1071)
train=sample(200 ,100)
svmfit = svm(y~., data=dat[train ,], kernel ="radial", gamma = 1, cost = 1)
plot(svmfit , dat)
The above is a special form of SVM where we used a radial kernel. While the use of non-linear kernels is an interesting topic to explore, we merely introduce it here. For the following example and exercise, we will use a linear kernel.
Recall from class that SVM requires a tuning parameter \(C\) (which, if you check the manual page, the svm function calls cost). Like the ridgereg.cv function, the e1071 library has a built-in cross-validation function for choosing a good value of \(C\). Observe the following:
set.seed(13)
tune.out=tune(svm ,y~.,data=dat ,kernel ="linear",
ranges =list(cost=c(0.001,0.01,0.1, 1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.001
##
## - best performance: 0.25
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.25 0.0745356
## 2 1e-02 0.25 0.0745356
## 3 1e-01 0.25 0.0745356
## 4 1e+00 0.25 0.0745356
## 5 5e+00 0.25 0.0745356
## 6 1e+01 0.25 0.0745356
## 7 1e+02 0.25 0.0745356
According to the output, our best choice of the cost parameter would be 0.001. The tune() function stores the best model obtained, which can be accessed as follows:
bestmod =tune.out$best.model
We begin by loading the leaves dataset again and this time only dropping the id variable. We will further extract the genus of each observation using the stringr package in R.
library(stringr)
leaf = read.csv("leaves.csv", stringsAsFactors = FALSE)
leaf$id = NULL
leaf$genus <- str_split(leaf$species, "_", simplify = TRUE)[, 1]
leaf$genus = as.factor(leaf$genus)
leaf$species = NULL
shape1 by shape50, coloring by genus label. Does this data look linearly seperable?YOUR CODE AND ANALYSIS HERE
YOUR CODE HERE
cost value of the best model. Use the predict function to classify the data from the test set and the training set. Report both the testing and training errorsYOUR CODE HERE