Greetings everyone! With the final project looming overhead, I have attempted to make this a lighter computer assignment. Since the amount that I need to cover does not really align with my intent here, I am going to provide some examples in this assignment and then simply require you to use the built in
R
functions on some data. Please don’t be intimidated by the number of questions, this assignment is mostly about reading examples and using pre-written functions fromR
. -Murph
Note: You’ll need to install the e1071 and stringr packages before this document will compile.
Recall that both LASSO and Ridge have a penalty parameter \(\lambda\) that controls the amount of regularization in our procedure. As you might imagine, a good choice of \(\lambda\) is crucial to finding a good procedure fit. With our goal being to maximize our prediction accuracy, we can actually use what we have learned about cross-validation to choose a good \(\lambda\). Recall that we use cross-validation to create an estimate for the actual test error of a procedure (the test error we would get for a general procedure, not of a particular classification rule we have created on the training data).
Each distinct value of the penalty parameter \(\lambda\) determines a distinct model we could use to fit our data and our aim is to pick the best procedure. The following algorithm uses cross-validation to choose \(\lambda\) by calculating estimates of prediction accuracy for a set of candidate \(\lambda\)s. The \(\lambda\) chosen is the one that has the best estimated accuracy.
\[CV(\lambda) = \frac{1}{K} \sum_{k = 1}^K e_k(\lambda).\]
Further discussion of this process can be found here. Luckily for us, R
has built-in functionality to perform this entire process for us! We will practice using these built-in functions on the following data.
We will attempt to identify trees based on image data of their leaves. This is a tough problem, though apps such as iNaturalist now do a pretty good job identifying plants from images taken on your phone.
The data set is from here.
Images have been pre-processed, so the dataset inlcudes vectors for margin, shape and texture attributes for each of almost 1000 images.
We will start by loading the leaves
dataset and dropping the id
and species
variables.
leaf = read.csv("leaves.csv")
leaf$id = NULL
leaf$species = NULL
Lasso
function from the last homework. Make your response the margin1
variable and the rest of the variables your predictors. Set the parameter fix.lambda
to FALSE.library(MASS)
YOUR CODE HERE
fix.lambda
is set to FALSE, the Lasso
function tunes a \(\lambda\) using cross-validation. According to the manual page, what is the default number of folds the Lasso
function uses?YOUR ANSWER HERE
YOUR CODE HERE
R
is done with the ridgereg.cv
function in the MXM
package. ridgereg.cv
cross validates on every value of \(\lambda\) you provide and plots the Mean Square Prediction Error (MSPE) for each \(\lambda\). Use ridgereg.cv
for \(\lambda \in \{0.5, 1.0, 1.5, \dots, 3.5, 4.0\}\).library(MXM)
YOUR CODE HERE
ridgereg.cv
, which value of \(\lambda\) should we choose?YOUR ANSWER HERE
YOUR CODE HERE
For this section we will use the svm
function in package e1071
. Let us walk through a simple example (originally found here) to see how the svm
function works:
set.seed(13)
x=matrix(rnorm(200*2) , ncol =2)
x[1:100 ,] = x[1:100 ,] + 2
x[101:150 ,]= x[101:150,] - 2
y=c(rep(1 ,150) ,rep(2 ,50) )
dat=data.frame(x=x,y=as.factor(y))
plot(x, col =y)
Note here that our data are NOT linearly seperable! In fact, it does not appear that a linear seperation rule will work here. Let us verify this using a linear kernel with our SVM:
library(e1071)
train=sample(200 ,100)
svmfit = svm(y~., data=dat[train ,], kernel ="linear", gamma = 1, cost = 1)
plot(svmfit , dat)
It would appear that all observations are classified as 1s, confirming our suspicions. Luckily, SVMs are not limited to linear kernels:
library(e1071)
train=sample(200 ,100)
svmfit = svm(y~., data=dat[train ,], kernel ="radial", gamma = 1, cost = 1)
plot(svmfit , dat)
The above is a special form of SVM where we used a radial kernel. While the use of non-linear kernels is an interesting topic to explore, we merely introduce it here. For the following example and exercise, we will use a linear kernel.
Recall from class that SVM requires a tuning parameter \(C\) (which, if you check the manual page, the svm
function calls cost
). Like the ridgereg.cv
function, the e1071
library has a built-in cross-validation function for choosing a good value of \(C\). Observe the following:
set.seed(13)
tune.out=tune(svm ,y~.,data=dat ,kernel ="linear",
ranges =list(cost=c(0.001,0.01,0.1, 1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.001
##
## - best performance: 0.25
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.25 0.0745356
## 2 1e-02 0.25 0.0745356
## 3 1e-01 0.25 0.0745356
## 4 1e+00 0.25 0.0745356
## 5 5e+00 0.25 0.0745356
## 6 1e+01 0.25 0.0745356
## 7 1e+02 0.25 0.0745356
According to the output, our best choice of the cost
parameter would be 0.001. The tune() function stores the best model obtained, which can be accessed as follows:
bestmod =tune.out$best.model
We begin by loading the leaves
dataset again and this time only dropping the id
variable. We will further extract the genus
of each observation using the stringr
package in R
.
library(stringr)
leaf = read.csv("leaves.csv", stringsAsFactors = FALSE)
leaf$id = NULL
leaf$genus <- str_split(leaf$species, "_", simplify = TRUE)[, 1]
leaf$genus = as.factor(leaf$genus)
leaf$species = NULL
shape1
by shape50
, coloring by genus
label. Does this data look linearly seperable?YOUR CODE AND ANALYSIS HERE
YOUR CODE HERE
cost
value of the best model. Use the predict
function to classify the data from the test set and the training set. Report both the testing and training errorsYOUR CODE HERE