Let’s walk through an example done during lecture. First, load the adults.csv
data (downloaded originally from here).
adult = read.csv("adults.csv")
adult$X = NULL
Next, we will run the logistic regression model to predict the classifier income
, which marks whether a given adult makes \(\leq \$50k\) (coded as a 0) or \(\geq \$50k\) (coded as a 1). To assess the performance of our model, we will only build the model on 75% of our data so that we can later use the remaining 25% as a testing data set.
set.seed(13)
training_size <- round(.75 * nrow(adult)) # training set size
indices = sample(1:nrow(adult), training_size)
training_set <- adult[indices,]
testing_set <- adult[-(indices),]
m1 <- glm(income ~ ., data = training_set, family = binomial('logit'))
summary(m1)
Examine the summary
of our logistic regression model. Comment on the significance of each of our predictors. Do any of the significant predictors surprise you? Provide an interpretation of what the Estimate
value is for the predictor age
. Your answer should say something about this value’s relation to the log-odds.
YOUR ANALYSIS HERE
Now that you have created a model on the training_data
, use the predict
function in R to use your model to classify the data in the testing_data
. What proportion of values were classified correctly?
YOUR CODE, AND ANALYSIS, HERE
Build a LDA model on the training_data
, and see how well it performs classifying the observations in the testing_data
. Compare the proportion of values classified correctly to this same metric we just calculated for the logistic regression model.
YOUR CODE, AND ANALYSIS, HERE
Now, let us apply these same methods on some new data, the given data kids.csv
. Each observation of this data set records the demographics of a person as well as whether or not they bought a magazine.
The variables given are as follows:
as well as a binary classifier Buy
that marks whether or not a given person bought a magazine.
Randomly assign 75% of your data as the training data and the other 25% as your testing data. Build a logistic regression model on the training, discuss the model summary and significance of the parameter values, and assess the model’s performance in the testing data. Then build a LDA model on the training data and see how well it performs classifying the observations in the testing data. Compare the proportion of values classified correctly to this same metric we just calculated for the logistic regression model.
This last exercise leaves much of the process up to you! Go through previous code, google issues, and read manual pages if you get lost.
set.seed(13)
YOUR CODE, AND ANALYSIS, HERE
This data, and the information about it, was gotten from here.