--- title: "Internship Report in R" output: pdf_document --- ## Internship Report ```{r echo = F} rm(list=ls()) ``` ## Importing Packages ```{r echo = FALSE, include=FALSE, warning = FALSE, message = FALSE} #Including Packages installIfAbsentAndLoad <- function(neededVector) { for(thispackage in neededVector) { if( ! require(thispackage, character.only = T) ) { install.packages(thispackage)} require(thispackage, character.only = T) } } needed <- c('purrr', 'randomForest', 'caret') installIfAbsentAndLoad(needed) ``` ## Data Preprocessing Steps ```{r echo = TRUE} ###### Read the data in ###### data <- read.csv(file='insurance.csv') ###### Print the first rows ###### print(head(data, 5)) ###### Print the columns' names ###### print(colnames(data)) ###### Print number of rows ###### print(nrow(data)) ###### Converting to Numeric Variables ###### sex <- ifelse(data["sex"] == "female", 0, 1) smoker <- ifelse(data["smoker"] == "yes", 1, 0) region <- as.numeric(data$region) ##### Replacing columns in the Data ###### data["sex"] <- sex data["smoker"] <- smoker data["region"] <- region ``` ## Linear Models - using the `purrr` package to get individual models ```{r echo = TRUE} ###### Linear Regression ###### vars = c('age', 'sex', 'bmi', 'children', 'smoker', 'region') #Using the purrr package to run all the models corresponding to the predictors models <- vars %>% paste ('charges ~', .) %>% map(as.formula) %>% map(lm, data = data) ``` ## Summaries of the Models Age ```{r echo = T} # age summary summary(models[[1]]) ``` Sex ```{r echo = T} # sex summary summary(models[[2]]) ``` BMI ```{r echo = T} # bmi summary summary(models[[3]]) ``` Children ```{r echo = T} # children summary summary(models[[4]]) ``` Smoker ```{r echo = T} # smoker summary summary(models[[5]]) ``` Region ```{r echo = T} # region summary summary(models[[6]]) ``` ****** ## Linear Model with All Predictors ```{r echo = TRUE} ###### Model with all the predictors ###### allpreds <- lm(charges ~ ., data = data) ``` ## Summary of the Model ```{r echo = T} ###### Summary ###### summary(allpreds) ``` ****** ## Linear Model with the Most Relevant Predictors ``` {r echo = T} most_rel <- lm(charges ~ age + bmi + children + smoker, data = data) ``` ## Summary of the Model ```{r echo = T} summary(most_rel) ``` ## Random Forest Model ```{r echo = T} ###### Random Forest Model ###### set.seed(100) #setting a train and test set train <- sample(nrow(data), 0.8*nrow(data), replace = FALSE) trainset <- data[train,] testset <- data[-train,] random.forest1 <- randomForest(charges ~ ., data = trainset, ntree = 500, mtry = 6, importance = TRUE) random.forest1 ``` ****** ## Generating the plot ```{r echo = T} plot(main = "Random Forest Error vs. Number of Trees", random.forest1) ``` ## Generating a Confusion Matrix In order to get a better model, I decided to use the `ifelse()` function in R and get a cutoff of the data i.e. using the Mean and Median in this case **10,000 USD** to predict charges. **Less than or equal** to **10,000** is 0, and **more than or equal** is a 1. Summary of the testset$charges variable ```{r echo = T} summary(testset$charges) ``` ### Confusion Matrix - using the `Caret` Package ```{r echo = T} ###### Testing the model ###### prediction <- predict(random.forest1, newdata = testset) prediction <- ifelse(prediction <= 10000, 0, 1) testing <- ifelse(testset$charges <= 10000, 0, 1) confusionMatrix(factor(prediction, levels = min(testing):max(testing)), factor(testing, levels = min(testing):max(testing))) ``` ## Tuning the Random Forest Model The tuneRF() function comes from the `randomForest` package. According to the documentation, this function starts from the given parameter of `mtry` - 3 in this example - and searches for the **optimal value of mtry**. *With respect to Out-of-Bag error estimate* ```{r echo = T, warning = F} set.seed(100) tuning.model <- tuneRF( x = testset, y = testset$charges, ntreeTry = 600, mtryStart = 3, stepFactor = 0.5, improve = 0.03, trace = FALSE ) ``` ### Benefits of Random Forest -Easy to interpret the models -Could be used for regression or classification -Could be used in large datasets ### Pitfalls of Random Forest -Are prone to overfitting -Accuraccy tends to be lower than other Machine Learning techniques -High Variance *[Citation: Towards AI](https://towardsai.net/p/machine-learning/why-choose-random-forest-and-not-decision-trees)* ## For Comparison with Python Models (Links) [GitHub Pages for the Internship](https://arcelioeperez.github.io/dash-app/) | [GitHub Repository](https://github.com/arcelioeperez/dash-app/tree/gh-pages) [Heroku App - using Dash and Plotly](https://my-internship-app.herokuapp.com/)