---
title: "Internship Report in R"
output: pdf_document
---

## Internship Report   
```{r echo = F}  
rm(list=ls())
```  

## Importing Packages    
```{r echo = FALSE, include=FALSE, warning = FALSE, message = FALSE}  
#Including Packages 


installIfAbsentAndLoad <- function(neededVector) {
  for(thispackage in neededVector) {
    if( ! require(thispackage, character.only = T) )
    { install.packages(thispackage)}
    require(thispackage, character.only = T)
  }
}

needed <- c('purrr', 'randomForest', 'caret')  
installIfAbsentAndLoad(needed) 
```  

## Data Preprocessing Steps  
```{r echo = TRUE}  

###### Read the data in ######
data <- read.csv(file='insurance.csv')

###### Print the first rows ######
print(head(data, 5))

###### Print the columns' names ######
print(colnames(data))

###### Print number of rows ######
print(nrow(data))

###### Converting to Numeric Variables ###### 
sex <- ifelse(data["sex"] == "female", 0, 1)
smoker  <- ifelse(data["smoker"] == "yes", 1, 0)
region <- as.numeric(data$region)

##### Replacing columns in the Data ###### 
data["sex"] <-  sex
data["smoker"] <-  smoker
data["region"] <- region
```  

## Linear Models - using the `purrr` package to get individual models  
```{r echo = TRUE}  
###### Linear Regression ######
vars = c('age', 'sex', 'bmi', 'children', 'smoker', 'region')
#Using the purrr package to run all the models corresponding to the predictors
models <- vars %>% paste ('charges ~', .) %>% map(as.formula) %>% map(lm, data = data)
```  

## Summaries of the Models  

Age  
```{r echo = T}
# age summary
summary(models[[1]])
```  
Sex 

```{r echo = T}
# sex summary
summary(models[[2]])
```  
BMI 

```{r echo = T}
# bmi summary
summary(models[[3]])
```  
Children  

```{r echo = T}
# children summary
summary(models[[4]])
```  
Smoker  

```{r echo = T}
# smoker summary
summary(models[[5]])
```  
Region  

```{r echo = T}
# region summary
summary(models[[6]])
```     
  
******  


## Linear Model with All Predictors     
```{r echo = TRUE} 
###### Model with all the predictors  ###### 
allpreds <- lm(charges ~ ., data = data)
```    

## Summary of the Model     
```{r echo = T}
###### Summary ###### 
summary(allpreds)
```  

******  

## Linear Model with the Most Relevant Predictors  
``` {r echo = T}
most_rel <- lm(charges ~ age + bmi + children + smoker, data = data)
```  





## Summary of the Model  
```{r echo = T}
summary(most_rel)
```

## Random Forest Model  
```{r echo = T}
###### Random Forest Model ######
set.seed(100)

#setting a train and test set 
train <- sample(nrow(data), 0.8*nrow(data), replace = FALSE) 
trainset <- data[train,]
testset <- data[-train,]

random.forest1 <- randomForest(charges ~ ., data = trainset, ntree = 500, mtry = 6, 
                               importance = TRUE)

random.forest1
```  
******
## Generating the plot  
```{r echo = T}
plot(main = "Random Forest Error vs. Number of Trees", random.forest1)
```  

## Generating a Confusion Matrix  
In order to get a better model, I decided to use the `ifelse()` function in R and get a cutoff of the data i.e. using the Mean and Median in this case **10,000 USD** to predict charges. **Less than or equal** to **10,000** is 0, and **more than or equal** is a 1.  

Summary of the testset$charges variable  
```{r echo = T}
summary(testset$charges)
```  

### Confusion Matrix - using the `Caret` Package  
```{r echo = T}
###### Testing the model ######
prediction <- predict(random.forest1, newdata = testset)

prediction <- ifelse(prediction <= 10000, 0, 1)
testing <- ifelse(testset$charges <= 10000, 0, 1)

confusionMatrix(factor(prediction, levels = min(testing):max(testing)), 
      factor(testing, levels = min(testing):max(testing)))

```  

## Tuning the Random Forest Model  
The tuneRF() function comes from the `randomForest` package.  

According to the documentation, this function starts from the given parameter of `mtry` - 3 in this example - and searches for the **optimal value of mtry**.  

*With respect to Out-of-Bag error estimate*  
```{r echo = T, warning = F}
set.seed(100)
tuning.model <- tuneRF(
  x = testset, 
  y = testset$charges, 
  ntreeTry = 600, 
  mtryStart = 3, 
  stepFactor = 0.5, 
  improve = 0.03, 
  trace = FALSE
)

```

### Benefits of Random Forest   

-Easy to interpret the models  
-Could be used for regression or classification  
-Could be used in large datasets   

### Pitfalls of Random Forest    

-Are prone to overfitting    
-Accuraccy tends to be lower than other Machine Learning techniques    
-High Variance

*[Citation: Towards AI](https://towardsai.net/p/machine-learning/why-choose-random-forest-and-not-decision-trees)*  

## For Comparison with Python Models (Links)
[GitHub Pages for the Internship](https://arcelioeperez.github.io/dash-app/) | [GitHub Repository](https://github.com/arcelioeperez/dash-app/tree/gh-pages)  
[Heroku App - using Dash and Plotly](https://my-internship-app.herokuapp.com/)