---
title: "Dataframe Analysis"
author: "Barry"
date: "2/2/2022"
output: html_document
---

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient:

Attribute Information
1) gender: "Male", "Female"
2) age: age of the patient
3) hypertension: No if the patient doesn't have hypertension, Yes if the patient has hypertension
4) heart_disease: No if the patient doesn't have any heart diseases, Yes if the patient has a heart disease
5) ever_married: "No" or "Yes"
6) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
7) Residence_type: "Rural" or "Urban"
8) avg_glucose_level: average glucose level in blood
9) bmi: body mass index
10) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
11) stroke: Yes if the patient had a stroke and No if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Please do not edit the code block below. For those interested, I am performing some basic cleaning on the dataset by 1. Removing the observation whose sex is not recorded (Other) 2. Setting the 'ID' column to rownames 3. Converting the character columns to factors.

Run the code block to load the dataset.

```{R}
library(dplyr)
dataset <- data.frame(read.delim("https://raw.githubusercontent.com/BarryDigby/Youth-Academy/master/data/stroke-data.csv", sep=",", header=T))
dataset <- dataset[-c(which(dataset$gender == "Other")),]
rownames(dataset) <- dataset$id
dataset <- dataset[,c(2:length(dataset))]
factor_features <- c("gender", "hypertension", "heart_disease", "ever_married", "work_type", "Residence_type", "smoking_status", "stroke")
dataset <- dataset %>%
  mutate_at(factor_features, as.factor)
```


# Part 1

First, answer some basic questions about the dataset. 

Create a new code block and record your answers below the code block:

Q1: How many patients are in the dataset? 

Q2: How many males and females are in the dataset? (Hint: use table() on the gender column).

Q3: What is the average (mean) age of patients in the dataset? (Hint use mean() on the age column).

Q4: What is the age of the youngest and eldest patient in the dataset? (Hint use min() and max() on the age column).

Q5: How many patients in the dataset have had a stroke? (similar to Q2)


< delete this line and create a code block to answer the questions >

A1:

A2:

A3:

A4:

A5: 

# Part 2

We will investigate the effect of gender on the risk of having a stroke.

Create a code block below and split the dataframe by gender, creating two new dataframes called 'male_df' and 'female_df', respectively. (Hint: use the subset() function as shown on the website)

< delete this line and create a code block splitting the dataset into males and females >

Based on the two dataframes:

Q1: Report the number of females that have had strokes and the number of males that have had strokes. (Hint: use table() on the stroke column for each dataframe).

Q2: What is the average age for females and males in the dataset?

< delete this line and create a code block to answer the question >

A1: 

Q2: 

# Part 3

The answer reported in Part 2 could be misleading. If there are more males than females or vice versa, the result given in Part 2 could be due to sampling bias.

To overcome sampling bias, we will generate a new statistic called 'stroke rate' for females and males in a three step process:

1. Calculate the number of males that have had a stroke and save it as a variable called 'n_stroke_male'

2. Calculate the number of males in the dataset and save it as a variable called 'n_males'

3. Divide n_stroke_males/n_males to compute the rate of strokes in men. Save the result to a variable. 

Repeat the above process for females and report your answer for each gender below the code block.

Q1: Is the rate of strokes higher in males or females? 

< delete this line to calculate the rate of strokes in males and females >

A1:

# Part 4

A tabloid newspaper has contacted you and wants to run the headline "Marriage causes strokes!". Investigate the hypothesis below (or hang up the phone) and report your findings to him/her.

Split the original dataset into two new dataframe called 'married' 'single' using the subset() function, just like you did previously for males and females.

< delete this line and split the dataset, saving the two new variables > 

Q1: How many married patients have had a stroke? 

Q2: How many single patients have had a stroke? 

< delete this line to create a new code block to answer the above quesions >

A1: 

A2:

Just like the analysis of females and males in Part 3, calculate the rate of strokes in married and single patients. 

Q3: The reporter wants to know if strokes are more likely in married patients?

< create a code block to calculate the rate of strokes in married and single patients >

A3: 


Bonus: Please include any comments you feel you should mention before the reporter tries to publish the story!"

# Part 5

Below is an example of plots that can convey information about more than one column at a time.

This is a teaser for future weeks where you will make these plots yourself to visualise datasets

```{R}
library(ggpubr)
gghistogram(dataset, x="age", y="..count..", facet.by = "gender", color = "stroke", add = "mean", fill = "stroke", palette = "lancet")
ggboxplot(dataset, x="stroke", y="age", facet.by = "gender", fill = "stroke", palette = "lancet", bxp.errorbar = T)
```


```{R}
library(biomaRt)

hsa_mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

listAttributes(hsa_mart)

x <- getBM(attributes = c("mirbase_accession", "mirbase_id", "ensembl_gene_id"),mart = hsa_mart)

x <-  x[!(is.na(x$mirbase_accession) | x$mirbase_accession==""), ]
```