--- title: "Estimation" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Samples and populations Biologists want to make inferences about a population based on subsamples of that population. - collections of the population are the **sample** - number of observations is the **sample size** Basic method of sampling is simple random sampling (all observations have the same probability of being sampled) - rarely does this happen (Why is this a concern?) Random sampling is important because we want to use sample statistics to estimate the population parameters. We can not directly measure the population parameters because it is too large. A good estimator of a population should: - be unbiased. Repeated samples should produce estimates that do not under or over estimate the population parameter - consistent so with increases in sample size should bring the sample parameter closer to the population parameter. - should be efficient (has the lowest variance among competing estimates) There are two broad types of estimators: 1. point estimates (single value) 2. interval estimates (range of values) ## Common parameters and statistics ### Center (location of distribution) To explore these statistics, we will generate a large sample of data. ```{r} set.seed(9222016) rand_data <- data.frame(obs = 1:999,concentration = rnorm(999, 113, 16)) head(rand_data) # minumum min(rand_data$concentration) # maximum max(rand_data$concentration) ``` Lets visualize this using a histogram. There are two approaches we can do this. One is to generate the binned data with `dplyr` and the other is to use `geom_histogram` in ggplot. ```{r} library(ggplot2) library(dplyr) # We can set the number of bins to add the data ggplot(data = rand_data) + geom_histogram(aes(x = concentration), fill = "dodgerblue", colour="black", bins = 50) + theme_bw() # or we can set how wide we want the binds to be ggplot(data = rand_data) + geom_histogram(aes(x = concentration), fill = "dodgerblue", colour="black", binwidth = 5) + theme_bw() # use dplyr to create bins and then geom_bar to create plot summ_data<-rand_data %>% mutate(bin = cut(rand_data$concentration, seq(50,175, by = 5), labels =seq(50,170, by = 5))) %>% group_by(bin) %>% summarise(N = n()) %>% ungroup() %>% mutate(bin = as.numeric(as.character(bin))) ggplot(data = summ_data) + geom_bar(aes(x = bin, y = N), stat ="identity",fill = "dodgerblue", colour = "black", width = 5) + theme_bw() ``` #### L-estimator L-estimator is based on ordering data from smalles to largest then using a linear combination of weighted-order statitistics ##### Mean The mean is an unbiased estimator of population mean where each observation is weighted by 1/n. ```{r} sum(rand_data$concentration) / length(rand_data$concentration) mean(rand_data$concentration) ``` ##### Median Median is an unbiased estimator of population mean for normal distribution, better estimator in skewed distributions ```{r} # First order the data from smallest to largest conc_ordered <- rand_data$concentration[order(rand_data$concentration)] length(conc_ordered) # find number of values conc_ordered[500] # 500 is the midpoint of 999 median(rand_data$concentration) ``` ##### Trimmed mean Trimmed mean is the mean after trimming off a proportion of the data from the highest and lowest observations. Can help deal with outliers. ```{r} concentration = c(2,4,6,7,11,21,81,90,105,121) # First order the data from smallest to largest conc_ordered <- concentration[order(concentration)] length(conc_ordered) *0.10 # need to remove the smallest and largest 50 values conc_ordered[-c(1,10)] # remove those values mean(conc_ordered[-c(1,10)]) mean(conc_ordered, trim = 0.1) # you can use the trim function in mean() ``` ##### Winsorized mean Winsorized mean is similar to the trimmed mean but the values that are excluded are replaced by the neighboring value (substituted rather than dropped). ```{r} # First order the data from smallest to largest conc_ordered <- concentration[order(concentration)] length(conc_ordered) *0.10 # need to remove the smallest and largest 50 values conc_ordered[-c(1,10)] # remove those values mean(conc_ordered[c(2,2,3,4,5,6,7,8,9,9)]) ``` ```{r eval = FALSE} install.packages('psych') ``` ```{r} library(psych) winsor.mean(concentration, trim = 0.1) ``` Notice that these numbers are slightly different. That is because instead of replacing the trimmed values with the nearest number, `winsor.mean` replaces them with the 20% and 80% quantile. #### M-estimators M-estimators give different weights gradually from the middle of the sample and incorporate a measure of variability in the estimation procedure. Uses an iterative approach and are useful with extreme outliers. Not commonly used but do have a role in robust regression and ANOVA techniques for analyzing linear models. ```{r} library(MASS) #requires the MASS package data(chem) #load the chem dataset chem_data<-data.frame(ind = 1:length(chem),chem = chem) ggplot(data = chem_data) + geom_point(aes(x = ind, y= chem), size = 2, colour = "firebrick") + theme_bw() ``` We can see from this data that there is one HUGE outlier. Running `huber` and `mean` give us different results. ```{r} mean(chem) huber(chem) ``` #### R-estimators R-estimators based on the ranks of the observations rather than the observations themselves and form the basis for many rank- based “non-parametric” tests. ##### Hodges–Lehmann estimator Hodges–Lehmann estimator is the median of the averages of all possible pairs of observations ```{r} fake_data<- rpois(20, 10) #all possible combinations all_combos <- expand.grid(x=(fake_data), y = (fake_data)) all_combos$mean <- apply(all_combos,1,mean) median(all_combos$mean) # We can get the same answer from the wilcox test wilcox.test(fake_data, conf.int = TRUE)$estimate ``` ### Spread or variability of your sample Like estimators for the central tendency of your data, there are also numerous ways to assess the spread in your sample. Going back to our concentration data that we created earlier, we will look at some of these estimates. #### Range The range is perhaps the simplest estimate of the spread of your data ```{r} # range range(rand_data$concentration) #minimum and maximum ``` #### Variance Variance is the average squared deviation from the mean. ```{r} sum_squares<-sum((rand_data$concentration - mean(rand_data$concentration))^2) sum_squares / (length(rand_data$concentration) - 1) # variance var(rand_data$concentration) ``` #### Standard deviation Square root of the variance. ```{r} sqrt(var(rand_data$concentration)) sd(rand_data$concentration) ``` #### Coefficient of variation Used to compare standard deviations across populations with different means because it is independent of the measurement units. ```{r} sd(rand_data$concentration) / mean(rand_data$concentration) ``` #### Median absolute deviation Less sensitive to outliers than the above measures and is presented in association with medians. ```{r} abs_deviation <-abs(rand_data$concentration - median(rand_data$concentration)) median(abs_deviation) * 1.4826 mad(rand_data$concentration) ``` #### Interquartile range The difference between the first quartile and the third quartile. Used in ggplots `geom_boxplot`. ```{r} quantz <- quantile(rand_data$concentration, c(0.75, 0.25)) quantz[1] - quantz[2] IQR(rand_data$concentration) ``` #### Degrees of freedom Degrees of freedom is simply the number of observations in our sample that are “free to vary” when we are estimating the variance. We already know one of those values is the mean, thus df = n - 1. #### Standard error The standard error of the mean is describing the variation in our sample mean. It is termed an error because it is the error about $\bar{y}$. If the error is large, then repeated sampling would produce different means. Standard error is calculated as the standard deviation divided by the square root of the number of observations. There is no function in `stats` that calculates the standard error. ```{r} se <- sd(rand_data$concentration)/sqrt(length(rand_data$concentration)) se ``` #### Confidence intervals NOTE: all of this is assuming normality. If you are not working with a normal distribution then other methods maybe necessary to calculate variance (in particular). In frequentist terms, the confidence interval is not a probability statement. A confidince interval can be thought of, in this context, as one interval generated by a procedure that will give correct intervals 95% of the time. Confidence intervals can be calculated using the t-statistic critical value and the standard error. The width of the confidence interval can be found using the `qt()` function in R. ```{r} critical.vals<-data.frame(ConfidenceInterval=c('99%','95%','90%','80%'), CriticalValue=c(abs(qt(0.010, 10000)), abs(qt(0.025,10000)), abs(qt(0.050, 10000)), abs(qt(0.10, 10000)))) critical.vals ``` We can illustrate this using `ggplot2`. ```{r} set.seed(12345) x<-data.frame(obs=1:100,val=rnorm(100)) head(x) se_conc <- sd(x$val)/sqrt(length(x$val)) sd_conc <- sd(x$val) mean_conc<- mean(x$val) ggplot(data = x) + geom_histogram(aes(x = val),fill = "dodgerblue", colour = "black") + geom_errorbarh(aes(xmin = mean_conc - 1.96*se_conc, xmax = mean_conc + 1.96*se_conc, y = 15, x =mean_conc )) + geom_point(aes(x = mean_conc, y = 15), size = 1) + geom_errorbarh(aes(xmin = mean_conc - 1.96*sd_conc, xmax = mean_conc + 1.96*sd_conc, y = 13, x =mean_conc ),colour="red") + geom_point(aes(x = mean_conc, y = 13), size = 1,colour="red") + annotate("text", x = c(0.75,2.7), y = c(15,13), label = c("se", "sd")) + theme_bw() ``` You can see the difference in the above plot. Standard deviations describe the spread in the data, whereas the standard error describes where the mean (or predicted value) falls. What happens to the standard error as we increase the sample size to 200, 500, 1000? #### Resampling methods There are a couple of different ways to calculate the spread or confidence interval when the sampling distribution is unknown or is definitively not normal. These methods involve resampling your data over many different iterations to build distributions of the expected values (i.e., means, medians, confidence intervals). ##### Jackknifing Jackknifining is a predecessor of bootstrapping and is less computationally expensive. Jackknifing is done by going through the data and leaving one observation out and calulating the sample statistic of interest. The term 'jackknifing' was coined by John Tukey referring to its robustness as a tool. We can explore this below: ```{r} rand_values <- data.frame(x = 1:75,val =rnorm(75, mean = 10, sd = 3)) # Generate random values head(rand_values) stor_vals<-data.frame(iter = 1:nrow(rand_values), j_mean=NA) # create a data.frame to store values head(stor_vals) # to do this we will need to run a loop for(i in 1:nrow(rand_values)){ sub_rand_values<-rand_values[-i,] # remove that data value mean_val <- mean(sub_rand_values$val) stor_vals$j_mean[i] <- mean_val # store that mean for the iteration } ggplot(data = stor_vals) + geom_histogram(aes(x = j_mean), bins = 30) + geom_vline(aes(xintercept = mean(rand_values$val)), colour = "red") + geom_vline(aes(xintercept = mean(stor_vals$j_mean)), colour = "blue", linetype = "dashed") + theme_bw() ``` ##### Bootstrapping Bootstrapping is another resampling technique. Instead of "leaving one out", we resample the data. We have two options. One is to take a random subset of the data (likely with no replacement) or resample the full number of the data set (with replacement). Like the jackknife, we calculate our sample statistic. Unlike the jackknife, it tends to be a general rule that we resample a lot of times. The general rule is ~ 1,000 times. ```{r} head(rand_values) iter = 1000 # number of iterations stor_vals_bs<-data.frame(iter = 1:iter, bs_mean=NA) # create a data.frame to store values head(stor_vals_bs) # to do this we will need to run a loop for(i in 1:iter){ sub_rand_values<-rand_values[sample(1:nrow(rand_values),nrow(rand_values),replace = TRUE),] # remove that data value mean_val <- mean(sub_rand_values$val) stor_vals_bs$bs_mean[i] <- mean_val # store that mean for the iteration } ggplot(data = stor_vals_bs) + geom_histogram(aes(x = bs_mean), bins = 30) + geom_vline(aes(xintercept = mean(rand_values$val)), colour = "red") + geom_vline(aes(xintercept = mean(stor_vals_bs$bs_mean)), colour = "blue", linetype = "dashed") + theme_bw() ``` #### Now it is your turn 1. Load the `lovett` data from our github website (data/ExperimentalDesignData/chpt2/) using the appropriate means 2. Calculate these statistics for SO4, SO4MOD, and CL - Mean - Median - 5% trimmed mean - Huber’s M-estimate - Standard deviation - Interquartile range - Median absolute deviation - Standard error of mean - 95% confidence interval for mean 3. Combine these in a single data.frame [Answers here](https://chrischizinski.github.io/SNR_R_Group/answers/answer2.html) Use the bootstrapping approach to calculate for SO4: 1. Mean 2. Standard error 3. 95% confidence interval 4. Median 5. Standard error 6. Combine these in a data.frame with the values from the population