In a study at the Upper Flat Creek
study area in the University of Idaho Experimental Forest, a random sample of
\(n=336\) trees was selected from the forest, with measurements recorded on Douglas
Fir, Grand Fir, Western Red
Cedar, and Western Larch trees. The data set called ufc
is available from the
spuRs
package (Jones et al. 2018) and
contains dbh.cm
(tree diameter at 1.37 m from the ground, measured in cm) and
height.m
(tree height in meters).
The relationship displayed in
Figure 2.110 is positive,
moderately strong with some curvature and increasing variability as the
diameter increases. There do not appear to be groups in the data set but since
this contains four different types of trees, we would want to revisit this plot
by type of tree.
library(spuRs) #install.packages("spuRs")
data(ufc)
ufc <- as_tibble(ufc)
scatterplot(height.m~dbh.cm, data=ufc, smooth=F, regLine=T)
Of particular interest is an observation with a diameter around 58 cm and a height of less than 5 m. Observing a tree with a diameter around 60 cm is not unusual in the data set, but none of the other trees with this diameter had heights under 15 m. It ends up that the likely outlier is in observation number 168 and because it is so unusual it likely corresponds to either a damaged tree or a recording error.
## # A tibble: 1 x 5
## plot tree species dbh.cm height.m
## <int> <int> <fct> <dbl> <dbl>
## 1 67 6 WL 57.5 3.4
With the outlier in the data set, the correlation is 0.77 and without it, the correlation increases to 0.79. The removal does not create a big change because the data set is relatively large and the diameter value is close to the mean of the \(x\text{'s}\)97 but it has some impact on the strength of the correlation.
## [1] 0.7699552
## [1] 0.7912053
With the outlier included, the bootstrap 95% confidence interval goes from 0.702 to 0.820 – we are 95% confident that the true correlation between diameter and height in the population of trees is between 0.708 and 0.819. When the outlier is dropped from the data set, the 95% bootstrap CI is 0.753 to 0.826, which shifts the lower endpoint of the interval up, reducing the width of the interval from 0.111 to 0.073. In other words, the uncertainty regarding the value of the population correlation coefficient is reduced. The reason to remove the observation is that it is unusual based on the observed pattern, which implies an error in data collection or sampling from a population other than the one used for the other observations and, if the removal is justified, it helps us refine our inferences for the population parameter. But measuring the linear relationship in these data where there is a clear curve violates one of our assumptions of using these methods – we’ll see some other ways of detecting this issue in Section 6.10 and we’ll try to “fix” this example using transformations in the Chapter ??.
(ref:fig6-12) Bootstrap distributions of the correlation coefficient for the full data set (top) and without potential outlier included (bottom) with observed correlation (bold line) and bounds for the 95% confidence interval (dashed lines). Notice the change in spread of the bootstrap distributions as well as the different centers.
## [1] 0.7699552
set.seed(208)
par(mfrow=c(2,1))
B <- 1000
Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
Tstar[b] <- cor(dbh.cm~height.m, data=resample(ufc))
}
quantiles <- qdata(Tstar, c(.025,.975)) #95% Confidence Interval
quantiles
## quantile p
## 2.5% 0.7075771 0.025
## 97.5% 0.8190283 0.975
hist(Tstar, labels=T, xlim=c(0.6,0.9), ylim=c(0,275),
main="Bootstrap distribution of correlation with all data")
abline(v=Tobs, col="red", lwd=3)
abline(v=quantiles$quantile, col="blue", lty=2, lwd=3)
Tobs <- cor(dbh.cm~height.m, data=ufc[-168,]); Tobs
## [1] 0.7912053
Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
Tstar[b] <- cor(dbh.cm~height.m, data=resample(ufc[-168,]))
}
quantiles <- qdata(Tstar, c(.025,.975)) #95% Confidence Interval
quantiles
## quantile p
## 2.5% 0.7532338 0.025
## 97.5% 0.8259416 0.975
hist(Tstar, labels=T, xlim=c(0.6,0.9), ylim=c(0,275),
main= "Bootstrap distribution of correlation without outlier")
abline(v=Tobs, col="red", lwd=3)
abline(v=quantiles$quantile, col="blue", lty=2, lwd=3)
Jones, Owen, Robert Maillardet, Andrew Robinson, Olga Borovkova, and Steven Carnie. 2018. SpuRs: Functions and Datasets for "Introduction to Scientific Programming and Simulation Using R". https://CRAN.R-project.org/package=spuRs.