6.5 Are tree diameters related to tree heights?

In a study at the Upper Flat Creek study area in the University of Idaho Experimental Forest, a random sample of \(n=336\) trees was selected from the forest, with measurements recorded on Douglas Fir, Grand Fir, Western Red Cedar, and Western Larch trees. The data set called ufc is available from the spuRs package (Jones et al. 2018) and contains dbh.cm (tree diameter at 1.37 m from the ground, measured in cm) and height.m (tree height in meters). The relationship displayed in Figure 2.110 is positive, moderately strong with some curvature and increasing variability as the diameter increases. There do not appear to be groups in the data set but since this contains four different types of trees, we would want to revisit this plot by type of tree.

library(spuRs) #install.packages("spuRs")
data(ufc)
ufc <- as_tibble(ufc)
scatterplot(height.m~dbh.cm, data=ufc, smooth=F, regLine=T)

Figure 2.110: Scatterplot of tree heights (m) vs tree diameters (cm).

Of particular interest is an observation with a diameter around 58 cm and a height of less than 5 m. Observing a tree with a diameter around 60 cm is not unusual in the data set, but none of the other trees with this diameter had heights under 15 m. It ends up that the likely outlier is in observation number 168 and because it is so unusual it likely corresponds to either a damaged tree or a recording error.

ufc[168,]

## # A tibble: 1 x 5
##    plot  tree species dbh.cm height.m
##   <int> <int> <fct>    <dbl>    <dbl>
## 1    67     6 WL        57.5      3.4

With the outlier in the data set, the correlation is 0.77 and without it, the correlation increases to 0.79. The removal does not create a big change because the data set is relatively large and the diameter value is close to the mean of the \(x\text{'s}\)⁹⁷ but it has some impact on the strength of the correlation.

cor(dbh.cm~height.m, data=ufc)

## [1] 0.7699552

cor(dbh.cm~height.m, data=ufc[-168,])

## [1] 0.7912053

With the outlier included, the bootstrap 95% confidence interval goes from 0.702 to 0.820 – we are 95% confident that the true correlation between diameter and height in the population of trees is between 0.708 and 0.819. When the outlier is dropped from the data set, the 95% bootstrap CI is 0.753 to 0.826, which shifts the lower endpoint of the interval up, reducing the width of the interval from 0.111 to 0.073. In other words, the uncertainty regarding the value of the population correlation coefficient is reduced. The reason to remove the observation is that it is unusual based on the observed pattern, which implies an error in data collection or sampling from a population other than the one used for the other observations and, if the removal is justified, it helps us refine our inferences for the population parameter. But measuring the linear relationship in these data where there is a clear curve violates one of our assumptions of using these methods – we’ll see some other ways of detecting this issue in Section 6.10 and we’ll try to “fix” this example using transformations in the Chapter ??.

(ref:fig6-12) Bootstrap distributions of the correlation coefficient for the full data set (top) and without potential outlier included (bottom) with observed correlation (bold line) and bounds for the 95% confidence interval (dashed lines). Notice the change in spread of the bootstrap distributions as well as the different centers.

Tobs <- cor(dbh.cm~height.m, data=ufc); Tobs

## [1] 0.7699552

set.seed(208)
par(mfrow=c(2,1))
B <- 1000
Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
  Tstar[b] <- cor(dbh.cm~height.m, data=resample(ufc))
}
quantiles <- qdata(Tstar, c(.025,.975)) #95% Confidence Interval
quantiles

##        quantile     p
## 2.5%  0.7075771 0.025
## 97.5% 0.8190283 0.975

hist(Tstar, labels=T, xlim=c(0.6,0.9), ylim=c(0,275),
     main="Bootstrap distribution of correlation with all data")
abline(v=Tobs, col="red", lwd=3)
abline(v=quantiles$quantile, col="blue", lty=2, lwd=3)

Tobs <- cor(dbh.cm~height.m, data=ufc[-168,]); Tobs

## [1] 0.7912053

Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
  Tstar[b] <- cor(dbh.cm~height.m, data=resample(ufc[-168,]))
}
quantiles <- qdata(Tstar, c(.025,.975)) #95% Confidence Interval
quantiles

##        quantile     p
## 2.5%  0.7532338 0.025
## 97.5% 0.8259416 0.975

hist(Tstar, labels=T, xlim=c(0.6,0.9), ylim=c(0,275),
     main= "Bootstrap distribution of correlation without outlier")
abline(v=Tobs, col="red", lwd=3)
abline(v=quantiles$quantile, col="blue", lty=2, lwd=3)

Figure 2.111: (ref:fig6-12)

References

Jones, Owen, Robert Maillardet, Andrew Robinson, Olga Borovkova, and Steven Carnie. 2018. SpuRs: Functions and Datasets for "Introduction to Scientific Programming and Simulation Using R". https://CRAN.R-project.org/package=spuRs.

Observations at the edge of the \(x\text{'s}\) will be called high leverage points in Section 6.9; this point is a low leverage point because it is close to mean of the \(x\text{'s}\).↩