--- title: "Cross and Auto-Correlation" description: | Summaries of relationships between and within time series. author: - name: Kris Sankaran affiliation: UW Madison date: 03-04-2021 output: distill::distill_article: self_contained: false --- _[Reading](https://otexts.com/fpp3/scatterplots.html), [Recording](https://mediaspace.wisc.edu/media/Week%206%20%5B4%5D/1_0pbritzj), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-02-23-week6-4/week6-4.Rmd)_ ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` ```{r} library("dplyr") library("tsibbledata") library("feasts") library("tidyr") library("lubridate") library("ggplot2") library("stringr") theme_set(theme_minimal()) ``` 1. There is often interest in seeing how two time series relate to one another. A scatterplot can be useful in gauging this relationship. For example, the plot below suggests that there is a relationship between electricity demand and temperature, but it's hard to make out exactly the nature of the relationship. ```{r} vic_2014 <- vic_elec %>% filter(year(Time) == 2014) vic_2014 %>% pivot_longer(Demand:Temperature) %>% ggplot(aes(x = Time, y = value)) + geom_line() + facet_wrap(~ name, scales = "free_y") ``` A scatterplot clarifies that, while electricity demand generally goes up in the cooler months, the very highest demand happens during high heat days. ```{r} ggplot(vic_2014, aes(x = Temperature, y = Demand)) + geom_point(alpha = 0.6, size = 0.7) ``` 2. Note that the timepoints are far from independent. Points tend to drift gradually across the scatterplot, rather than jumping to completely different regions in short time intervals. This is just the 2D consequence of the time series varying smoothly. ```{r} lagged <- rbind(vic_2014[-1, ], NA) %>% setNames(str_c("lagged_", colnames(vic_2014))) ggplot(bind_cols(vic_2014, lagged), aes(x = Temperature, y = Demand)) + geom_point(alpha = 0.6, size = 0.7) + geom_segment( aes(xend = lagged_Temperature, yend = lagged_Demand), size = .4, alpha = 0.5 ) ``` 3. To formally measure the linear relationship between two time series, we can use the cross-correlation, \begin{align} \frac{\sum_{t}\left(x_{t} - \hat{\mu}_{X}\right)\left(y_{t} - \hat{\mu}_{Y}\right)}{\hat{\sigma}_{X}\hat{\sigma}_{Y}} \end{align} which for the data above is, ```{r} cor(vic_2014$Temperature, vic_2014$Demand) ``` 4. Cross-correlation can be extended to autocorrelation — the correlation between a time series and a lagged version of itself. This measure is useful for quantifying the strength of seasonality within a time series. A daily time series with strong weekly seasonality will have high autocorrelation at lag 7, for example. The example below shows the lag-plots for Australian beer production after 2000. The plot makes clear that there is high autocorrelation at lags 4 and 8, suggesting high quarterly seasonality. ```{r} recent_production <- aus_production %>% filter(year(Quarter) > 2000) gg_lag(recent_production, Beer, geom = "point") ``` Indeed, we can confirm this by looking at the original data. ```{r} ggplot(recent_production) + geom_line(aes(x = time(Quarter), y = Beer)) ``` 5. These lag plots take up a bit of space. A more compact summary is to compute the autocorrelation function (ACF). Peaks and valleys in an ACF suggest seasonality at the frequency indicated by the lag value. ```{r} acf_data <- ACF(recent_production, Beer) autoplot(acf_data) ``` 7. Gradually decreasing slopes in the ACF suggest trends. This is because if there is a trend, the current value tends to be very correlated with the recent past. It's possible to have both seasonality within a trend, in which case the ACF function has bumps where the seasonal peaks align. ```{r} ggplot(as_tsibble(a10), aes(x = time(index), y = value)) + geom_line() acf_data <- ACF(as_tsibble(a10), lag_max = 100) autoplot(acf_data) ```