--- title: "Collections of Time Series" description: | Navigating across related time series. author: - name: Kris Sankaran affiliation: UW Madison date: 03-05-2021 output: distill::distill_article: self_contained: false --- ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` _[Reading](https://otexts.com/fpp3/exploring-australian-tourism-data.html), [Recording](https://mediaspace.wisc.edu/media/Week%206%20%5B5%5D%20Collections%20of%20Time%20Series/1_413ogf8e), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-02-23-week6-5/week6-5.Rmd)_ ```{r} library("broom") library("dplyr") library("feasts") library("fpp2") library("ggplot2") library("ggrepel") library("lubridate") library("stringr") library("tidyr") library("tsibble") library("tsibbledata") theme_set(theme_minimal()) ``` 1. We have seen ways of visualizing a single time series (seasonal plots, ACF) and small numbers of time series (Cross Correlation). In practice, it’s also common to encounter large collections of time series. These datasets tend to require more sophisticated analysis techniques, but we will review one useful approach, based on extracted features. 2. The high-level idea is to represent each time series by a vector of summary statistics, like the maximum value, the slope, and so on. These vector summaries can then be used to create an overview of variation seen across all time series. For example, just looking at the first few regions in the Australian tourism dataset, we can see that there might be useful features related to the overall level (Coral Coast is larger than Barkly), recent trends (increased business in North West), and seasonality (South West is especially seasonal). ```{r} tourism <- as_tsibble(tourism, index = Quarter) %>% mutate(key = str_c(Region, Purpose, sep="-")) %>% update_tsibble(key = c("Region", "State", "Purpose", "key")) regions <- tourism %>% distinct(Region) %>% pull(Region) ggplot(tourism %>% filter(Region %in% regions[1:9])) + geom_line(aes(x = time(Quarter), y = Trips, col = Purpose)) + scale_color_brewer(palette = "Set2") + facet_wrap(~Region, scale = "free_y") + theme(legend.position = "bottom") ``` 3. Computing these kinds of summary statistics by hand would be tedious. Fortunately, the feasts package makes it easy to extract a variety of statistics for tsibble objects. ```{r} tourism_features <- tourism %>% features(Trips, feature_set(pkgs = "feasts")) tourism_features ``` 5. Once you have a data.frame summarizing these time series, you can run any clustering or dimensionality reduction procedure on the summary. For example, this is 2D representation from PCA. We will get into much more depth about dimensionality reduction later in this course — for now, just think of this as an abstract map relating all the time series. ```{r} pcs <- tourism_features %>% select(-State, -Region, -Purpose, -key) %>% prcomp(scale = TRUE) %>% augment(tourism_features) outliers <- pcs %>% filter(.fittedPC1 ^ 2 + .fittedPC2 ^ 2 > 120) ``` This PCA makes it very clear that the different travel purposes have different time series, likely due to the heavy seasonality of holiday travel (Melbourne seems to be an interesting exception). ```{r} ggplot(pcs, aes(x = .fittedPC1, y = .fittedPC2)) + geom_point(aes(col = Purpose)) + geom_text_repel( data = outliers, aes(label = Region), size = 2.5 ) + scale_color_brewer(palette = "Set2") + labs(x = "PC1", y = "PC2") + coord_fixed() + theme(legend.position = "bottom") ``` 6. We can look at the series that are outlying in the PCA. The reading has some stories for why these should be considered outliers. They seem to be series with substantial increasing trends or which have exceptionally high overall counts. ```{r} outlier_series <- tourism %>% filter(key %in% outliers$key) ggplot(outlier_series) + geom_line(aes(x = time(Quarter), y = Trips, col = Purpose)) + scale_color_brewer(palette = "Set2") + facet_wrap(~Region, scale = "free_y") + theme(legend.position = "bottom") ``` 7. This featurization approach is especially powerful when combined with coordinated views. It is possible to link the points in the PCA plot with the time series display, so that selecting points in the PCA shows the corresponding time series.