---
title: "COVID-19 and R Part II"
author: "Drew Tyre"
date: '2020-03-14'
output: 
  html_document:
    code_folding: hide
draft: no
slug: COVID-19_and_R_II
tags_include:
- R
- OPD
categories: Research
---

Yesterday I got into [the COVID-19 forecasting business](/post/covid-19_and_R/). 
Today I want to see how my predictions held up. 

EDIT 2020-03-16: DISCLAIMER: These unofficial results are not peer reviewed, and should be
treated as such. My goal is to learn about forecasting in real time, 
how simple models compare with more complex models, and even how to compare
different models.

# Munging the data ... again

My least favorite part of data analysis is munging the data into a usable form. 
Especially #otherpeoplesdata. John Hopkins University is doing a great service
making their data under their COVID-19 dashboard open, but they've made some
interesting choices. I can only assume these choices work well for what they are
doing; certainly isn't making my work easier! From the issues posted on github
sounds like the JHU team might be getting overwhelmed. So I'm going to try
the data from [Our World In Data](https://ourworldindata.org/coronavirus-source-data).


```{r, warning=FALSE, message=FALSE}
library("tidyverse")
library("lubridate")
owid_url <- "https://covid.ourworldindata.org/data/total_cases.csv"

owid_wide <- read_csv(owid_url) # just grab it once 

us_confirmed_total <- owid_wide %>% 
  pivot_longer(-1, 
               names_to = "country", values_to = "cumulative_cases") %>% 
  filter(country == "United States") %>% 
  arrange(date)  


p1 <- ggplot(data = us_confirmed_total,
       mapping = aes(x = date)) + 
  geom_line(mapping = aes(y = cumulative_cases)) + 
  geom_point(mapping = aes(y = cumulative_cases))  
p1
```

Well, fooey. [CDC reports over 1600 cases]() for March 13. The
[WorldOMeter](https://www.worldometers.info/coronavirus/country/us/) reports 
over 2000! However, the JHU data seems to have corrected in the 
last 12 hours, so I'll go with that again. I'll keep looking for
new data sources. Another lesson from the past 24 hours -- download
*and then save* the data! Because the next time it is downloaded it
might break everything ... 

```{r}
load("data/jhu_wide_2020-03-14.Rda")
us_confirmed_total <- jhu_wide %>% 
  rename(province = "Province/State", 
         country_region = "Country/Region") %>% 
  pivot_longer(-c(province, country_region, Lat, Long), 
               names_to = "Date", values_to = "cumulative_cases") %>% 
  filter(country_region == "US") %>% 
  mutate(Date = mdy(Date)) %>% 
#  mutate(Date = lubridate::mdy(Date) - lubridate::days(1)) %>% 
  arrange(province, Date)  %>% 
  # filter out state rows prior to march 8, and county rows after that. 
  filter(str_detect(province, ", ") & Date <= "2020-03-9" |
           str_detect(province, ", ", negate = TRUE) & Date > "2020-03-9") %>% 
  group_by(Date) %>% # then group by Date and sum
  summarize(cumulative_cases = sum(cumulative_cases)) %>% 
  ungroup() %>% 
  mutate(incident_cases = c(0, diff(cumulative_cases)))

p1 <- ggplot(data = filter(us_confirmed_total, Date > "2020-02-28", Date <= "2020-03-11"),
       mapping = aes(x = Date)) + 
#  geom_line(mapping = aes(y = incident_cases)) + 
  geom_point(mapping = aes(y = incident_cases)) + 
  scale_y_log10() + 
  geom_smooth(mapping = aes(y = incident_cases), method = "lm")

us_exp_model <- us_confirmed_total %>% 
  mutate(log_incident_cases = log10(incident_cases),    # transform the data
         day = as.numeric(Date - ymd("2020-02-27"))) %>% # get day relative to Feb 27
  filter(Date > "2020-02-28", Date <= "2020-03-11") %>% 
  lm(data = .,
     formula = log_incident_cases ~ day) 

predicted <- tibble(day = 14:20)
predicted_list <- predict(us_exp_model, newdata = predicted, se.fit = TRUE)
predicted$fit <- predicted_list$fit # this is klutzy, but I want to see the answer! 
predicted$se.fit <- predicted_list$se.fit
predicted <- predicted %>% 
  mutate(Date = ymd("2020-2-27") + day,
         pred_var = se.fit^2 + predicted_list$residual.scale^2,
         lpl = 10^(fit - sqrt(pred_var)*qt(0.975, df = 11)),
         upl = 10^(fit + sqrt(pred_var)*qt(0.975, df = 11)),
         fit = 10^fit)

p1 + 
  geom_line(data = predicted,
            mapping = aes(y = fit),
            linetype = 2) +
  geom_ribbon(data = predicted,
            mapping = aes(ymin = lpl, ymax = upl),
            alpha = 0.2) +
  geom_point(data = filter(us_confirmed_total, Date > "2020-03-11"),
             mapping = aes(y = incident_cases))
```

So not bad! Although if the observations continue to fall below the expected value that would
indicate some lack of fit. If you take the time to carefully compare these predictions with
yesterday's predictions you'll notice some small discrepancies. As I was wrestling with
data inconsistencies, I realized that correcting the UTC dates by a whole day was throwing everything
off. So everything is now shifted over by one day. The estimated coefficients of the model don't change very much; the rate of growth is still 0.15 day^-1^ and the intercept dropped a little bit. 
Still forecasting `r paste(format(predicted$lpl[7], digits = 0), " - ", format(predicted$upl[7], digits=0, scientific=FALSE))` new cases on March 18. That's bigger than the range I mentioned
in yesterday's post, but I forgot about the effect of the logarithmic scale on y when I eyeballed it.

So, things to do: run different countries that I care about, like Canada, Germany, Australia etc etc. 
Find some additional models to try, and develop a good measure of comparison.