--- title: "COVID-19 and R Part VII" author: "Drew Tyre" date: "2020-03-19" output: word_document: default html_document: code_folding: hide draft: no slug: COVID-19_and_R_VII tags_include: - R - OPD - COVID19 categories: Research og_image: /post/covid-19_in_r_vii_files/figure-html/featured_image-1.png --- Wednesday's number of new cases in the USA was still inside the prediction interval, and still below expectations. Italy's numbers appear to have leveled off from exponential to linear growth. Nonetheless, Italy continues to add more severe cases per day than their ICU bed capacity. Unless things change drastically, I predict the USA will start to exceed local ICU bed capacity by March 23, in places like New York, Washington state, and the Bay Area. Certainly by the end of March. Places like Nebraska may not experience these problems, because measures to reduce transmission started while case numbers were still low. That outcome is not set in stone. The time to act is now. Even a one day delay in reducing transmission rate dramatically increases the peak number of sick people. If you're lucky to be in a low risk group, you can still spread the virus. Although your symptoms are mild, the person you hand the virus off to could experience much worse symptoms. You don't have to hermetically seal your house. I like this graphic: ![social distancing suggestions](https://rutherfordcountytn.gov/img/2.png) [This all started a week ago](/post/covid-19_and_r/), in case you're just joining. # The bottom line DISCLAIMER: These unofficial results are not peer reviewed, and should not be treated as such. My goal is to learn about forecasting in real time, how simple models compare with more complex models, and even how to compare different models. The figure below is complex, so I'll highlight some features first. The y axis is now cases / 100,000 population to make it easier to compare Italy and the USA. As before, the BLUE LINES are exponential growth models that assume the rate of increase is *constant*. The dashed black lines are predictions from that model. To make comparison easier, the estimated model and predictions for both countries are on both panels. The lines and intervals for the other country are faded out. The RED LINES are 20 times the number of ICU beds per 100,000 population in each country. Note that the y axis is logarithmic. ```{r featured_image, warning=FALSE, message=FALSE, echo=FALSE} library("drc") library("tidyverse") library("lubridate") library("broom") savefilename <- "data/jhu_wide_2020-03-19.Rda" # # jhu_url <- paste("https://raw.githubusercontent.com/CSSEGISandData/", # "COVID-19/master/csse_covid_19_data/", # "csse_covid_19_time_series/", # "time_series_19-covid-Confirmed.csv", sep = "") # # jhu_wide <- read_csv(jhu_url) # just grab it once from github # # archive it! # save(jhu_wide, file = savefilename) load(savefilename) country_data <- read_csv("data/countries_covid-19.csv") %>% mutate(start_date = mdy(start_date))%>% filter(is.na(Region)) %>% mutate(Country = case_when(Country == "United States" ~ "USA", TRUE ~ Country)) source("R/jhu_helpers.R") us_confirmed_total <- us_wide2long_old(jhu_wide, "USA") %>% group_by(Date) %>% # then group by Date and sum summarize(cumulative_cases = sum(cumulative_cases)) %>% ungroup() %>% mutate(incident_cases = c(0, diff(cumulative_cases)), country_region = "USA", province = NA_character_) %>% filter(Date > "2020-02-28") # trim to first day after which all incident_cases > 0 us_exp_model <- us_confirmed_total %>% mutate(log_incident_cases = log10(incident_cases), # transform the data day = as.numeric(Date - ymd("2020-02-27"))) %>% # get day relative to Feb 27 filter(Date > "2020-02-28", Date <= "2020-03-11") %>% lm(data = ., formula = log_incident_cases ~ day) us_fits <- augment(us_exp_model, se_fit = TRUE) %>% mutate(country_region = "USA", province = NA_character_, Date = ymd("2020-2-27") + day, lcl = 10^(.fitted - .se.fit*qt(0.975, df = 10)), ucl = 10^(.fitted + .se.fit*qt(0.975, df = 10)), fit = 10^.fitted) predicted <- tibble(day = 14:27) predicted_list <- predict(us_exp_model, newdata = predicted, se.fit = TRUE) predicted$fit <- predicted_list$fit # this is klutzy, but I want to see the answer! predicted$se.fit <- predicted_list$se.fit us_predicted <- predicted %>% mutate(country_region = "USA", province = NA_character_, Date = ymd("2020-2-27") + day, pred_var = se.fit^2 + predicted_list$residual.scale^2, lpl = 10^(fit - sqrt(pred_var)*qt(0.975, df = 10)), upl = 10^(fit + sqrt(pred_var)*qt(0.975, df = 10)), fit = 10^fit) # Take the wide format data and make it long italy_confirmed <- other_wide2long_old(jhu_wide, countries = "Italy") %>% # Calculate the number of new cases per day mutate(incident_cases = c(0, diff(cumulative_cases)))%>% # only keep data after feb 21 when the outbreak started, and remove row with no reported cases filter(Date >= "2020-02-21", incident_cases > 0) # Fit the model to the first 12 days of data (to match what I did with USA) italy_exp_model <- italy_confirmed %>% filter(Date <= "2020-03-03") %>% mutate(log_incident_cases = log10(incident_cases), day = as.numeric(Date - ymd("2020-02-21"))) %>% lm(data = ., formula = log_incident_cases ~ day) italy_fits <- augment(italy_exp_model, se_fit = TRUE) %>% mutate(country_region = "Italy", province = NA_character_, Date = ymd("2020-2-21") + day, lcl = 10^(.fitted - .se.fit*qt(0.975, df = 10)), ucl = 10^(.fitted + .se.fit*qt(0.975, df = 10)), fit = 10^.fitted) # Make predictions out to March 16 predicted <- tibble(day = 12:26) predicted_list <- predict(italy_exp_model, newdata = predicted, se.fit = TRUE) italy_predicted <- predicted %>% mutate(country_region = "Italy", province = NA_character_, Date = ymd("2020-02-21") + day, fit = predicted_list$fit, se.fit = predicted_list$se.fit, pred_var = se.fit^2 + predicted_list$residual.scale^2, lpl = 10^(fit - sqrt(pred_var)*qt(0.975, df = 10)), upl = 10^(fit + sqrt(pred_var)*qt(0.975, df = 10)), fit = 10^fit) observed <- bind_rows(us_confirmed_total, italy_confirmed) %>% left_join(country_data, by = c("country_region" = "Country") ) %>% mutate(icpc = incident_cases * 10 / popn, group = as.numeric(factor(country_region))) fits <- bind_rows(USA = us_fits, Italy = italy_fits)%>% left_join(country_data, by = c("country_region" = "Country"))%>% mutate(fitpc = fit * 10 / popn, lclpc = lcl * 10 / popn, uclpc = ucl * 10 / popn, group = as.numeric(factor(country_region))) predicted <- bind_rows(USA = us_predicted, Italy = italy_predicted)%>% left_join(country_data, by = c("country_region" = "Country"))%>% mutate(fitpc = fit * 10 / popn, lplpc = lpl * 10 / popn, uplpc = upl * 10 / popn, group = as.numeric(factor(country_region))) ggplot(data = observed, mapping = aes(x = Date)) + geom_point(mapping = aes(y = icpc)) + facet_wrap(~country_region, scales = "free_x") + scale_y_log10() + geom_line(data = select(fits, -country_region), mapping = aes(y = fitpc, group = group), color = "blue", alpha = 0.5) + geom_ribbon(data = select(fits, -country_region), mapping = aes(ymin = lclpc, ymax = uclpc, group = group), alpha = 0.1) + geom_line(data = select(predicted, -country_region), mapping = aes(y = fitpc, group = group), linetype = 2, alpha = 0.5) + geom_ribbon(data = select(predicted, -country_region), mapping = aes(ymin = lplpc, ymax = uplpc, group = group), alpha = 0.1) + geom_line(data = fits, mapping = aes(y = fitpc), color = "blue", size = 1.25) + geom_ribbon(data = fits, mapping = aes(ymin = lclpc, ymax = uclpc), alpha = 0.2) + geom_line(data = predicted, mapping = aes(y = fitpc), linetype = 2) + geom_ribbon(data = predicted, mapping = aes(ymin = lplpc, ymax = uplpc), alpha = 0.2) + geom_hline(mapping = aes(yintercept = icu_beds/0.05), color = "red", linetype = 2) + labs(y = "Incident cases per 100,000 people", title = "Total new reported cases per day", subtitle = "Exponential model in blue; predicted values with 95% prediction intervals with dashed line.", caption = paste("Source data from https://github.com/CSSEGISandData/COVID-19 downloaded ", format(file.mtime(savefilename), usetz = TRUE), "\n Unofficial, not peer reviewed results.", "\n Copyright Andrew Tyre 2020. Distributed with MIT License.") ) + geom_text(mapping = aes(y = 1.3*icu_beds/0.05), x = ymd("2020-02-21"), label = "20 x # ICU beds / 100K", size = 2.5, hjust = "left") ``` [Much as yesterday](/post/covid-19_and_r_vi/), new cases in the USA are inside the prediction interval, and still below expectations[^1]. Italy's numbers appear to have leveled off from exponential to linear growth. Look at the new case numbers starting around March 12 -- they are going along at a constant level, a flat line on this graph. Each day Italy adds about the same number of cases as the day before. In exponential growth, the number of cases per day *increases* at the same rate. On a log axis exponential growth looks like an increasing straight line. Nonetheless, Italy continues to add more severe cases per day than their ICU bed capacity. > Unless things change drastically, I predict the USA will start to exceed local ICU bed capacity by March 23, in places like New York, Washington state, and the Bay Area. Certainly by the end of March. Places like Nebraska may not experience these problems, because measures to reduce transmission started while case numbers were still low. That is a bit more nuanced than yesterday's prediction. I was looking at maps of the USA outbreak last night and realized that the country level totals are obscuring a lot of variation between states. This matters because *when* a locale starts reducing transmission by social distancing relative to the size of the outbreak makes a big difference. Consider this figure[^2] comparing two Italian provinces: ![barplot of COVID-19 cases in two Italian provinces](https://mfr.osf.io/export?url=https://osf.io/wqnga/?view_only=c2f00dfe3677493faa421fc2ea38e295%26direct%26mode=render%26action=download%26public_file=True&initialWidth=848&childId=mfrIframe&parentTitle=OSF+%7C+BergamoLodi.jpg&parentUrl=https://osf.io/wqnga/?view_only=c2f00dfe3677493faa421fc2ea38e295&format=2400x2400.jpeg) The first confirmed case in Italy was in Lodi province on February 21. The Italian government prohibited movement in or out of 10 municipalities in Lodi on February 23, only two days later. Although cases began to increase in Bergamo province on February 28, movement prohibitions were not extended there until March 8, more than a week after the first cases were reported. Cases increased in both provinces, but in Lodi the increase is linear, while Bergamo the increase is exponential. I fear New York and friends are like Bergamo. I'm hoping Nebraska and other small states are like Lodi. # Full data nerd past here Remember -- coding while thinking, no coherent story, ymmv. I want to break down Canada and the USA to state / province, and see if there is anything useful I can get at that level. I'll stick to the simple exponential model as a "null model" for now. UGH, I just remembered why I haven't done that before ... needed to be able to map the county rows with 2 letter state abbreviations to the state names in the state rows. So that took an hour. I took the opportunity to follow a suggestion from brilliant quantitative ecologist JB and put the data munging code that isn't changing into a separate script file. And then that fancy new code broke everything in my base plot, so spent another hour figuring out how I broke everything. ```{r, message=FALSE} us_by_state <- us_wide2long_old(jhu_wide,"USA") # see, fancy! canada_by_prov <- other_wide2long_old(jhu_wide, countries = "Canada") # reload the country data because I messed it up above country_data <- read_csv("data/countries_covid-19.csv") %>% mutate(start_date = mdy(start_date)) all_by_province <- bind_rows(us_by_state, canada_by_prov) %>% left_join(country_data, by = c("province" ="Region")) %>% group_by(country_region, province) %>% mutate(incident_cases = c(0, diff(cumulative_cases)), max_cases = max(cumulative_cases)) %>% ungroup() %>% mutate(icpc = incident_cases * 10 / popn, group = as.numeric(factor(country_region))) %>% # remove early rows from each state/province # remove 0 rows in the middle that are probably reporting errors filter(incident_cases > 0, Date >= start_date, province %in% c("California", "Washington", "New York", "British Columbia", "Alberta", "Ontario")) %>% mutate(province = factor(province, levels = c("California", "Washington", "New York", "British Columbia", "Alberta", "Ontario"))) base_plot <- ggplot(data = all_by_province, mapping = aes(x = Date)) + geom_point(mapping = aes(y = icpc, color = province)) + facet_wrap(~province, dir="v") + scale_y_log10() + theme(legend.position = "none") + labs(y = "Incident cases / 100K people") # fit the models all_models <- all_by_province %>% mutate(log10_ic = log10(incident_cases), day = as.numeric(Date - start_date)) %>% filter(day <= 12) %>% group_by(province) %>% nest() %>% mutate(model = map(data, ~lm(log10_ic~day, data = .))) all_fit <- all_models %>% mutate(fit = map(model, augment, se_fit = TRUE), fit = map(fit, select, -c("log10_ic","day"))) %>% select(-model) %>% unnest(cols = c("data","fit")) %>% mutate(fit = 10^.fitted, lcl = 10^(.fitted - .se.fit * qt(0.975, df = 10)), ucl = 10^(.fitted + .se.fit * qt(0.975, df = 10)), fitpc = fit * 10 / popn, lclpc = lcl * 10 / popn, uclpc = ucl * 10 / popn) all_predicted <- all_by_province %>% mutate(day = as.numeric(Date - start_date)) %>% filter(day > 12) %>% #these are the rows we want to predict with group_by(province) %>% nest() %>% left_join(select(all_models, province, model), by="province") %>% mutate(predicted = map2(model, data, ~augment(.x, newdata = .y, se_fit = TRUE)), sigma = map_dbl(model, ~summary(.x)$sigma)) %>% select(-c(model,data)) %>% unnest(cols = predicted) %>% mutate( fit = 10^.fitted, pred_var = sigma^2 + .se.fit^2, lpl = 10^(.fitted - sqrt(pred_var)*qt(0.975, df = 10)), upl = 10^(.fitted + sqrt(pred_var)*qt(0.975, df = 10)), fitpc = fit * 10 / popn, lplpc = lpl * 10 / popn, uplpc = upl * 10 / popn) base_plot + geom_line(data = all_fit, mapping = aes(y = fitpc, color = province), size = 1.25) + geom_ribbon(data = all_fit, mapping = aes(ymin = lclpc, ymax = uclpc), alpha = 0.2) + geom_line(data = all_predicted, mapping = aes(y = fitpc, color = province), linetype = 2) + geom_ribbon(data = all_predicted, mapping = aes(ymin = lplpc, ymax = uplpc), alpha = 0.2) ``` That figure is all per capita, so by that metric the two places that are worst off are New York and British Columbia, both adding around 0.5% of the population per day and accelerating. The worst news is that both New York and California continue to grow consistent with the exponential model. It's a bit harder to tell with the Canadian provinces because the outbreaks are shorter there, so not much left after using 12 days to fit the model. Ontario is interesting because the total number of cases is highest there (for Canada), but per capita they are lower. # What's next? So tomorrow I should be able to fiddle the state/province graphs a bit more and get localized predictions for when ICU capacity might become a problem. I guess it's getting time to find an alternative model to the exponential as well. The logistic curves from a few days ago were terrible at forecasting, so not sure what else to try. Suggestions in the comments please! I'd also like to refine the prediction of ICU capacity. [I finally found someone making predictions that include confidence limits](https://medium.com/@megan.higgie/without-serious-action-australia-will-run-out-of-intensive-care-beds-between-7-and-10-april-59f83b52756e) and Dr. Higgie predicts the time at which Australia will start to experience issues with ICU capacity (early April). She makes a great point that cases occupy an ICU bed for some weeks, so just looking at the new cases is optimistic. Instead there needs to be a kind of sliding window sum of the new cases. [^1]: Code not shown in the post can be found [on Github](https://raw.githubusercontent.com/rbind/cheap_shots/master/content/post/covid-19_in_R_VII.Rmd). This post benefited from comments by Ramona Sky, Kelly Helm-Smith, and Jessica Burnett. [^2]: [This preprint is the source of the figure, and discusses age structure effects](https://mfr.osf.io/render?url=https://osf.io/fd4rh/?view_only=c2f00dfe3677493faa421fc2ea38e295%26fbclid=IwAR2XQCAUzqpOWtdkLKmaIiorv5uih9RBg0BreAgSnYd2Qq9BnjWxyIf0zAk%26direct%26mode=render%26action=download%26mode=render)