{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Covid-19 Timeseries Forecasting\n", "\n", "The aim of this notebook is to identify the optimal model from a given simple\n", "parametric family that best characterizes the growth curve of COVID-19 cases without \n", "overfitting the data. \n", "The first part of the dataset (up to 20 March) will be used for training, along [Bayesian Information Criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) for selecting the degree of our polynomial. \n", "The chosen model will be tested on the rest of the dataset. \n", " \n", "Intuitively, is a timeseries forecasting model able to accurately predict the future number of cases? \n", "I believe that it is not for most countries. There are many factors that a time series model will not be able to forecast. A country may decide to take some measures like lockdowns, increase the number of tests, promote self isolation, etc.\n", " \n", " \n", ">This is most obvious with weather forecasting where we have excellent models based on the physics of the atmosphere. No time series model will perform as well as a good atmospheric model for forecasting short-term weather. That is why meteorologists do not use time series models [(ref)](https://robjhyndman.com/hyndsight/forecasting-covid19/). \n", " \n", "However, it is an interesting exercise on the field of forecasting in a domain that is highly interpretable (at least at the depths I am reaching).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-04-01T16:00:57.737730Z", "start_time": "2020-04-01T16:00:57.163348Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib.dates as mdates\n", "\n", "import datetime\n", "import logging\n", "\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.preprocessing import normalize\n", "\n", "from statsmodels.tsa.seasonal import seasonal_decompose\n", "from statsmodels.tsa.stattools import adfuller\n", "from statsmodels.tsa.stattools import acf,pacf\n", "from statsmodels.tsa.arima_model import ARIMA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data gathering\n", "\n", "As in the [previous notebook](https://github.com/MikeXydas/Weekend-EDAs/blob/master/Covid19_Testing_Importance.ipynb) the number of confirmed cases will be taken from the [John Hopkins](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) dataset which is daily updated. \n", "We will group by country name and sum over provinces/states so as to have the cases per country." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-04-01T16:01:04.573835Z", "start_time": "2020-04-01T16:00:59.063815Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | 1/22/20 | \n", "1/23/20 | \n", "1/24/20 | \n", "1/25/20 | \n", "1/26/20 | \n", "1/27/20 | \n", "1/28/20 | \n", "1/29/20 | \n", "1/30/20 | \n", "1/31/20 | \n", "... | \n", "3/22/20 | \n", "3/23/20 | \n", "3/24/20 | \n", "3/25/20 | \n", "3/26/20 | \n", "3/27/20 | \n", "3/28/20 | \n", "3/29/20 | \n", "3/30/20 | \n", "3/31/20 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country/Region | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| Afghanistan | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "40 | \n", "40 | \n", "74 | \n", "84 | \n", "94 | \n", "110 | \n", "110 | \n", "120 | \n", "170 | \n", "174 | \n", "
| Albania | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "89 | \n", "104 | \n", "123 | \n", "146 | \n", "174 | \n", "186 | \n", "197 | \n", "212 | \n", "223 | \n", "243 | \n", "
| Algeria | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "201 | \n", "230 | \n", "264 | \n", "302 | \n", "367 | \n", "409 | \n", "454 | \n", "511 | \n", "584 | \n", "716 | \n", "
| Andorra | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "113 | \n", "133 | \n", "164 | \n", "188 | \n", "224 | \n", "267 | \n", "308 | \n", "334 | \n", "370 | \n", "376 | \n", "
| Angola | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "4 | \n", "4 | \n", "5 | \n", "7 | \n", "7 | \n", "7 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| Venezuela | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "70 | \n", "77 | \n", "84 | \n", "91 | \n", "107 | \n", "107 | \n", "119 | \n", "119 | \n", "135 | \n", "135 | \n", "
| Vietnam | \n", "0 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "... | \n", "113 | \n", "123 | \n", "134 | \n", "141 | \n", "153 | \n", "163 | \n", "174 | \n", "188 | \n", "203 | \n", "212 | \n", "
| West Bank and Gaza | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "52 | \n", "59 | \n", "59 | \n", "59 | \n", "84 | \n", "91 | \n", "98 | \n", "109 | \n", "116 | \n", "119 | \n", "
| Zambia | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3 | \n", "3 | \n", "3 | \n", "12 | \n", "16 | \n", "22 | \n", "28 | \n", "29 | \n", "35 | \n", "35 | \n", "
| Zimbabwe | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "5 | \n", "7 | \n", "7 | \n", "7 | \n", "8 | \n", "
161 rows × 70 columns
\n", "