{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: CO$_2$ at Mauna Loa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the late 1950's Charles Keeling invented a accurate way to measure atmospheric CO$_2$ concentration and began taking regular measurements at the Mauna Loa observatory. Today, measurements are continuously recorded. Check out last hours measurement result [here](https://www.co2.earth/daily-co2). \n", "\n", "![Mauna Loa observatory](http://sites.gsu.edu/geog1112/files/2014/07/MaunaLoaObservatory_small-2g29jvt.png)\n", "\n", "Not much was known about how fossil fuel burning influences the climate in the late 1950s. The first couple years of data collection showed that CO$_2$ levels rose and fell following summer and winter, tracking the growth and decay of vegetation in the northern hemisphere. As multiple years passed, the steady upward trend increasingly grew into focus. With over 70 years of collected data, the Keeling curve is one of the most important climate indicators.\n", "\n", "The history behind these measurements and their influence on climatology today and other interesting reading:\n", "\n", "- http://scrippsco2.ucsd.edu/history_legacy/early_keeling_curve#\n", "- https://scripps.ucsd.edu/programs/keelingcurve/2016/05/23/why-has-a-drop-in-global-co2-emissions-not-caused-co2-levels-in-the-atmosphere-to-stabilize/#more-1412\n", "- http://cdiac.ornl.gov/\n", "\n", "Let's load in the data, tidy it up, and have a look. The [raw data set is located here](http://scrippsco2.ucsd.edu/data/atmospheric_co2/mlo). This notebook uses the [Bokeh package](http://bokeh.pydata.org/en/latest/) for plots that benefit from interactivity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:29.657111Z", "start_time": "2017-09-18T04:07:29.636254Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force === true) {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " var el = document.getElementById(\"68a18d41-84a0-4d71-877d-ea28ee518052\");\n", " el.textContent = \"BokehJS \" + Bokeh.version + \" successfully loaded.\";\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " try {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " }\n", " finally {\n", " delete window._bokeh_onload_callbacks\n", " }\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"68a18d41-84a0-4d71-877d-ea28ee518052\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid '68a18d41-84a0-4d71-877d-ea28ee518052' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.6.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.6.min.js\"];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " },\n", " \n", " function(Bokeh) {\n", " \n", " document.getElementById(\"68a18d41-84a0-4d71-877d-ea28ee518052\").textContent = \"BokehJS is loading...\";\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.6.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.6.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.6.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.6.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === true)) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"68a18d41-84a0-4d71-877d-ea28ee518052\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pymc3 as pm\n", "import pandas as pd\n", "import numpy as np\n", "import theano.tensor as tt\n", "\n", "from bokeh.plotting import figure, show\n", "from bokeh.models import BoxAnnotation, Span, Label, Legend\n", "from bokeh.io import output_notebook\n", "from bokeh.palettes import brewer\n", "output_notebook()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:01.648339Z", "start_time": "2017-09-18T04:07:01.508100Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CO2seasonaly_adjustedfitseasonally_adjusted_fitCO2_filledseasonally_adjusted_filled
1958-03-15315.69314.43316.18314.90315.69314.43
1958-04-15317.46315.15317.30314.98317.46315.15
1958-05-15317.50314.73317.84315.06317.50314.73
1958-07-15315.86315.17315.87315.22315.86315.17
1958-08-15314.93316.17314.01315.29314.93316.17
\n", "
" ], "text/plain": [ " CO2 seasonaly_adjusted fit seasonally_adjusted_fit \\\n", "1958-03-15 315.69 314.43 316.18 314.90 \n", "1958-04-15 317.46 315.15 317.30 314.98 \n", "1958-05-15 317.50 314.73 317.84 315.06 \n", "1958-07-15 315.86 315.17 315.87 315.22 \n", "1958-08-15 314.93 316.17 314.01 315.29 \n", "\n", " CO2_filled seasonally_adjusted_filled \n", "1958-03-15 315.69 314.43 \n", "1958-04-15 317.46 315.15 \n", "1958-05-15 317.50 314.73 \n", "1958-07-15 315.86 315.17 \n", "1958-08-15 314.93 316.17 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_monthly = pd.read_csv(pm.get_data(\"monthly_in_situ_co2_mlo.csv\"), header=56)\n", " \n", "# - replace -99.99 with NaN\n", "data_monthly.replace(to_replace=-99.99, value=np.nan, inplace=True)\n", "\n", "# fix column names\n", "cols = [\"year\", \"month\", \"--\", \"--\", \"CO2\", \"seasonaly_adjusted\", \"fit\",\n", " \"seasonally_adjusted_fit\", \"CO2_filled\", \"seasonally_adjusted_filled\"]\n", "data_monthly.columns = cols\n", "cols.remove(\"--\"); cols.remove(\"--\")\n", "data_monthly = data_monthly[cols]\n", "\n", "# drop rows with nan\n", "data_monthly.dropna(inplace=True)\n", "\n", "# fix time index\n", "data_monthly[\"day\"] = 15\n", "data_monthly.index = pd.to_datetime(data_monthly[[\"year\", \"month\", \"day\"]])\n", "cols.remove(\"year\"); cols.remove(\"month\")\n", "data_monthly = data_monthly[cols]\n", "\n", "data_monthly.head(5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:02.863693Z", "start_time": "2017-09-18T04:07:02.832873Z" }, "collapsed": true }, "outputs": [], "source": [ "# function to convert datetimes to numbers that are useful to algorithms\n", "# this will be useful later when doing prediction\n", "\n", "def dates_to_idx(timelist):\n", " reference_time = pd.to_datetime('1958-03-15')\n", " t = (timelist - reference_time) / pd.Timedelta(1, \"Y\")\n", " return np.asarray(t)\n", "\n", "t = dates_to_idx(data_monthly.index)\n", "\n", "# normalize CO2 levels\n", "y = data_monthly[\"CO2\"].values\n", "first_co2 = y[0]\n", "std_co2 = np.std(y)\n", "y_n = (y - first_co2) / std_co2\n", "\n", "data_monthly = data_monthly.assign(t = t)\n", "data_monthly = data_monthly.assign(y_n = y_n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data might be familiar to you, since it was used as an example in the [Gaussian Processes for Machine Learning](http://www.gaussianprocess.org/gpml/) book by Rasmussen & Williams. The version of the data set they use starts in the late 1950's, but stops at the end of 2003. So that our PyMC3 example is somewhat comparable to their example, we use the stretch of data from before 2004 as the \"training\" set. The data from 2004 to current we'll use to test our predictions. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:05.078659Z", "start_time": "2017-09-18T04:07:05.070764Z" }, "collapsed": true }, "outputs": [], "source": [ "# split into training and test set\n", "sep_idx = data_monthly.index.searchsorted(pd.to_datetime(\"2003-12-15\"))\n", "data_prior = data_monthly.iloc[:sep_idx+1, :]\n", "data_after = data_monthly.iloc[sep_idx:, :]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:34.021576Z", "start_time": "2017-09-18T04:07:33.851673Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# make plot\n", "\n", "p = figure(x_axis_type='datetime', title='Monthly CO2 Readings from Mauna Loa',\n", " plot_width=700, plot_height=400)\n", "p.yaxis.axis_label = 'CO2 [ppm]'\n", "p.xaxis.axis_label = 'Date'\n", "predict_region = BoxAnnotation(left=pd.to_datetime(\"2003-12-15\"), \n", " fill_alpha=0.1, fill_color=\"firebrick\")\n", "p.add_layout(predict_region)\n", "ppm400 = Span(location=400,\n", " dimension='width', line_color='red',\n", " line_dash='dashed', line_width=2)\n", "p.add_layout(ppm400)\n", "\n", "p.line(data_monthly.index, data_monthly['CO2'], \n", " line_width=2, line_color=\"black\", alpha=0.5)\n", "p.circle(data_monthly.index, data_monthly['CO2'], \n", " line_color=\"black\", alpha=0.1, size=2)\n", "\n", "train_label = Label(x=100, y=165, x_units='screen', y_units='screen',\n", " text='Training Set', render_mode='css', border_line_alpha=0.0,\n", " background_fill_alpha=0.0)\n", "test_label = Label(x=585, y=80, x_units='screen', y_units='screen',\n", " text='Test Set', render_mode='css', border_line_alpha=0.0,\n", " background_fill_alpha=0.0)\n", "\n", "p.add_layout(train_label)\n", "p.add_layout(test_label)\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bokeh plots are interactive, so panning and zooming can be done with the sidebar on the right hand side. The seasonal rise and fall is plainly apparent, as is the upward trend. Here is a link to an plots of [this curve at different time scales, and in the context of historical ice core data](https://scripps.ucsd.edu/programs/keelingcurve/).\n", "\n", "The 400 ppm level is highlighted with a dashed line. In addition to fitting a descriptive model, our goal will be to predict the first month the 400 ppm threshold is crossed, which was [May, 2013](https://scripps.ucsd.edu/programs/keelingcurve/2013/05/20/now-what/#more-741). In the data set above, the CO$_2$ average reading for May, 2013 was about 399.98, close enough to be our correct target date. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Modeling the Keeling Curve using GPs\n", "\n", "As a starting point, we use the GP model described in Rasmussen & Williams. Instead of using flat priors on covariance function hyperparameters and then maximizing the marginal likelihood like is done in the textbook, we place somewhat informative priors on the hyperparameters and use the NUTS sampler for inference. We use the `gp.Marginal` since Gaussian noise is assumed.\n", "\n", "The R+W model is a sum of three GPs for the signal, and one GP for the noise.\n", "\n", "1. A long term smooth rising trend represented by an exponentiated quadratic kernel.\n", "2. A periodic term that decays away from exact periodicity. This is represented by the product of a `Periodic` covariance function and an exponentiated quadratic.\n", "3. Small and medium term irregularities with a rational quadratic kernel.\n", "4. The noise is modeled as the sum of an Exponential and a white noise kernel\n", "\n", "The prior on CO$_2$ as a function of time is,\n", "\n", "$$\n", "f(t) \\sim \\mathcal{GP}_{\\text{slow}}(0,\\, k_1(t, t')) + \n", " \\mathcal{GP}_{\\text{med}}(0,\\, k_2(t, t')) + \n", " \\mathcal{GP}_{\\text{per}}(0,\\, k_3(t, t')) +\n", " \\mathcal{GP}_{\\text{noise}}(0,\\, k_n(t, t'))\n", "$$\n", "\n", "## Hyperparameter priors\n", "We use fairly uninformative priors for the scale hyperparameters of the covariance functions, and informative Gamma parameters for lengthscales. The PDFs used for the lengthscale priors is shown below:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:07:45.674064Z", "start_time": "2017-09-18T04:07:40.669818Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = np.linspace(0, 150, 5000)\n", "priors = [\n", " (\"ℓ_pdecay\", pm.Gamma.dist(alpha=10, beta=0.075)),\n", " (\"ℓ_psmooth\", pm.Gamma.dist(alpha=4, beta=3)),\n", " (\"period\", pm.Normal.dist(mu=1.0, sd=0.05)),\n", " (\"ℓ_med\", pm.Gamma.dist(alpha=2, beta=0.75)),\n", " (\"α\", pm.Gamma.dist(alpha=5, beta=2)),\n", " (\"ℓ_trend\", pm.Gamma.dist(alpha=4, beta=0.1)),\n", " (\"ℓ_noise\", pm.Gamma.dist(alpha=2, beta=4))]\n", "\n", "colors = brewer['Paired'][7]\n", "\n", "p = figure(title='Lengthscale and period priors',\n", " plot_width=700, plot_height=400)\n", "p.yaxis.axis_label = 'Probability'\n", "p.xaxis.axis_label = 'Years'\n", "\n", "for i, prior in enumerate(priors):\n", " p.line(x, np.exp(prior[1].logp(x).eval()), legend=prior[0], \n", " line_width=3, line_color=colors[i])\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- `ℓ_pdecay`: The periodic decay. The smaller this parameter is, the faster the periodicity goes away. I doubt that the seasonality of the CO$_2$ will be going away any time soon (hopefully), and there's no evidence for that in the data. Most of the prior mass is from 60 to >140 years.\n", "\n", "- `ℓ_psmooth`: The smoothness of the periodic component. It controls how \"sinusoidal\" the periodicity is. The plot of the data shows that seasonality is not an exact sine wave, but its not terribly different from one. We use a Gamma whose mode is at one, and doesn't have too large of a variance, with most of the prior mass from around 0.5 and 2.\n", "\n", "- `period`: The period. We put a very strong prior on $p$, the period that is centered at one. R+W fix $p=1$, since the period is annual. \n", "\n", "- `ℓ_med`: This is the lengthscale for the short to medium long variations. This prior has most of its mass below 6 years.\n", "\n", "- `α`: This is the shape parameter. This prior is centered at 3, since we're expecting there to be some more variation than could be explained by an exponentiated quadratic. \n", "\n", "- `ℓ_trend`: The lengthscale of the long term trend. It has a wide prior with mass on a decade scale. Most of the mass is between 10 to 60 years.\n", "\n", "- `ℓ_noise`: The lengthscale of the noise covariance. This noise should be very rapid, in the scale of several months to at most a year or two." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know beforehand which GP components should have a larger magnitude, so we try to include this information in the scale parameters. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:08:20.468320Z", "start_time": "2017-09-18T04:08:19.951636Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = np.linspace(0, 4, 5000)\n", "priors = [\n", " (\"η_per\", pm.HalfCauchy.dist(beta=2)),\n", " (\"η_med\", pm.HalfCauchy.dist(beta=1.0)),\n", " (\"η_trend\", pm.HalfCauchy.dist(beta=3)), # will use beta=2, but 2.2 is visible on plot\n", " (\"σ\", pm.HalfNormal.dist(sd=0.25)),\n", " (\"η_noise\", pm.HalfNormal.dist(sd=0.5))]\n", "\n", "colors = brewer['Paired'][5]\n", "\n", "p = figure(title='Scale priors',\n", " plot_width=700, plot_height=400)\n", "p.yaxis.axis_label = 'Probability'\n", "p.xaxis.axis_label = 'Years'\n", "\n", "for i, prior in enumerate(priors):\n", " p.line(x, np.exp(prior[1].logp(x).eval()), legend=prior[0], \n", " line_width=3, line_color=colors[i])\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For all of the scale priors we use distributions that shrink the scale towards zero. The seasonal component and the long term trend have the least mass near zero, since they are the largest influences in the data. \n", "\n", "- `η_per`: Scale of the periodic or seasonal component.\n", "- `η_med`: Scale of the short to medium term component.\n", "- `η_trend`: Scale of the long term trend.\n", "- `σ`: Scale of the white noise.\n", "- `η_noise`: Scale of correlated, short term noise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The model in PyMC3\n", "\n", "Below is the actual model. Each of the three component GPs is constructed separately. Since we are doing MAP, we use `Marginal` GPs and lastly call the `.marginal_likelihood` method to specify the marginal posterior. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:08:23.795652Z", "start_time": "2017-09-18T04:08:23.790701Z" }, "collapsed": true }, "outputs": [], "source": [ "# pull out normalized data\n", "t = data_prior[\"t\"].values[:,None]\n", "y = data_prior[\"y_n\"].values" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:04.890150Z", "start_time": "2017-09-18T04:08:24.515940Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "logp = 1,667.4, ||grad|| = 1.7276: 100%|██████████| 263/263 [01:27<00:00, 3.02it/s] \n" ] } ], "source": [ "with pm.Model() as model:\n", " # yearly periodic component x long term trend\n", " η_per = pm.HalfCauchy(\"η_per\", beta=2, testval=1.0)\n", " ℓ_pdecay = pm.Gamma(\"ℓ_pdecay\", alpha=10, beta=0.075)\n", " period = pm.Normal(\"period\", mu=1, sd=0.05)\n", " ℓ_psmooth = pm.Gamma(\"ℓ_psmooth \", alpha=4, beta=3)\n", " cov_seasonal = η_per**2 * pm.gp.cov.Periodic(1, period, ℓ_psmooth) \\\n", " * pm.gp.cov.Matern52(1, ℓ_pdecay)\n", " gp_seasonal = pm.gp.Marginal(cov_func=cov_seasonal)\n", " \n", " # small/medium term irregularities\n", " η_med = pm.HalfCauchy(\"η_med\", beta=0.5, testval=0.1)\n", " ℓ_med = pm.Gamma(\"ℓ_med\", alpha=2, beta=0.75)\n", " α = pm.Gamma(\"α\", alpha=5, beta=2) \n", " cov_medium = η_med**2 * pm.gp.cov.RatQuad(1, ℓ_med, α)\n", " gp_medium = pm.gp.Marginal(cov_func=cov_medium)\n", " \n", " # long term trend\n", " η_trend = pm.HalfCauchy(\"η_trend\", beta=2, testval=2.0)\n", " ℓ_trend = pm.Gamma(\"ℓ_trend\", alpha=4, beta=0.1)\n", " cov_trend = η_trend**2 * pm.gp.cov.ExpQuad(1, ℓ_trend)\n", " gp_trend = pm.gp.Marginal(cov_func=cov_trend) \n", "\n", " # noise model\n", " η_noise = pm.HalfNormal(\"η_noise\", sd=0.5, testval=0.05)\n", " ℓ_noise = pm.Gamma(\"ℓ_noise\", alpha=2, beta=4)\n", " σ = pm.HalfNormal(\"σ\", sd=0.25, testval=0.05)\n", " cov_noise = η_noise**2 * pm.gp.cov.Matern32(1, ℓ_noise) +\\\n", " pm.gp.cov.WhiteNoise(σ)\n", "\n", " gp = gp_seasonal + gp_medium + gp_trend\n", " \n", " y_ = gp.marginal_likelihood(\"y\", X=t, y=y, noise=cov_noise)\n", " mp = pm.find_MAP(include_transformed=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:04.906335Z", "start_time": "2017-09-18T04:10:04.898404Z" } }, "outputs": [ { "data": { "text/plain": [ "['period:0.999726976161941',\n", " 'α:1.170964081505256',\n", " 'η_med:0.024147872518165272',\n", " 'η_noise:0.007809337970705424',\n", " 'η_per:0.10222380093634102',\n", " 'η_trend:1.8313633958127935',\n", " 'σ:0.0067451696343060665',\n", " 'ℓ_med:1.5439691919986576',\n", " 'ℓ_noise:0.17300634495608952',\n", " 'ℓ_pdecay:126.01206470514794',\n", " 'ℓ_psmooth :0.7362483610055942',\n", " 'ℓ_trend:51.56586203609078']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# display the results, dont show transformed parameter values\n", "sorted([name+\":\"+str(mp[name]) for name in mp.keys() if not name.endswith(\"_\")])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first glance the results look reasonable. The lengthscale that determines how fast the seasonality varies is about 126 years. This means that given the data, we wouldn't expect such strong periodicity to vanish until centuries have passed. The trend lengthscale is also long, about 50 years. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the fit\n", "\n", "The code below looks at the fit of the total GP, and each component individually. The total fit and its $2\\sigma$ uncertainty is shown in red. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:13.646878Z", "start_time": "2017-09-18T04:10:04.912178Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicting with gp ...\n", "Predicting with gp_trend ...\n", "Predicting with gp_medium ...\n", "Predicting with gp_seasonal ...\n", "Done\n" ] } ], "source": [ "# predict at a 15 day granularity\n", "dates = pd.date_range(start='3/15/1958', end=\"12/15/2003\", freq=\"15D\")\n", "tnew = dates_to_idx(dates)[:,None]\n", "\n", "print(\"Predicting with gp ...\")\n", "mu, var = gp.predict(tnew, point=mp, diag=True)\n", "mean_pred = mu*std_co2 + first_co2\n", "var_pred = var*std_co2**2\n", "\n", "# make dataframe to store fit results\n", "fit = pd.DataFrame({\"t\": tnew.flatten(), \n", " \"mu_total\": mean_pred, \n", " \"sd_total\": np.sqrt(var_pred)},\n", " index=dates)\n", "\n", "print(\"Predicting with gp_trend ...\")\n", "mu, var = gp_trend.predict(tnew, point=mp,\n", " given={\"gp\": gp, \"X\": t, \"y\": y, \"noise\": cov_noise}, \n", " diag=True)\n", "fit = fit.assign(mu_trend = mu*std_co2 + first_co2, \n", " sd_trend = np.sqrt(var*std_co2**2))\n", "\n", "print(\"Predicting with gp_medium ...\")\n", "mu, var = gp_medium.predict(tnew, point=mp, \n", " given={\"gp\": gp, \"X\": t, \"y\": y, \"noise\": cov_noise}, \n", " diag=True)\n", "fit = fit.assign(mu_medium = mu*std_co2 + first_co2, \n", " sd_medium = np.sqrt(var*std_co2**2))\n", "\n", "print(\"Predicting with gp_seasonal ...\")\n", "mu, var = gp_seasonal.predict(tnew, point=mp,\n", " given={\"gp\": gp, \"X\": t, \"y\": y, \"noise\": cov_noise}, \n", " diag=True)\n", "fit = fit.assign(mu_seasonal = mu*std_co2 + first_co2, \n", " sd_seasonal = np.sqrt(var*std_co2**2))\n", "print(\"Done\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:14.020872Z", "start_time": "2017-09-18T04:10:13.672558Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## make plot\n", "p = figure(title=\"Fit to the Mauna Loa Data\",\n", " x_axis_type='datetime', plot_width=700, plot_height=400)\n", "p.yaxis.axis_label = 'CO2 [ppm]'\n", "p.xaxis.axis_label = 'Date'\n", "\n", "# plot mean and 2σ region of total prediction\n", "upper = fit.mu_total + 2*fit.sd_total\n", "lower = fit.mu_total - 2*fit.sd_total\n", "band_x = np.append(fit.index.values, fit.index.values[::-1])\n", "band_y = np.append(lower, upper[::-1])\n", "\n", "# total fit\n", "p.line(fit.index, fit.mu_total, \n", " line_width=1, line_color=\"firebrick\", legend=\"Total fit\")\n", "p.patch(band_x, band_y, \n", " color=\"firebrick\", alpha=0.6, line_color=\"white\")\n", "\n", "# trend\n", "p.line(fit.index, fit.mu_trend, \n", " line_width=1, line_color=\"blue\", legend=\"Long term trend\")\n", "\n", "# medium\n", "p.line(fit.index, fit.mu_medium, \n", " line_width=1, line_color=\"green\", legend=\"Medium range variation\")\n", "\n", "# seasonal\n", "p.line(fit.index, fit.mu_seasonal, \n", " line_width=1, line_color=\"orange\", legend=\"Seasonal process\")\n", "\n", "# true value\n", "p.circle(data_prior.index, data_prior['CO2'], \n", " color=\"black\", legend=\"Observed data\")\n", "p.legend.location = \"top_left\"\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fit matches the observed data very well. The trend, seasonality, and short/medium term effects also are cleanly separated out. Looking closer at the seasonality, it appears to be widening as time goes on. Lets plot a couple years and take a look:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:23.758053Z", "start_time": "2017-09-18T04:10:14.022808Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicting year 1960\n", "Predicting year 1970\n", "Predicting year 1980\n", "Predicting year 1990\n", "Predicting year 2000\n" ] }, { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot several years\n", "\n", "p = figure(title=\"Several years of the seasonal component\",\n", " plot_width=700, plot_height=400)\n", "p.yaxis.axis_label = 'Δ CO2 [ppm]'\n", "p.xaxis.axis_label = 'Month'\n", "\n", "colors = brewer['Paired'][5]\n", "years = [\"1960\", \"1970\", \"1980\", \"1990\", \"2000\"]\n", "\n", "for i, year in enumerate(years):\n", " dates = pd.date_range(start=\"1/1/\"+year, end=\"12/31/\"+year, freq=\"10D\")\n", " tnew = dates_to_idx(dates)[:,None]\n", " \n", " print(\"Predicting year\", year)\n", " mu, var = gp_seasonal.predict(tnew, point=mp, diag=True,\n", " given={\"gp\": gp, \"X\": t, \"y\": y, \"noise\": cov_noise})\n", " mu_pred = mu*std_co2\n", " \n", " # plot mean\n", " x = np.asarray((dates - dates[0])/pd.Timedelta(1, \"M\")) + 1\n", " p.line(x, mu_pred, \n", " line_width=1, line_color=colors[i], legend=year)\n", "\n", "p.legend.location = \"bottom_left\"\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot makes it clear that there is a broadening over time. So it would seem that as there is more CO$_2$ in the atmosphere, [the absorbtion/release cycle due to the growth and decay of vegetation in the northern hemisphere](https://scripps.ucsd.edu/programs/keelingcurve/2013/06/04/why-does-atmospheric-co2-peak-in-may/) becomes more slightly more pronounced. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction\n", "\n", "How well do our forecasts look? Clearly the observed data trends up and the seasonal effect is very pronounced. Does our GP model capture this well enough to make reasonable extrapolations? Our \"training\" set went up until the end of 2003, so we are going to predict from January 2014 out to the end of 2017. \n", "\n", "Although there isn't any particular significance to this event other than it being a nice round number, our side goal was to see how well we could predict the date when the 400 ppm mark is first crossed. [This event first occurred during May, 2013](https://scripps.ucsd.edu/programs/keelingcurve/2013/05/20/now-what/#more-741) and there were a few [news articles about other significant milestones](https://www.usatoday.com/story/tech/sciencefair/2016/09/29/carbon-dioxide-levels-400-ppm-scripps-mauna-loa-global-warming/91279952/)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:28.197680Z", "start_time": "2017-09-18T04:10:23.759873Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sampling gp predictions ...\n" ] } ], "source": [ "dates = pd.date_range(start=\"11/15/2003\", end=\"12/15/2020\", freq=\"10D\")\n", "tnew = dates_to_idx(dates)[:,None]\n", "\n", "print(\"Sampling gp predictions ...\")\n", "mu_pred, cov_pred = gp.predict(tnew, point=mp)\n", "\n", "# draw samples, and rescale\n", "n_samples = 20000\n", "samples = pm.MvNormal.dist(mu=mu_pred, cov=cov_pred).random(size=n_samples)\n", "samples = samples * std_co2 + first_co2" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:28.448029Z", "start_time": "2017-09-18T04:10:28.199690Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### make plot\n", "p = figure(x_axis_type='datetime', plot_width=700, plot_height=300)\n", "p.yaxis.axis_label = 'CO2 [ppm]'\n", "p.xaxis.axis_label = 'Date'\n", " \n", "### plot mean and 2σ region of total prediction\n", "# scale mean and var\n", "mu_pred_sc = mu_pred * std_co2 + first_co2\n", "sd_pred_sc = np.sqrt(np.diag(cov_pred) * std_co2**2 )\n", "\n", "upper = mu_pred_sc + 2*sd_pred_sc\n", "lower = mu_pred_sc - 2*sd_pred_sc\n", "band_x = np.append(dates, dates[::-1])\n", "band_y = np.append(lower, upper[::-1])\n", " \n", "p.line(dates, mu_pred_sc, \n", " line_width=2, line_color=\"firebrick\", legend=\"Total fit\")\n", "p.patch(band_x, band_y, \n", " color=\"firebrick\", alpha=0.6, line_color=\"white\")\n", " \n", "# some predictions\n", "idx = np.random.randint(0, samples.shape[0], 10)\n", "p.multi_line([dates]*len(idx), [samples[i,:] for i in idx],\n", " color=\"firebrick\", alpha=0.5, line_width=0.5)\n", "# true value\n", "p.line(data_after.index, data_after['CO2'], \n", " line_width=2, line_color=\"black\", legend=\"Observed data\")\n", "\n", "ppm400 = Span(location=400,\n", " dimension='width', line_color='black',\n", " line_dash='dashed', line_width=1)\n", "p.add_layout(ppm400)\n", "p.legend.location = \"bottom_right\"\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean prediction and the $2\\sigma$ uncertainty is in red. A couple samples from the marginal posterior are also shown on there. It looks like our model was a little optimistic about how much CO2 is being released. The first time the $2\\sigma$ uncertainty crosses the 400 ppm threshold is in May 2015, two years late. \n", "\n", "One reason this is occuring is because our GP prior had zero mean. This means we encoded prior information that says that the function should go to zero as we move away from our observed data. This assumption probably isn't justified. It's also possible that the CO$_2$ trend is increasing faster than linearly -- important knowledge for accurate predictions. Another possibility is the MAP estimate. Without looking at the full posterior, the uncertainty in our estimates is underestimated. How badly is unknown. \n", "\n", "The following code looks at each of the posterior samples, and collects the date when a measurement higher than 400 ppm is predicted. One of the great things about Bayesian approaches is that samples from the posterior can be used to create any arbitrary statistic." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:28.475456Z", "start_time": "2017-09-18T04:10:28.449916Z" }, "collapsed": true }, "outputs": [], "source": [ "from collections import Counter\n", "def cross400(preds, dates, start=pd.to_datetime(\"2003-12-15\")):\n", " \"\"\" loop over each prediction, finding first entry that is greater than 400 \"\"\"\n", " times = Counter()\n", " times[start] = 0\n", " N = preds.shape[0]\n", " for i in range(N):\n", " p = preds[i,:]\n", " idx = np.searchsorted(p, 400, side='left')\n", " if idx >= len(dates): # prediction never actually crosses 400\n", " times[\"never\"] += 1\n", " else: #it does cross 400, when?\n", " times[dates[idx]] += 1\n", " never = times.pop('never', 0)\n", " def resampler(x):\n", " if len(x) == 0:\n", " return 0\n", " else:\n", " return x\n", " times = pd.Series(times).resample(\"D\").apply(resampler)\n", " return times, never\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets look at the posterior probability by year: " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2017-09-18T04:10:29.059926Z", "start_time": "2017-09-18T04:10:28.477414Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First time 400 ppm of CO2 hit: May, 2013\n", "2003: 0.0\n", "2004: 0.0\n", "2005: 0.0\n", "2006: 0.0\n", "2007: 0.0\n", "2008: 0.0\n", "2009: 0.0\n", "2010: 0.0\n", "2011: 0.0\n", "2012: 0.0\n", "2013: 5e-05\n", "2014: 0.0001\n", "2015: 0.00025\n", "2016: 0.0047\n", "2017: 0.0579\n", "2018: 0.1604\n", "2019: 0.1705\n", "2020: 0.2884\n", "Never, or later than 2020: 0.3177\n" ] } ], "source": [ "print(\"First time 400 ppm of CO2 hit: May, 2013\") \n", "times, never = cross400(samples, dates)\n", "for year in range(2003,2021):\n", " val = times[str(year)].sum()\n", " print(str(year)+\": \", val/float(n_samples))\n", "print(\"Never, or later than 2020:\", never/float(n_samples))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having a zero mean GP prior is causing the prediction to be pretty far off. Some possibilities for fixing this is to use a constant mean function, whose value could maybe be assigned the historical, or pre-industrial revolution, CO$_2$ average. This may not be the best indicator for future CO$_2$ levels though. \n", "\n", "Also, using only historical CO$_2$ data may not be the best predictor. In addition to looking at the underlying behavior of what determines CO$_2$ levels using a GP fit, we could also incorporate other information, such as the amount of CO$_2$ that is released by fossil fuel burning. \n", "\n", "Next, we'll see about using PyMC3's GP functionality to improve the model, look at full posteriors, and incorporate other sources of data on drivers of CO$_2$ levels." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }