{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Brier Score\n", "\n", "The Brier score is the most commonly used verification metric for evaluating a probability of a binary outcome forecast, such as a \"chance of rainfall\" forecast.\n", "\n", "Probabilistic forecasts can be expressed as values between 0 and 1, or they could be expressed as an ensemble. We can use `scores` to evaluate both kinds of forecasts.\n", "\n", "## Using the Brier score to evaluate probabilistic forecasts (expressed as values between 0 and 1) of binary events\n", "\n", "Probabilistic forecasts of binary events can be expressed with values between 0 and 1, and observations are exactly 0 (event did not occur), or 1 (event occurred).\n", "\n", "\n", "The metric is then calculated the same way as MSE and is defined as\n", "\n", "$$ s(x,y) = (x - y)^2$$\n", "\n", "Where\n", "\n", "- $x$ is the forecast between 0 and 1, and \n", "- $y$ is the observation which is either 0 or 1.\n", "\n", "The Brier score is a [strictly proper scoring rule](https://sites.stat.washington.edu/people/raftery/Research/PDF/Gneiting2007jasa.pdf) where lower values are better (it is negatively oriented), where a perfect score is 0 and the worst score is 1.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from scores.probability import brier_score\n", "from scipy.stats import beta, binom\n", "\n", "import numpy as np\n", "import xarray as xr\n", "\n", "np.random.seed(100)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# To learn more about the implementation of the Brier score, uncomment the following\n", "# help(brier_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We generate two synthetic forecasts. By design, `fcst1` is a good forecast, while `fcst2` is a poor forecast. We measure the difference in skill by calculating and comparing their Brier scores." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "fcst1 = beta.rvs(2, 1, size=1000)\n", "obs = binom.rvs(1, fcst1)\n", "fcst2 = beta.rvs(0.5, 1, size=1000)\n", "fcst1 = xr.DataArray(data=fcst1, dims=\"time\", coords={\"time\": np.arange(0, 1000)})\n", "fcst2 = xr.DataArray(data=fcst2, dims=\"time\", coords={\"time\": np.arange(0, 1000)})\n", "obs = xr.DataArray(data=obs, dims=\"time\", coords={\"time\": np.arange(0, 1000)})" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Brier score for fcst1 = 0.17\n", "Brier score for fcst2 = 0.43\n" ] } ], "source": [ "brier_fcst1 = brier_score(fcst1, obs)\n", "brier_fcst2 = brier_score(fcst2, obs)\n", "\n", "print(f\"Brier score for fcst1 = {brier_fcst1.item():.2f}\")\n", "print(f\"Brier score for fcst2 = {brier_fcst2.item():.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, `fcst1` has the lower Brier score quantifying the degree to which it is better than `fcst2`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notes\n", "- If you are using the Brier score on large data with Dask, consider setting `check_args` arg to `False` in `brier_score`. \n", "- In the future, the Brier score components calculation will be added.\n", "- You may be interested in working through the Murphy Diagram tutorial which allows you to break down the performance of the Brier score based on each threshold probability.\n", "\n", "## Using the Brier score to evaluate ensemble forecasts\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`scores` can also be used to evaluate ensembles using the Brier score. By default, it calculates the fair Brier score which is defined as\n", "\n", "$$s_{i,y} = \\left(\\frac{i}{m} - y\\right)^2 - \\frac{i(m-i)}{m^2(m-1)}$$\n", "\n", "Where:\n", "\n", "- $i$ is the number of ensemble members that predict the event\n", "- $m$ is the total number of ensemble members\n", "- $y$ is the observed event\n", "\n", "It is possible to calculate the Brier score without the fair correction (by setting ``fair_correction=False``). In this case, the Brier score is calculated as\n", "\n", "\n", "$$s_{i,y} = \\left(\\frac{i}{m} - y\\right)^2$$\n", "\n", "The use of the Brier score without the fair adjustment may favour ensembles that have members that have the behaviour of being sampled from a different distribution than the observations. For more information, see [Ferro (2014)](https://doi.org/10.1002/qj.2270).\n", "\n", "Let's first do a synthetic experiment with 10,000 timesteps to demonstrate the impact of the fair correction using the `scores.probability.brier_score_for_ensemble` function. Suppose the observations are randomly drawn from the distribution $\\mathrm{Pr}(y=1) = 0.5$. Let's now create two ensemble forecasts. The first (we will call `fcst_a`) has 2 ensemble members where the value for each member is randomly drawn from the same distribution that the observations are drawn from and is an ideal forecast. The second ensemble (called `fcst_b`) has 20 ensemble members but is biased. The value for each member is randomly drawn from the distribution $\\mathrm{Pr}(y=1) = 0.4$.\n", "\n", "Let's now calculate the Brier score with and without the fair adjustment." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from scores.probability import brier_score_for_ensemble\n", "\n", "# Uncomment the following line to learn more about the ensemble Brier score\n", "# help(brier_score_for_ensemble)\n", "\n", "N = 10000\n", "M1 = 2\n", "M2 = 20\n", "obs = np.random.choice([0, 1], size=N, p=[0.5, 0.5])\n", "time = np.arange(N)\n", "obs = xr.DataArray(obs, dims=[\"time\"], coords={\"time\": time})\n", "\n", "fcst_a = np.random.choice([0, 1], size=(M1, N), p=[0.5, 0.5])\n", "ensemble = np.arange(M1)\n", "fcst_a = xr.DataArray(fcst_a, dims=[\"ensemble\", \"time\"], coords={\"ensemble\": ensemble, \"time\": time})\n", "\n", "fcst_b = np.random.choice([0, 1], size=(M2, N), p=[0.6, 0.4])\n", "ensemble = np.arange(M2)\n", "fcst_b = xr.DataArray(fcst_b, dims=[\"ensemble\", \"time\"], coords={\"ensemble\": ensemble, \"time\": time})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we calculate the fair Brier score for both ensembles. The `thresholds` arg defines the threshold that an event occurs. By default, events are inclusive of the exact value of the threshold.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.DataArray (threshold: 1)>\n",
"array([0.2532])\n",
"Coordinates:\n",
" * threshold (threshold) float64 0.5<xarray.DataArray (threshold: 1)>\n",
"array([0.26047579])\n",
"Coordinates:\n",
" * threshold (threshold) float64 0.5<xarray.DataArray (threshold: 1)>\n",
"array([0.37765])\n",
"Coordinates:\n",
" * threshold (threshold) float64 0.5<xarray.DataArray (threshold: 1)>\n",
"array([0.27247375])\n",
"Coordinates:\n",
" * threshold (threshold) float64 0.5<xarray.DataArray (threshold: 4)>\n",
"array([0. , 0.2532, 0.2532, 0. ])\n",
"Coordinates:\n",
" * threshold (threshold) float64 -1.0 0.5 1.0 3.0