# Forecast run parsing

The Solar Forecast Arbiter is designed to analyze *non-overlapping* forecasts. The [documentation](https://solarforecastarbiter.org/definitions/) details the motivation for this choice. Briefly, this forces forecast users and providers to think carefully about what kinds of forecasts they want to analyze. Ideally, these choices will be informed by one or more decision-making processes. 

Forecast providers often design systems that create many overlapping forecasts. For example, new forecasts may be issued once per hour and extend for 48 hours. In some systems, the raw forecasts may also have a higher temporal resolution than the end-user application requires. Providers or end users need to parse these forecasts by parameters such as lead time before they can be analyzed. Similarly, these forecast runs must be parsed into one or more non-overlapping "forecast evaluation time series" before they can be uploaded to the Solar Forecast Arbiter. The figure below illustrates this situation. It shows three forecast runs (green) issued 1 hour apart, each with 15 minute intervals and a length of 3 hours. These forecast runs are sliced, resampled, and concatenated into two different forecast evaluation time series: a 1 hour ahead, 15 minute interval forecast (blue); and a 2 hour ahead, 1 hour interval forecast (red).

![timeline](https://solarforecastarbiter.org/images/timeline_merged.svg)

This tutorial with demonstrate how to prepare many forecast runs for analysis in the Solar Forecast Arbiter.

In [1]:
import datetime
from pathlib import Path
import os
import shutil

import numpy as np
import pandas as pd

Let's generate forecast data from a simulated operational system. The system will create a new forecast every hour, saving each forecast to a csv file. The forecast run attributes will be similar to those shown in the figure above:

* 1 hour issue frequency
* 3 hour run length
* 0 lead time
* 15 minute interval length

The forecast values will be equal to the initialization hour + the forecast hour.

In [2]:
# make a new directory to store the output. 
# a cell at the end of the notebook will delete this directory using shutil.rmtree(fx_path)
fx_path = Path('generated_forecasts')
fx_path.mkdir()

In [3]:
issue_start = pd.Timestamp('2020-01-01', tz='UTC')
issue_end = issue_start + pd.Timedelta('12h')
issue_freq = pd.Timedelta('1h')
closed = 'left'  # consistent with interval averages labeled by the beginning of the interval
issue_times = pd.date_range(start=issue_start, end=issue_end, freq=issue_freq, closed=closed)

lead_time = pd.Timedelta('0min')
run_length = pd.Timedelta('3h')
interval_length = pd.Timedelta('15min')

time_format = '%Y%m%dT%H%M%SZ'

for issue_time in issue_times:
    fx_start = issue_time + lead_time
    fx_end = fx_start + run_length 
    fx_index = pd.date_range(start=fx_start, end=fx_end, freq=interval_length, closed=closed, name='timestamp')
    fx_values = issue_time.hour + fx_index.hour
    fx = pd.Series(fx_values, index=fx_index, name='value')
    issue_time_str = issue_time.strftime(time_format)
    header = f'# forecast issued at {issue_time_str}\n'
    file_name = f'fx_{issue_time_str}.csv'
    with open(fx_path / file_name, 'w') as f:
        f.write(header)
        fx.to_csv(f)

Here's a list of all of the files we created

In [4]:
fx_files = sorted(fx_path.iterdir())
fx_files

[PosixPath('generated_forecasts/fx_20200101T000000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T010000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T020000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T030000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T040000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T050000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T060000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T070000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T080000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T090000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T100000Z.csv'),
 PosixPath('generated_forecasts/fx_20200101T110000Z.csv')]

Now inspect a couple of the files to confirm the data is as we expected.

In [5]:
with open(fx_files[0]) as f:
    print(f.read())

# forecast issued at 20200101T000000Z
timestamp,value
2020-01-01 00:00:00+00:00,0
2020-01-01 00:15:00+00:00,0
2020-01-01 00:30:00+00:00,0
2020-01-01 00:45:00+00:00,0
2020-01-01 01:00:00+00:00,1
2020-01-01 01:15:00+00:00,1
2020-01-01 01:30:00+00:00,1
2020-01-01 01:45:00+00:00,1
2020-01-01 02:00:00+00:00,2
2020-01-01 02:15:00+00:00,2
2020-01-01 02:30:00+00:00,2
2020-01-01 02:45:00+00:00,2



In [6]:
with open(fx_files[1]) as f:
    print(f.read())

# forecast issued at 20200101T010000Z
timestamp,value
2020-01-01 01:00:00+00:00,2
2020-01-01 01:15:00+00:00,2
2020-01-01 01:30:00+00:00,2
2020-01-01 01:45:00+00:00,2
2020-01-01 02:00:00+00:00,3
2020-01-01 02:15:00+00:00,3
2020-01-01 02:30:00+00:00,3
2020-01-01 02:45:00+00:00,3
2020-01-01 03:00:00+00:00,4
2020-01-01 03:15:00+00:00,4
2020-01-01 03:30:00+00:00,4
2020-01-01 03:45:00+00:00,4



Ok, so we have a series of overlapping forecasts and we need to slice and reassemble them into something the Solar Forecast Arbiter can use. The for loop below accomplishes this for the hour ahead, 15 minute interval forecast shown in blue in the figure.

In [7]:
lead_time = pd.Timedelta('1h')
run_length = pd.Timedelta('1h')

fx_parsed = []
for fx_run in fx_files:
    fx = pd.read_csv(fx_run, comment='#', index_col=0, parse_dates=True)
    # get issue time from filename. remove .csv suffix, then remove fx_ prefix
    issue_time = fx_run.name.split('.')[0].split('_')[1]
    issue_time = pd.Timestamp(issue_time)
    fx_start = issue_time + lead_time
    # -1ns to account for interval_label=beginning when slicing
    fx_end = fx_start + run_length - pd.Timedelta('1ns')  
    fx_sliced = fx.loc[fx_start:fx_end]
    fx_parsed.append(fx_sliced)
    
blue_fx_concat = pd.concat(fx_parsed)

In [8]:
blue_fx_concat

Unnamed: 0_level_0,value
timestamp,Unnamed: 1_level_1
2020-01-01 01:00:00+00:00,1
2020-01-01 01:15:00+00:00,1
2020-01-01 01:30:00+00:00,1
2020-01-01 01:45:00+00:00,1
2020-01-01 02:00:00+00:00,3
2020-01-01 02:15:00+00:00,3
2020-01-01 02:30:00+00:00,3
2020-01-01 02:45:00+00:00,3
2020-01-01 03:00:00+00:00,5
2020-01-01 03:15:00+00:00,5


Next we repeat the process for the 2 hour ahead, 1 hour interval red forecast.

In [9]:
lead_time = pd.Timedelta('2h')
interval_length = pd.Timedelta('1h')
run_length = pd.Timedelta('1h')

fx_parsed = []
for fx_run in fx_files:
    fx = pd.read_csv(fx_run, comment='#', index_col=0, parse_dates=True)
    # get issue time from filename. remove .csv suffix, then remove fx_ prefix
    issue_time = fx_run.name.split('.')[0].split('_')[1]
    issue_time = pd.Timestamp(issue_time)
    fx_start = issue_time + lead_time
    # -1ns to account for interval_label=beginning when slicing
    fx_end = fx_start + run_length - pd.Timedelta('1ns')  
    fx_sliced = fx.loc[fx_start:fx_end].resample(interval_length).mean()
    fx_parsed.append(fx_sliced)
    
red_fx_concat = pd.concat(fx_parsed)
red_fx_concat

Unnamed: 0_level_0,value
timestamp,Unnamed: 1_level_1
2020-01-01 02:00:00+00:00,2
2020-01-01 03:00:00+00:00,4
2020-01-01 04:00:00+00:00,6
2020-01-01 05:00:00+00:00,8
2020-01-01 06:00:00+00:00,10
2020-01-01 07:00:00+00:00,12
2020-01-01 08:00:00+00:00,14
2020-01-01 09:00:00+00:00,16
2020-01-01 10:00:00+00:00,18
2020-01-01 11:00:00+00:00,20


These forecasts could be uploaded to the Solar Forecast Arbiter if the corresponding metadata exists. Users may choose to create the metadata using the [Dashboard](https://dashboard.solarforecastarbiter.org) or programmatically. See the [Data Upload and Download in Python](data_upload_download.ipynb) tutorial for an example of how to programmatically create the metadata.

A more sophisticated workflow might start by formally defining the metadata in the Solar Forecast Arbiter and then use that metadata to parse the forecast runs. This is left as an exercise for the reader!

Finally, let's be tidy and delete the directory and files that we created.

In [10]:
shutil.rmtree(fx_path)