{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Open Power System Data: time series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part of the project [Open Power System Data](http://open-power-system-data.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [1. Introductory Notes](#1.-Introductory-Notes)\n", "* [2. Settings](#2.-Settings)\n", "\t* [2.1 Preparations](#2.1-Preparations)\n", "\t* [2.2 Specify parameters](#2.2-Specify-parameters)\n", "* [3. Download](#3.-Download)\n", "* [4. Read](#4.-Read)\n", "* [5. Processing](#5.-Processing)\n", "\t* [5.1 Missing Data Handling](#5.1-Missing-Data-Handling)\n", "\t* [5.2 Aggregate German data from individual TSOs](#5.2-Aggregate-German-data-from-individual-TSOs)\n", "\t* [5.3 Create hourly data from 15' data](#5.3-Create-hourly-data-from-15'-data)\n", "* [6. Create metadata](#6.-Create-metadata)\n", "\t* [6.1 General metadata](#6.1-General-metadata)\n", "\t* [6.2 Columns-specific metadata](#6.2-Columns-specific-metadata)\n", "* [7. Write the data to disk](#7.-Write-the-data-to-disk)\n", "\t* [7.1 Write to SQL-database](#7.1-Write-to-SQL-database)\n", "\t* [7.2 Write to Excel](#7.2-Write-to-Excel)\n", "\t* [7.3 Write to CSV](#7.3-Write-to-CSV)\n", "* [8. Plausibility checks](#8.-Plausibility-checks)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Introductory Notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Jupyter notebook python script downloads and processes time-series data from European power systems. The notebook has been used to create the [timeseries-datapackage](http://data.open-power-system-data.org/datapackage_timeseries/) that is available on the [Open Power System Data plattform](http://data.open-power-system-data.org/).\n", "\n", "A Jupyter notebook is a file that combines executable programming code with visualizations and comments in markdown format, allowing for an intuitive documentation of the code.\n", "\n", "The notebook is hosted in a [GitHub repository](https://github.com/Open-Power-System-Data/datapackage_timeseries) that can be [downloaded](https://github.com/Open-Power-System-Data/datapackage_timeseries/archive/master.zip) for execution on your local computer (You need a running python installation to do this, for example [Anaconda](https://www.continuum.io/downloads)).\n", "\n", "The download and read functions are implemented as distinct modules that are imported to this notebook. Click below to inspect the code (The link to the local copy will only work if you are running this notebook on your yomputer):\n", "\n", "- **The sources file** ([GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/2016-07-14/config/sources.yml) / [local copy](config/sources.yml))\n", "\n", "- **The download script** ([GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/2016-07-14/timeseries_scripts/download.py) / [local copy](timeseries_scripts\\download.py)) downloads the data from our [sources](http://open-power-system-data.org/opsd-sources#time-series) to your hard drive.\n", "- **The read script** ([GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/2016-07-14/timeseries_scripts/read.py) / [local copy](timeseries_scripts\\read.py)) reads each downloaded file into a pandas-DataFrame and merges data from different sources but with the same time resolution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Settings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Preparations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load libraries, set up a log etc.\n", "This notebook makes use of the [pycountry](https://pypi.python.org/pypi/pycountry) library that is not part of Anaconda. Install it with with `pip install pycountry` from your command line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from datetime import datetime, date, timedelta\n", "import pandas as pd\n", "import numpy as np\n", "import logging\n", "import pycountry\n", "import json\n", "import sqlite3\n", "import yaml\n", "from itertools import chain\n", "\n", "from timeseries_scripts.read import read\n", "from timeseries_scripts.download import download \n", "# reload modules with execution of any code, to avoid having to restart \n", "# the kernel after editing timeseries_scripts\n", "%load_ext autoreload\n", "%autoreload 2 \n", "\n", "logger = logging.getLogger('log')\n", "logger.setLevel('INFO')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 Specify parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Optionally, specify the beginning and end of the interval for which to attempt\n", "# The download. If None, all available data will be downloadet\n", "start_date = None #date(2016, 1, 1)\n", "end_date = None #date(2016, 6, 30)\n", "\n", "sources_yaml_path = 'config/sources.yml'\n", "out_path = 'original_data'\n", "\n", "# Optionally, specify a subset to download/read, e.g. subset=['TenneT', '50Hertz']\n", "include_sources = ['50Hertz', 'Amprion', 'TenneT', 'TransnetBW', 'Elia', 'ENTSO-E', 'OPSD', 'Svenska Kraftnaet']\n", "#include_sources = ['Svenska Kraftnaet']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Download" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download sources are in `config/sources.yml`, which specifies, for each source, the variables (such as wind and solar generation) alongside all the parameters necessary to execute the downloads.\n", "\n", "First, a data directory is created on your local computer. Then, download parameters for each data source are defined, including the URL. These parameters are then turned into a YAML-string. Finally, the download is executed one by one. If all data need to be downloaded, this usually takes several hours.\n", "\n", "\n", "Each file is saved under it's original filename. Note that the original file names are often not self-explanatory (called \"data\" or \"January\"). The files content is revealed by its place in the directory structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "download(sources_yaml_path, out_path, end_date=end_date, start_date=start_date, subset=include_sources)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Energinet.dk data needs to be downloadet manually from http://www.energinet.dk/en/el/engrosmarked/udtraek-af-markedsdata/Sider/default.aspx\n", "Check The following Boxes Nand then press the \"Get extract\"-button at the end of the page:\n", "\n", "Period\n", "Get data from: 01-01-2000 To: Today\n", "all months\n", "\n", "Data columns\n", "Elspot Price, Currency Code/MWh\n", "\n", "- DK-West\n", "- DK-East\n", "- Norway\n", "- Sweden (SE)\n", "- Sweden (SE3)\n", "- Sweden (SE4)\n", "- DE European Power Exchange\n", "\n", "Production and consumption, MWh/h\n", "\n", "- DK-West: Wind power production\n", "- DK-West: Solar cell production (estimated)\n", "- DK-East: Wind power production\n", "- DK-East: Solar cell production (estimated)\n", "- DK: Wind power production (onshore)\n", "- DK: Wind power production (offshore)\n", "\n", "Data format\n", "Currency code EUR\n", "Decimal format English number Format (period as decimal separator)\n", "Date format Other date format(YYYY-MM-DD)\n", "Recieve to Excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Read" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section reads each downloaded file into a pandas-DataFrame and merges data from different sources but with the same time resolution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the names of the rows at the top of the data used to store metadata internally. In the output data, this information will be moved to the datapackage.json File." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "headers = ['variable', 'region', 'attribute', 'source', 'web']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "data_sets = read(sources_yaml_path, out_path, headers, subset=include_sources)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "data_sets['60min']#['2015-12-31 12':]#.xs('CH', axis=1, level='region', drop_level=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save/load the data already parsed for faster access if you need to restart the skript" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data_sets['15min'].to_pickle('data_sets_15.pickle')\n", "data_sets['60min'].to_pickle('data_sets_60.pickle')\n", "#data_sets = {}\n", "#data_sets['15min'] = pd.read_pickle('data_sets_15_Elia.pickle')\n", "##data_sets['60min'] = pd.read_pickle('data_sets_60.pickle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section performs some aggregations and transforms the data to the [tabular data package format](http://data.okfn.org/doc/tabular-data-package), where actual data is saved in a CSV file, while metadata (information on format, units, sources, and descriptions) is stored in a JSON file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 Missing Data Handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patch missing data. At this stage, only implemented for 15 minute resolution solar/wind in-feed data from german TSOs. Small gaps (up to 2 hours) are filled by linear interpolation. For the generation timeseries, larger gaps are guessed by up-/down scaling the data from other balancing areas to fit the expected magnitude of the missing data.\n", "\n", "The locations of missing data are stored in the nan_table DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def interpolator(i, j, row, col, col_name, nan_regs, one_period):\n", " '''interpolate missing value spans up to 2 hours'''\n", " if i + 1 == len(nan_regs):\n", " logger.info(\n", " '%s : \\n '\n", " 'interpolated %s up-to-2-hour-spans of NaNs',\n", " col_name[0:3], i + 1 - j\n", " )\n", " to_fill = slice(row['start_idx'] - one_period, row['till_idx'] + one_period)\n", " col.iloc[:,0].loc[to_fill] = col.iloc[:,0].loc[to_fill].interpolate()\n", " return col\n", "\n", "def guesser(row, col, col_name, nan_regs, frame, one_period):\n", " '''guess missing value spans longer than one hour based on other tsos'''\n", " #logger.info('guessed %s entries after %s', row['count'], row['start_idx'])\n", " day_before = pd.DatetimeIndex(\n", " freq='15min',\n", " start=row['start_idx'] - timedelta(hours=24),\n", " end=row['start_idx'] - one_period\n", " )\n", "\n", " to_fill = pd.DatetimeIndex(\n", " freq='15min',\n", " start=row['start_idx'],\n", " end=row['till_idx']\n", " )\n", "\n", " # other_tsos = [c[1] for c in compact.drop(col_name, axis=1).loc[:,(col_name[0],slice(None),col_name[2])].columns.tolist()]\n", " other_tsos = [\n", " tso\n", " for tso\n", " in ['DE50hertz', 'DEamprion', 'DEtennet', 'DEtransnetbw']\n", " if tso != col_name[1]\n", " ]\n", "\n", " # select columns with data for same technology (wind/solar) but from other TSOs\n", " similar = frame.loc[:,(col_name[0],other_tsos,col_name[2])]\n", " # calculate the sum using columns without NaNs the day \n", " # before or during the period to be guessed\n", " similar = similar.dropna(\n", " axis=1,\n", " how='any',\n", " subset=day_before.append(to_fill)\n", " ).sum(axis=1)\n", " # calculate scaling factor for other TSO data\n", " factor = similar.loc[day_before].sum(axis=0) / col.loc[day_before,:].sum(axis=0)\n", "\n", " guess = similar.loc[to_fill] / float(factor)\n", " col.iloc[:,0].loc[to_fill] = guess\n", " a = float(col.iloc[:,0].loc[row['start_idx'] - one_period])\n", " b = float(col.iloc[:,0].loc[row['start_idx']])\n", " if a == 0:\n", " deviation = '{} absolut'.format(a - b)\n", " else:\n", " deviation = '{:.2f} %'.format((a - b) / a * 100)\n", " logger.info(\n", " '%s : \\n '\n", " 'guessed %s entries after %s \\n '\n", " 'last non-missing: %s \\n '\n", " 'first guessed: %s \\n '\n", " 'deviation of first guess from last known value: %s',\n", " col_name[0:3], row['count'], row['start_idx'], a, b, deviation\n", " ) \n", " return col\n", "\n", "def chooser(col, col_name, nan_regs, frame, one_period):\n", " for i, row in nan_regs.iterrows():\n", " j = 0\n", " # interpolate missing value spans up to 2 hours\n", " if row['span'] <= timedelta(hours=2):\n", " col = interpolator(i, j, row, col, col_name, nan_regs, one_period)\n", " # guess missing value spans longer than one hour based on other tsos\n", " elif col_name[1][:2] == 'DE' and col_name[2] == 'generation':\n", " j += 1\n", " col = guesser(row, col, col_name, nan_regs, frame, one_period)\n", " return col\n", "\n", "def nan_finder(frame, patch=False):\n", " '''Search for missing values in a DataFrame and apply custom patching.'''\n", " nan_table = pd.DataFrame()\n", " patched = pd.DataFrame()\n", " one_period = frame.index[1] - frame.index[0]\n", " for col_name, col in frame.iteritems():\n", " col = col.to_frame() # kann man colname wieder an df drankleben? df sollte col heißen\n", "\n", " # tag all occurences of NaN in the data (but not before first actual entry or after last one)\n", " col['tag'] = (\n", " (col.index >= col.first_valid_index()) &\n", " (col.index <= col.last_valid_index()) &\n", " col.isnull().transpose().as_matrix()\n", " ).transpose()\n", "\n", " # make another DF to hold info about each region\n", " nan_regs = pd.DataFrame()\n", "\n", " # first row of consecutive region is a True preceded by a False in tags\n", " nan_regs['start_idx'] = col.index[\n", " col['tag'] & ~ \n", " col['tag'].shift(1).fillna(False)]\n", "\n", " # last row of consecutive region is a False preceded by a True \n", " nan_regs['till_idx'] = col.index[\n", " col['tag'] & ~ \n", " col['tag'].shift(-1).fillna(False)] \n", " \n", " if not col['tag'].any():\n", " logger.info('%s : nothing to patch in this column', col_name[0:3])\n", " col.drop('tag', axis=1, inplace=True)\n", " nan_idx = pd.MultiIndex.from_arrays([\n", " [0, 0, 0, 0],\n", " ['count', 'span', 'start_idx', 'till_idx']\n", " ]\n", " )\n", " nan_list = pd.DataFrame(index=nan_idx, columns=col.columns)\n", " else:\n", " # how long is each region\n", " nan_regs['span'] = nan_regs['till_idx'] - nan_regs['start_idx'] + one_period\n", " nan_regs['count'] = (nan_regs['span'] / one_period)\n", " # sort the info DF to put longest missing region on top\n", " nan_regs = nan_regs.sort_values(\n", " 'count',\n", " ascending=False\n", " ).reset_index(drop=True)\n", " \n", " col.drop('tag', axis=1, inplace=True)\n", " nan_list = nan_regs.stack().to_frame()\n", " nan_list.columns = col.columns\n", " \n", " if patch:\n", " col = chooser(col, col_name, nan_regs, frame, one_period)\n", " if len(patched) == 0:\n", " patched = col\n", " else:\n", " patched = patched.combine_first(col)\n", "\n", " if len(nan_table) == 0:\n", " nan_table = nan_list\n", " else:\n", " nan_table = nan_table.combine_first(nan_list)\n", "\n", " nan_table.columns.names = headers\n", " patched.columns.names = headers\n", "\n", " return patched, nan_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patch the 15 minutes dataset and display the location of missing Data in the original data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "patched, nan_table = nan_finder(data_sets['15min'], patch=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "nan_table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nan_table.to_excel('nan_table2.xlsx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute this to see whether there is still missing data. This is the case for some of the forecast columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "patched2, nan_table2 = nan_finder(patched)\n", "nan_table2#.loc[(slice(None),['count','start_idx']),:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute this to see an example of where the data has been patched." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_sets['15min'].loc['2015-10-24 23:00:00':'2015-10-25 03:00:00', 'wind']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "patched.loc['2015-10-24 23:00:00':'2015-10-25 03:00:00', 'wind']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace the untreated data set with the patched one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_sets['15min'] = patched" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Aggregate German data from individual TSOs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The wind and solar in-feed data for the 4 German balancing areas is summed up and stored in in new columns, which are then used to calculate profiles, that is, the share of wind/solar capacity producing at a given time. The column headers are created in the fashion introduced in the read script." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "web = 'http://data.open-power-system-data.org/datapackage_timeseries'\n", "for tech in ['wind', 'solar']:\n", " for attribute in ['generation', 'forecast']:\n", " sum_col = pd.Series()\n", " for tso in ['DE50hertz', 'DEamprion', 'DEtennet', 'DEtransnetbw']:\n", " try:\n", " add_col = data_sets['15min'][tech, tso, attribute]\n", " if len(sum_col) == 0:\n", " sum_col = add_col\n", " else:\n", " sum_col = sum_col + add_col.values\n", " except KeyError:\n", " pass\n", " \n", " # Create a new MultiIndex\n", " tuples = [(tech, 'DE', attribute, 'own calculation', web)]\n", " columns = pd.MultiIndex.from_tuples(tuples, names=headers)\n", " sum_col.columns = columns\n", " data_sets['15min'] = data_sets['15min'].combine_first(sum_col)\n", " \n", " # Calculate the profile column\n", " try:\n", " if attribute == 'generation':\n", " profile_col = sum_col.values / data_sets['15min'][tech, 'DE', 'capacity']\n", " tuples = [(tech, 'DE', 'profile', 'own calculation', web)]\n", " columns = pd.MultiIndex.from_tuples(tuples, names=headers)\n", " profile_col.columns = columns\n", " data_sets['15min'] = data_sets['15min'].combine_first(profile_col)\n", " except KeyError:\n", " pass # FIXME" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "New columns for the aggregated data have been added to the 15 minutes dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data_sets['15min']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3 Create hourly data from 15' data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The German renewables in-feed data comes in 15-minute intervals. We resample it to hourly intervals in order to match the load data from ENTSO-E." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled = data_sets['15min'].resample('H').mean()\n", "try:\n", " data_sets['60min'] = data_sets['60min'].combine_first(resampled)\n", "except KeyError:\n", " data_sets['60min'] = resampled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "New columns for the resampled data have been added to the 60 minutes dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "data_sets['60min']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Insert a column with Central European (Summer-)time.\n", "# Still causes some problems, not recommended\n", "#for res_key, df in data_sets.items():\n", "# if not df.empty:\n", "# df.insert(0, 'cet-timestamp', df.index.tz_localize('UTC').tz_convert('Europe/Brussels'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Create metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part, we create the metadata that will document the data output in CSV format. The metadata we be stored in JSON format, which is very much like a python dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.1 General metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we define the general metadata for the timeseries datapackage" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [], "source": [ "metadata_head = '''\n", "name: opsd-timeseries\n", "\n", "title: 'Time-series data: load, wind and solar, prices'\n", "\n", "description: This data package contains different kinds of timeseries\n", " data relevant for power system modelling, namely electricity consumption \n", " (load) for 36 European countries as well as wind and solar power generation\n", " and capacities and prices for a growing subset of countries. \n", " The timeseries become available at different points in time depending on the\n", " sources. The full dataset is only available from 2012 onwards. The\n", " data has been downloaded from the sources, resampled and merged in\n", " a large CSV file with hourly resolution. Additionally, the data\n", " available at a higher resolution (Some renewables in-feed, 15\n", " minutes) is provided in a separate file. All data processing is\n", " conducted in python and pandas and has been documented in the\n", " Jupyter notebooks linked below.\n", "opsd-jupyter-notebook-url: https://github.com/Open-Power-System-Data/\n", " datapackage_timeseries/blob/2016-07-14/main.ipynb\n", "\n", "version: '2016-07-14'\n", "\n", "opsd-changes-to-last-version: Included data from Energinet.DK, Elia and \n", " Svenska Kraftnaet\n", "\n", "keywords:\n", " - timeseries\n", " - electricity\n", " - in-feed\n", " - capacity\n", " - renewables\n", " - wind\n", " - solar\n", " - load\n", " - tso\n", " - europe\n", " - germany\n", "\n", "geographical-scope: Europe\n", "\n", "licenses: \n", " - url: http://example.com/license/url/here\n", " version: 1.0\n", " name: License Name Here\n", " id: license-id-from-open\n", "\n", "views: \n", " - {}\n", "\n", "maintainers:\n", " - web: http://example.com/\n", " name: Jonathan Muehlenpfordt\n", " email: muehlenpfordt@neon-energie.de\n", "\n", "resources:\n", "'''\n", "\n", "source_template = '''\n", " - name: {source}\n", " web: {web}\n", "'''\n", "\n", "resource_template = '''\n", " - path: timeseries{res_key}.csv\n", " format: csv\n", " mediatype: text/csv\n", " alternative_formats:\n", " - path: timeseries{res_key}.csv\n", " stacking: Singleindex\n", " format: csv\n", " - path: timeseries{res_key}.xlsx\n", " stacking: Singleindex\n", " format: xlsx\n", " - path: timeseries{res_key}_multiindex.xlsx\n", " stacking: Multiindex\n", " format: xlsx\n", " - path: timeseries{res_key}_multiindex.csv\n", " stacking: Multiindex\n", " format: csv\n", " - path: timeseries{res_key}_stacked.csv\n", " stacking: Stacked\n", " format: csv\n", " schema:\n", " fields:\n", "'''\n", "\n", "indexfield = '''\n", " - name: timestamp\n", " description: Start of timeperiod in UTC\n", " type: datetime\n", " format: YYYY-MM-DDThh:mm:ssZ\n", "'''\n", "\n", "field_template = '''\n", " - name: {variable}_{region}_{attribute}\n", " description: {description}\n", " type: number\n", " source:\n", " name: {source}\n", " web: {web}\n", " opsd-properties: \n", " Region: {region}\n", " Variable: {variable}\n", " Attribute: {attribute}\n", "'''\n", "\n", "descriptions_template = '''\n", "load: Consumption in {geo} in MW\n", "generation: Actual {tech} generation in {geo} in MW\n", "actual: Actual {tech} generation in {geo} in MW\n", "forecast: {tech} day-ahead generation forecast in {geo} in MW\n", "capacity: {tech} capacity in {geo} in MW\n", "profile: Share of {tech} capacity producing in {geo}\n", "offshoreshare: {tech} actual offshore generation in {geo} in MW\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 Columns-specific metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each dataset/outputfile, the metadata has an entry in the \"resources\" list that describes the file/dataset. The main part of each entry is the \"schema\" dictionary, consisting of a list of \"fields\", meaning the columns in the dataset. The first field is the timestamp index of the dataset. For the other fields, we iterate over the columns of the MultiIndex index of the datasets to contruct the corresponding metadata." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resource_list = '' # list of files included in the datapackage\n", "source_list = '' # list of sources were data comes from\n", "for res_key, df in data_sets.items():\n", " field_list = indexfield # list of of columns in a file, starting with the index field\n", " for col in df.columns: # create \n", " h = {k: v for k, v in zip(headers, col)}\n", " if len(h['region']) > 2:\n", " geo = h['region'] + ' control area'\n", " elif h['region'] == 'NI':\n", " geo = 'Northern Ireland'\n", " elif h['region'] == 'CS':\n", " geo = 'Serbia and Montenegro'\n", " else:\n", " geo = pycountry.countries.get(alpha2=h['region']).name\n", "\n", " descriptions = yaml.load(\n", " descriptions_template.format(tech=h['variable'], geo=geo)\n", " )\n", " h['description'] = descriptions[h['attribute']]\n", " field_list = field_list + field_template.format(**h)\n", " source_list = source_list + source_template.format(**h)\n", " resource_list = resource_list + resource_template.format(res_key=res_key) + field_list\n", "source_list = [dict(tupleized) #remove duplicates from sources_list\n", " for tupleized\n", " in set(tuple(entry.items())\n", " for entry\n", " in yaml.load(source_list)\n", " )\n", " ] \n", "\n", "metadata = yaml.load(metadata_head)\n", "metadata['sources'] = source_list\n", "metadata['resources'] = yaml.load(resource_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute this to write the metadata to disk" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [], "source": [ "datapackage_json = json.dumps(metadata, indent=2, separators=(',', ': '))\n", "with open('datapackage.json', 'w') as f:\n", " f.write(datapackage_json)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Write the data to disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we want to write the data to the output files and save it in the directory of this notebook. First, we prepare different shapes of the dataset." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "data_sets_singleindex = {}\n", "data_sets_multiindex = {}\n", "data_sets_stacked = {}\n", "for res_key, df in data_sets.items():\n", " if not df.empty:\n", " df_singleindex = df.copy()\n", " # use first 3 levels of multiindex to create singleindex\n", " df_singleindex.columns = ['_'.join(col[0:3])\n", " for col\n", " in df.columns.values\n", " ]\n", " data_sets_singleindex[res_key] = df_singleindex\n", "\n", " data_sets_multiindex[res_key + '_multiindex'] = df\n", "\n", " stacked = df.copy()\n", " stacked.columns = stacked.columns.droplevel(['source', 'web'])\n", " stacked = stacked.transpose().stack(dropna=True).to_frame(name='data')\n", " data_sets_stacked[res_key + '_stacked'] = stacked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.1 Write to SQL-database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file format is required for the filtering function on the OPSD website. This takes about 30 seconds to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%time \n", "for res_key, df in data_sets_singleindex.items():\n", " table = 'timeseries' + res_key\n", " df = df.copy()\n", " df.index = df.index.strftime('%Y-%m-%dT%H:%M:%SZ')\n", " df.to_sql(table, sqlite3.connect('data.sqlite'),\n", " if_exists='replace', index_label='timestamp')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.2 Write to Excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This takes about 15 Minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%time \n", "for res_key, df in chain(\n", " data_sets_singleindex.items(),\n", " data_sets_multiindex.items()\n", " ):\n", " f = 'timeseries' + res_key\n", " df.to_excel(f+ '.xlsx', float_format='%.2f', merge_cells=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.3 Write to CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This takes about 10 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%time\n", "for res_key, df in chain(\n", " data_sets_singleindex.items(),\n", " data_sets_multiindex.items(),\n", " data_sets_stacked.items()\n", " ):\n", " f = 'timeseries' + res_key\n", " df.to_csv(f + '.csv', float_format='%.2f',\n", " date_format='%Y-%m-%dT%H:%M:%SZ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8. Plausibility checks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "work in progress" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pv = compact.xs(('solar'), level=('variable'), axis=1, drop_level=False)\n", "# pv.index = pd.MultiIndex.from_arrays([pv.index.date, pv.index.time], names=['date','time'])\n", "# pv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pv.groupby(level='time').max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pv.unstack().idxmax().to_frame().unstack().transpose()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }