{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"width:100%; background-color: #D9EDF7; border: 1px solid #CFCFCF; text-align: left; padding: 10px;\">\n",
    "      <b>Renewable power plants: Validation and output notebook</b>\n",
    "      <ul>\n",
    "        <li><a href=\"main.ipynb\">Main notebook</a></li>\n",
    "        <li><a href=\"download_and_process.ipynb\">Download and process notebook</a></li>\n",
    "        <li>Validation and output notebook</li>\n",
    "      </ul>\n",
    "      <br>This notebook is part of the <a href=\"http://data.open-power-system-data.org/renewable_power_plants\"> Renewable power plants Data Package</a> of <a href=\"http://open-power-system-data.org\">Open Power System Data</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Part 1 of the script (<a href=\"download_and_process.ipynb\">Download and process Notebook</a>) has downloaded and merged the original data. This Notebook subsequently checks, validates the list of renewable power plants and creates CSV/XLSX/SQLite files. It also generates a daily time series of cumulated installed capacities by energy source.\n",
    "\n",
    "*(Before running this script make sure you ran Part 1, so that the renewables.pickle files for each country exist in the same folder as the scripts)*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Initialization\" data-toc-modified-id=\"Initialization-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Initialization</a></span><ul class=\"toc-item\"><li><span><a href=\"#Script-setup\" data-toc-modified-id=\"Script-setup-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Script setup</a></span></li><li><span><a href=\"#Load-data\" data-toc-modified-id=\"Load-data-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href=\"#Download-coastline-data\" data-toc-modified-id=\"Download-coastline-data-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Download coastline data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Load-the-list-of-sources\" data-toc-modified-id=\"Load-the-list-of-sources-1.3.1\"><span class=\"toc-item-num\">1.3.1&nbsp;&nbsp;</span>Load the list of sources</a></span></li></ul></li></ul></li><li><span><a href=\"#Validation-Markers\" data-toc-modified-id=\"Validation-Markers-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Validation Markers</a></span><ul class=\"toc-item\"><li><span><a href=\"#Germany-DE\" data-toc-modified-id=\"Germany-DE-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Germany DE</a></span></li><li><span><a href=\"#France-FR\" data-toc-modified-id=\"France-FR-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>France FR</a></span></li><li><span><a href=\"#United-Kingdom-UK\" data-toc-modified-id=\"United-Kingdom-UK-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>United Kingdom UK</a></span></li></ul></li><li><span><a href=\"#Harmonization\" data-toc-modified-id=\"Harmonization-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Harmonization</a></span><ul class=\"toc-item\"><li><span><a href=\"#Harmonizing-column-order\" data-toc-modified-id=\"Harmonizing-column-order-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Harmonizing column order</a></span></li><li><span><a href=\"#Cleaning-fields\" data-toc-modified-id=\"Cleaning-fields-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Cleaning fields</a></span></li><li><span><a href=\"#Sort\" data-toc-modified-id=\"Sort-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Sort</a></span></li><li><span><a href=\"#Leave-unspecified-cells-blank\" data-toc-modified-id=\"Leave-unspecified-cells-blank-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Leave unspecified cells blank</a></span></li><li><span><a href=\"#Separate-dirty-from-clean\" data-toc-modified-id=\"Separate-dirty-from-clean-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Separate dirty from clean</a></span></li></ul></li><li><span><a href=\"#Capacity-time-series\" data-toc-modified-id=\"Capacity-time-series-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Capacity time series</a></span><ul class=\"toc-item\"><li><span><a href=\"#Make-separate-series-for-Great-Britain-and-Northern-Ireland\" data-toc-modified-id=\"Make-separate-series-for-Great-Britain-and-Northern-Ireland-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>Make separate series for Great Britain and Northern Ireland</a></span></li><li><span><a href=\"#Create-total-wind-columns\" data-toc-modified-id=\"Create-total-wind-columns-4.2\"><span class=\"toc-item-num\">4.2&nbsp;&nbsp;</span>Create total wind columns</a></span></li><li><span><a href=\"#Create-one-time-series-file-containing-al-countries\" data-toc-modified-id=\"Create-one-time-series-file-containing-al-countries-4.3\"><span class=\"toc-item-num\">4.3&nbsp;&nbsp;</span>Create one time series file containing al countries</a></span></li></ul></li><li><span><a href=\"#Make-the-normalized-dataframe-for-all-the-countries\" data-toc-modified-id=\"Make-the-normalized-dataframe-for-all-the-countries-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Make the normalized dataframe for all the countries</a></span></li><li><span><a href=\"#Output\" data-toc-modified-id=\"Output-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Output</a></span><ul class=\"toc-item\"><li><span><a href=\"#Write-data-files\" data-toc-modified-id=\"Write-data-files-6.1\"><span class=\"toc-item-num\">6.1&nbsp;&nbsp;</span>Write data files</a></span><ul class=\"toc-item\"><li><span><a href=\"#Write-CSV-files\" data-toc-modified-id=\"Write-CSV-files-6.1.1\"><span class=\"toc-item-num\">6.1.1&nbsp;&nbsp;</span>Write CSV-files</a></span></li><li><span><a href=\"#Write-XLSX-files\" data-toc-modified-id=\"Write-XLSX-files-6.1.2\"><span class=\"toc-item-num\">6.1.2&nbsp;&nbsp;</span>Write XLSX-files</a></span></li><li><span><a href=\"#Write-SQLite\" data-toc-modified-id=\"Write-SQLite-6.1.3\"><span class=\"toc-item-num\">6.1.3&nbsp;&nbsp;</span>Write SQLite</a></span></li></ul></li><li><span><a href=\"#Write-meta-data\" data-toc-modified-id=\"Write-meta-data-6.2\"><span class=\"toc-item-num\">6.2&nbsp;&nbsp;</span>Write meta data</a></span></li><li><span><a href=\"#Generate-checksums\" data-toc-modified-id=\"Generate-checksums-6.3\"><span class=\"toc-item-num\">6.3&nbsp;&nbsp;</span>Generate checksums</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Initialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "settings = {\n",
    "    'version': '2019-04-05',\n",
    "    'changes': 'Updated all countries with new data available (DE, FR, CH, DK), added data for UK and expanded renewable capacity timeseries to more countries (DK, UK, CH in addition to DE).'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Script setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import logging\n",
    "import os\n",
    "import urllib.parse\n",
    "import re\n",
    "import zipfile\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import requests\n",
    "import sqlalchemy\n",
    "import yaml\n",
    "import hashlib\n",
    "import os\n",
    "import fiona\n",
    "import cartopy.io.shapereader as shpreader\n",
    "import shapely.geometry as sgeom\n",
    "from shapely.prepared import prep\n",
    "from shapely.ops import unary_union\n",
    "import fake_useragent\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "# Option to make pandas display 40 columns max per dataframe (default is 20)\n",
    "pd.options.display.max_columns = 40\n",
    "\n",
    "# Create input and output folders if they don't exist\n",
    "os.makedirs(os.path.join('input', 'original_data'), exist_ok=True)\n",
    "\n",
    "os.makedirs('output', exist_ok=True)\n",
    "os.makedirs(os.path.join('output', 'renewable_power_plants'), exist_ok=True)\n",
    "package_path = os.path.join('output', 'renewable_power_plants',settings['version'])\n",
    "os.makedirs(package_path, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = set(['DE', 'DK','FR','PL','CH', 'UK'])\n",
    "countries_non_DE = countries - set(['DE'])\n",
    "countries_dirty = set(['DE_outvalidated_plants', 'FR_overseas_territories'])\n",
    "countries_including_dirty = countries | countries_dirty\n",
    "\n",
    "# Read data from script Part 1 download_and_process\n",
    "dfs = {}\n",
    "for country in countries:\n",
    "    dfs[country] = pd.read_pickle('intermediate/'+country+'_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download coastline data\n",
    "\n",
    "The coastline shapefile is needed to check if the geocoordinates of the land powerplants point to a land location, and conversely, if the geocoordinates of the onshore facilities point to a location not on land."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coastline_url = 'https://www.ngdc.noaa.gov/mgg/shorelines/data/gshhg/latest/gshhg-shp-2.3.7.zip'\n",
    "\n",
    "user_agent = fake_useragent.UserAgent()\n",
    "\n",
    "directory_path = os.path.join('input', 'maps', 'coastline')\n",
    "os.makedirs(directory_path, exist_ok=True)\n",
    "filepath = os.path.join(directory_path, 'gshhg-shp-2.3.7.zip')\n",
    "\n",
    "# check if the file exists; if not, download it\n",
    "if not os.path.exists(filepath):\n",
    "    session = requests.session()\n",
    "    print(coastline_url)\n",
    "    print('Downloading...')\n",
    "    headers = {'User-Agent' : user_agent.random}\n",
    "    r = session.get(coastline_url, headers=headers, stream=True)\n",
    "    total_size = r.headers.get('content-length')\n",
    "    total_size = int(total_size)\n",
    "    chuncksize = 4096\n",
    "    with open(filepath, 'wb') as file:\n",
    "        downloaded = 0\n",
    "        for chunck in r.iter_content(chuncksize):\n",
    "            file.write(chunck)\n",
    "            downloaded += chuncksize\n",
    "            print('\\rProgress: {:.2f}%'.format(100 * downloaded / float(total_size)), end='')\n",
    "    print(' Done.')\n",
    "    zip_ref = zipfile.ZipFile(filepath, 'r')\n",
    "    zip_ref.extractall(directory_path)\n",
    "    zip_ref.close()\n",
    "else:\n",
    "    print('The file is already there:', filepath)\n",
    "    filepath = '' + filepath\n",
    "\n",
    "coastline_shapefile_path = os.path.join('input', 'maps', 'coastline', 'GSHHS_shp', 'f', 'GSHHS_f_L1.shp')\n",
    "print(\"Shapefile path: \", coastline_shapefile_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the list of sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source_df = pd.read_csv(os.path.join('input', 'sources.csv'))\n",
    "source_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Validation Markers\n",
    "\n",
    "This section checks the DataFrame for a set of pre-defined criteria and adds markers to the entries in an additional column. The marked data will be included in the output files, but marked, so that they can be easiliy filtered out. For creating the validation plots and the time series, suspect data is skipped."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validation_marker = {}\n",
    "mark_rows = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Germany DE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Add marker to data according to criteria (see validation_marker above)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# It seems that some DE.commissioning_date values are integers, which causes\n",
    "# the parts of the code dealing with dates to break.\n",
    "integer_dates_mask = dfs['DE'].apply(lambda row: type(row['commissioning_date']) is int, axis=1).values\n",
    "\n",
    "print(\"Integer dates\")\n",
    "display(dfs['DE'][integer_dates_mask])\n",
    "\n",
    "dfs['DE']=dfs['DE'][~integer_dates_mask]\n",
    "dfs['DE'].reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "key = 'R_1'\n",
    "cutoff_date_bnetza = '2017-12-31'\n",
    "cutoff_date_bnetza = pd.Timestamp(2017, 12, 31)\n",
    "\n",
    "mark_rows[key] = (dfs['DE']['commissioning_date'] <= cutoff_date_bnetza) &\\\n",
    "                 (dfs['DE']['data_source'].isin(['BNetzA', 'BNetzA_PV', 'BNetzA_PV_historic']))\n",
    "\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"data_source = BNetzA and commissioning_date < \" + str(cutoff_date_bnetza.date()),\n",
    "    \"Long explanation\": \"This powerplant is probably also represented by an entry from the TSO data and should therefore be filtered out.\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "key = 'R_2'\n",
    "mark_rows[key] = ((dfs['DE']['notification_reason'] != 'Inbetriebnahme') &\n",
    "                 (dfs['DE']['data_source'] == 'BNetzA'))\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"notification_reason other than commissioning (Inbetriebnahme)\",\n",
    "    \"Long explanation\": \"This powerplant is probably represented by an earlier entry already (possibly also from the TSO data) and should therefore be filtered out.\"\n",
    "}\n",
    "\n",
    "key = 'R_3'\n",
    "mark_rows[key] = (dfs['DE']['commissioning_date'].isnull())\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"commissioning_date not specified\",\n",
    "    \"Long explanation\": \"\"\n",
    "}\n",
    "\n",
    "key = 'R_4'\n",
    "mark_rows[key] = dfs['DE'].electrical_capacity <= 0.0\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"electrical_capacity not specified\",\n",
    "    \"Long explanation\": \"\"\n",
    "}\n",
    "\n",
    "key = 'R_5'\n",
    "mark_rows[key] = dfs['DE']['grid_decommissioning_date'].isnull() == False # Just the entry which is not double should be kept, thus the other one is marked\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"decommissioned from the grid\",\n",
    "    \"Long explanation\": \"This powerplant is probably commissioned again to the grid of another grid operator and therefore this doubled entry should be filtered out.\"\n",
    "}\n",
    "\n",
    "key = 'R_6'\n",
    "mark_rows[key] = dfs['DE']['decommissioning_date'].isnull() == False\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"decommissioned\",\n",
    "    \"Long explanation\": \"This powerplant is completely decommissioned.\"\n",
    "}\n",
    "\n",
    "key = 'R_8' # note that we skip R7 here as R7 is used for frech oversees power plants below (we never change meanings of R markers, so R7 stays reserved for that)\n",
    "mark_rows[key] = (dfs['DE'].duplicated(['eeg_id'],keep='first') # note that this depends on BNetzA items to be last in list, because we want to keep the TSO items\n",
    "                  & (dfs['DE']['eeg_id'].isnull() == False)) \n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"duplicate_eeg_id\",\n",
    "    \"Long explanation\": \"This power plant is twice in the data (e.g. through BNetzA and TSOs).\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs['DE']['comment'] = ''\n",
    "for key, rows_to_mark in mark_rows.items():\n",
    "    dfs['DE'].loc[rows_to_mark, 'comment'] += key+\"|\"\n",
    "\n",
    "del mark_rows, key, rows_to_mark # free variables no longer needed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Summarize capacity of suspect data by data_source\n",
    "display(dfs['DE'].groupby(['comment', 'data_source'])['electrical_capacity'].sum().to_frame())\n",
    "\n",
    "# Summarize capacity of suspect data by energy source\n",
    "dfs['DE'].groupby(['comment', 'energy_source_level_2'])['electrical_capacity'].sum().to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Create cleaned DataFrame**\n",
    "\n",
    "All marked entries are deleted for the cleaned version of the DataFrame that is utilized for creating time series of installation and for the validation plots."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## France FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create empty marker column\n",
    "dfs['FR']['comment'] = \"\"\n",
    "\n",
    "key = 'R_7'\n",
    "mark_rows_FR_not_in_Europe = dfs['FR'][((dfs['FR']['lat'] < 41) |\n",
    "                       (dfs['FR']['lon'] < -6) |\n",
    "                       (dfs['FR']['lon'] > 10))].index\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"not connected to the European grid\",\n",
    "    \"Long explanation\": \"This powerplant is located in regions belonging to France but not located in Europe (e.g. Guadeloupe).\"\n",
    "}\n",
    "\n",
    "dfs['FR'].loc[mark_rows_FR_not_in_Europe, 'comment'] += key+\"|\"\n",
    "\n",
    "del mark_rows_FR_not_in_Europe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## United Kingdom UK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an empty marker column\n",
    "dfs['UK']['comment'] = \"\"\n",
    "\n",
    "# Create a function to check if an offshore powerplant is not wind\n",
    "geoms = fiona.open(coastline_shapefile_path)\n",
    "land_geom = sgeom.MultiPolygon([sgeom.shape(geom['geometry']) for geom in geoms])\n",
    "land = prep(land_geom)\n",
    "\n",
    "def not_on_land_but_should_be(powerplant_data):\n",
    "    longitude = powerplant_data['lon']\n",
    "    latitude = powerplant_data['lat']\n",
    "    if pd.isnull(longitude) or pd.isnull(latitude):\n",
    "        return False\n",
    "    not_on_land = not land.contains(sgeom.Point(longitude, latitude))\n",
    "    offshore_ok =  'Offshore' in [powerplant_data['region'], powerplant_data['municipality']] or \\\n",
    "                  (powerplant_data['energy_source_level_2'] in ['Wind', 'Marine'])\n",
    "    return not_on_land and not offshore_ok\n",
    "    \n",
    "\n",
    "key = 'R_9'\n",
    "validation_marker[key] = {\n",
    "    \"Short explanation\": \"Not on land, but should be.\",\n",
    "    \"Long explanation\": \"The geocoordinates of this powerplant indicate that it is not on the UK mainland, but the facility is not an offshore wind farm.\"\n",
    "}\n",
    "\n",
    "mark_rows_UK_not_on_land = dfs['UK'].apply(lambda row: not_on_land_but_should_be(row), axis=1)\n",
    "dfs['UK'].loc[mark_rows_UK_not_on_land, 'comment'] += key+\"|\"\n",
    "\n",
    "del mark_rows_UK_not_on_land"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harmonization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harmonizing column order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "field_lists = {\n",
    "    'DE': ['commissioning_date', 'decommissioning_date', 'energy_source_level_1', 'energy_source_level_2','energy_source_level_3', 'technology',\n",
    "           'electrical_capacity', 'voltage_level', 'tso', 'dso', 'dso_id', 'eeg_id',\n",
    "           'federal_state', 'postcode', 'municipality_code', 'municipality', 'address',\n",
    "           'lat', 'lon', 'data_source', 'comment'],\n",
    "    'DK': ['commissioning_date', 'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology', 'electrical_capacity',\n",
    "           'dso', 'gsrn_id', 'postcode', 'municipality_code', 'municipality', 'address', \n",
    "           'lat', 'lon', 'hub_height', 'rotor_diameter', 'manufacturer', 'model', 'data_source'],\n",
    "    'FR': ['municipality_code', 'municipality', 'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', \n",
    "           'technology', 'electrical_capacity', 'number_of_installations', 'lat', 'lon', 'data_source', 'as_of_year', 'comment'],\n",
    "    'PL': ['district', 'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology',\n",
    "           'electrical_capacity', 'number_of_installations', 'data_source', 'as_of_year'],\n",
    "    'CH': ['commissioning_date', 'municipality', 'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3',\n",
    "           'technology','electrical_capacity', 'municipality_code', 'project_name', 'production', 'tariff',\n",
    "           'contract_period_end', 'street', 'canton', 'company', 'lat', 'lon', 'data_source'],\n",
    "    'UK': ['commissioning_date', 'uk_beis_id', 'site_name', 'operator',\n",
    "           'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology',\n",
    "           'electrical_capacity', 'chp', 'capacity_individual_turbine',\n",
    "           'number_of_turbines', 'solar_mounting_type', 'address', 'municipality', 'region', 'country', 'postcode',\n",
    "           'lat', 'lon', 'data_source', 'comment']\n",
    "}\n",
    "\n",
    "for country in field_lists:\n",
    "    for field in field_lists[country]:\n",
    "        if field not in dfs[country].columns:\n",
    "            print(country, field)\n",
    "    dfs[country] = dfs[country].loc[:, field_lists[country]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning fields\n",
    "\n",
    "Five digits behind the decimal point for decimal fields. Dates should be without timestamp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs['DE']['address'][~dfs['DE']['address'].isnull()]\n",
    "dfs['DK'].columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaning_specs = {\n",
    "    'decimal' : {\n",
    "        'DE': ['electrical_capacity','lat','lon'],\n",
    "        'DK': ['electrical_capacity','lat','lon'],\n",
    "        'CH': ['electrical_capacity','lat','lon'],\n",
    "        'FR': ['electrical_capacity','lat','lon'],\n",
    "        'PL': ['electrical_capacity'],\n",
    "        'UK': ['electrical_capacity', 'lat', 'lon']\n",
    "    },\n",
    "    'integer': {\n",
    "        'DE': ['municipality_code'],\n",
    "        'UK': ['uk_beis_id']\n",
    "    },\n",
    "    'date': {\n",
    "        'DE': ['commissioning_date', 'decommissioning_date'],\n",
    "        'DK': ['commissioning_date'],\n",
    "        'CH': ['commissioning_date'],\n",
    "        'UK': ['commissioning_date']\n",
    "    },\n",
    "    'one-line string': {\n",
    "        'DE' : ['federal_state', 'municipality', 'address'],\n",
    "        'DK' : ['municipality', 'address', 'manufacturer', 'model'],\n",
    "        'FR' : ['municipality'],\n",
    "        'PL' : ['district'],\n",
    "        'CH' : ['municipality', 'project_name', 'canton', 'street', 'company'],\n",
    "        'UK' : ['address', 'municipality', 'site_name', 'region']\n",
    "    }\n",
    "}\n",
    "\n",
    "def to_1_line(string):\n",
    "    if pd.isnull(string) or not isinstance(string, str):\n",
    "        return string\n",
    "    return string.replace('\\r', '').replace('\\n', '')\n",
    "\n",
    "for cleaning_type, cleaning_spec in cleaning_specs.items():\n",
    "    for country, fields in cleaning_spec.items():\n",
    "        for field in fields:\n",
    "            print('Cleaning ' + country + '.' + field +' to ' + cleaning_type + '.')\n",
    "            if cleaning_type == 'decimal':\n",
    "                dfs[country][field] = dfs[country][field].map(lambda x: round(x, 8))\n",
    "            elif cleaning_type == 'integer':\n",
    "                dfs[country][field] = pd.to_numeric(dfs[country][field], errors='coerce')\n",
    "                dfs[country][field] = dfs[country][field].map(lambda x: '%.0f' % x)  \n",
    "            elif cleaning_type == 'date':\n",
    "                dfs[country][field] = dfs[country][field].apply(lambda x: x.date())\n",
    "            elif cleaning_type == 'one-line string':\n",
    "                dfs[country][field] = dfs[country][field].apply(lambda x: to_1_line(x))\n",
    "\n",
    "print('Done!')\n",
    "\n",
    "del cleaning_specs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sort"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sort_by = {\n",
    "    'DE': 'commissioning_date',\n",
    "    'DK': 'commissioning_date',\n",
    "    'CH': 'commissioning_date',\n",
    "    'FR': 'municipality_code',\n",
    "    'PL': 'district',\n",
    "    'UK': 'commissioning_date'\n",
    "}\n",
    "\n",
    "for country, sort_by in sort_by.items():\n",
    "    print('Sorting', country)\n",
    "    try:\n",
    "        dfs[country] = dfs[country].iloc[dfs[country][sort_by].sort_values().index]\n",
    "    except Exception as e:\n",
    "        print('\\tException:',e)\n",
    "print('Done!')\n",
    "del sort_by"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Leave unspecified cells blank\n",
    "\n",
    "This step may take some time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for country in countries:\n",
    "    print(country)\n",
    "    dfs[country].fillna('', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Separate dirty from clean\n",
    "\n",
    "We separate all plants which have a validation marker in the comments column into a separate DataFrame and eventually also in a separate CSV file, so the main country files only contain \"clean\" plants, i.e. those without any special comment. This is useful since all our comments denote that most people would probably not like to include them in their calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dirty_keys = {\n",
    "    'DE' : 'DE_outvalidated_plants',\n",
    "    'FR' : 'FR_overseas_territories',\n",
    "} \n",
    "\n",
    "for country in dirty_keys.keys():\n",
    "    print(country)\n",
    "    idx_dirty = dfs[country][dfs[country].comment.str.len() > 1].index\n",
    "    dirty_key = dirty_keys[country]\n",
    "    dfs[dirty_key] = dfs[country].loc[idx_dirty]\n",
    "    dfs[country] = dfs[country].drop(idx_dirty)\n",
    "\n",
    "del idx_dirty, dirty_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Capacity time series\n",
    "\n",
    "This section creates a daily and yearly time series of the cumulated installed capacity by energy source for the United Kingdom, Germany, Denmark, and Switzerland. Three time series will be created for the UK: one for the whole country (GB-UKM), one for Northern Ireland (GB-NIR), and one for the Great Britain (GB-GBN). This data will be part of the output and will be compared in a plot for validation in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily_timeseries = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def to_new_level(row):\n",
    "    if(row['energy_source_level_2'] == 'Wind'):\n",
    "        energy_type_label = (row['energy_source_level_2']+'_'+row['technology']).lower()\n",
    "    else:\n",
    "        energy_type_label = row['energy_source_level_2'].lower()\n",
    "\n",
    "    return energy_type_label\n",
    "\n",
    "def to_daily_timeseries(df, start_date, end_date):  \n",
    "    # Combine energy levels to new standardized values\n",
    "    df['energy_type'] = df[['energy_source_level_2', 'energy_source_level_3', 'technology']].apply(to_new_level, axis=1)\n",
    "    \n",
    "    # Set range of time series as index\n",
    "    daily_timeseries = pd.DataFrame(index=pd.date_range(start=start_date, end=end_date, freq='D'))\n",
    "    \n",
    "    # Create cumulated time series per energy source for both yearly and daily time series\n",
    "    for energy_type in df['energy_type'].unique():\n",
    "        temp = (df[['commissioning_date', 'electrical_capacity']]\n",
    "            .loc[df['energy_type'] == energy_type])\n",
    "        temp_timeseries = temp.set_index('commissioning_date')\n",
    "        temp_timeseries.index = pd.DatetimeIndex(temp_timeseries.index)\n",
    "\n",
    "        # Create cumulated time series per energy_source and day\n",
    "        daily_timeseries[energy_type] = temp_timeseries.resample('D').sum().cumsum().fillna(method='ffill')\n",
    "        \n",
    "        # Make sure that the columns are properly filled\n",
    "        daily_timeseries[energy_type]= daily_timeseries[energy_type].fillna(method='ffill').fillna(value=0)\n",
    "    \n",
    "    # Reset the time index\n",
    "    daily_timeseries.reset_index(inplace=True)\n",
    "\n",
    "    # Set the index name\n",
    "    daily_timeseries.rename(columns={'index': 'day'}, inplace=True)\n",
    "    \n",
    "    # Drop the temporary column \"energy_type\"\n",
    "    df.drop('energy_type', axis=1, inplace=True)\n",
    "    return daily_timeseries\n",
    "\n",
    "eligible_for_timeseries = [country for country in countries if 'commissioning_date' in dfs[country].columns]\n",
    "#eligible_for_timeseries = ['CH', 'DK', 'UK', 'DE']\n",
    "possible_start_dates = [dfs[country]['commissioning_date'].min() for country in eligible_for_timeseries]\n",
    "possible_end_dates = [dfs[country]['commissioning_date'].max() for country in eligible_for_timeseries]\n",
    "\n",
    "#print(\"Possible start and end dates:\")\n",
    "#for country in eligible_for_timeseries:\n",
    "#    print(country, dfs[country]['commissioning_date'].min(), dfs[country]['commissioning_date'].max())\n",
    "\n",
    "start_date = min(possible_start_dates)\n",
    "end_date = max(possible_end_dates)\n",
    "\n",
    "for country in eligible_for_timeseries:\n",
    "    print(\"Timeseries for\", country)\n",
    "    try:\n",
    "        daily_timeseries[country] = to_daily_timeseries(dfs[country], start_date, end_date)\n",
    "        print('\\t Done!')\n",
    "    except Exception as e:\n",
    "        print('\\t', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make separate series for Great Britain and Northern Ireland"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the mask for Northern Ireland\n",
    "ni_mask = dfs['UK']['country'] == 'Northern Ireland'\n",
    "\n",
    "# Split the UK data\n",
    "ni_df = dfs['UK'][ni_mask].copy()\n",
    "gb_df = dfs['UK'][~ni_mask].copy()\n",
    "\n",
    "# Make the timeseries for Northern Ireland\n",
    "daily_timeseries['GB-NIR'] = to_daily_timeseries(ni_df, start_date, end_date)\n",
    "\n",
    "# Make the timeseries for Great Briatin (England, Wales, Scotland)\n",
    "daily_timeseries['GB-GBN'] = to_daily_timeseries(gb_df, start_date, end_date)\n",
    "\n",
    "# Renaming the entry for UK to conform to the ISO codes\n",
    "daily_timeseries['GB-UKM'] = daily_timeseries.pop('UK')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create total wind columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create column \"wind\" as sum of onshore and offshore\n",
    "for country in daily_timeseries:\n",
    "    onshore = 'wind_onshore' in daily_timeseries[country].columns\n",
    "    offshore = 'wind_offshore' in daily_timeseries[country].columns\n",
    "    if onshore and offshore:\n",
    "        daily_timeseries[country]['wind'] = daily_timeseries[country]['wind_onshore'] + daily_timeseries[country]['wind_offshore']\n",
    "    elif onshore and not offshore:\n",
    "        daily_timeseries[country]['wind'] = daily_timeseries[country]['wind_onshore']\n",
    "    elif (not onshore) and offshore:\n",
    "        daily_timeseries[country]['wind'] = daily_timeseries[country]['wind_offshore']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create one time series file containing al countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unified_daily_timeseries = pd.DataFrame(index=pd.date_range(start=start_date, end=end_date, freq='D'))\n",
    "# Append the country name to capacity columns' names\n",
    "for c in daily_timeseries:\n",
    "    new_columns = [c + \"_\" + col + \"_capacity\" if col != 'day' else 'day' for col in daily_timeseries[c].columns]\n",
    "    daily_timeseries[c].columns = new_columns\n",
    "    \n",
    "# Unify separate series\n",
    "unified_daily_timeseries = pd.concat(daily_timeseries.values(), axis=1, sort=False)\n",
    "\n",
    "# Make sure the day column appears only one\n",
    "days = unified_daily_timeseries['day']\n",
    "unified_daily_timeseries.drop('day', axis=1, inplace=True)\n",
    "unified_daily_timeseries['day'] = days.iloc[:, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort columns alphabetically\n",
    "unified_daily_timeseries = unified_daily_timeseries.reindex(sorted(unified_daily_timeseries.columns), axis=1)\n",
    "unified_daily_timeseries = unified_daily_timeseries.set_index('day').reset_index() # move day column to first position\n",
    "\n",
    "# drop column DE_Other fossil fuels (we don't want fossil fuels in here as they don't belong into renewables)\n",
    "# and hydro is not all of hydro but only subsidised hydro, which could be misleading\n",
    "unified_daily_timeseries.drop(columns='DE_other fossil fuels_capacity', inplace=True)\n",
    "unified_daily_timeseries.drop(columns='DE_hydro_capacity', inplace=True)\n",
    "\n",
    "# Show some rows\n",
    "unified_daily_timeseries.tail(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make the normalized dataframe for all the countries\n",
    "\n",
    "Here, we create a dataframe containing the following data for all the countries:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "geographical_resolution = {\n",
    "    'PL' : 'district',\n",
    "    'FR' : 'municipality',\n",
    "    'CH' : 'municipality',\n",
    "    'DE' : 'power plant',\n",
    "    'DK' : 'power plant',\n",
    "    'UK' : 'power plant'\n",
    "}\n",
    "\n",
    "dfs_to_concat = []\n",
    "\n",
    "columns = ['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'electrical_capacity',\n",
    "           'data_source', 'municipality', 'lon', 'lat', 'commissioning_date', 'geographical_resolution', \n",
    "           'as_of_year'\n",
    "          ]\n",
    "\n",
    "for country in countries:\n",
    "    country_df = dfs[country].loc[:, columns].copy()\n",
    "    country_df['country'] = country\n",
    "    country_df['geographical_resolution'] = geographical_resolution[country]\n",
    "    if country == 'PL':\n",
    "        country_df['as_of_year'] = 2016\n",
    "    elif country == 'FR':\n",
    "        country_df['as_of_year'] = 2017\n",
    "    dfs_to_concat.append(country_df)\n",
    "\n",
    "european_df = pd.concat(dfs_to_concat)\n",
    "european_df.reset_index(inplace=True, drop=True)\n",
    "\n",
    "european_df.sample(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Output\n",
    "This section finally writes the Data Package:\n",
    "* CSV + XLSX + SQLite\n",
    "* Meta data (JSON)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(package_path, exist_ok=True)\n",
    "\n",
    "# Make sure the daily timeseries has only the date part, not the full datetime with time information\n",
    "unified_daily_timeseries['day'] = unified_daily_timeseries['day'].dt.date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write data files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write CSV-files\n",
    "\n",
    "One csv-file for each country. This process will take some time depending on you hardware."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write each country's dataset as a separate csv file\n",
    "table_names = {}\n",
    "for country in countries_including_dirty:\n",
    "    print(country)\n",
    "    table_names[country] = 'renewable_power_plants_'+country if country not in countries_dirty else 'res_plants_separated_'+country\n",
    "    dfs[country].to_csv(os.path.join(package_path, table_names[country]+'.csv'),\n",
    "            sep=',',\n",
    "            decimal='.',\n",
    "            date_format='%Y-%m-%d',\n",
    "            line_terminator='\\n',\n",
    "            encoding='utf-8',\n",
    "            index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write daily cumulated time series as csv\n",
    "unified_daily_timeseries.to_csv(os.path.join(package_path, 'renewable_capacity_timeseries.csv'),\n",
    "        sep=',',\n",
    "        float_format='%.3f',\n",
    "        decimal='.',\n",
    "        date_format='%Y-%m-%d',\n",
    "        encoding='utf-8',\n",
    "        index=False)\n",
    "print('Done!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "european_df.to_csv(os.path.join(package_path, 'renewable_power_plants_EU.csv'),\n",
    "            sep=',',\n",
    "            decimal='.',\n",
    "            date_format='%Y-%m-%d',\n",
    "            line_terminator='\\n',\n",
    "            encoding='utf-8',\n",
    "            index=False)\n",
    "print('Done!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write csv of Marker Explanations\n",
    "validation_marker_df = pd.DataFrame(validation_marker).transpose()\n",
    "validation_marker_df = validation_marker_df.iloc[:, ::-1] # Reverse column order\n",
    "validation_marker_df.index.name = 'Validation marker'\n",
    "validation_marker_df.reset_index(inplace=True)\n",
    "validation_marker_df.to_csv(os.path.join(package_path, 'validation_marker.csv'), \n",
    "        sep=',',\n",
    "        decimal='.',\n",
    "        date_format='%Y-%m-%d',\n",
    "        line_terminator='\\n',\n",
    "        encoding='utf-8',\n",
    "        index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write XLSX-files\n",
    "\n",
    "This process will take some time depending on your hardware.\n",
    "\n",
    "All country power plant list will be written in one xlsx-file. Each country power plant list is written in a separate sheet. As the German power plant list has too many entries for one sheet, it will be split in two. An additional sheet includes the explanations of the marker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write the results as xlsx file\n",
    "%time writer = pd.ExcelWriter(os.path.join(package_path, 'renewable_power_plants.xlsx'), engine='xlsxwriter', date_format='yyyy-mm-dd')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Writing DE part 1')\n",
    "%time dfs['DE'][:1000000].to_excel(writer, index=False, sheet_name='DE part-1')\n",
    "\n",
    "print('Writing DE part 2')\n",
    "%time dfs['DE'][1000000:].to_excel(writer, index=False, sheet_name='DE part-2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "display(dfs.keys())\n",
    "\n",
    "for country in (countries_non_DE | countries_dirty):\n",
    "    print('Writing ' + country)\n",
    "    %time dfs[country].to_excel(writer, index=False, sheet_name=country)\n",
    "    \n",
    "print('Writing validation marker sheet')\n",
    "%time validation_marker_df.to_excel(writer, index=False, sheet_name='validation_marker')\n",
    "\n",
    "# Save timeseries as Excel\n",
    "%time unified_daily_timeseries.to_excel(writer, index=False, sheet_name='capacity_timeseries')\n",
    "\n",
    "print('Saving...')\n",
    "%time writer.save()\n",
    "print('...done!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write SQLite"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some date columns are giving the engine some trouble, therefore cast to string: \n",
    "#if 'DE' in dfs:\n",
    "#    dfs['DE'].decommissioning_date = dfs['DE'].decommissioning_date.astype(str)\n",
    "#    dfs['DE'].commissioning_date = dfs['DE'].commissioning_date.astype(str)\n",
    "#    dfs['DE_outvalidated_plants'].commissioning_date = dfs['DE_outvalidated_plants'].commissioning_date.astype(str)\n",
    "\n",
    "# Using chunksize parameter is for lower\n",
    "# memory computers. Removing it might speed things up.\n",
    "engine = sqlalchemy.create_engine('sqlite:///' + package_path + '/renewable_power_plants.sqlite')  \n",
    "\n",
    "for country in countries_including_dirty:\n",
    "    if country=='DE_outvalidated_plants':\n",
    "        continue # The DE_outvalidated_plants file gives a strange error message. Therefore do not put it into SQLite.\n",
    "    %time dfs[country].to_sql(table_names[country], engine, if_exists=\"replace\", chunksize=100000, index=False)\n",
    "\n",
    "validation_marker_df.to_sql('validation_marker', engine, if_exists=\"replace\", chunksize=100000, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save timeseries as sqlite\n",
    "%time european_df.to_sql('renewable_power_plants_EU', engine, if_exists=\"replace\", chunksize=100000, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save timeseries as sqlite\n",
    "%time unified_daily_timeseries.to_sql('renewable_capacity_timeseries', engine, if_exists=\"replace\", chunksize=100000, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Write meta data\n",
    "\n",
    "The Data Packages meta data are created in the specific JSON format as proposed by the Open Knowledge Foundation. Please see the Frictionless Data project by OKFN (http://data.okfn.org/) and the Data Package specifications (http://dataprotocols.org/data-packages/) for more details.\n",
    "\n",
    "In order to keep the Jupyter Notebook more readable the metadata is written in the human-readable YAML format using a multi-line string and then parse the string into a Python dictionary and save it as a JSON file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = \"\"\"\n",
    "hide: yes\n",
    "profile: tabular-data-package\n",
    "_metadataVersion: 1.2\n",
    "name: opsd_renewable_power_plants\n",
    "title: Renewable power plants\n",
    "description: List of renewable energy power stations\n",
    "longDescription: >-\n",
    "    This Data Package contains a list of renewable energy power plants in lists of \n",
    "    renewable energy-based power plants of Germany, Denmark, France, Switzerland, the\n",
    "    United Kingdom and Poland. \n",
    "    Germany: More than 1.7 million renewable power plant entries, eligible under the \n",
    "    renewable support scheme (EEG). \n",
    "    Denmark: Wind and phovoltaic power plants with a high level of detail. \n",
    "    France: Aggregated capacity and number of installations per energy source per \n",
    "    municipality (Commune). \n",
    "    Poland: Summed capacity and number of installations per energy source \n",
    "    per municipality (Powiat).\n",
    "    Switzerland: Renewable power plants eligible under the Swiss feed in tariff KEV \n",
    "    (Kostendeckende Einspeisevergütung).\n",
    "    United Kingdom: Renewable power plants in the United Kingdom.\n",
    "    Due to different data availability, the power plant lists are of different \n",
    "    accurancy and partly provide different power plant parameter. Due to that, the \n",
    "    lists are provided as seperate csv-files per country and as separate sheets in the\n",
    "    excel file. Suspect data or entries with high probability of duplication are marked\n",
    "    in the column 'comment'. Theses validation markers are explained in the file\n",
    "    validation_marker.csv.\n",
    "    Additionally, the Data Package includes daily time series of cumulated\n",
    "    installed capacity per energy source type for Germany, Denmark, Switzerland and the United Kingdom. All data processing is \n",
    "    conducted in Python and pandas and has been documented in the Jupyter Notebooks linked below. \n",
    "keywords: [master data register,power plants,renewables,germany,denmark,france,poland,switzerland,united kingdom,open power system data]\n",
    "spatial: \n",
    "    location: Germany, Denmark, France, Poland, Switzerland, United Kingdom\n",
    "    resolution: Power plants, municipalities\n",
    "resources:\n",
    "    - path: renewable_power_plants_DE.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: commissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              description: Date of commissioning of specific unit\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: decommissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              description: Date of decommissioning of specific unit\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Subtype of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "              unit: MW\n",
    "            - name: voltage_level\n",
    "              description: Voltage level of grid connection\n",
    "              type: string\n",
    "            - name: tso\n",
    "              description: Name of transmission system operator of the area the plant is located\n",
    "              type: string\n",
    "            - name: dso\n",
    "              description: Name of distribution system operator of the region the plant is located in\n",
    "              type: string\n",
    "            - name: dso_id\n",
    "              description: Company number of German distribution grid operator\n",
    "              type: string\n",
    "            - name: eeg_id\n",
    "              description: Power plant EEG (German feed-in tariff law) remuneration number\n",
    "              type: string\n",
    "            - name: federal_state\n",
    "              description: Name of German administrative level 'Bundesland'\n",
    "              type: string\n",
    "            - name: postcode\n",
    "              description: German zip-code\n",
    "              type: string\n",
    "            - name: municipality_code\n",
    "              description: German Gemeindenummer (municipalitiy number)\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Name of German Gemeinde (municipality)\n",
    "              type: string\n",
    "            - name: address\n",
    "              description: Street name or name of land parcel\n",
    "              type: string\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "            - name: comment\n",
    "              description: Shortcodes for comments related to this entry, explanation can be looked up in validation_marker.csv\n",
    "              type: string\n",
    "    - path: renewable_power_plants_DK.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: commissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "            - name: dso\n",
    "              description: Name of distribution system operator of the region the plant is located in\n",
    "              type: string\n",
    "            - name: gsrn_id\n",
    "              description: Danish wind turbine identifier number (GSRN)\n",
    "              type: integer\n",
    "            - name: postcode\n",
    "              description: Danish zip-code\n",
    "              type: string\n",
    "            - name: municipality_code\n",
    "              description: Danish 3-digit Kommune-Nr\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Name of Danish Kommune\n",
    "              type: string\n",
    "            - name: address\n",
    "              description: Street name or name of land parcel\n",
    "              type: string\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: hub_height\n",
    "              description: Wind turbine hub heigth in m\n",
    "              type: number\n",
    "            - name: rotor_diameter\n",
    "              description: Wind turbine rotor diameter in m\n",
    "              type: number\n",
    "            - name: manufacturer\n",
    "              description: Company that has built the wind turbine\n",
    "              type: string\n",
    "            - name: model\n",
    "              description: Wind turbine model type\n",
    "              type: string\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "    - path: renewable_power_plants_FR.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: municipality_code\n",
    "              description: French 5-digit INSEE code for Communes\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Name of French Commune\n",
    "              type: string\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Subtype of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "            - name: number_of_installations\n",
    "              description: Number of installations of the energy source subtype in the municipality. Due to confidentiality reasons, the values smaller than 3 are published as ''<3'' (as in the source).\n",
    "              type: integer\n",
    "              bareNumber: false\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "            - name: as_of_year\n",
    "              description: Year for which the data source compiled the original dataset.\n",
    "              type: integer\n",
    "    - path: renewable_power_plants_PL.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: district\n",
    "              description: Name of the Polish powiat\n",
    "              type: string\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              opsdContentfilter: \"true\"\n",
    "              type: string\n",
    "            - name: energy_source_level_3\n",
    "              description: Subtype of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "            - name: number_of_installations\n",
    "              description: Number of installations of the energy source subtype in the district\n",
    "              type: integer\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "            - name: as_of_year\n",
    "              description: Year for which the data source compiled the original dataset.\n",
    "              type: integer\n",
    "    - path: renewable_power_plants_UK.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValues: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: commissioning_date\n",
    "              description: Date of commissioning of specific unit\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: uk_beis_id\n",
    "              description: ID for the plant as assigned by UK BEIS.\n",
    "              type: integer\n",
    "            - name: site_name\n",
    "              description: Name of site\n",
    "              type: string\n",
    "            - name: operator\n",
    "              description: Name of operator\n",
    "              type: string\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              opsdContentfilter: \"true\"\n",
    "              type: string\n",
    "            - name: energy_source_level_3\n",
    "              description: Type of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              description: Installed electrical capacity in MW\n",
    "              unit: MW\n",
    "              type: number\n",
    "            - name: chp\n",
    "              description: Is the project capable of combined heat and power output\n",
    "              type: string\n",
    "            - name: capacity_individual_turbine\n",
    "              description: For windfarms, the individual capacity of each wind turbine in megawatts (MW)\n",
    "              type: number\n",
    "            - name: number_of_turbines\n",
    "              description: For windfarms, the number of wind turbines located on the site\n",
    "              type: integer\n",
    "            - name: solar_mounting_type\n",
    "              description: For solar PV developments, whether the PV panels are ground or roof mounted\n",
    "              type: string\n",
    "            - name: address\n",
    "              description: Address\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Municipality\n",
    "              type: string\n",
    "            - name: region\n",
    "              description: Region\n",
    "              type: string\n",
    "            - name: country\n",
    "              description: The UK's constituent country in which the facility is located.\n",
    "              type: string\n",
    "            - name: postcode\n",
    "              description: Postcode\n",
    "              type: string\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: string\n",
    "            - name: lon\n",
    "              description: Longitude coordinates\n",
    "              type: string\n",
    "            - name: data_source\n",
    "              description: The source of database entries\n",
    "              type: string\n",
    "            - name: comment\n",
    "              description: Shortcodes for comments related to this entry, explanation can be looked up in validation_marker.csv\n",
    "              type: string\n",
    "    - path: renewable_power_plants_CH.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: commissioning_date\n",
    "              description: Commissioning date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: municipality\n",
    "              description: Municipality\n",
    "              type: string\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Type of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "            - name: municipality_code\n",
    "              description: Municipality code\n",
    "              type: integer\n",
    "            - name: project_name\n",
    "              description: Name of the project\n",
    "              type: string\n",
    "            - name: production\n",
    "              description: Yearly production in MWh\n",
    "              type: number\n",
    "            - name: tariff\n",
    "              description: Tariff in CHF for 2016\n",
    "              type: number\n",
    "            - name: contract_period_end\n",
    "              description: End year of subsidy contract\n",
    "              type: number\n",
    "            - name: street\n",
    "              description: Street name\n",
    "              type: string\n",
    "            - name: canton\n",
    "              description: Name of the cantones/ member states of the Swiss confederation\n",
    "              type: string\n",
    "            - name: company\n",
    "              description: Name of the company\n",
    "              type: string\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "    - path: res_plants_separated_DE_outvalidated_plants.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:         \n",
    "          fields:\n",
    "            - name: commissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              description: Date of commissioning of specific unit\n",
    "            - name: decommissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              description: Date of decommissioning of specific unit\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Subtype of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "              unit: MW\n",
    "            - name: thermal_capacity\n",
    "              description: Installed thermal capacity in MW\n",
    "              type: number\n",
    "              unit: MW\n",
    "            - name: voltage_level\n",
    "              description: Voltage level of grid connection\n",
    "              type: string\n",
    "            - name: tso\n",
    "              description: Name of transmission system operator of the area the plant is located\n",
    "              type: string\n",
    "            - name: dso\n",
    "              description: Name of distribution system operator of the region the plant is located in\n",
    "              type: string\n",
    "            - name: dso_id\n",
    "              description: Company number of German distribution grid operator\n",
    "              type: string\n",
    "            - name: eeg_id\n",
    "              description: Power plant EEG (German feed-in tariff law) remuneration number\n",
    "              type: string\n",
    "            - name: federal_state\n",
    "              description: Name of German administrative level 'Bundesland'\n",
    "              type: string\n",
    "            - name: postcode\n",
    "              description: German zip-code\n",
    "              type: string\n",
    "            - name: municipality_code\n",
    "              description: German Gemeindenummer (municipalitiy number)\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Name of German Gemeinde (municipality)\n",
    "              type: string\n",
    "            - name: address\n",
    "              description: Street name or name of land parcel\n",
    "              type: string\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "            - name: comment\n",
    "              description: Shortcodes for comments related to this entry, explanation can be looked up in validation_marker.csv\n",
    "              type: string\n",
    "    - path: res_plants_separated_FR_overseas_territories.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: municipality_code\n",
    "              description: French 5-digit INSEE code for Communes\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: Name of French Commune\n",
    "              type: string\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Subtype of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: technology\n",
    "              description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              unit: MW\n",
    "              description: Installed electrical capacity in MW\n",
    "              type: number\n",
    "            - name: number_of_installations\n",
    "              description: Number of installations of the energy source subtype in the municipality\n",
    "              type: integer\n",
    "            - name: lat\n",
    "              description: Latitude coordinates\n",
    "              type: geopoint\n",
    "            - name: lon\n",
    "              description: Longitude coordinates \n",
    "              type: geopoint\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "    - path: renewable_power_plants.xlsx\n",
    "      format: xlsx\n",
    "    - path: validation_marker.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      mediatype: text/csv\n",
    "      missingValue: \"\"\n",
    "      schema:         \n",
    "          fields:\n",
    "            - name: Validation_Marker\n",
    "              description: Name of validation marker utilized in column comment in the renewable_power_plant_germany.csv\n",
    "              type: string\n",
    "            - name: Explanation\n",
    "              description: Comment explaining meaning of validation marker\n",
    "              type: string\n",
    "    - path: renewable_power_plants_EU.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      mediatype: text/csv\n",
    "      missingValue: \"\"\n",
    "      schema:\n",
    "          fields:\n",
    "            - name: energy_source_level_1\n",
    "              description: Type of energy source (e.g. Renewable energy)\n",
    "              type: string\n",
    "            - name: energy_source_level_2\n",
    "              description: Type of energy source (e.g. Wind, Solar)\n",
    "              type: string\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: energy_source_level_3\n",
    "              description: Type of energy source (e.g. Biomass and biogas)\n",
    "              type: string\n",
    "            - name: electrical_capacity\n",
    "              description: Installed electrical capacity in MW\n",
    "              unit: MW\n",
    "              type: number\n",
    "            - name: data_source\n",
    "              description: Source of database entry\n",
    "              type: string\n",
    "            - name: municipality\n",
    "              description: The name of the municipality in which the facility is located\n",
    "              type: string\n",
    "            - name: lon\n",
    "              description: Geographical longitude\n",
    "              type: number\n",
    "            - name: lat\n",
    "              description: Geographical latitude\n",
    "              type: number\n",
    "            - name: commissioning_date\n",
    "              type: date\n",
    "              format: YYYY-MM-DD\n",
    "              description: Date of commissioning of specific unit\n",
    "            - name: geographical_resolution\n",
    "              description: Precision of geographical information (exact power plant location, municipality, district)\n",
    "              type: \n",
    "            - name: as_of_year\n",
    "              description: Year for which the data source compiled the corresponding dataset\n",
    "              type: integer\n",
    "            - name: country\n",
    "              description: The country in which the facility is located\n",
    "              type: number\n",
    "    - path: renewable_capacity_timeseries.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      mediatype: text/csv\n",
    "      missingValue: \"\"\n",
    "      schema: \n",
    "          fields:\n",
    "            - name: day\n",
    "              type: date\n",
    "              description: The day of the timeseries entry\n",
    "              opsdContentfilter: \"true\"\n",
    "            - name: CH_bioenergy_capacity\n",
    "              description: Cumulative bioenergy electrical capacity for Switzerland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Switzerland\n",
    "                Variable: Bioenergy\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Swiss Federal Office of Energy\n",
    "            - name: CH_hydro_capacity\n",
    "              description: Cumulative hydro electrical capacity for Switzerland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Switzerland\n",
    "                Variable: Hydro\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Swiss Federal Office of Energy\n",
    "            - name: CH_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for Switzerland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Switzerland\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Swiss Federal Office of Energy\n",
    "            - name: CH_wind_capacity \n",
    "              ription: Cumulative total wind electrical capacity for Switzerland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Switzerland\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Swiss Federal Office of Energy\n",
    "            - name: CH_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for Switzerland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Switzerland\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Swiss Federal Office of Energy\n",
    "            - name: DE_bioenergy_capacity\n",
    "              description: Cumulative bioenergy electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Bioenergy\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_geothermal_capacity\n",
    "              description: Cumulative geothermal electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Geothermal\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_storage_capacity\n",
    "              description: Cumulative storage electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Storage\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_wind_offshore_capacity\n",
    "              description: Cumulative offshore wind electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Wind offshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_wind_capacity \n",
    "              ription: Cumulative total wind electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DE_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for Germany in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Germany\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BNetzA and Netztransparenz.de\n",
    "            - name: DK_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for Denmark in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Denmark\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Energinet.dk\n",
    "            - name: DK_wind_offshore_capacity\n",
    "              description: Cumulative offshore wind electrical capacity for Denmark in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Denmark\n",
    "                Variable: Wind offshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Danish Energy Agency\n",
    "            - name: DK_wind_capacity \n",
    "              ription: Cumulative total wind electrical capacity for Denmark in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Denmark\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Danish Energy Agency\n",
    "            - name: DK_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for Denmark in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Denmark\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from Danish Energy Agency\n",
    "            - name: GB-GBN_bioenergy_capacity\n",
    "              description: Cumulative bioenergy electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Bioenergy\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_hydro_capacity\n",
    "              description: Cumulative hydro electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Hydro\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_marine_capacity\n",
    "              description: Cumulative marine electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Marine\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_wind_offshore_capacity\n",
    "              description: Cumulative offshore wind electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Wind offshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_wind_capacity\n",
    "              description: Cumulative total wind electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-GBN_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for Great Britain (England, Scotland, Wales) in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Great Britain (England, Scotland, Wales)\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-NIR_bioenergy_capacity\n",
    "              description: Cumulative bioenergy electrical capacity for Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Northern Ireland\n",
    "                Variable: Bioenergy\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-NIR_marine_capacity\n",
    "              description: Cumulative marine electrical capacity for Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Northern Ireland\n",
    "                Variable: Marine\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-NIR_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Northern Ireland\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-NIR_wind_capacity\n",
    "              description: Cumulative total wind electrical capacity for Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Northern Ireland\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-NIR_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: Northern Ireland\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_bioenergy_capacity\n",
    "              description: Cumulative bioenergy electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Bioenergy\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_hydro_capacity\n",
    "              description: Cumulative hydro electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Hydro\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_marine_capacity\n",
    "              description: Cumulative marine electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Marine\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_solar_capacity\n",
    "              description: Cumulative solar electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Solar\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_wind_offshore_capacity\n",
    "              description: Cumulative offshore wind electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Wind offshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_wind_capacity\n",
    "              description: Cumulative total wind electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Wind\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "            - name: GB-UKM_wind_onshore_capacity\n",
    "              description: Cumulative onshore wind electrical capacity for the United Kingdom of Great Britain and Northern Ireland in MW\n",
    "              unit: MW\n",
    "              opsdProperties:\n",
    "                Region: United Kingdom of Great Britain and Northern Ireland\n",
    "                Variable: Wind onshore\n",
    "              type: number\n",
    "              source:\n",
    "                name: Own calculation based on plant-level data from BEIS\n",
    "sources:\n",
    "    - title: BNetzA\n",
    "      path: https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/VOeFF_Registerdaten/2018_12_Veroeff_RegDaten.xlsx?__blob=publicationFile&v=2\n",
    "      description: Bundesnetzagentur register of renewable power plants (excl. PV)\n",
    "    - title: BNetzA_PV\n",
    "      path: https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/PV_Datenmeldungen/Meldungen_Juli17-Dez18.xlsx?__blob=publicationFile&v=2\n",
    "      description: Bundesnetzagentur register of PV power plants\n",
    "    - title: BNetzA_PV_historic\n",
    "      path: https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/PV_Datenmeldungen/Archiv_PV/Meldungen_Aug-Juni2017.xlsx?__blob=publicationFile&v=2\n",
    "      description: Bundesnetzagentur register of PV power plants\n",
    "    - title: TransnetBW, TenneT, Amprion, 50Hertz, Netztransparenz.de\n",
    "      path: https://www.netztransparenz.de/de/Anlagenstammdaten.htm\n",
    "      description: Netztransparenz.de - information platform of German TSOs (register of renewable power plants in their control area)\n",
    "    - title: Postleitzahlen Deutschland\n",
    "      path: http://www.suche-postleitzahl.org/downloads\n",
    "      description: Zip codes of Germany linked to geo-information\n",
    "    - title: Energinet.dk\n",
    "      path: http://www.energinet.dk/SiteCollectionDocuments/Danske%20dokumenter/El/SolcelleGraf.xlsx\n",
    "      description: register of Danish wind power plants\n",
    "    - title: Energistyrelsen\n",
    "      path: https://ens.dk/sites/ens.dk/files/Statistik/anlaegprodtilnettet.xls\n",
    "      description: ens.dk - register of Danish Wind power plants\n",
    "    - title: GeoNames\n",
    "      path: http://download.geonames.org/export/zip/\n",
    "      description: geonames.org\n",
    "    - title: Ministry for the Ecological and Inclusive Transition\n",
    "      path: https://www.statistiques.developpement-durable.gouv.fr/donnees-locales-relatives-aux-installations-de-production-delectricite-renouvelable-beneficiant-0?rubrique=23&dossier=189\n",
    "    - title: OpenDataSoft\n",
    "      path: http://public.opendatasoft.com/explore/dataset/correspondance-code-insee-code-postal/download/'\\\n",
    "           '?format=csv&refine.statut=Commune%20simple&timezone=Europe/Berlin&use_labels_for_header=true\n",
    "      description: Code Postal - Code INSEE\n",
    "    - title: Urzad Regulacji Energetyki (URE)\n",
    "      path: http://www.ure.gov.pl/uremapoze/mapa.html\n",
    "      description: Energy Regulatory Office of Poland\n",
    "    - title: Bundesamt für Energie (BFE)\n",
    "      path: https://www.bfe.admin.ch/bfe/de/home/foerderung/erneuerbare-energien/einspeiseverguetung/_jcr_content/par/tabs/items/tab/tabpar/externalcontent.external.exturl.xlsx/aHR0cHM6Ly9wdWJkYi5iZmUuYWRtaW4uY2gvZGUvcHVibGljYX/Rpb24vZG93bmxvYWQvOTMxMC54bHN4.xlsx\n",
    "      description: Swiss Federal Office of Energy\n",
    "    - title: UK Government Department of Business, Energy & Industrial Strategy (BEIS)\n",
    "      path: https://www.gov.uk/government/publications/renewable-energy-planning-database-monthly-extract\n",
    "      description: Renewable Energy Planning Database quarterly extract\n",
    "contributors:\n",
    "    - title: Ingmar Schlecht\n",
    "      role: Maintainer, developer\n",
    "      organization: Neon GmbH\n",
    "      email: schlecht@neon-energie.de\n",
    "    - title: Milos Simic\n",
    "      role: Developer\n",
    "      email: milos.simic.ms@gmail.com\n",
    "\"\"\"\n",
    "\n",
    "metadata = yaml.load(metadata)\n",
    "\n",
    "metadata['homepage'] = 'https://data.open-power-system-data.org/renewable_power_plants/'+settings['version']\n",
    "metadata['id'] = 'https://doi.org/10.25832/renewable_power_plants/'+settings['version']\n",
    "metadata['last_changes'] = settings['changes']\n",
    "metadata['version'] = settings['version']\n",
    "\n",
    "lastYear = int(settings['version'][0:4])-1\n",
    "\n",
    "metadata['temporal'] = {\n",
    "    'referenceDate': str(lastYear)+'-12-31'\n",
    "}\n",
    "\n",
    "metadata['documentation'] = 'https://github.com/Open-Power-System-Data/renewable_power_plants/blob/'+settings['version']+'/main.ipynb'\n",
    "\n",
    "datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '), ensure_ascii=False)\n",
    "\n",
    "# Write the information of the metadata\n",
    "with open(os.path.join(package_path, 'datapackage.json'), 'w', encoding='utf-8') as f:\n",
    "    f.write(datapackage_json)\n",
    "    f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate checksums\n",
    "\n",
    "Generates checksums.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_sha_hash(path, blocksize=65536):\n",
    "    sha_hasher = hashlib.sha256()\n",
    "    with open(path, 'rb') as f:\n",
    "        buffer = f.read(blocksize)\n",
    "        while len(buffer) > 0:\n",
    "            sha_hasher.update(buffer)\n",
    "            buffer = f.read(blocksize)\n",
    "        return sha_hasher.hexdigest()\n",
    "\n",
    "files = [\n",
    "    'validation_marker.csv', \n",
    "    'renewable_power_plants.sqlite', 'renewable_power_plants.xlsx',\n",
    "]\n",
    "\n",
    "for country in countries_including_dirty:\n",
    "    files.append(table_names[country]+'.csv')\n",
    "\n",
    "files.append('renewable_capacity_timeseries.csv')\n",
    "\n",
    "files.append('renewable_power_plants_EU.csv')\n",
    "    \n",
    "with open('checksums.txt', 'w') as f:\n",
    "    for file_name in sorted(files):\n",
    "        print(file_name)\n",
    "        file_hash = get_sha_hash(os.path.join(package_path, file_name))\n",
    "        f.write('{},{}\\n'.format(file_name, file_hash))\n",
    "        print('Done!')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "p3",
   "language": "python",
   "name": "p3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 0
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "716px",
    "left": "104px",
    "top": "280px",
    "width": "231px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "position": {
    "height": "531px",
    "left": "1530px",
    "right": "40px",
    "top": "273px",
    "width": "350px"
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}