{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"width:100%; background-color: #D9EDF7; border: 1px solid #CFCFCF; text-align: left; padding: 10px;\">\n",
    "      <b>Renewable power plants: Download and process notebook</b>\n",
    "      <ul>\n",
    "        <li><a href=\"main.ipynb\">Main notebook</a></li>\n",
    "        <li>Download and process notebook</li>\n",
    "        <li><a href=\"validation_and_output.ipynb\">Validation and output notebook</a></li>\n",
    "      </ul>\n",
    "      <br>This notebook is part of the <a href=\"http://data.open-power-system-data.org/renewable_power_plants\"> Renewable power plants Data Package</a> of <a href=\"http://open-power-system-data.org\">Open Power System Data</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This script downlads and extracts the original data of renewable power plant lists from the data sources, processes and merges them. It subsequently adds the geolocation for each power plant. Finally it saves the DataFrames as pickle-files. Make sure you run the download and process Notebook before the validation and output Notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Script-setup\" data-toc-modified-id=\"Script-setup-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Script setup</a></span></li><li><span><a href=\"#Settings\" data-toc-modified-id=\"Settings-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Settings</a></span><ul class=\"toc-item\"><li><span><a href=\"#Choose-download-option\" data-toc-modified-id=\"Choose-download-option-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Choose download option</a></span></li><li><span><a href=\"#Download-function\" data-toc-modified-id=\"Download-function-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Download function</a></span></li><li><span><a href=\"#Setup-translation-dictionaries\" data-toc-modified-id=\"Setup-translation-dictionaries-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>Setup translation dictionaries</a></span></li></ul></li><li><span><a href=\"#Download-and-process-per-country\" data-toc-modified-id=\"Download-and-process-per-country-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Download and process per country</a></span><ul class=\"toc-item\"><li><span><a href=\"#Germany-DE\" data-toc-modified-id=\"Germany-DE-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Germany DE</a></span></li><li><span><a href=\"#Denmark-DK\" data-toc-modified-id=\"Denmark-DK-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Denmark DK</a></span></li><li><span><a href=\"#France-FR\" data-toc-modified-id=\"France-FR-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>France FR</a></span></li><li><span><a href=\"#Poland-PL\" data-toc-modified-id=\"Poland-PL-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Poland PL</a></span></li><li><span><a href=\"#Switzerland-CH\" data-toc-modified-id=\"Switzerland-CH-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Switzerland CH</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Script setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import posixpath\n",
    "import urllib.parse\n",
    "import urllib.request\n",
    "import re\n",
    "import zipfile\n",
    "import pickle\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import utm  # for transforming geoinformation in the utm format\n",
    "import requests\n",
    "from string import Template\n",
    "from IPython.display import display\n",
    "\n",
    "logging.basicConfig(\n",
    "    level=logging.INFO,\n",
    "    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
    "    datefmt='%d %b %Y %H:%M:%S'\n",
    ")\n",
    "\n",
    "logger = logging.getLogger()\n",
    "\n",
    "# Create input and output folders if they don't exist\n",
    "os.makedirs(os.path.join('input', 'original_data'), exist_ok=True)\n",
    "os.makedirs('output', exist_ok=True)\n",
    "os.makedirs(os.path.join('output', 'renewable_power_plants'), exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choose download option\n",
    "The original data can either be downloaded from the original data sources as specified below or from the opsd-Server. Default option is to download from the original sources as the aim of the project is to stay as close to original sources as possible. However, if problems with downloads e.g. due to changing urls occur, you can still run the script with the original data from the opsd_server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download_from = 'original_sources'\n",
    "#download_from = 'opsd_server' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_and_cache(url, session=None):\n",
    "    \"\"\"This function downloads a file into a folder called \n",
    "    original_data and returns the local filepath.\"\"\"\n",
    "    path = urllib.parse.urlsplit(url).path\n",
    "    filename = posixpath.basename(path)\n",
    "    filepath = os.path.join('input', 'original_data', filename)\n",
    "    print(url)\n",
    "\n",
    "    # check if file exists, if not download it\n",
    "    print(filepath)\n",
    "    if not os.path.exists(filepath):\n",
    "        if not session:\n",
    "            print('No session')\n",
    "            session = requests.session()\n",
    "\n",
    "        print(\"Downloading file: \", filename)\n",
    "        r = session.get(url, stream=True)\n",
    "\n",
    "        chuncksize = 1024\n",
    "        with open(filepath, 'wb') as file:\n",
    "            for chunck in r.iter_content(chuncksize):\n",
    "                file.write(chunck)\n",
    "    else:\n",
    "        print(\"Using local file from\", filepath)\n",
    "    filepath = '' + filepath\n",
    "    return filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup translation dictionaries\n",
    "\n",
    "Column and value names of the original data sources will be translated to English and standardized across different sources. Standardized column names, e.g. \"electrical_capacity\" are required to merge data in one DataFrame.<br>\n",
    "The column and the value translation lists are provided in the input folder of the Data Package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Get column translation list\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "columnnames.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get value translation list\n",
    "valuenames = pd.read_csv(os.path.join('input', 'value_translation_list.csv'))\n",
    "valuenames.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and process per country\n",
    "\n",
    "For one country after the other, the original data is downloaded, read, processed, translated, eventually georeferenced and saved. If respective files are already in the local folder, these will be utilized.\n",
    "To process the provided data [pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is applied.<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Germany DE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "**[Netztransparenz.de](https://www.netztransparenz.de/de/Anlagenstammdaten.htm)** - Official grid transparency platform from the German TSOs (50Hertz, Amprion, TenneT and TransnetBW).\n",
    "\n",
    "**[Bundesnetzagentur (BNetzA)](https://www.bundesnetzagentur.de/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/EEG_Registerdaten/EEG_Registerdaten_node.html)** - German Federal Network Agency for Electricity, Gas, Telecommunications, Posts and Railway (In separate files for **data for roof-mounted PV power plants** and for **all other renewable energy power plants**.)\n",
    "\n",
    "Data URL for BNetzA gets updated every few month. To be sure, always check if the links (url_bnetza; url_bnetza_pv) are up to date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# point URLs to original data depending on the chosen download option\n",
    "if download_from == 'original_sources':\n",
    "    tsos = ['50Hertz', 'Amprion', 'TenneT', 'TransnetBW']\n",
    "    datasets = ['50Hertz', 'Amprion', 'TenneT', 'TransnetBW','bnetza','bnetza_pv','bnetza_pv_historic']\n",
    "    url_pattern_netztransparenz = 'https://www.netztransparenz.de/portals/1/Content/Erneuerbare-Energien-Gesetz/EEG-Anlagestammdaten/${tso}_EEG-Anlagenstammdaten_2016.zip'\n",
    "    urls = {\n",
    "        'bnetza': 'https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/VOeFF_Registerdaten/2017_12_Veroeff_RegDaten.xlsx?__blob=publicationFile&v=2',\n",
    "        'bnetza_pv': 'https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/VOeFF_Registerdaten/PV_Datenmeldungen/Meldungen_Juli-Dez2017.xlsx?__blob=publicationFile&v=2',\n",
    "        'bnetza_pv_historic': 'https://www.bundesnetzagentur.de/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/VOeFF_Registerdaten/PV_Datenmeldungen/Meldungen_Aug-Juni2017.xlsx?__blob=publicationFile&v=2'\n",
    "    }\n",
    "\n",
    "elif download_from == 'opsd_server':\n",
    "    print('Download from opsd_server is currently not implemented.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Note: Some BNetzA files fail to download automatically, so put them manually to the input/original_data/ folder\n",
    "\n",
    "# Download all data sets before processing.\n",
    "if download_from == 'original_sources':\n",
    "    filenames = {}\n",
    "    for tso in tsos:\n",
    "        url = Template(url_pattern_netztransparenz).substitute(tso=tso)\n",
    "        %time filenames[tso] = zipfile.ZipFile(download_and_cache(url))\n",
    "\n",
    "    for dataset in ['bnetza','bnetza_pv','bnetza_pv_historic']:\n",
    "        %time filenames[dataset] = download_and_cache(urls[dataset])\n",
    "\n",
    "elif download_from == 'opsd_server':\n",
    "    print('Download from opsd_server is currently not implemented.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read TSO data from zip file\n",
    "\n",
    "dfs = {}\n",
    "\n",
    "for tso in tsos:\n",
    "    print('Reading '+tso+'_EEG-Anlagenstammdaten_2016.csv')\n",
    "    dfs[tso] = pd.read_csv(\n",
    "        filenames[tso].open(tso+'_EEG-Anlagenstammdaten_2016.csv'),\n",
    "        sep=';',\n",
    "        thousands='.',\n",
    "        decimal=',',\n",
    "        header=0,\n",
    "        parse_dates=[11, 12, 13, 14],\n",
    "        encoding='iso-8859-1',\n",
    "        dayfirst=True,\n",
    "        low_memory=False\n",
    "    )\n",
    "    print('Done reading '+tso+'_EEG-Anlagenstammdaten_2016.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "key = 'Installierte Leistung'\n",
    "dfs['Amprion'][key] = [x.replace('.', '').replace(',', '.') for x in dfs['Amprion'][key]]\n",
    "dfs['Amprion'][key] = dfs['Amprion'][key].astype(float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read BNetzA register\n",
    "print('Reading bnetza: '+filenames['bnetza'])\n",
    "dfs['bnetza'] = pd.read_excel(filenames['bnetza'],\n",
    "                          sheetname='Gesamtübersicht',\n",
    "                          header=0,\n",
    "                          converters={'4.9 Postleit-zahl': str, 'Gemeinde-Schlüssel': str}\n",
    ")\n",
    "\n",
    "skiprows = {'bnetza_pv_historic': 10, 'bnetza_pv': 9}\n",
    "\n",
    "for dataset in ['bnetza_pv', 'bnetza_pv_historic']:\n",
    "    print('Reading '+dataset+': '+filenames[dataset])\n",
    "    xls_handle = pd.ExcelFile(filenames[dataset])\n",
    "    print('Concatenating all '+dataset+' sheets into one dataframe')\n",
    "    dfs[dataset] = pd.concat(\n",
    "        xls_handle.parse(\n",
    "            sheet,\n",
    "            skiprows=skiprows[dataset],\n",
    "            converters={'Anlage \\nPLZ': str}\n",
    "        ) for sheet in xls_handle.sheet_names\n",
    "    )\n",
    "    dfs[dataset].tail(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop empty column called \"Unnamed: 4\"\n",
    "dfs['TransnetBW'] = dfs['TransnetBW'].drop(['Unnamed: 4'], axis=1)\n",
    "dfs['bnetza_pv_historic'] = dfs['bnetza_pv_historic'].drop(['Unnamed: 7'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump( dfs, open( \"intermediate/temp_dfs_1.p\", \"wb\" ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs = pickle.load( open( \"intermediate/temp_dfs_1.p\", \"rb\" ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names\n",
    "To standardise the DataFrame the original column names from the German TSOs and the BNetzA wil be translated and new english column names wil be assigned to the DataFrame. The unique column names are required to merge the DataFrame.<br>\n",
    "The column_translation_list is provided here as csv in the input folder. It is loaded in _2.3 Setup of translation dictionaries_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Germany, create dictionary and show dictionary\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "idx_DE = columnnames[columnnames['country'] == 'DE'].index\n",
    "column_dict_DE = columnnames.loc[idx_DE].set_index('original_name')['opsd_name'].to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start the column translation process for each original data source\n",
    "print('Translation')\n",
    "for dataset in datasets:\n",
    "    # Remove newlines and any other duplicate whitespaces in column names:\n",
    "    dfs[dataset] = dfs[dataset].rename(columns={col: re.sub(r\"\\s+\", ' ', col) for col in dfs[dataset].columns})\n",
    "    # Do column name translations\n",
    "    dfs[dataset].rename(columns=column_dict_DE, inplace=True)\n",
    "\n",
    "print('done.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add information and choose columns\n",
    "All data source names and for the BNetzA-PV data the energy source level 2 will added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add data source names to the DataFrames\n",
    "for tso in tsos:\n",
    "    dfs[tso]['data_source'] = tso\n",
    "\n",
    "dfs['bnetza']['data_source'] = 'BNetzA'\n",
    "dfs['bnetza_pv']['data_source'] = 'BNetzA_PV'\n",
    "dfs['bnetza_pv_historic']['data_source'] = 'BNetzA_PV_historic'\n",
    "\n",
    "# Add for the BNetzA PV data the energy source level 2\n",
    "dfs['bnetza_pv']['energy_source_level_2'] = 'Photovoltaics'\n",
    "dfs['bnetza_pv_historic']['energy_source_level_2'] = 'Photovoltaics'\n",
    "\n",
    "# Correct datetime-format\n",
    "def decom_fkt(x):\n",
    "    x = str(x)\n",
    "    if x == 'nan':\n",
    "        x = ''\n",
    "    else:\n",
    "        x = x[0:10]\n",
    "    return x\n",
    "\n",
    "dfs['bnetza']['decommissioning_date'] = dfs['bnetza']['decommissioning_date'].apply(decom_fkt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select those columns of the orignal data which are utilised further\n",
    "dfs['bnetza'] = dfs['bnetza'].loc[:, ('commissioning_date', 'decommissioning_date',\n",
    "                              'notification_reason', 'energy_source_level_2',\n",
    "                              'electrical_capacity_kW', 'thermal_capacity_kW',\n",
    "                              'voltage_level', 'dso', 'eeg_id', 'bnetza_id',\n",
    "                              'federal_state', 'postcode', 'municipality_code',\n",
    "                              'municipality', 'address', 'address_number',\n",
    "                              'utm_zone', 'utm_east', 'utm_north',\n",
    "                              'data_source')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for dataset in datasets: print(dataset+':'); display(dfs[dataset].tail(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge DataFrames\n",
    "The individual DataFrames from the TSOs (Netztransparenz.de) and BNetzA are merged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge DataFrames of each original source into a common DataFrame DE_renewables\n",
    "dfs_list = []\n",
    "\n",
    "for dataset in datasets:\n",
    "    dfs_list.append(dfs[dataset])\n",
    "\n",
    "DE_renewables = pd.concat(dfs_list)\n",
    "DE_renewables.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure the decommissioning_column has the right dtype\n",
    "%time DE_renewables['decommissioning_date'] = DE_renewables['decommissioning_date'].apply(pd.to_datetime)\n",
    "DE_renewables['decommissioning_date'] = pd.to_datetime(DE_renewables['decommissioning_date'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DE_renewables.reset_index(drop=True, inplace=True)\n",
    "DE_renewables.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2\n",
    "Different German terms for energy source level 2, energy source level 3, technology and voltage levels are translated and harmonized across the individual data sources. The value_translation_list is provided here as csv in the input folder. It is loaded in _2.3 Setup of translation dictionaries_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Germany, create dictionary and show dictionary\n",
    "valuenames = pd.read_csv(os.path.join('input', 'value_translation_list.csv'))\n",
    "idx_DE = valuenames[valuenames['country'] == 'DE'].index\n",
    "value_dict_DE = valuenames.loc[idx_DE].set_index('original_name')['opsd_name'].to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('replacing..')\n",
    "# Replace all original value names by the OPSD value names. \n",
    "# Running time: some minutes. %time prints the time your computer required for this step\n",
    "%time DE_renewables.replace(value_dict_DE, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DE_renewables['postcode'] = DE_renewables['postcode'].apply(pd.to_numeric, errors='ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate and assign energy source level 1 - 3 and technology **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionary in order to assign energy_source to its subtype\n",
    "energy_source_dict_DE = valuenames.loc[idx_DE].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Column energy_source partly contains energy source level 3 and technology information,\n",
    "# thus this column is copied to new column technology...\n",
    "DE_renewables['technology'] = DE_renewables['energy_source_level_2']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ...and the energy source level 2 values are replaced by the higher level classification\n",
    "DE_renewables['energy_source_level_2'].replace(\n",
    "    energy_source_dict_DE, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose energy source level 2 entries where energy_source is \"Bioenergy\" in order to \n",
    "# seperate Bioenergy subtypes to \"energy_source_level_3\" and subtypes for the rest to \"technology\"\n",
    "idx_DE_Bioenergy = DE_renewables[DE_renewables['energy_source_level_2'] == 'Bioenergy'].index\n",
    "\n",
    "# Assign technology to energy source level 3 for all entries where energy source level 2 is \n",
    "# Bioenergy and delete those entries from technology\n",
    "DE_renewables[['energy_source_level_3']] = DE_renewables.iloc[idx_DE_Bioenergy][['technology']]\n",
    "DE_renewables.loc[idx_DE_Bioenergy]['technology'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy source level 1 to the dataframe\n",
    "DE_renewables['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Summary of DataFrame**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Electrical capacity per energy source level 2 (in MW)\n",
    "DE_renewables.groupby(['energy_source_level_2'])['electrical_capacity_kW'].sum() / 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "DE_renewables[['electrical_capacity_kW', 'thermal_capacity_kW']] /= 1000\n",
    "\n",
    "# adapt column name\n",
    "DE_renewables.rename(columns={'electrical_capacity_kW': 'electrical_capacity',\n",
    "                              'thermal_capacity_kW': 'thermal_capacity'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get coordinates by postcode\n",
    "*(for data with no existing geocoordinates)*\n",
    "\n",
    "The available post code in the original data provides a first approximation for the geocoordinates of the RE power plants.<br>\n",
    "The BNetzA data provides the full zip code whereas due to data privacy the TSOs only report the first three digits of the power plant's post code (e.g. 024xx) and no address. Subsequently a centroid of the post code region polygon is used to find the coordinates.\n",
    "\n",
    "With data from\n",
    "*  http://www.suche-postleitzahl.org/downloads?download=plz-gebiete.shp.zip\n",
    "*  http://www.suche-postleitzahl.org/downloads?download_file=plz-3stellig.shp.zip\n",
    "*  http://www.suche-postleitzahl.org/downloads\n",
    "\n",
    "a CSV-file for all existing German post codes with matching geocoordinates has been compiled. The latitude and longitude coordinates were generated by running a PostgreSQL + PostGIS database. Additionally the respective TSO has been added to each post code. *(A Link to the SQL script will follow here later)*\n",
    "\n",
    "*(License: http://www.suche-postleitzahl.org/downloads, Open Database Licence for free use. Source of data: © OpenStreetMap contributors)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read generated postcode/location file\n",
    "postcode = pd.read_csv(os.path.join('input', 'de_tso_postcode_full.csv'))\n",
    "\n",
    "# Drop possible duplicates in postcodes\n",
    "postcode.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "\n",
    "# Show first entries\n",
    "postcode.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Merge geometry information by using the postcode**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take postcode and longitude/latitude information\n",
    "postcode = postcode[['postcode', 'lon', 'lat']]\n",
    "\n",
    "DE_renewables = DE_renewables.merge(postcode, on=['postcode'],  how='left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transform geoinformation\n",
    "*(for data with already existing geoinformation)*\n",
    "\n",
    "In this section the existing geoinformation (in UTM-format) will be transformed into latidude and longitude coordiates as a uniform standard for geoinformation. \n",
    "\n",
    "The BNetzA data set offers UTM Geoinformation with the columns *utm_zone (UTM-Zonenwert)*, *utm_east* and *utm_north*. Most of utm_east-values include the utm_zone-value **32** at the beginning of the number. In order to properly standardize and transform this geoinformation into latitude and longitude it is necessary to remove this utm_zone value. For all UTM entries the utm_zone 32 is used by the BNetzA.\n",
    "\n",
    "\n",
    "|utm_zone|\t utm_east|\t utm_north| comment|\n",
    "|---|---|---| ----|\n",
    "|32|\t413151.72|\t6027467.73| proper coordinates|\n",
    "|32|\t**32**912159.6008|\t5692423.9664| caused error by 32|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How many different utm_zone values are in the data set?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DE_renewables.groupby(['utm_zone'])['utm_zone'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remove the utm_zone \"32\" from the utm_east value**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find entries with 32 value at the beginning\n",
    "idx_32 = (DE_renewables['utm_east'].astype(str).str[:2] == '32')\n",
    "idx_notnull = DE_renewables['utm_east'].notnull()\n",
    "\n",
    "# Remove 32 from utm_east entries\n",
    "DE_renewables.loc[idx_32, 'utm_east'] = DE_renewables.loc[idx_32,\n",
    "                                                          'utm_east'].astype(str).str[2:].astype(float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_to_latlon(utm_east, utm_north, utm_zone):\n",
    "    try:\n",
    "        return utm.to_latlon(utm_east, utm_north, utm_zone, 'U')\n",
    "    except:\n",
    "        return ''\n",
    "\n",
    "DE_renewables['latlon'] = DE_renewables.loc[idx_notnull, ['utm_east', 'utm_north', 'utm_zone']].apply(\n",
    "        lambda x: convert_to_latlon(x[0], x[1], x[2]), axis=1).astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conversion UTM to latitude and longitude**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lat = []\n",
    "lon = []\n",
    "\n",
    "for row in DE_renewables['latlon']:\n",
    "    try:\n",
    "        # Split tuple format into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        parts = row.split(',')\n",
    "        if(len(parts)<2):\n",
    "            raise Exception('This is not a proper tuple. So go to exception block.')\n",
    "        lat.append(parts[0])\n",
    "        lon.append(parts[1])\n",
    "    except:\n",
    "        # set NaN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "DE_renewables['latitude'] = pd.to_numeric(lat)\n",
    "DE_renewables['longitude'] = pd.to_numeric(lon)\n",
    "\n",
    "# Add new values to DataFrame lon and lat\n",
    "DE_renewables['lat'] = DE_renewables[['lat', 'latitude']].apply(\n",
    "    lambda x: x[1] if pd.isnull(x[0]) else x[0],\n",
    "    axis=1)\n",
    "\n",
    "DE_renewables['lon'] = DE_renewables[['lon', 'longitude']].apply(\n",
    "    lambda x: x[1] if pd.isnull(x[0]) else x[0],\n",
    "    axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Check: missing coordinates by data source and type**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#DE_renewables[DE_renewables['data_source'] == '50Hertz'].to_excel('test.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Missing coordinates ', DE_renewables.lat.isnull().sum())\n",
    "\n",
    "display(\n",
    "    DE_renewables[DE_renewables.lat.isnull()].groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count()\n",
    ")\n",
    "\n",
    "print('Share of missing coordinates (note that NaN can mean it\\'s all fine):')\n",
    "\n",
    "DE_renewables[DE_renewables.lat.isnull()].groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count() / DE_renewables.groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remove temporary columns**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop lonlat column that contains both, latitute and longitude\n",
    "DE_renewables.drop(['latlon', 'longitude', 'latitude'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(\n",
    "    DE_renewables[DE_renewables['data_source'] == 'TenneT'].head(3)[['lon','lat','data_source']]\n",
    ")\n",
    "display(\n",
    "    DE_renewables[DE_renewables['data_source'] == 'BNetzA'].head(3)[['lon','lat','data_source']]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save\n",
    " \n",
    "The merged, translated, cleaned, DataFrame will be saved temporily as a pickle file, which stores a Python object fast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DE_renewables.to_pickle('intermediate/DE_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Denmark DK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "** [Energistyrelsen (ens) / Danish Energy Agency](http://www.ens.dk/info/tal-kort/statistik-noegletal/oversigt-energisektoren/stamdataregister-vindmoller)** - The wind turbines register is released by the Danish Energy Agency. \n",
    "\n",
    "** [Energinet.dk](http://www.energinet.dk/DA/El/Engrosmarked/Udtraek-af-markedsdata/Sider/Statistik.aspx)** - The data of solar power plants are released by the leading transmission network operator Denmark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# point URLs to original data depending on the chosen download option\n",
    "if download_from == 'original_sources':\n",
    "    \n",
    "    # The link from ens.dk is currently unavailable. We are in contact with ens in order to be able\n",
    "    # to provide an alternative source. Until then choose download from 'opsd-server' where the\n",
    "    # last available data can be cached.\n",
    "    url_DK_ens = 'https://ens.dk/sites/ens.dk/files/Statistik/anlaegprodtilnettet.xls'\n",
    "    url_DK_energinet = 'https://www.energinet.dk/-/media/Energinet/El-CSI/Dokumenter/Data/SolcellerGraf-2016-11.xlsx'\n",
    "    url_DK_geo = 'http://download.geonames.org/export/zip/DK.zip'\n",
    "\n",
    "elif download_from == 'opsd_server':\n",
    "    \n",
    "    url_DK_ens = (url_opsd + version + folder + '/DK/anlaegprodtilnettet.xls')\n",
    "    url_DK_energinet = (url_opsd + version + folder + '/DK/SolcellerGraf%202016-11.xlsx')\n",
    "    url_DK_geo = (url_opsd + version + folder + 'DK/DK.zip')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get wind turbines data\n",
    "DK_wind_df = pd.read_excel(download_and_cache(url_DK_ens),\n",
    "                           sheetname='IkkeAfmeldte-Existing turbines',\n",
    "                           thousands='.',\n",
    "                           header=17,\n",
    "                           skipfooter=3,\n",
    "                           parse_cols=16,\n",
    "                           converters={'Møllenummer (GSRN)': str,\n",
    "                                       'Kommune-nr': str,\n",
    "                                       'Postnr': str}\n",
    "                           )\n",
    "\n",
    "# Get photovoltaic data\n",
    "DK_solar_df = pd.read_excel(download_and_cache(url_DK_energinet),\n",
    "                            sheetname='Data',\n",
    "                            converters={'Postnr': str}\n",
    "                            )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for Denmark, create dictionary and show dictionary\n",
    "idx_DK = columnnames[columnnames['country'] == 'DK'].index\n",
    "column_dict_DK = columnnames.loc[idx_DK].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Windows has problems reading the csv entry for east and north (DK).\n",
    "# The reason might be the difference when opening the csv between linux and\n",
    "# windows.\n",
    "column_dict_DK_temp = {}\n",
    "for k, v in column_dict_DK.items():\n",
    "    column_dict_DK_temp[k] = v\n",
    "    if v == 'utm_east' or v == 'utm_north':\n",
    "        # merge 2 lines to 1\n",
    "        new_key = ''.join(k.splitlines())\n",
    "        column_dict_DK_temp[new_key] = v\n",
    "\n",
    "column_dict_DK = column_dict_DK_temp\n",
    "\n",
    "column_dict_DK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate columns by list\n",
    "DK_wind_df['X (øst) koordinat UTM 32 Euref89'] = DK_wind_df['X (øst) koordinat \\nUTM 32 Euref89']\n",
    "DK_wind_df['Y (nord) koordinat UTM 32 Euref89'] = DK_wind_df['Y (nord) koordinat \\nUTM 32 Euref89']\n",
    "\n",
    "#and 13 are the keys that make problems\n",
    "DK_wind_df.drop(DK_wind_df.columns[[12, 13]], axis=1, inplace=True)\n",
    "\n",
    "# Replace column names based on column_dict_DK\n",
    "DK_wind_df.rename(columns=column_dict_DK, inplace=True)\n",
    "DK_solar_df.rename(columns=column_dict_DK, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source and missing information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add names of the data sources to the DataFrames\n",
    "DK_wind_df['data_source'] = 'Energistyrelsen'\n",
    "DK_solar_df['data_source'] = 'Energinet.dk'\n",
    "\n",
    "# Add energy source level 2 and technology for each of the two DataFrames\n",
    "DK_wind_df['energy_source_level_2'] = 'Wind'\n",
    "DK_solar_df['energy_source_level_2'] = 'Solar'\n",
    "DK_solar_df['technology'] = 'Photovoltaics'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Denmark, create dictionary and show dictionary\n",
    "idx_DK = valuenames[valuenames['country'] == 'DK'].index\n",
    "value_dict_DK = valuenames.loc[idx_DK].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_DK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "DK_wind_df.replace(value_dict_DK, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**UTM32 to latitude and longitude** *(Data from Energistyrelsen)*\n",
    "\n",
    "The Energistyrelsen data set offers UTM Geoinformation with the columns utm_east and utm_north belonging to the UTM zone 32. In this section the existing geoinformation (in UTM-format) will be transformed into latidude and longitude coordiates as a uniform standard for geoinformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Index for all values with utm information\n",
    "idx_notnull = DK_wind_df['utm_east'].notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert from UTM values to latitude and longitude coordinates\n",
    "DK_wind_df['lonlat'] = DK_wind_df.loc[idx_notnull, ['utm_east', 'utm_north']\n",
    "                                      ].apply(lambda x: utm.to_latlon(x[0],\n",
    "                                                                      x[1],\n",
    "                                                                      32,\n",
    "                                                                      'U'), axis=1).astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split latitude and longitude in two columns\n",
    "lat = []\n",
    "lon = []\n",
    "\n",
    "for row in DK_wind_df['lonlat']:\n",
    "    try:\n",
    "        # Split tuple format\n",
    "        # into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        lat.append(row.split(',')[0])\n",
    "        lon.append(row.split(',')[1])\n",
    "    except:\n",
    "        # set NAN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "DK_wind_df['lat'] = pd.to_numeric(lat)\n",
    "DK_wind_df['lon'] = pd.to_numeric(lon)\n",
    "\n",
    "# drop lonlat column that contains both, latitute and longitude\n",
    "DK_wind_df.drop('lonlat', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Postcode to lat/lon (WGS84)**\n",
    "*(for data from Energinet.dk)*\n",
    "\n",
    "The available post code in the original data provides an approximation for the geocoordinates of the solar power plants.<br>\n",
    "The postcode will be assigned to latitude and longitude coordinates with the help of the postcode table.\n",
    "\n",
    "** [geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)** The postcode  data from Denmark is provided by Geonames and licensed under a [Creative Commons Attribution 3.0 license](http://creativecommons.org/licenses/by/3.0/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_DK_geo = zipfile.ZipFile(download_and_cache(url_DK_geo))\n",
    "\n",
    "# Read generated postcode/location file\n",
    "DK_geo = pd.read_csv(zip_DK_geo.open('DK.txt'), sep='\\t', header=-1)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "DK_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "DK_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "DK_geo['postcode'] = DK_geo['postcode'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add longitude/latitude infomation assigned by postcode (for Energinet.dk data)\n",
    "DK_solar_df = DK_solar_df.merge(DK_geo[['postcode', 'lon', 'lat']],\n",
    "                                on=['postcode'],\n",
    "                                how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show number of units with missing coordinates seperated by wind and solar\n",
    "print('Missing Coordinates DK_wind ', DK_wind_df.lat.isnull().sum())\n",
    "print('Missing Coordinates DK_solar ', DK_solar_df.lat.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge DataFrames and choose columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge DataFrames for wind and solar into DK_renewables\n",
    "dataframes = [DK_wind_df, DK_solar_df]\n",
    "DK_renewables = pd.concat(dataframes)\n",
    "DK_renewables = DK_renewables.reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy source level 1 to the dataframe\n",
    "DK_renewables['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select those columns of the orignal data which are utilised further\n",
    "column_interest = ['commissioning_date', 'energy_source_level_1', 'energy_source_level_2',\n",
    "                   'technology', 'electrical_capacity_kW', 'dso', 'gsrn_id', 'postcode',\n",
    "                   'municipality_code', 'municipality', 'address', 'address_number',\n",
    "                   'utm_east', 'utm_north', 'lon', 'lat', 'hub_height',\n",
    "                   'rotor_diameter', 'manufacturer', 'model', 'data_source']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean DataFrame from columns other than specified above\n",
    "DK_renewables = DK_renewables.loc[:, column_interest]\n",
    "DK_renewables.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical_capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "DK_renewables['electrical_capacity_kW'] /= 1000\n",
    "\n",
    "# adapt column name\n",
    "DK_renewables.rename(columns={'electrical_capacity_kW': 'electrical_capacity'},\n",
    "                     inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DK_renewables.to_pickle('intermediate/DK_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## France FR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data source:\n",
    "\n",
    "** [Ministry for the Ecological and Inclusive Transition](http://www.statistiques.developpement-durable.gouv.fr/energie-climat/r/energies-renouvelables.html?tx_ttnews%5Btt_news%5D=24638&cHash=d237bf9985fdca39d7d8c5dc84fb95f9)** - Number of installations and installed capacity of the different renewable source for every municipality in France. Service of observation and statistics, survey, date of last update: 15/12/2016. Data until 31/12/2015."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# point URLs to original data depending on the chosen download option\n",
    "if download_from == 'original_sources':\n",
    "\n",
    "    url_FR_gouv = \"http://www.statistiques.developpement-durable.gouv.fr/fileadmin/documents/Themes/Energies_et_climat/Les_differentes_energies/Energies_renouvelables/donnees_locales/2015/electricite-renouvelable-par-commune-2015.xls\"\n",
    "    url_FR_geo = 'http://public.opendatasoft.com/explore/dataset/code-postal-code-insee-2015/download/?format=csv&timezone=Europe/Berlin&use_labels_for_header=true'\n",
    "\n",
    "else:\n",
    "    url_FR_gouv = (url_opsd + version + folder +\n",
    "                   '/FR/electricite-renouvelable-par-commune-2015.xls')\n",
    "    url_FR_geo = (url_opsd + version + folder +\n",
    "                  'FR/code-postal-code-insee-2015.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get data of renewables per municipality\n",
    "FR_re_df = pd.read_excel(download_and_cache(url_FR_gouv),\n",
    "                         sheetname='Commune',\n",
    "                         encoding='UTF8',\n",
    "                         thousands='.',\n",
    "                         decimals=',',\n",
    "                         header=[3, 4],\n",
    "                         skipfooter=8,  # contains summarized values\n",
    "                         index_col=[0, 1],  # required for MultiIndex\n",
    "                         converters={'Code officiel géographique': str})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rearrange columns and translate column names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The French data source contains number of installations and sum of installed capacity per energy source per municipality. The structure is adapted to the power plant list of other countries. The list is limited to the plants which are covered by article 10 of february 2000 by an agreement to a purchase commitment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rearrange data\n",
    "FR_re_df.index.rename(['insee_com', 'municipality'], inplace=True)\n",
    "FR_re_df.columns.rename(['energy_source_level_2', None], inplace=True)\n",
    "FR_re_df = (FR_re_df\n",
    "            .stack(level='energy_source_level_2', dropna=False)\n",
    "            .reset_index(drop=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_FR = columnnames[columnnames['country'] == 'FR'].index\n",
    "column_dict_FR = columnnames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate columnnames\n",
    "FR_re_df.rename(columns=column_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop all rows that just contain NA\n",
    "FR_re_df = FR_re_df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FR_re_df['data_source'] = 'Ministry for the Ecological and Inclusive Transition'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Kept secret if number of installations < 3**\n",
    "\n",
    "If the number of installations is less than 3, it is marked with an _s_ instead of the number 1 or 2 due to statistical confidentiality ([further explanation by the data provider](http://www.statistiques.developpement-durable.gouv.fr/fileadmin/documents/Themes/Energies_et_climat/Les_differentes_energies/Energies_renouvelables/donnees_locales/2014/methodo-donnees-locales-electricte-renouvelable-12-2015-b.pdf)). Here, the _s_ is changed to _< 3_. This is done in the same step as the other value translations of the energy sources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_FR = valuenames[valuenames['country'] == 'FR'].index\n",
    "value_dict_FR = valuenames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "FR_re_df.replace(value_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate and assign energy source level 1-3 and technology**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source to its subtype\n",
    "energy_source_dict_FR = valuenames.loc[idx_FR].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Column energy_source partly contains subtype information, thus this column is copied\n",
    "# to new column for energy_source_subtype...\n",
    "FR_re_df['technology'] = FR_re_df['energy_source_level_2']\n",
    "\n",
    "# ...and the energy source subtype values in the energy_source column are replaced by\n",
    "# the higher level classification\n",
    "FR_re_df['energy_source_level_2'].replace(energy_source_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy_source_level_1 to the dataframe\n",
    "FR_re_df['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FR_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose energy source level 2 entries where energy source level 2 is Bioenergy in order to  \n",
    "# seperate Bioenergy subtypes to energy source level 3 and subtypes for the rest to technology\n",
    "idx_FR_Bioenergy = FR_re_df[FR_re_df['energy_source_level_2'] == 'Bioenergy'].index\n",
    "\n",
    "# Assign technology to energy source level 3  for all entries where energy source level 2 is  \n",
    "# Bioenergy and delete those entries from  technology\n",
    "FR_re_df[['energy_source_level_3']] = FR_re_df.iloc[idx_FR_Bioenergy][['technology']]\n",
    "FR_re_df.ix[idx_FR_Bioenergy,'technology'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Municipality (INSEE) code to lon/lat\n",
    "The available INSEE code in the original data provides a first approximation for the geocoordinates of the renewable power plants. The following data source is utilized for assigning INSEE code to coordinates of the municipalities:\n",
    "\n",
    "** [OpenDataSoft](http://public.opendatasoft.com/explore/dataset/code-postal-code-insee-2015/information/)** publishes a list of French INSEE codes and corresponding coordinates is published under the [Licence Ouverte (Etalab)](https://www.etalab.gouv.fr/licence-ouverte-open-licence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Downlad French geo-information. As download_and_cache_function is not working\n",
    "# properly yet, thus other way of downloading\n",
    "filename = 'code-postal-insee-2015.csv'\n",
    "filepath = os.path.join('input', 'original_data', filename)\n",
    "if not os.path.exists(filepath):\n",
    "    print(\"Downloading file: \", filename)\n",
    "    FR_geo_csv = urllib.request.urlretrieve(url_FR_geo, filepath)\n",
    "else:\n",
    "    print(\"Using local file from\", filepath)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read INSEE Code Data\n",
    "FR_geo = pd.read_csv(filepath,\n",
    "                     sep=';',\n",
    "                     header=0,\n",
    "                     converters={'Code_postal': str})\n",
    "\n",
    "# Drop possible duplicates of the same INSEE code\n",
    "FR_geo.drop_duplicates('INSEE_COM', keep='last', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create columns for latitude/longitude\n",
    "lat = []\n",
    "lon = []\n",
    "\n",
    "# split in latitude/longitude\n",
    "for row in FR_geo['Geo Point']:\n",
    "    try:\n",
    "        # Split tuple format\n",
    "        # into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        lat.append(row.split(',')[0])\n",
    "        lon.append(row.split(',')[1])\n",
    "    except:\n",
    "        # set NAN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "# add these columns to the INSEE DataFrame\n",
    "FR_geo['lat'] = pd.to_numeric(lat)\n",
    "FR_geo['lon'] = pd.to_numeric(lon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column names of merge key have to be named identically\n",
    "FR_re_df.rename(columns={'municipality_code': 'INSEE_COM'}, inplace=True)\n",
    "\n",
    "# Merge longitude and latitude columns by the Code INSEE\n",
    "FR_re_df = FR_re_df.merge(FR_geo[['INSEE_COM', 'lat', 'lon']],\n",
    "                          on=['INSEE_COM'],\n",
    "                          how='left')\n",
    "\n",
    "# Translate Code INSEE column back to municipality_code\n",
    "FR_re_df.rename(columns={'INSEE_COM': 'municipality_code'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FR_re_df.to_pickle('intermediate/FR_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Poland PL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data source:\n",
    "\n",
    "** [Urzad Regulacji Energetyki (URE) / Energy Regulatory Office](http://www.ure.gov.pl/uremapoze/mapa.html)** - Number of installations and installed capacity per energy source of renewable energy. Summed per powiat (districts) ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Polish data has to be downloaded manually \n",
    "if you have not chosen download_from = opsd_server.\n",
    "- Go to http://www.ure.gov.pl/uremapoze/mapa.html (requires Flash Player)\n",
    "- Click on the British flag in the lower right corner for Englisch version\n",
    "- Set detail to highest (to the right) in the upper right corner\n",
    "- Click on the printer symbol in the lower left corner\n",
    "- 'Generate', then the rtf-file simple.rtf will be downloaded\n",
    "- Put it in the folder input/original_data on your computer\n",
    "- If the download is temporarily not working choose download_from == 'opsd_server'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if download_from == 'opsd_server':\n",
    "    url_PL_ure = (url_opsd + version + folder + '/PL/simple.rtf')\n",
    "    download_and_cache(url_PL_ure)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read rtf-file to string with the correct encoding\n",
    "with open(os.path.join('input', 'original_data', 'simple.rtf'), 'r') as rtf:\n",
    "    file_content = rtf.read()\n",
    "\n",
    "file_content = file_content.encode('utf-8').decode('iso-8859-2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rearrange data from rft-file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rtf file has one table for each district in the rtf-file which needs to be separated from each and other and restructured to get all plants in one DataFrame with the information: district, energy_source, number_of_installations, installed_capacity. Thus in the following, the separating items are defined, the district tables split in parts, all put in one list and afterwards transferred to a pandas DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a new line is separating all parts\n",
    "sep_split_into_parts = r'{\\fs12 \\f1 \\line }'\n",
    "# separates the table rows of each table\n",
    "sep_data_parts = r'\\trql'\n",
    "\n",
    "reg_exp_district = r'(?<=Powiat:).*(?=})'\n",
    "\n",
    "reg_exp_installation_type = (\n",
    "    r'(?<=\\\\fs12 \\\\f1 \\\\pard \\\\intbl \\\\ql \\\\cbpat[2|3|4] \\{\\\\fs12 \\\\f1  ).*(?=\\})')\n",
    "reg_exp_installation_value = (\n",
    "    r'(?<=\\\\fs12 \\\\f1 \\\\pard \\\\intbl \\\\qr \\\\cbpat[3|4] \\{\\\\fs12 \\\\f1 ).*(?=})')\n",
    "\n",
    "# split file into parts\n",
    "parts = file_content.split(sep_split_into_parts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list containing the data\n",
    "data_set = []\n",
    "for part in parts:\n",
    "    # match district\n",
    "    district = re.findall(reg_exp_district, part)\n",
    "    if len(district) == 0:\n",
    "        pass\n",
    "    else:\n",
    "        district = district[0].lstrip()\n",
    "        # separate each part\n",
    "        data_parts = part.split(sep_data_parts)\n",
    "        # data structure:\n",
    "        # data_row = {'district': '', 'install_type': '', 'quantity': '', 'power': ''}\n",
    "        for data_rows in data_parts:\n",
    "            wrapper_list = []\n",
    "            # match each installation type\n",
    "            installation_type = re.findall(reg_exp_installation_type, data_rows)\n",
    "            for inst_type in installation_type:\n",
    "                wrapper_list.append({'district': district, 'technology': inst_type})\n",
    "            # match data - contains twice as many entries as\n",
    "            # installation type (quantity, power vs. install type)\n",
    "            data_values = re.findall(reg_exp_installation_value, data_rows)\n",
    "            if len(data_values) == 0:\n",
    "                #log.debug('data values empty')\n",
    "                pass\n",
    "            else:\n",
    "                # connect data\n",
    "                for i, _ in enumerate(wrapper_list):\n",
    "                    wrapper_list[i]['number_of_installations'] = data_values[(\n",
    "                        i * 2)]\n",
    "                    wrapper_list[i]['electrical_capacity'] = data_values[(\n",
    "                        i * 2) + 1]\n",
    "\n",
    "                # prepare to write to file\n",
    "                for data in wrapper_list:\n",
    "                    data_set.append(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mapping of malformed unicode which appear in the Polish district names\n",
    "polish_truncated_unicode_map = {\n",
    "    r'\\uc0\\u322': 'ł',\n",
    "    r'\\uc0\\u380': 'ż',\n",
    "    r'\\uc0\\u243': 'ó',\n",
    "    r'\\uc0\\u347': 'ś',\n",
    "    r'\\uc0\\u324': 'ń',\n",
    "    r'\\uc0\\u261': 'ą',\n",
    "    r'\\uc0\\u281': 'ę',\n",
    "    r'\\uc0\\u263': 'ć',\n",
    "    r'\\uc0\\u321': 'Ł',\n",
    "    r'\\uc0\\u378': 'ź',\n",
    "    r'\\uc0\\u346': 'Ś',\n",
    "    r'\\uc0\\u379': 'Ż'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# changing malformed unicode\n",
    "for entry in data_set:\n",
    "    while r'\\u' in entry['district']:\n",
    "        index = entry['district'].index(r'\\u')\n",
    "        offset = index + 9\n",
    "        to_be_replaced = entry['district'][index:offset]\n",
    "        if to_be_replaced in polish_truncated_unicode_map.keys():\n",
    "            # offset + 1 because there is a trailing whitespace\n",
    "            entry['district'] = entry['district'].replace(entry['district'][index:offset + 1],\n",
    "                                                          polish_truncated_unicode_map[to_be_replaced])\n",
    "        else:\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create pandas DataFrame with similar structure as the other countries\n",
    "PL_re_df = pd.DataFrame(data_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df['data_source'] = 'Urzad Regulacji Energetyki'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_PL = valuenames[valuenames['country'] == 'PL'].index\n",
    "value_dict_PL = valuenames.loc[idx_PL].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_PL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "PL_re_df.technology.replace(value_dict_PL, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Assign energy source level 1-3 and technology**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Assign energy_source_level_1 to the dataframe\n",
    "PL_re_df['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source to its subtype\n",
    "energy_source_dict_PL = valuenames.loc[idx_PL].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Create new column for energy_source\n",
    "PL_re_df['energy_source_level_2'] = PL_re_df.technology\n",
    "\n",
    "# Fill this with the energy source instead of subtype information\n",
    "PL_re_df.energy_source_level_2.replace(energy_source_dict_PL, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose energy_source entries where energy_source is \"Bioenergy\" in order to  \n",
    "# seperate Bioenergy subtypes to \"energy_source_level_3\" and subtypes for the rest to \"technology\"\n",
    "idx_PL_Bioenergy = PL_re_df[PL_re_df['energy_source_level_2'] == 'Bioenergy'].index\n",
    "\n",
    "# Assign technology to \"energy_source_level_3\" for all entries where energy_source_level_2 is\n",
    "# \"Bioenergy\" and delete those entries from \"technology\"\n",
    "PL_re_df[['energy_source_level_3']] = PL_re_df.iloc[idx_PL_Bioenergy][['technology']]\n",
    "PL_re_df.ix[idx_PL_Bioenergy, 'technology'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Adjust datatype of numeric columns**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change type to numeric\n",
    "PL_re_df['electrical_capacity'] = pd.to_numeric(PL_re_df['electrical_capacity'])\n",
    "# Additionally commas are deleted\n",
    "PL_re_df['number_of_installations'] = pd.to_numeric(\n",
    "    PL_re_df['number_of_installations'].str.replace(',', ''))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Aggregate**\n",
    "\n",
    "For entries/rows of the same district and energy source level 2, electrical capacity and number of installations are aggregaated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df = PL_re_df.groupby(['district', 'energy_source_level_2', 'technology'],\n",
    "                            as_index=False\n",
    "                            ).agg({'electrical_capacity': sum,\n",
    "                                   'number_of_installations': sum,\n",
    "                                   'data_source': 'first'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df.to_pickle('intermediate/PL_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Switzerland CH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data source:\n",
    "\n",
    "** [Swiss Federal Office of Energy](http://www.bfe.admin.ch/themen/00612/02073/index.html?dossier_id=02166&lang=de)** - Data of all renewable power plants receiving \"Kostendeckende Einspeisevergütung\" (KEV) which is the Swiss feed in tarif for renewable power plants. \n",
    "Geodata is based on municipality codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# point URLs to original data depending on the chosen download option\n",
    "if download_from == 'original_sources':\n",
    "    \n",
    "    url_CH_KEV= \"http://www.bfe.admin.ch/php/modules/publikationen/stream.php?extlang=de&name=de_620208649.xlsx&endung=Liste%20aller%20KEV-Bez%FCger%20im%20Jahr%202016\"\n",
    "    url_CH_KEV='KEV-Bezüger_2016_Publikation.xlsx'\n",
    "    url_CH_geo = 'http://download.geonames.org/export/zip/CH.zip'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get data of renewables per municipality\n",
    "CH_re_df = pd.read_excel(download_and_cache(url_CH_KEV),\n",
    "                         sheetname='KEV-Bezüger 2016',\n",
    "                         encoding='UTF8',\n",
    "                         thousands='.',\n",
    "                         decimals=','\n",
    "                         #header=[0]\n",
    "                         #skipfooter=9,  # contains summarized values\n",
    "                         #index_col=[0, 1], # required for MultiIndex\n",
    "                         #converters={'Code officiel géographique':str}\n",
    "                         )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Switzerland, create dictionary and show dictionary\n",
    "idx_CH = columnnames[columnnames['country'] == 'CH'].index\n",
    "column_dict_CH = columnnames.loc[idx_CH].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_CH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate columnnames\n",
    "CH_re_df.rename(columns=column_dict_CH, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop all rows that just contain NA\n",
    "CH_re_df = CH_re_df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df['data_source'] = 'BFE'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_CH = valuenames[valuenames['country'] == 'CH'].index\n",
    "value_dict_CH = valuenames.loc[idx_CH].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_CH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "CH_re_df.replace(value_dict_CH, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate and assign energy source level 1-3 and technology**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source to its subtype\n",
    "energy_source_dict_CH = valuenames.loc[idx_CH].set_index('opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# ...and the energy source subtype values in the energy_source column are replaced by \n",
    "# the higher level classification\n",
    "CH_re_df['energy_source_level_2'].replace(energy_source_dict_CH, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy_source_level_1 to the dataframe\n",
    "CH_re_df['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Postcode to lat/lon (WGS84)**\n",
    "\n",
    "The available municipality code in the original data provides an approximation for the geocoordinates of the renewable power plants.<br>\n",
    "The postcode will be assigned to latitude and longitude coordinates with the help of the postcode table.\n",
    "\n",
    "** [geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)** The postcode  data from Switzerland is provided by Geonames and licensed under a [Creative Commons Attribution 3.0 license](http://creativecommons.org/licenses/by/3.0/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_CH_geo = zipfile.ZipFile(download_and_cache(url_CH_geo))\n",
    "\n",
    "# Read generated postcode/location file\n",
    "CH_geo = pd.read_csv(zip_CH_geo.open('CH.txt'), sep='\\t', header=-1)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "CH_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "CH_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "CH_geo['postcode'] = CH_geo['postcode'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# harmonise data class \n",
    "CH_geo.postcode = CH_geo.postcode.astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add longitude/latitude infomation assigned by municipality code\n",
    "CH_re_df = pd.merge(CH_re_df,\n",
    "                    CH_geo[['lat', 'lon', 'postcode']],\n",
    "                    left_on='municipality_code',\n",
    "                    right_on='postcode',\n",
    "                    how='left'\n",
    "                    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical_capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "CH_re_df['electrical_capacity'] /= 1000\n",
    "\n",
    "# kWh to MWh\n",
    "CH_re_df['production'] /= 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df.to_pickle('intermediate/CH_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check and validation of the renewable power plants list as well as the creation of CSV/XLSX/SQLite files can be found in Part 2 of this script. It also generates a daily time series of cumulated installed capacities by energy source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 0
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "716px",
    "left": "104px",
    "top": "280px",
    "width": "231px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}