{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"width:100%; background-color: #D9EDF7; border: 1px solid #CFCFCF; text-align: left; padding: 10px;\">\n",
    "      <b>Renewable power plants: Download and process notebook</b>\n",
    "      <ul>\n",
    "        <li><a href=\"main.ipynb\">Main notebook</a></li>\n",
    "        <li>Download and process notebook</li>\n",
    "        <li><a href=\"validation_and_output.ipynb\">Validation and output notebook</a></li>\n",
    "      </ul>\n",
    "      <br>This notebook is part of the <a href=\"http://data.open-power-system-data.org/renewable_power_plants\"> Renewable power plants Data Package</a> of <a href=\"http://open-power-system-data.org\">Open Power System Data</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This script downlads and extracts the original data of renewable power plant lists from the data sources, processes and merges them. It subsequently adds the geolocation for each power plant. Finally it saves the DataFrames as pickle-files. Make sure you run the download and process Notebook before the validation and output Notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Script-setup\" data-toc-modified-id=\"Script-setup-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Script setup</a></span></li><li><span><a href=\"#Settings\" data-toc-modified-id=\"Settings-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Settings</a></span><ul class=\"toc-item\"><li><span><a href=\"#Choose-download-option\" data-toc-modified-id=\"Choose-download-option-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Choose download option</a></span></li><li><span><a href=\"#Update-the-download-links\" data-toc-modified-id=\"Update-the-download-links-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Update the download links</a></span></li><li><span><a href=\"#Set-up-the-downloader-for-data-sources\" data-toc-modified-id=\"Set-up-the-downloader-for-data-sources-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>Set up the downloader for data sources</a></span></li><li><span><a href=\"#Set-up-the-NUTS-converter\" data-toc-modified-id=\"Set-up-the-NUTS-converter-2.4\"><span class=\"toc-item-num\">2.4&nbsp;&nbsp;</span>Set up the NUTS converter</a></span></li><li><span><a href=\"#Setup-translation-dictionaries\" data-toc-modified-id=\"Setup-translation-dictionaries-2.5\"><span class=\"toc-item-num\">2.5&nbsp;&nbsp;</span>Setup translation dictionaries</a></span></li></ul></li><li><span><a href=\"#Download-and-process-per-country\" data-toc-modified-id=\"Download-and-process-per-country-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Download and process per country</a></span><ul class=\"toc-item\"><li><span><a href=\"#Germany-DE\" data-toc-modified-id=\"Germany-DE-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Germany DE</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download-and-read\" data-toc-modified-id=\"Download-and-read-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Download and read</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.1.2\"><span class=\"toc-item-num\">3.1.2&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Add-information-and-choose-columns\" data-toc-modified-id=\"Add-information-and-choose-columns-3.1.3\"><span class=\"toc-item-num\">3.1.3&nbsp;&nbsp;</span>Add information and choose columns</a></span></li><li><span><a href=\"#Merge-DataFrames\" data-toc-modified-id=\"Merge-DataFrames-3.1.4\"><span class=\"toc-item-num\">3.1.4&nbsp;&nbsp;</span>Merge DataFrames</a></span></li><li><span><a href=\"#Translate-values-and-harmonize-energy-source-level-2\" data-toc-modified-id=\"Translate-values-and-harmonize-energy-source-level-2-3.1.5\"><span class=\"toc-item-num\">3.1.5&nbsp;&nbsp;</span>Translate values and harmonize energy source level 2</a></span></li><li><span><a href=\"#Transform-electrical-capacity-from-kW-to-MW\" data-toc-modified-id=\"Transform-electrical-capacity-from-kW-to-MW-3.1.6\"><span class=\"toc-item-num\">3.1.6&nbsp;&nbsp;</span>Transform electrical capacity from kW to MW</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.1.7\"><span class=\"toc-item-num\">3.1.7&nbsp;&nbsp;</span>Georeferencing</a></span><ul class=\"toc-item\"><li><span><a href=\"#Get-coordinates-by-postcode\" data-toc-modified-id=\"Get-coordinates-by-postcode-3.1.7.1\"><span class=\"toc-item-num\">3.1.7.1&nbsp;&nbsp;</span>Get coordinates by postcode</a></span></li><li><span><a href=\"#Transform-geoinformation\" data-toc-modified-id=\"Transform-geoinformation-3.1.7.2\"><span class=\"toc-item-num\">3.1.7.2&nbsp;&nbsp;</span>Transform geoinformation</a></span></li></ul></li><li><span><a href=\"#Clean-data\" data-toc-modified-id=\"Clean-data-3.1.8\"><span class=\"toc-item-num\">3.1.8&nbsp;&nbsp;</span>Clean data</a></span></li><li><span><a href=\"#Assign-NUTS-codes\" data-toc-modified-id=\"Assign-NUTS-codes-3.1.9\"><span class=\"toc-item-num\">3.1.9&nbsp;&nbsp;</span>Assign NUTS codes</a></span></li><li><span><a href=\"#Visualize\" data-toc-modified-id=\"Visualize-3.1.10\"><span class=\"toc-item-num\">3.1.10&nbsp;&nbsp;</span>Visualize</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.1.11\"><span class=\"toc-item-num\">3.1.11&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#Denmark-DK\" data-toc-modified-id=\"Denmark-DK-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Denmark DK</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download-and-read\" data-toc-modified-id=\"Download-and-read-3.2.1\"><span class=\"toc-item-num\">3.2.1&nbsp;&nbsp;</span>Download and read</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.2.2\"><span class=\"toc-item-num\">3.2.2&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Add-data-source-and-missing-information\" data-toc-modified-id=\"Add-data-source-and-missing-information-3.2.3\"><span class=\"toc-item-num\">3.2.3&nbsp;&nbsp;</span>Add data source and missing information</a></span></li><li><span><a href=\"#Correct-the-dates\" data-toc-modified-id=\"Correct-the-dates-3.2.4\"><span class=\"toc-item-num\">3.2.4&nbsp;&nbsp;</span>Correct the dates</a></span></li><li><span><a href=\"#Translate-values-and-harmonize-energy-source-level-2\" data-toc-modified-id=\"Translate-values-and-harmonize-energy-source-level-2-3.2.5\"><span class=\"toc-item-num\">3.2.5&nbsp;&nbsp;</span>Translate values and harmonize energy source level 2</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.2.6\"><span class=\"toc-item-num\">3.2.6&nbsp;&nbsp;</span>Georeferencing</a></span></li><li><span><a href=\"#Merge-DataFrames,-add-NUTS-information-and-choose-columns\" data-toc-modified-id=\"Merge-DataFrames,-add-NUTS-information-and-choose-columns-3.2.7\"><span class=\"toc-item-num\">3.2.7&nbsp;&nbsp;</span>Merge DataFrames, add NUTS information and choose columns</a></span></li><li><span><a href=\"#Select-columns\" data-toc-modified-id=\"Select-columns-3.2.8\"><span class=\"toc-item-num\">3.2.8&nbsp;&nbsp;</span>Select columns</a></span></li><li><span><a href=\"#Remove-duplicate-rows\" data-toc-modified-id=\"Remove-duplicate-rows-3.2.9\"><span class=\"toc-item-num\">3.2.9&nbsp;&nbsp;</span>Remove duplicate rows</a></span></li><li><span><a href=\"#Transform-electrical_capacity-from-kW-to-MW\" data-toc-modified-id=\"Transform-electrical_capacity-from-kW-to-MW-3.2.10\"><span class=\"toc-item-num\">3.2.10&nbsp;&nbsp;</span>Transform electrical_capacity from kW to MW</a></span></li><li><span><a href=\"#Visualize\" data-toc-modified-id=\"Visualize-3.2.11\"><span class=\"toc-item-num\">3.2.11&nbsp;&nbsp;</span>Visualize</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.2.12\"><span class=\"toc-item-num\">3.2.12&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#France-FR\" data-toc-modified-id=\"France-FR-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>France FR</a></span><ul class=\"toc-item\"><li><span><a href=\"#ODRE-data\" data-toc-modified-id=\"ODRE-data-3.3.1\"><span class=\"toc-item-num\">3.3.1&nbsp;&nbsp;</span>ODRE data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Load-the-data\" data-toc-modified-id=\"Load-the-data-3.3.1.1\"><span class=\"toc-item-num\">3.3.1.1&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.3.1.2\"><span class=\"toc-item-num\">3.3.1.2&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Add-data-source\" data-toc-modified-id=\"Add-data-source-3.3.1.3\"><span class=\"toc-item-num\">3.3.1.3&nbsp;&nbsp;</span>Add data source</a></span></li><li><span><a href=\"#Translate-values\" data-toc-modified-id=\"Translate-values-3.3.1.4\"><span class=\"toc-item-num\">3.3.1.4&nbsp;&nbsp;</span>Translate values</a></span></li><li><span><a href=\"#Correct-site-names\" data-toc-modified-id=\"Correct-site-names-3.3.1.5\"><span class=\"toc-item-num\">3.3.1.5&nbsp;&nbsp;</span>Correct site names</a></span></li><li><span><a href=\"#Replace-suspicious-dates-with-N/A\" data-toc-modified-id=\"Replace-suspicious-dates-with-N/A-3.3.1.6\"><span class=\"toc-item-num\">3.3.1.6&nbsp;&nbsp;</span>Replace suspicious dates with N/A</a></span></li><li><span><a href=\"#Check-missing-values\" data-toc-modified-id=\"Check-missing-values-3.3.1.7\"><span class=\"toc-item-num\">3.3.1.7&nbsp;&nbsp;</span>Check missing values</a></span></li><li><span><a href=\"#Standardize-the-energy-types-and-technologies\" data-toc-modified-id=\"Standardize-the-energy-types-and-technologies-3.3.1.8\"><span class=\"toc-item-num\">3.3.1.8&nbsp;&nbsp;</span>Standardize the energy types and technologies</a></span></li><li><span><a href=\"#Standardize-source-levels-1-3-and-technology\" data-toc-modified-id=\"Standardize-source-levels-1-3-and-technology-3.3.1.9\"><span class=\"toc-item-num\">3.3.1.9&nbsp;&nbsp;</span>Standardize source levels 1-3 and technology</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.3.1.10\"><span class=\"toc-item-num\">3.3.1.10&nbsp;&nbsp;</span>Georeferencing</a></span></li><li><span><a href=\"#Convert-electrical-capacity-to-MW\" data-toc-modified-id=\"Convert-electrical-capacity-to-MW-3.3.1.11\"><span class=\"toc-item-num\">3.3.1.11&nbsp;&nbsp;</span>Convert electrical capacity to MW</a></span></li></ul></li><li><span><a href=\"#Old-data\" data-toc-modified-id=\"Old-data-3.3.2\"><span class=\"toc-item-num\">3.3.2&nbsp;&nbsp;</span>Old data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Add-data-source\" data-toc-modified-id=\"Add-data-source-3.3.2.1\"><span class=\"toc-item-num\">3.3.2.1&nbsp;&nbsp;</span>Add data source</a></span></li><li><span><a href=\"#Translate-values-and-harmonize-energy-source-level-2\" data-toc-modified-id=\"Translate-values-and-harmonize-energy-source-level-2-3.3.2.2\"><span class=\"toc-item-num\">3.3.2.2&nbsp;&nbsp;</span>Translate values and harmonize energy source level 2</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.3.2.3\"><span class=\"toc-item-num\">3.3.2.3&nbsp;&nbsp;</span>Georeferencing</a></span></li></ul></li><li><span><a href=\"#Integrate-old-and-new-data\" data-toc-modified-id=\"Integrate-old-and-new-data-3.3.3\"><span class=\"toc-item-num\">3.3.3&nbsp;&nbsp;</span>Integrate old and new data</a></span></li><li><span><a href=\"#Select-the-columns\" data-toc-modified-id=\"Select-the-columns-3.3.4\"><span class=\"toc-item-num\">3.3.4&nbsp;&nbsp;</span>Select the columns</a></span></li><li><span><a href=\"#Visualize\" data-toc-modified-id=\"Visualize-3.3.5\"><span class=\"toc-item-num\">3.3.5&nbsp;&nbsp;</span>Visualize</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.3.6\"><span class=\"toc-item-num\">3.3.6&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#Poland-PL\" data-toc-modified-id=\"Poland-PL-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Poland PL</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download\" data-toc-modified-id=\"Download-3.4.1\"><span class=\"toc-item-num\">3.4.1&nbsp;&nbsp;</span>Download</a></span></li><li><span><a href=\"#Load-and-explore-the-data\" data-toc-modified-id=\"Load-and-explore-the-data-3.4.2\"><span class=\"toc-item-num\">3.4.2&nbsp;&nbsp;</span>Load and explore the data</a></span></li><li><span><a href=\"#Inspect-the-data\" data-toc-modified-id=\"Inspect-the-data-3.4.3\"><span class=\"toc-item-num\">3.4.3&nbsp;&nbsp;</span>Inspect the data</a></span></li><li><span><a href=\"#Harmonising-energy-levels\" data-toc-modified-id=\"Harmonising-energy-levels-3.4.4\"><span class=\"toc-item-num\">3.4.4&nbsp;&nbsp;</span>Harmonising energy levels</a></span></li><li><span><a href=\"#Georeferencing-(NUTS-classification)\" data-toc-modified-id=\"Georeferencing-(NUTS-classification)-3.4.5\"><span class=\"toc-item-num\">3.4.5&nbsp;&nbsp;</span>Georeferencing (NUTS classification)</a></span><ul class=\"toc-item\"><li><span><a href=\"#Add-NUTS-information\" data-toc-modified-id=\"Add-NUTS-information-3.4.5.1\"><span class=\"toc-item-num\">3.4.5.1&nbsp;&nbsp;</span>Add NUTS information</a></span></li></ul></li><li><span><a href=\"#Add-data-source-and-year\" data-toc-modified-id=\"Add-data-source-and-year-3.4.6\"><span class=\"toc-item-num\">3.4.6&nbsp;&nbsp;</span>Add data source and year</a></span></li><li><span><a href=\"#Select-columns\" data-toc-modified-id=\"Select-columns-3.4.7\"><span class=\"toc-item-num\">3.4.7&nbsp;&nbsp;</span>Select columns</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.4.8\"><span class=\"toc-item-num\">3.4.8&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#Switzerland-CH\" data-toc-modified-id=\"Switzerland-CH-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Switzerland CH</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download-and-read\" data-toc-modified-id=\"Download-and-read-3.5.1\"><span class=\"toc-item-num\">3.5.1&nbsp;&nbsp;</span>Download and read</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.5.2\"><span class=\"toc-item-num\">3.5.2&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Add-data-source\" data-toc-modified-id=\"Add-data-source-3.5.3\"><span class=\"toc-item-num\">3.5.3&nbsp;&nbsp;</span>Add data source</a></span></li><li><span><a href=\"#Harmonize-energy-source-hierarchy-and-translate-values\" data-toc-modified-id=\"Harmonize-energy-source-hierarchy-and-translate-values-3.5.4\"><span class=\"toc-item-num\">3.5.4&nbsp;&nbsp;</span>Harmonize energy source hierarchy and translate values</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.5.5\"><span class=\"toc-item-num\">3.5.5&nbsp;&nbsp;</span>Georeferencing</a></span><ul class=\"toc-item\"><li><span><a href=\"#Postcode-to-lat/lon-(WGS84)\" data-toc-modified-id=\"Postcode-to-lat/lon-(WGS84)-3.5.5.1\"><span class=\"toc-item-num\">3.5.5.1&nbsp;&nbsp;</span>Postcode to lat/lon (WGS84)</a></span></li><li><span><a href=\"#Add-NUTS-information\" data-toc-modified-id=\"Add-NUTS-information-3.5.5.2\"><span class=\"toc-item-num\">3.5.5.2&nbsp;&nbsp;</span>Add NUTS information</a></span></li></ul></li><li><span><a href=\"#Transform-electrical_capacity-from-kW-to-MW\" data-toc-modified-id=\"Transform-electrical_capacity-from-kW-to-MW-3.5.6\"><span class=\"toc-item-num\">3.5.6&nbsp;&nbsp;</span>Transform electrical_capacity from kW to MW</a></span></li><li><span><a href=\"#Select-columns-to-keep\" data-toc-modified-id=\"Select-columns-to-keep-3.5.7\"><span class=\"toc-item-num\">3.5.7&nbsp;&nbsp;</span>Select columns to keep</a></span></li><li><span><a href=\"#Visualize\" data-toc-modified-id=\"Visualize-3.5.8\"><span class=\"toc-item-num\">3.5.8&nbsp;&nbsp;</span>Visualize</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.5.9\"><span class=\"toc-item-num\">3.5.9&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#United-Kingdom-UK\" data-toc-modified-id=\"United-Kingdom-UK-3.6\"><span class=\"toc-item-num\">3.6&nbsp;&nbsp;</span>United Kingdom UK</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download-and-Read\" data-toc-modified-id=\"Download-and-Read-3.6.1\"><span class=\"toc-item-num\">3.6.1&nbsp;&nbsp;</span>Download and Read</a></span></li><li><span><a href=\"#Clean-the-data\" data-toc-modified-id=\"Clean-the-data-3.6.2\"><span class=\"toc-item-num\">3.6.2&nbsp;&nbsp;</span>Clean the data</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.6.3\"><span class=\"toc-item-num\">3.6.3&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Add-data-source\" data-toc-modified-id=\"Add-data-source-3.6.4\"><span class=\"toc-item-num\">3.6.4&nbsp;&nbsp;</span>Add data source</a></span></li><li><span><a href=\"#Translate-values-and-harmonise-energy-source-levels-1-3-and-technology\" data-toc-modified-id=\"Translate-values-and-harmonise-energy-source-levels-1-3-and-technology-3.6.5\"><span class=\"toc-item-num\">3.6.5&nbsp;&nbsp;</span>Translate values and harmonise energy source levels 1-3 and technology</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.6.6\"><span class=\"toc-item-num\">3.6.6&nbsp;&nbsp;</span>Georeferencing</a></span><ul class=\"toc-item\"><li><span><a href=\"#Cases-with-unknown-Easting-and-Northing-coordinates\" data-toc-modified-id=\"Cases-with-unknown-Easting-and-Northing-coordinates-3.6.6.1\"><span class=\"toc-item-num\">3.6.6.1&nbsp;&nbsp;</span>Cases with unknown Easting and Northing coordinates</a></span></li><li><span><a href=\"#Cases-for-approximation\" data-toc-modified-id=\"Cases-for-approximation-3.6.6.2\"><span class=\"toc-item-num\">3.6.6.2&nbsp;&nbsp;</span>Cases for approximation</a></span></li><li><span><a href=\"#Add-NUTS-information\" data-toc-modified-id=\"Add-NUTS-information-3.6.6.3\"><span class=\"toc-item-num\">3.6.6.3&nbsp;&nbsp;</span>Add NUTS information</a></span></li><li><span><a href=\"#Visualize-the-data\" data-toc-modified-id=\"Visualize-the-data-3.6.6.4\"><span class=\"toc-item-num\">3.6.6.4&nbsp;&nbsp;</span>Visualize the data</a></span></li></ul></li><li><span><a href=\"#Keep-only-the-columns-of-interest\" data-toc-modified-id=\"Keep-only-the-columns-of-interest-3.6.7\"><span class=\"toc-item-num\">3.6.7&nbsp;&nbsp;</span>Keep only the columns of interest</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.6.8\"><span class=\"toc-item-num\">3.6.8&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#Sweden\" data-toc-modified-id=\"Sweden-3.7\"><span class=\"toc-item-num\">3.7&nbsp;&nbsp;</span>Sweden</a></span><ul class=\"toc-item\"><li><span><a href=\"#Load-the-data\" data-toc-modified-id=\"Load-the-data-3.7.1\"><span class=\"toc-item-num\">3.7.1&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href=\"#Clean-the-data\" data-toc-modified-id=\"Clean-the-data-3.7.2\"><span class=\"toc-item-num\">3.7.2&nbsp;&nbsp;</span>Clean the data</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.7.3\"><span class=\"toc-item-num\">3.7.3&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Correct-the-dates\" data-toc-modified-id=\"Correct-the-dates-3.7.4\"><span class=\"toc-item-num\">3.7.4&nbsp;&nbsp;</span>Correct the dates</a></span></li><li><span><a href=\"#Add-source\" data-toc-modified-id=\"Add-source-3.7.5\"><span class=\"toc-item-num\">3.7.5&nbsp;&nbsp;</span>Add source</a></span></li><li><span><a href=\"#Translate-values-and-harmonize-energy-source-levels\" data-toc-modified-id=\"Translate-values-and-harmonize-energy-source-levels-3.7.6\"><span class=\"toc-item-num\">3.7.6&nbsp;&nbsp;</span>Translate values and harmonize energy source levels</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.7.7\"><span class=\"toc-item-num\">3.7.7&nbsp;&nbsp;</span>Georeferencing</a></span></li><li><span><a href=\"#Assigning-NUTS-codes\" data-toc-modified-id=\"Assigning-NUTS-codes-3.7.8\"><span class=\"toc-item-num\">3.7.8&nbsp;&nbsp;</span>Assigning NUTS codes</a></span></li><li><span><a href=\"#Select-the-columns-to-keep\" data-toc-modified-id=\"Select-the-columns-to-keep-3.7.9\"><span class=\"toc-item-num\">3.7.9&nbsp;&nbsp;</span>Select the columns to keep</a></span></li><li><span><a href=\"#Visualize\" data-toc-modified-id=\"Visualize-3.7.10\"><span class=\"toc-item-num\">3.7.10&nbsp;&nbsp;</span>Visualize</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.7.11\"><span class=\"toc-item-num\">3.7.11&nbsp;&nbsp;</span>Save</a></span></li></ul></li><li><span><a href=\"#Czech-Republic\" data-toc-modified-id=\"Czech-Republic-3.8\"><span class=\"toc-item-num\">3.8&nbsp;&nbsp;</span>Czech Republic</a></span><ul class=\"toc-item\"><li><span><a href=\"#Download-and-read-the-data\" data-toc-modified-id=\"Download-and-read-the-data-3.8.1\"><span class=\"toc-item-num\">3.8.1&nbsp;&nbsp;</span>Download and read the data</a></span></li><li><span><a href=\"#Clean-the-data\" data-toc-modified-id=\"Clean-the-data-3.8.2\"><span class=\"toc-item-num\">3.8.2&nbsp;&nbsp;</span>Clean the data</a></span></li><li><span><a href=\"#Reformat-the-data\" data-toc-modified-id=\"Reformat-the-data-3.8.3\"><span class=\"toc-item-num\">3.8.3&nbsp;&nbsp;</span>Reformat the data</a></span></li><li><span><a href=\"#Translate-column-names\" data-toc-modified-id=\"Translate-column-names-3.8.4\"><span class=\"toc-item-num\">3.8.4&nbsp;&nbsp;</span>Translate column names</a></span></li><li><span><a href=\"#Translate-values-and-harmonize-energy-levels\" data-toc-modified-id=\"Translate-values-and-harmonize-energy-levels-3.8.5\"><span class=\"toc-item-num\">3.8.5&nbsp;&nbsp;</span>Translate values and harmonize energy levels</a></span></li><li><span><a href=\"#Add-data-source\" data-toc-modified-id=\"Add-data-source-3.8.6\"><span class=\"toc-item-num\">3.8.6&nbsp;&nbsp;</span>Add data source</a></span></li><li><span><a href=\"#Georeferencing\" data-toc-modified-id=\"Georeferencing-3.8.7\"><span class=\"toc-item-num\">3.8.7&nbsp;&nbsp;</span>Georeferencing</a></span></li><li><span><a href=\"#Assign-NUTS-codes\" data-toc-modified-id=\"Assign-NUTS-codes-3.8.8\"><span class=\"toc-item-num\">3.8.8&nbsp;&nbsp;</span>Assign NUTS codes</a></span></li><li><span><a href=\"#Select-the-columns-to-keep\" data-toc-modified-id=\"Select-the-columns-to-keep-3.8.9\"><span class=\"toc-item-num\">3.8.9&nbsp;&nbsp;</span>Select the columns to keep</a></span></li><li><span><a href=\"#Drop-duplicates\" data-toc-modified-id=\"Drop-duplicates-3.8.10\"><span class=\"toc-item-num\">3.8.10&nbsp;&nbsp;</span>Drop duplicates</a></span></li><li><span><a href=\"#Visualuze\" data-toc-modified-id=\"Visualuze-3.8.11\"><span class=\"toc-item-num\">3.8.11&nbsp;&nbsp;</span>Visualuze</a></span></li><li><span><a href=\"#Save\" data-toc-modified-id=\"Save-3.8.12\"><span class=\"toc-item-num\">3.8.12&nbsp;&nbsp;</span>Save</a></span></li></ul></li></ul></li><li><span><a href=\"#Zip-the-raw-data\" data-toc-modified-id=\"Zip-the-raw-data-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Zip the raw data</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:09.182538Z",
     "start_time": "2020-08-20T08:37:09.179340Z"
    }
   },
   "outputs": [],
   "source": [
    "version = '2020-08-25'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Script setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:11.262056Z",
     "start_time": "2020-08-20T08:37:10.477305Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import posixpath\n",
    "import urllib.parse\n",
    "import urllib.request\n",
    "import re\n",
    "import zipfile\n",
    "import pickle\n",
    "import urllib\n",
    "import shutil\n",
    "import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import utm  # for transforming geoinformation in the utm format\n",
    "import requests\n",
    "import fake_useragent\n",
    "from string import Template\n",
    "from IPython.display import display\n",
    "import xlrd\n",
    "import bs4\n",
    "import bng_to_latlon\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.pyplot import figure\n",
    "\n",
    "# for visualizing locations on maps\n",
    "import cartopy.crs as ccrs \n",
    "import cartopy.feature as cfeature\n",
    "from cartopy.io import shapereader\n",
    "import geopandas\n",
    "import shapely\n",
    "\n",
    "logging.basicConfig(\n",
    "    level=logging.INFO,\n",
    "    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
    "    datefmt='%d %b %Y %H:%M:%S'\n",
    ")\n",
    "\n",
    "logger = logging.getLogger()\n",
    "\n",
    "# Create input, intermediate and output folders if they don't exist.\n",
    "# If the paths are relative, the correspoding folders will be created\n",
    "# inside the current working directory.\n",
    "input_directory_path = os.path.join('input', 'original_data')\n",
    "intermediate_directory_path = 'intermediate'\n",
    "output_directory_path = os.path.join('output', 'renewable_power_plants')\n",
    "\n",
    "os.makedirs(input_directory_path, exist_ok=True)\n",
    "os.makedirs(intermediate_directory_path, exist_ok=True)\n",
    "os.makedirs(output_directory_path, exist_ok=True)\n",
    "\n",
    "# Create the folder to which the Eurostat files with data at the level of the whole EU/Europe\n",
    "#are going to be downloaded\n",
    "eurostat_eu_directory_path = os.path.join('input', 'eurostat_eu')\n",
    "os.makedirs(eurostat_eu_directory_path, exist_ok=True)\n",
    "\n",
    "# Define the path of the file with the list of sources.\n",
    "source_list_filepath = os.path.join('input', 'sources.csv')\n",
    "\n",
    "# Import the utility functions and classes from the util package\n",
    "import util.helper\n",
    "from util.visualizer import visualize_points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choose download option\n",
    "The original data can either be downloaded from the original data sources as specified below or from the opsd-Server. Default option is to download from the original sources as the aim of the project is to stay as close to original sources as possible. However, if problems with downloads e.g. due to changing urls occur, you can still run the script with the original data from the opsd_server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:13.238775Z",
     "start_time": "2020-08-20T08:37:13.236123Z"
    }
   },
   "outputs": [],
   "source": [
    "download_from = 'original_sources'\n",
    "#download_from = 'opsd_server' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update the download links\n",
    "\n",
    "The download link for the UK is updated at the end of each quarter by the source provider, BEIS. We keep up with those changes by extracting the download link automatically from the web page it is on. That way, the link does not have to be updated manually.\n",
    "\n",
    "**Note**: you must be connected to the Internet if you want to execute this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:15.645996Z",
     "start_time": "2020-08-20T08:37:14.803805Z"
    }
   },
   "outputs": [],
   "source": [
    "source_df = pd.read_csv(source_list_filepath)\n",
    "uk_main_page = 'https://www.gov.uk/government/publications/renewable-energy-planning-database-monthly-extract'\n",
    "current_link = util.helper.get_beis_link(uk_main_page)\n",
    "current_filename = current_link.split('/')[-1]\n",
    "\n",
    "source_df.loc[(source_df['country'] == 'UK') & (source_df['source'] == 'BEIS'), 'url'] = current_link\n",
    "source_df.loc[(source_df['country'] == 'UK') & (source_df['source'] == 'BEIS'), 'filename'] = current_filename\n",
    "source_df.to_csv(source_list_filepath, index=False, header=True)\n",
    "\n",
    "source_df.fillna('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that, as of August 25, 2020, the following sources are available only from the OPSD server and the data will be downloaded from it even if `download_from` is set to `'original_sources'`:\n",
    "- Energinet (DK)\n",
    "- Eurostat files which contain correspondence tables between postal codes and NUTS.\n",
    "\n",
    "The original links which should be downloaded from OPSD are marked as inactive in the column `active` in the above dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the downloader for data sources\n",
    "The `Downloader` class in the `util` package is responsible for downloading the original files to appropriate folders. In order to access its functionality, we have to instantiate it first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:43.673267Z",
     "start_time": "2020-08-20T08:37:43.531504Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import util.downloader\n",
    "from util.downloader import Downloader\n",
    "downloader = Downloader(version, input_directory_path, source_list_filepath, download_from)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the NUTS converter\n",
    "The `NUTSConverter` class in the `util` package uses the information on each facility's  postcode, municipalty name, municipality code, longitude, and latitude to assign it correct [NUTS 2016](https://ec.europa.eu/eurostat/web/nuts/history) level 1, 2, and 3 codes.\n",
    "\n",
    "Here, we instantiate the converter so that we can use it later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:37:54.286427Z",
     "start_time": "2020-08-20T08:37:45.438835Z"
    }
   },
   "outputs": [],
   "source": [
    "#import importlib\n",
    "#importlib.reload(util.nuts_converter)\n",
    "#importlib.reload(util.downloader)\n",
    "#from util.downloader import Downloader\n",
    "#downloader = Downloader(version, input_directory_path, source_list_filepath, download_from)\n",
    "from util.nuts_converter import NUTSConverter\n",
    "nuts_converter = NUTSConverter(downloader, eurostat_eu_directory_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup translation dictionaries\n",
    "\n",
    "Column and value names of the original data sources will be translated to English and standardized across different sources. Standardized column names, e.g. \"electrical_capacity\" are required to merge data in one DataFrame.<br>\n",
    "The column and the value translation lists are provided in the input folder of the Data Package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:38:00.361893Z",
     "start_time": "2020-08-20T08:38:00.345827Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Get column translation list\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "columnnames.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:38:01.128965Z",
     "start_time": "2020-08-20T08:38:01.119335Z"
    }
   },
   "outputs": [],
   "source": [
    "# Get value translation list\n",
    "valuenames = pd.read_csv(os.path.join('input', 'value_translation_list.csv'))\n",
    "valuenames.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and process per country\n",
    "\n",
    "For one country after the other, the original data is downloaded, read, processed, translated, eventually georeferenced and saved. If respective files are already in the local folder, these will be utilized.\n",
    "To process the provided data [pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is applied.<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Germany DE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "**[Netztransparenz.de](https://www.netztransparenz.de/de/Anlagenstammdaten.htm)** - Official grid transparency platform from the German Transmission System Operators (TSOs): 50Hertz, Amprion, TenneT and TransnetBW.\n",
    "\n",
    "**[Bundesnetzagentur (BNetzA)](https://www.bundesnetzagentur.de/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/ZahlenDatenInformationen/EEG_Registerdaten/EEG_Registerdaten_node.html)** - German Federal Network Agency for Electricity, Gas, Telecommunications, Posts and Railway (In separate files for **data for roof-mounted PV power plants** and for **all other renewable energy power plants**.)\n",
    "\n",
    "Data URL for BNetzA gets updated every few month. To be sure, always check if the links (url_bnetza; url_bnetza_pv) are up to date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:39:09.377864Z",
     "start_time": "2020-08-20T08:38:07.377399Z"
    }
   },
   "outputs": [],
   "source": [
    "# Define the lists of source names\n",
    "downloader = Downloader(version, input_directory_path, source_list_filepath, download_from)\n",
    "\n",
    "tsos = ['50Hertz', 'Amprion', 'TenneT', 'TransnetBW']\n",
    "datasets = ['50Hertz', 'Amprion', 'TenneT', 'TransnetBW','bnetza','bnetza_pv','bnetza_pv_historic']\n",
    "\n",
    "# Download the files and get the local file paths indexed by source names\n",
    "filepaths = downloader.download_data_for_country('DE')\n",
    "\n",
    "# Remove the Eurostat NUTS file as it's a geoinformation source\n",
    "DE_postcode2nuts_filepath = filepaths.pop('Eurostat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:47:02.683965Z",
     "start_time": "2020-08-20T08:47:02.678173Z"
    }
   },
   "outputs": [],
   "source": [
    "# Open all data sets before processing.\n",
    "filenames = {}\n",
    "\n",
    "for source in filepaths:\n",
    "    filepath = filepaths[source]\n",
    "    print(source, filepath)\n",
    "    if os.path.splitext(filepath)[1] != '.xlsx' and zipfile.is_zipfile(filepath):\n",
    "        filenames[source] = zipfile.ZipFile(filepath)\n",
    "    else:\n",
    "        filenames[source] = filepath"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:47:15.074639Z",
     "start_time": "2020-08-20T08:47:03.747126Z"
    }
   },
   "outputs": [],
   "source": [
    "# Read TSO data from the zip files\n",
    "dfs = {}\n",
    "\n",
    "basenames_by_tso = {\n",
    "    '50Hertz': '50Hertz Transmission GmbH EEG-Zahlungen Stammdaten 2019',\n",
    "    'Amprion': 'Amprion GmbH EEG-Zahlungen Anlagenstammdaten 2019',\n",
    "    'TenneT': 'TenneT TSO GmbH Anlagenstammdaten 2019',\n",
    "    'TransnetBW': 'TransnetBW GmbH Anlagenstammdaten 2019',\n",
    "}\n",
    "        \n",
    "for tso in tsos:\n",
    "    filename = basenames_by_tso[tso]+'.csv'\n",
    "    print('Reading', filename)\n",
    "    #print(filenames[tso].namelist())\n",
    "    dfs[tso] = pd.read_csv(\n",
    "        filenames[tso].open(filename),\n",
    "        sep=';',\n",
    "        thousands='.',\n",
    "        decimal=',',\n",
    "\n",
    "        # Headers have to have the same order for all TSOs. Therefore just define headers here.\n",
    "        # Remove the following three lines if for next version, headers should be read out initially \n",
    "        # to then check if order is the same everywhere.\n",
    "        names=['EEG-Anlagenschlüssel', 'MASTR_Nr_EEG','Netzbetreiber Betriebsnummer','Netzbetreiber Name',\n",
    "               'Strasse_flurstueck','PLZ','Ort / Gemarkung','Gemeindeschlüssel','Bundesland',\n",
    "               'Installierte Leistung','Energieträger','Spannungsebene','Leistungsmessung','Regelbarkeit',\n",
    "               'Inbetriebnahme','Außerbetriebnahme','Netzzugang','Netzabgang'],\n",
    "        header=None,\n",
    "        skiprows=1,\n",
    "        parse_dates=[14, 15, 16, 17], #[11, 12, 13, 14]\n",
    "        #infer_datetime_format=True,\n",
    "        date_parser = lambda x: pd.to_datetime(x, errors='coerce', format='%d.%m.%Y'),\n",
    "        encoding='iso-8859-1',\n",
    "        dayfirst=True,\n",
    "        low_memory=False\n",
    "    )\n",
    "    print('Done reading ' + filename)\n",
    "\n",
    "for filename in filenames.values():\n",
    "    if(isinstance(filename, zipfile.ZipFile)):\n",
    "        #print(filename)\n",
    "        filename.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:47:18.726061Z",
     "start_time": "2020-08-20T08:47:18.721881Z"
    }
   },
   "outputs": [],
   "source": [
    "# define the date parser\n",
    "def date_parser(x):\n",
    "    if type(x) == str:\n",
    "        return datetime.datetime.strptime(x, '%D.%M.%Y')\n",
    "    elif type(x) == float and pd.isnull(x):\n",
    "        return pd.NaT\n",
    "    \n",
    "def inspect(x):\n",
    "    try:\n",
    "        converted = datetime.datetime.strptime(x, '%d.%m.%Y')\n",
    "        return False\n",
    "    except:\n",
    "        return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:04.058406Z",
     "start_time": "2020-08-20T08:50:37.190652Z"
    }
   },
   "outputs": [],
   "source": [
    "# Read BNetzA register\n",
    "print('Reading bnetza: '+filenames['bnetza'])\n",
    "dfs['bnetza'] = pd.read_excel(filenames['bnetza'],\n",
    "                          sheet_name='Gesamtübersicht',\n",
    "                          header=0,\n",
    "                          converters={'4.9 Postleit-zahl': str, 'Gemeinde-Schlüssel': str}\n",
    ")\n",
    "\n",
    "skiprows = {'bnetza_pv_historic': 10, 'bnetza_pv': 9}\n",
    "\n",
    "for dataset in ['bnetza_pv', 'bnetza_pv_historic']:\n",
    "    print(dataset)\n",
    "    print('Reading ' + dataset + ': ' + filenames[dataset])\n",
    "    xls_handle = pd.ExcelFile(filenames[dataset])\n",
    "    print('Concatenating all '+dataset+' sheets into one dataframe')\n",
    "    dfs[dataset] = pd.concat(\n",
    "        (xls_handle.parse(\n",
    "            sheet,\n",
    "            skiprows=skiprows[dataset],\n",
    "            converters={'Anlage \\nPLZ': str}\n",
    "        ) for sheet in xls_handle.sheet_names),\n",
    "        sort=True\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:11.135764Z",
     "start_time": "2020-08-20T08:51:10.260086Z"
    }
   },
   "outputs": [],
   "source": [
    "# Make sure that the column `Inbetriebnahme-datum *)` (commissioning date) in the bnetza_pv set is datetime.\n",
    "mask = dfs['bnetza_pv']['Inbetriebnahme-datum *)'].apply(lambda x: type(x) == int)\n",
    "\n",
    "dfs['bnetza_pv']['Inbetriebnahme-datum *)'] = pd.to_datetime(dfs['bnetza_pv']['Inbetriebnahme-datum *)'],\n",
    "                                                             errors='coerce',\n",
    "                                                             dayfirst=True,\n",
    "                                                             infer_datetime_format=True)\n",
    "dfs['bnetza_pv']['Inbetriebnahme-datum *)'] = dfs['bnetza_pv']['Inbetriebnahme-datum *)'].apply(\n",
    "    lambda x: x.to_datetime64()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:11.995843Z",
     "start_time": "2020-08-20T08:51:11.980776Z"
    }
   },
   "outputs": [],
   "source": [
    "dfs['bnetza_pv_historic'] = dfs['bnetza_pv_historic'].drop(['Unnamed: 7'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:18.638352Z",
     "start_time": "2020-08-20T08:51:14.084291Z"
    }
   },
   "outputs": [],
   "source": [
    "pickle.dump( dfs, open( \"intermediate/temp_dfs_DE_after_reading.pickle\", \"wb\" ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:20.285012Z",
     "start_time": "2020-08-20T08:51:18.640439Z"
    }
   },
   "outputs": [],
   "source": [
    "dfs = pickle.load( open( \"intermediate/temp_dfs_DE_after_reading.pickle\", \"rb\" ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names\n",
    "To standardise the DataFrame the original column names from the German TSOs and the BNetzA wil be translated and new English column names wil be assigned to the DataFrame. The unique column names are required to merge the DataFrame.<br>\n",
    "The column_translation_list is provided here as csv in the input folder. It is loaded in _2.3 Setup of translation dictionaries_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:20.560184Z",
     "start_time": "2020-08-20T08:51:20.287298Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for Germany, create dictionary and show dictionary\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "idx_DE = columnnames[columnnames['country'] == 'DE'].index\n",
    "column_dict_DE = columnnames.loc[idx_DE].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_DE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:21.340203Z",
     "start_time": "2020-08-20T08:51:20.562521Z"
    }
   },
   "outputs": [],
   "source": [
    "# Start the column translation process for each original data source\n",
    "print('Translation...')\n",
    "for dataset in dfs:\n",
    "    # Remove newlines and any other duplicate whitespaces in column names:\n",
    "    dfs[dataset] = dfs[dataset].rename(columns={col: re.sub(r\"\\s+\", ' ', col) for col in dfs[dataset].columns})\n",
    "    # Do column name translations\n",
    "    print(dataset)\n",
    "    #print(list(dfs[dataset].columns))\n",
    "    dfs[dataset].rename(columns=column_dict_DE, inplace=True)\n",
    "    #print(list(dfs[dataset].columns).index('decommissioning_date'))\n",
    "    #print('--------------------------------------------')\n",
    "\n",
    "print('done.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add information and choose columns\n",
    "All data source names and for the BNetzA-PV data the energy source level 2 will added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:23.238665Z",
     "start_time": "2020-08-20T08:51:23.212163Z"
    }
   },
   "outputs": [],
   "source": [
    "# Add data source names to the DataFrames\n",
    "for tso in tsos:\n",
    "    dfs[tso]['data_source'] = tso\n",
    "    dfs[tso]['tso'] = tso\n",
    "\n",
    "dfs['bnetza']['data_source'] = 'BNetzA'\n",
    "dfs['bnetza_pv']['data_source'] = 'BNetzA_PV'\n",
    "dfs['bnetza_pv_historic']['data_source'] = 'BNetzA_PV_historic'\n",
    "\n",
    "# Add for the BNetzA PV data the energy source level 2\n",
    "dfs['bnetza_pv']['energy_source_level_2'] = 'Photovoltaics'\n",
    "dfs['bnetza_pv_historic']['energy_source_level_2'] = 'Photovoltaics'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:25.692690Z",
     "start_time": "2020-08-20T08:51:24.344624Z"
    }
   },
   "outputs": [],
   "source": [
    "# Select those columns of the original data which are utilised further\n",
    "dfs['bnetza'] = dfs['bnetza'].loc[:, ('commissioning_date', 'decommissioning_date',\n",
    "                              'notification_reason', 'energy_source_level_2',\n",
    "                              'electrical_capacity_kW', 'thermal_capacity_kW',\n",
    "                              'voltage_level', 'dso', 'eeg_id', 'bnetza_id',\n",
    "                              'federal_state', 'postcode', 'municipality_code',\n",
    "                              'municipality', 'address', 'address_number',\n",
    "                              'utm_zone', 'utm_east', 'utm_north',\n",
    "                              'data_source')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:26.483268Z",
     "start_time": "2020-08-20T08:51:26.361326Z"
    }
   },
   "outputs": [],
   "source": [
    "for dataset in datasets: print(dataset+':'); display(dfs[dataset].tail(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge DataFrames\n",
    "The individual DataFrames from the TSOs (Netztransparenz.de) and BNetzA are merged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:30.265881Z",
     "start_time": "2020-08-20T08:51:28.398046Z"
    }
   },
   "outputs": [],
   "source": [
    "# Merge DataFrames of each original source into a common DataFrame DE_renewables\n",
    "dfs_list = []\n",
    "\n",
    "for dataset in datasets:\n",
    "    dfs_list.append(dfs[dataset])\n",
    "\n",
    "DE_renewables = pd.concat(dfs_list, sort=True)\n",
    "DE_renewables.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:43.431262Z",
     "start_time": "2020-08-20T08:51:43.409614Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables.reset_index(drop=True, inplace=True)\n",
    "DE_renewables.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2\n",
    "Different German terms for energy source level 2, energy source level 3, technology and voltage levels are translated and harmonized across the individual data sources. The value_translation_list is provided here as csv in the input folder. It is loaded in _2.3 Setup of translation dictionaries_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T08:51:48.923796Z",
     "start_time": "2020-08-20T08:51:48.863116Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for Germany, create dictionary and show dictionary\n",
    "valuenames = pd.read_csv(os.path.join('input', 'value_translation_list.csv'))\n",
    "idx_DE = valuenames[valuenames['country'] == 'DE'].index\n",
    "value_dict_DE = valuenames.loc[idx_DE].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_DE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:01:09.491744Z",
     "start_time": "2020-08-20T08:51:50.015603Z"
    }
   },
   "outputs": [],
   "source": [
    "print('replacing...')\n",
    "# Replace all original value names by the OPSD value names. \n",
    "# Running time: some minutes.\n",
    "DE_renewables.replace(value_dict_DE, inplace=True)\n",
    "print('Done!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:13:36.942501Z",
     "start_time": "2020-08-20T09:13:10.108038Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables['postcode'] = DE_renewables['postcode'].apply(pd.to_numeric, errors='ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate and assign energy source level 1 - 3 and technology**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:13:43.691189Z",
     "start_time": "2020-08-20T09:13:41.334416Z"
    }
   },
   "outputs": [],
   "source": [
    "# Create dictionary in order to assign energy_source to its subtype\n",
    "energy_source_dict_DE = valuenames.loc[idx_DE].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Column energy_source partly contains energy source level 3 and technology information,\n",
    "# thus this column is copied to new column technology...\n",
    "DE_renewables['technology'] = DE_renewables['energy_source_level_2']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:13:59.125779Z",
     "start_time": "2020-08-20T09:13:49.209700Z"
    }
   },
   "outputs": [],
   "source": [
    "# ...and the energy source level 2 values are replaced by the higher level classification\n",
    "DE_renewables['energy_source_level_2'].replace(energy_source_dict_DE, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:07.790626Z",
     "start_time": "2020-08-20T09:13:59.194274Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose energy source level 2 entries where energy_source is \"Bioenergy\" in order to \n",
    "# separate Bioenergy subtypes to \"energy_source_level_3\" and subtypes for the rest to \"technology\"\n",
    "idx_DE_Bioenergy = DE_renewables[DE_renewables['energy_source_level_2'] == 'Bioenergy'].index\n",
    "\n",
    "# Assign technology to energy source level 3 for all entries where energy source level 2 is \n",
    "# Bioenergy and delete those entries from technology\n",
    "DE_renewables[['energy_source_level_3']] = DE_renewables.iloc[idx_DE_Bioenergy][['technology']]\n",
    "DE_renewables.loc[idx_DE_Bioenergy]['technology'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:08.328088Z",
     "start_time": "2020-08-20T09:14:07.840440Z"
    }
   },
   "outputs": [],
   "source": [
    "# Assign energy source level 1 to the dataframe\n",
    "DE_renewables['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:15.638638Z",
     "start_time": "2020-08-20T09:14:09.388966Z"
    }
   },
   "outputs": [],
   "source": [
    "# Show the hierarchy of the energy types present in the frame\n",
    "energy_columns = ['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']\n",
    "DE_renewables[energy_columns].drop_duplicates().sort_values(by='energy_source_level_2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the **[OPSD energy hierarchy](http://open-power-system-data.org/2016-10-25-opsd_tree.svg)**, the power plants whose `energy_source_level_2` is either `Storage` or `Other fossil fuels` do not belong to the class of renewable-energy facilities. Therefore, we can remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:27.043264Z",
     "start_time": "2020-08-20T09:14:22.716611Z"
    }
   },
   "outputs": [],
   "source": [
    "drop_mask = DE_renewables['energy_source_level_2'].isin(['Other fossil fuels', 'Storage'])\n",
    "DE_renewables.drop(DE_renewables.index[drop_mask], axis=0, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Summary of DataFrame**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:38.573268Z",
     "start_time": "2020-08-20T09:14:36.275141Z"
    }
   },
   "outputs": [],
   "source": [
    "# Electrical capacity per energy source level 2 (in MW)\n",
    "DE_renewables.groupby(['energy_source_level_2'])['electrical_capacity_kW'].sum() / 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:39.726670Z",
     "start_time": "2020-08-20T09:14:38.588615Z"
    }
   },
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "DE_renewables[['electrical_capacity_kW', 'thermal_capacity_kW']] /= 1000\n",
    "\n",
    "# adapt column name\n",
    "DE_renewables.rename(columns={'electrical_capacity_kW': 'electrical_capacity',\n",
    "                              'thermal_capacity_kW': 'thermal_capacity'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get coordinates by postcode\n",
    "*(for data with no existing geocoordinates)*\n",
    "\n",
    "The available post code in the original data provides a first approximation for the geocoordinates of the RE power plants.<br>\n",
    "The BNetzA data provides the full zip code whereas due to data privacy the TSOs only report the first three digits of the power plant's post code (e.g. 024xx) and no address. Subsequently a centroid of the post code region polygon is used to find the coordinates.\n",
    "\n",
    "With data from\n",
    "*  http://www.suche-postleitzahl.org/downloads?download=plz-gebiete.shp.zip\n",
    "*  http://www.suche-postleitzahl.org/downloads?download_file=plz-3stellig.shp.zip\n",
    "*  http://www.suche-postleitzahl.org/downloads\n",
    "\n",
    "a CSV-file for all existing German post codes with matching geocoordinates has been compiled. The latitude and longitude coordinates were generated by running a PostgreSQL + PostGIS database. Additionally the respective TSO has been added to each post code. *(A Link to the SQL script will follow here later)*\n",
    "\n",
    "*(License: http://www.suche-postleitzahl.org/downloads, Open Database Licence for free use. Source of data: © OpenStreetMap contributors)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:14:55.643626Z",
     "start_time": "2020-08-20T09:14:49.237431Z"
    }
   },
   "outputs": [],
   "source": [
    "# Read generated postcode/location file\n",
    "postcode = pd.read_csv(os.path.join('input', 'de_tso_postcode_full.csv'))\n",
    "\n",
    "# Drop possible duplicates in postcodes\n",
    "postcode.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "\n",
    "# Show first entries\n",
    "postcode.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Merge geometry information by using the postcode**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:15:05.115867Z",
     "start_time": "2020-08-20T09:14:55.649185Z"
    }
   },
   "outputs": [],
   "source": [
    "# Take postcode and longitude/latitude information\n",
    "postcode = postcode[['postcode', 'lon', 'lat']]\n",
    "\n",
    "# Cast DE_renewables['postcode'] to int64 in order to do the natural join of the dataframes\n",
    "DE_renewables['postcode'] = pd.to_numeric(DE_renewables['postcode'], errors='coerce')\n",
    "\n",
    "# Join two dataframes\n",
    "DE_renewables = DE_renewables.merge(postcode, on=['postcode'],  how='left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transform geoinformation\n",
    "*(for data with already existing geoinformation)*\n",
    "\n",
    "In this section the existing geoinformation (in UTM-format) will be transformed into latidude and longitude coordiates as a uniform standard for geoinformation. \n",
    "\n",
    "The BNetzA data set offers UTM Geoinformation with the columns *utm_zone (UTM-Zonenwert)*, *utm_east* and *utm_north*. Most of utm_east-values include the utm_zone-value **32** at the beginning of the number. In order to properly standardize and transform this geoinformation into latitude and longitude it is necessary to remove this utm_zone value. For all UTM entries the utm_zone 32 is used by the BNetzA.\n",
    "\n",
    "\n",
    "|utm_zone|\t utm_east|\t utm_north| comment|\n",
    "|---|---|---| ----|\n",
    "|32|\t413151.72|\t6027467.73| proper coordinates|\n",
    "|32|\t**32**912159.6008|\t5692423.9664| caused error by 32|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How many different utm_zone values are in the data set?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:15:06.246441Z",
     "start_time": "2020-08-20T09:15:05.126784Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables.groupby(['utm_zone'])['utm_zone'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remove the utm_zone \"32\" from the utm_east value**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:15:09.870246Z",
     "start_time": "2020-08-20T09:15:06.252552Z"
    }
   },
   "outputs": [],
   "source": [
    "# Find entries with 32 value at the beginning\n",
    "idx_32 = (DE_renewables['utm_east'].astype(str).str[:2] == '32')\n",
    "idx_notnull = DE_renewables['utm_east'].notnull()\n",
    "\n",
    "# Remove 32 from utm_east entries\n",
    "DE_renewables.loc[idx_32, 'utm_east'] = DE_renewables.loc[idx_32,\n",
    "                                                          'utm_east'].astype(str).str[2:].astype(float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:15:14.575174Z",
     "start_time": "2020-08-20T09:15:09.872161Z"
    }
   },
   "outputs": [],
   "source": [
    "def convert_to_latlon(utm_east, utm_north, utm_zone):\n",
    "    try:\n",
    "        return utm.to_latlon(utm_east, utm_north, utm_zone, 'U')\n",
    "    except:\n",
    "        return ''\n",
    "\n",
    "DE_renewables['latlon'] = DE_renewables.loc[idx_notnull, ['utm_east', 'utm_north', 'utm_zone']].apply(\n",
    "        lambda x: convert_to_latlon(x[0], x[1], x[2]), axis=1).astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conversion UTM to latitude and longitude**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:16:55.784990Z",
     "start_time": "2020-08-20T09:15:14.603343Z"
    }
   },
   "outputs": [],
   "source": [
    "lat = []\n",
    "lon = []\n",
    "\n",
    "for row in DE_renewables['latlon']:\n",
    "    try:\n",
    "        # Split tuple format into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        parts = row.split(',')\n",
    "        if(len(parts)<2):\n",
    "            raise Exception('This is not a proper tuple. So go to exception block.')\n",
    "        lat.append(parts[0])\n",
    "        lon.append(parts[1])\n",
    "    except:\n",
    "        # set NaN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "DE_renewables['latitude'] = pd.to_numeric(lat)\n",
    "DE_renewables['longitude'] = pd.to_numeric(lon)\n",
    "\n",
    "# Add new values to DataFrame lon and lat\n",
    "DE_renewables['lat'] = DE_renewables[['lat', 'latitude']].apply(\n",
    "    lambda x: x[1] if pd.isnull(x[0]) else x[0],\n",
    "    axis=1)\n",
    "\n",
    "DE_renewables['lon'] = DE_renewables[['lon', 'longitude']].apply(\n",
    "    lambda x: x[1] if pd.isnull(x[0]) else x[0],\n",
    "    axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Check: missing coordinates by data source and type**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:16:55.819121Z",
     "start_time": "2020-08-20T09:16:55.786207Z"
    }
   },
   "outputs": [],
   "source": [
    "#DE_renewables[DE_renewables['data_source'] == '50Hertz'].to_excel('test.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:16:57.587923Z",
     "start_time": "2020-08-20T09:16:55.821195Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print('Missing coordinates ', DE_renewables.lat.isnull().sum())\n",
    "\n",
    "display(\n",
    "    DE_renewables[DE_renewables.lat.isnull()].groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count()\n",
    ")\n",
    "\n",
    "print('Share of missing coordinates (note that NaN can mean it\\'s all fine):')\n",
    "\n",
    "DE_renewables[DE_renewables.lat.isnull()].groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count() / DE_renewables.groupby(\n",
    "        ['energy_source_level_2','data_source']\n",
    "    )['data_source'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remove temporary columns**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:16:58.141561Z",
     "start_time": "2020-08-20T09:16:57.593478Z"
    }
   },
   "outputs": [],
   "source": [
    "# drop lonlat column that contains both, latitute and longitude\n",
    "DE_renewables.drop(['latlon', 'longitude', 'latitude'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save temporary Pickle (to have a point to quickly return to if things break after this point):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:09.070868Z",
     "start_time": "2020-08-20T09:16:58.142849Z"
    }
   },
   "outputs": [],
   "source": [
    "pickle.dump(DE_renewables, open( \"intermediate/temp_dfs_DE_before_cleaning.pickle\", \"wb\" ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:15.633872Z",
     "start_time": "2020-08-20T09:17:09.076347Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables = pickle.load( open( \"intermediate/temp_dfs_DE_before_cleaning.pickle\", \"rb\" ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:16.796534Z",
     "start_time": "2020-08-20T09:17:15.635265Z"
    }
   },
   "outputs": [],
   "source": [
    "# Remove out-of-range dates\n",
    "# Keep only values between 1900 and 2100 to rule out outliers / wrong values. \n",
    "# Also, Excel doesn't support dates before 1900..\n",
    "\n",
    "mask = ((DE_renewables['commissioning_date']>pd.Timestamp('1900')) & \n",
    "        (DE_renewables['commissioning_date']<pd.Timestamp('2100')))\n",
    "DE_renewables = DE_renewables[mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:17.454679Z",
     "start_time": "2020-08-20T09:17:16.798023Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables['municipality_code'] = DE_renewables['municipality_code'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:18.452186Z",
     "start_time": "2020-08-20T09:17:17.458790Z"
    }
   },
   "outputs": [],
   "source": [
    "# Remove spaces from municipality code\n",
    "DE_renewables['municipality_code'] = DE_renewables['municipality_code'].str.replace(' ', '', regex=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:20.628562Z",
     "start_time": "2020-08-20T09:17:18.454521Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables['municipality_code'] = pd.to_numeric(DE_renewables['municipality_code'], errors='coerce', downcast='integer')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:17:24.096164Z",
     "start_time": "2020-08-20T09:17:20.629828Z"
    }
   },
   "outputs": [],
   "source": [
    "# Merge address and address_number\n",
    "to_string = lambda x: str(x) if not pd.isnull(x) else ''\n",
    "DE_renewables['address'] = DE_renewables['address'].map(to_string) + ' ' + DE_renewables['address_number'].map(to_string)\n",
    "\n",
    "# Make sure that the column has no whitespaces at the beginning and the end\n",
    "DE_renewables['address'] = DE_renewables['address'].str.strip()\n",
    "\n",
    "# Remove the column with address numbers as it is not needed anymore\n",
    "del DE_renewables['address_number']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assign NUTS codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:21:44.000524Z",
     "start_time": "2020-08-20T09:17:30.768672Z"
    }
   },
   "outputs": [],
   "source": [
    "# Set up a temporary postcode column as a string column for joining with the appropriate NUTS correspondence table\n",
    "DE_renewables['postcode_str'] = DE_renewables['postcode'].astype(str).str[:-2]\n",
    "\n",
    "DE_renewables = nuts_converter.add_nuts_information(DE_renewables, 'DE', DE_postcode2nuts_filepath,\n",
    "                                                     postcode_column='postcode_str',\n",
    "                                                     how=['postcode', 'municipality_code', 'municipality', 'latlon'])\n",
    "\n",
    "# Drop the temporary column\n",
    "DE_renewables.drop('postcode_str', axis='columns', inplace=True)\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = DE_renewables['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', DE_renewables.shape[0], 'facilities in DE.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = DE_renewables['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', DE_renewables.shape[0], 'facilities in DE.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:22:59.162636Z",
     "start_time": "2020-08-20T09:21:44.079927Z"
    }
   },
   "outputs": [],
   "source": [
    "visualize_points(DE_renewables['lat'],\n",
    "                 DE_renewables['lon'],\n",
    "                'Germany',\n",
    "                 categories=DE_renewables['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save\n",
    " \n",
    "The merged, translated, cleaned, DataFrame will be saved temporily as a pickle file, which stores a Python object fast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:23:11.105540Z",
     "start_time": "2020-08-20T09:22:59.165253Z"
    }
   },
   "outputs": [],
   "source": [
    "DE_renewables.to_pickle('intermediate/DE_renewables.pickle')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:23:11.404793Z",
     "start_time": "2020-08-20T09:23:11.110714Z"
    }
   },
   "outputs": [],
   "source": [
    "del DE_renewables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Denmark DK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "**[Energistyrelsen (ens) / Danish Energy Agency](http://www.ens.dk/info/tal-kort/statistik-noegletal/oversigt-energisektoren/stamdataregister-vindmoller)** - The wind turbines register is released by the Danish Energy Agency. \n",
    "\n",
    "**[Energinet.dk](http://www.energinet.dk/DA/El/Engrosmarked/Udtraek-af-markedsdata/Sider/Statistik.aspx)** - The data of solar power plants are released by the leading transmission network operator Denmark.\n",
    "\n",
    "**[geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)** - The postcode  data from Denmark is provided by Geonames and licensed under a [Creative Commons Attribution 3.0 license](http://creativecommons.org/licenses/by/3.0/).\n",
    "\n",
    "**[Eurostat](https://ec.europa.eu/eurostat/about/overview)** - The data for converting information on municipalities, postcodes and geographic coordinates to NUTS 2016 classification codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data for Denmark\n",
    "filepaths = downloader.download_data_for_country('DK')\n",
    "print(filepaths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function for reading the data on the wind turbines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_dk_wind_turbines(filepath, sheet_name):\n",
    "    # Reads the data on Danish wind turbines\n",
    "    # from the sheet of the given name\n",
    "    # in the file with the path.\n",
    "    # Returns the data as a Pandas dataframe.\n",
    "    \n",
    "    book = xlrd.open_workbook(filepath)\n",
    "    sheet = book.sheet_by_name(sheet_name)\n",
    "    \n",
    "    # Since the column names are in two rows, not one,\n",
    "    # collect them in two parts. The first part is\n",
    "    # fixed and contains column names.\n",
    "    header = []\n",
    "    for i in range(0, 16):\n",
    "        # Make sure that strings 1) do not contain the newline sign\n",
    "        # and 2) have no trailing blank spaces.\n",
    "        column_name = sheet.cell_value(17, i).replace(\"\\n\", \"\").strip()\n",
    "        header = header + [column_name]\n",
    "    # The second part is variable. It consists of two subparts:\n",
    "    # 1) previous years (type float)\n",
    "    # 2) the past months of the current year (type date)\n",
    "    \n",
    "    # Reading previous years as column names\n",
    "    i = 16\n",
    "    cell = sheet.cell(16, i)\n",
    "\n",
    "    while cell.ctype == xlrd.XL_CELL_NUMBER:\n",
    "        column_name = str(int(cell.value))\n",
    "        header = header + [column_name]\n",
    "        i = i + 1\n",
    "        cell = sheet.cell(16, i)\n",
    "    \n",
    "    # Reading the months of the current year as column names\n",
    "    while cell.ctype == xlrd.XL_CELL_DATE:\n",
    "        year, month, _, _, _, _ = xlrd.xldate_as_tuple(cell.value, book.datemode)\n",
    "        column_name = str(\"{}-{}\".format(year, month))\n",
    "        header = header + [column_name]\n",
    "        i = i + 1\n",
    "        cell = sheet.cell(16, i)\n",
    "        \n",
    "    # Add the final column for the total of the current year\n",
    "    header += ['{}-total'.format(header[-1].split('-')[0])]\n",
    "        \n",
    "        \n",
    "    # Skip the first 17 rows in the sheet. The rest contains the data.\n",
    "    df = pd.read_excel(filepath,\n",
    "                       sheet_name=sheet_name,\n",
    "                       skiprows=17,\n",
    "                       skipfooter=3\n",
    "                    )\n",
    "    \n",
    "    # \n",
    "    #df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)\n",
    "    \n",
    "    # Set the column names.\n",
    "    df.columns = header\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get wind turbines data\n",
    "wind_turbines_sheet_name = 'IkkeAfmeldte-Existing turbines'\n",
    "DK_wind_filepath = filepaths['Energistyrelsen']\n",
    "DK_wind_df = read_dk_wind_turbines(DK_wind_filepath,\n",
    "                                   wind_turbines_sheet_name\n",
    "                                  )\n",
    "\n",
    "# Get photovoltaic data\n",
    "DK_solar_filepath = filepaths['Energinet']\n",
    "DK_solar_df = pd.read_excel(DK_solar_filepath,\n",
    "                            sheet_name='Data',\n",
    "                            skiprows=[0],\n",
    "                            converters={'Postnr': str}\n",
    "                           )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove duplicates\n",
    "DK_wind_df.drop_duplicates(inplace=True)\n",
    "DK_solar_df.drop_duplicates(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for Denmark, create dictionary and show dictionary\n",
    "idx_DK = columnnames[columnnames['country'] == 'DK'].index\n",
    "column_dict_DK = columnnames.loc[idx_DK].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Windows has problems reading the csv entry for east and north (DK).\n",
    "# The reason might be the difference when opening the csv between linux and\n",
    "# windows.\n",
    "column_dict_DK_temp = {}\n",
    "for k, v in column_dict_DK.items():\n",
    "    column_dict_DK_temp[k] = v\n",
    "    if v == 'utm_east' or v == 'utm_north':\n",
    "        # merge 2 lines to 1\n",
    "        new_key = ''.join(k.splitlines())\n",
    "        column_dict_DK_temp[new_key] = v\n",
    "\n",
    "column_dict_DK = column_dict_DK_temp\n",
    "\n",
    "column_dict_DK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace column names based on column_dict_DK\n",
    "DK_wind_df.rename(columns=column_dict_DK, inplace=True)\n",
    "DK_solar_df.rename(columns=column_dict_DK, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source and missing information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add names of the data sources to the DataFrames\n",
    "DK_wind_df['data_source'] = 'Energistyrelsen'\n",
    "DK_solar_df['data_source'] = 'Energinet.dk'\n",
    "\n",
    "# Add energy source level 2 and technology for each of the two DataFrames\n",
    "DK_wind_df['energy_source_level_2'] = 'Wind'\n",
    "DK_solar_df['energy_source_level_2'] = 'Solar'\n",
    "DK_solar_df['technology'] = 'Photovoltaics'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correct the dates\n",
    "\n",
    "Some dates in the Energinet dataset are equal to `1970-01-01`, which should be `NaN` instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask=DK_solar_df['commissioning_date'] == '1970-01-01'\n",
    "DK_solar_df.loc[mask, 'commissioning_date'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source level 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Denmark, create dictionary and show dictionary\n",
    "idx_DK = valuenames[valuenames['country'] == 'DK'].index\n",
    "value_dict_DK = valuenames.loc[idx_DK].set_index('original_name')['opsd_name'].to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "DK_wind_df.replace(value_dict_DK, inplace=True)\n",
    "DK_solar_df.replace(value_dict_DK, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**UTM32 to latitude and longitude** *(Data from Energistyrelsen)*\n",
    "\n",
    "The Energistyrelsen data set offers UTM Geoinformation with the columns utm_east and utm_north belonging to the UTM zone 32. In this section the existing geoinformation (in UTM-format) will be transformed into latidude and longitude coordiates as a uniform standard for geoinformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Index for all values with utm information\n",
    "idx_notnull = DK_wind_df['utm_east'].notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert from UTM values to latitude and longitude coordinates\n",
    "DK_wind_df['lonlat'] = DK_wind_df.loc[idx_notnull, ['utm_east', 'utm_north']\n",
    "                                      ].apply(lambda x: utm.to_latlon(x[0],\n",
    "                                                                      x[1],\n",
    "                                                                      32,\n",
    "                                                                      'U'), axis=1).astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split latitude and longitude in two columns\n",
    "lat = []\n",
    "lon = []\n",
    "\n",
    "for row in DK_wind_df['lonlat']:\n",
    "    try:\n",
    "        # Split tuple format\n",
    "        # into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        lat.append(row.split(',')[0])\n",
    "        lon.append(row.split(',')[1])\n",
    "    except:\n",
    "        # set NAN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "DK_wind_df['lat'] = pd.to_numeric(lat)\n",
    "DK_wind_df['lon'] = pd.to_numeric(lon)\n",
    "\n",
    "# drop lonlat column that contains both, latitute and longitude\n",
    "DK_wind_df.drop('lonlat', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Postcode to lat/lon (WGS84)**\n",
    "*(for data from Energinet.dk)*\n",
    "\n",
    "The available post code in the original data provides an approximation for the geocoordinates of the solar power plants.<br>\n",
    "The postcode will be assigned to latitude and longitude coordinates with the help of the postcode table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_DK_geo = zipfile.ZipFile(filepaths['Geonames'])\n",
    "\n",
    "# Read generated postcode/location file\n",
    "DK_geo = pd.read_csv(zip_DK_geo.open('DK.txt'), sep='\\t', header=None)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "DK_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "DK_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "DK_geo['postcode'] = DK_geo['postcode'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add longitude/latitude infomation assigned by postcode (for Energinet.dk data)\n",
    "DK_solar_df = DK_solar_df.merge(DK_geo[['postcode', 'lon', 'lat']],\n",
    "                                on=['postcode'],\n",
    "                                how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show number of units with missing coordinates separated by wind and solar\n",
    "print('Missing Coordinates DK_wind', DK_wind_df.lat.isnull().sum(), 'out of', len(DK_wind_df.index))\n",
    "print('Missing Coordinates DK_solar', DK_solar_df.lat.isnull().sum(), 'out of', len(DK_solar_df.index))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zip_DK_geo.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge DataFrames, add NUTS information and choose columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge DataFrames for wind and solar into DK_renewables\n",
    "dataframes = [DK_wind_df, DK_solar_df]\n",
    "DK_renewables = pd.concat(dataframes, sort=False)\n",
    "DK_renewables = DK_renewables.reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy source level 1 to the dataframe\n",
    "DK_renewables['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge the address and address-number columns into one\n",
    "to_string = lambda x: str(x) if not pd.isnull(x) else \"\"\n",
    "DK_renewables['address'] = DK_renewables['address'].map(to_string) + \" \" + DK_renewables['address_number'].map(to_string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure that the column has no whitespaces at the beginning or the end\n",
    "DK_renewables['address'] = DK_renewables['address'].str.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign NUTS codes\n",
    "DK_postcode2nuts = filepaths['Eurostat']\n",
    "DK_renewables = nuts_converter.add_nuts_information(DK_renewables, 'DK', DK_postcode2nuts,\n",
    "                                                    how=['latlon', 'postcode', 'municipality_code', 'municipality_name'])\n",
    "\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = DK_renewables['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', DK_renewables.shape[0], 'facilities in DK.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = DK_renewables['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', DK_renewables.shape[0], 'facilities in DK.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check geoinformation on the facilities for which NUTS codes could not be determined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DK_renewables[DK_renewables['nuts_1_region'].isnull()][['municipality', 'municipality_code', 'lat', 'lon']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, no information on municipality and latitude/longitude coordinates are present for those power plants, so there was no possibility to assign them their NUTS codes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select those columns of the orignal data which are utilised further\n",
    "columns_of_interest = ['commissioning_date', 'energy_source_level_1', 'energy_source_level_2',\n",
    "                   'technology', 'electrical_capacity_kW', 'dso', 'gsrn_id', 'postcode',\n",
    "                   'municipality_code', 'municipality', 'address',\n",
    "                   'utm_east', 'utm_north', 'lon', 'lat', 'nuts_1_region', 'nuts_2_region', 'nuts_3_region',\n",
    "                   'hub_height', 'rotor_diameter', 'manufacturer', 'model', 'data_source']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean DataFrame from columns other than specified above\n",
    "DK_renewables = DK_renewables.loc[:, columns_of_interest]\n",
    "DK_renewables.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remove duplicate rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove duplicates\n",
    "DK_renewables.drop_duplicates(inplace=True)\n",
    "DK_renewables.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical_capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "DK_renewables['electrical_capacity_kW'] /= 1000\n",
    "\n",
    "# adapt column name\n",
    "DK_renewables.rename(columns={'electrical_capacity_kW': 'electrical_capacity'},\n",
    "                     inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "visualize_points(DK_renewables['lat'],\n",
    "                 DK_renewables['lon'],\n",
    "                 'Denmark',\n",
    "                 categories=DK_renewables['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DK_renewables.to_pickle('intermediate/DK_renewables.pickle')\n",
    "del DK_renewables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## France FR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "**[Ministry for Ecological and Inclusive Transition](https://www.statistiques.developpement-durable.gouv.fr/donnees-locales-relatives-aux-installations-de-production-delectricite-renouvelable-beneficiant-0?rubrique=23&dossier=189)** - Number of installations and installed capacity of the different renewable source for every municipality in France. Data until 31/12/2017. As of 2020, this dataset is no longer maintained by the ministry and we refer to it as the **old dataset**.\n",
    "\n",
    "**[ODRÉ](https://opendata.reseaux-energies.fr/pages/accueil/)** - The Open Data Réseaux Énergies (ODRÉ, Open Data Networks for Energy) platform provides stakeholders with data around the themes of Production, Multi-energy Consumption, Storage, Mobility, Territories and Regions, Infrastructure, Markets and Meteorology. As of 2020, we refer to this dataset as the **new dataset**. It contains the data up to 31/12/2018.\n",
    "\n",
    "**[OpenDataSoft](http://public.opendatasoft.com/explore/dataset/code-postal-code-insee-2015/information/)** - a list of French INSEE codes and corresponding coordinates, published under the [Licence Ouverte (Etalab)](https://www.etalab.gouv.fr/licence-ouverte-open-licence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:29.969395Z",
     "start_time": "2020-08-20T09:24:29.645906Z"
    }
   },
   "outputs": [],
   "source": [
    "# Download the data\n",
    "filepaths = downloader.download_data_for_country('FR')\n",
    "\n",
    "# Show the local paths\n",
    "filepaths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ODRE data\n",
    "\n",
    "#### Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:35.211764Z",
     "start_time": "2020-08-20T09:24:34.053168Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load the data\n",
    "FR_re_filepath = filepaths['ODRE']\n",
    "FR_re_df = pd.read_csv(FR_re_filepath,\n",
    "                       sep=';',\n",
    "                       parse_dates=['dateRaccordement', 'dateDeraccordement',\n",
    "                                   'dateMiseEnService', 'dateDebutVersion'],\n",
    "                      infer_datetime_format=True)\n",
    "\n",
    "# Make sure that the column dateDeraccordement is datetime\n",
    "FR_re_df['dateDeraccordement'] = pd.to_datetime(FR_re_df['dateDeraccordement'], errors='coerce')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:35.457034Z",
     "start_time": "2020-08-20T09:24:35.225213Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show it\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "idx_FR = columnnames[(columnnames['country'] == 'FR') & (columnnames['data_source'] == 'ODRE')].index\n",
    "column_dict_FR = columnnames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_FR\n",
    "\n",
    "# Translate column names\n",
    "FR_re_df.rename(columns=column_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:35.523218Z",
     "start_time": "2020-08-20T09:24:35.459853Z"
    }
   },
   "outputs": [],
   "source": [
    "# Keep only the columns specified in the translation dictionary as we'll need only them\n",
    "columns_to_keep = list(column_dict_FR.values())\n",
    "FR_re_df = FR_re_df.loc[:, columns_to_keep]\n",
    "FR_re_df.reset_index(drop=True, inplace=True)\n",
    "\n",
    "# Show a pair of rows\n",
    "FR_re_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:35.813485Z",
     "start_time": "2020-08-20T09:24:35.809710Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df['data_source'] = 'Open Data Réseaux Énergies'\n",
    "FR_re_df['as_of_year'] = 2018 # Year for which the dataset has been compiled by the data source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Translate values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:36.570439Z",
     "start_time": "2020-08-20T09:24:36.436457Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create a dictionary and show it\n",
    "valuenames = pd.read_csv(os.path.join('input', 'value_translation_list.csv'))\n",
    "\n",
    "idx_FR = valuenames[(valuenames['country'] == 'FR') & (valuenames['data_source'] == 'ODRE')].index\n",
    "value_dict_FR = valuenames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:39.099402Z",
     "start_time": "2020-08-20T09:24:36.837742Z"
    }
   },
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "FR_re_df.replace(value_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Correct site names\n",
    "\n",
    "Some facilites do not come with their names. Instead, strings such as `Agrégation des installations de moins de 36KW`, `Confidentiel` and `confidentiel` are used. Here, we correct this by setting all such names to `np.nan`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:39.159600Z",
     "start_time": "2020-08-20T09:24:39.104150Z"
    }
   },
   "outputs": [],
   "source": [
    "no_name_aliases = ['Agrégation des installations de moins de 36KW', 'Confidentiel', 'confidentiel']\n",
    "no_name_mask = FR_re_df['site_name'].isin(no_name_aliases)\n",
    "FR_re_df.loc[no_name_mask, 'site_name'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Replace suspicious dates with N/A\n",
    "\n",
    "The commissioning dates of some solar and wind plants are set in the early 20th and late 19th centuries. We replace those dates with N/A since they do not make sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:39.950404Z",
     "start_time": "2020-08-20T09:24:39.162479Z"
    }
   },
   "outputs": [],
   "source": [
    "mask = (FR_re_df['commissioning_date'].dt.year <= 1900) &\\\n",
    "       ((FR_re_df['technology'].isin(['Photovoltaics', 'Onshore']) |\\\n",
    "        (FR_re_df['energy_source_level_2'] == 'Solar')))\n",
    "FR_re_df.loc[mask, 'commissioning_date'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:40.817418Z",
     "start_time": "2020-08-20T09:24:39.952564Z"
    }
   },
   "outputs": [],
   "source": [
    "#for x in FR_re_df[FR_re_df['commissioning_date'].dt.year <= 1980]['technology']:\n",
    "#    print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check missing values\n",
    "Now, we will drop all the columns and all the rows which contain only null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:41.098070Z",
     "start_time": "2020-08-20T09:24:40.822441Z"
    }
   },
   "outputs": [],
   "source": [
    "# Check the columns\n",
    "FR_re_df.isnull().all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see above, no column contains only the null value, so we do not need to drop any."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:41.191573Z",
     "start_time": "2020-08-20T09:24:41.168054Z"
    }
   },
   "outputs": [],
   "source": [
    "# Check the rows\n",
    "print('There is a row containing all the null values?')\n",
    "FR_re_df.isnull().all(axis=1).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No row contains only the null values, so no need to for filtering on that basis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Standardize the energy types and technologies\n",
    "\n",
    "Now, we proceed with standardizing the energy types and technologies present in the data according to the [**OPSD energy hierarchy**](http://open-power-system-data.org/2016-10-25-opsd_tree.svg)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:41.655001Z",
     "start_time": "2020-08-20T09:24:41.639979Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df[['energy_source_level_2', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to facilitate further processing, we can remove the rows that we know for sure we won't need.\n",
    "Those are the rows satisfying either of the following conditions:\n",
    "* `electrical_capacity` is `0` or `NaN`,\n",
    "* `energy_source_level_2` corresponds to a non-renewable energy type (`Non-renewable thermal`, `Non-hydraulic storage`, `Nuclear`),\n",
    "* `technology` indicates that a non-renewable technology is used at the facility (`Steam turbine`, `Combustion cogeneration`, `Combustion engine`, `Combined cycle`, `Pumped storage`, `Piston motor`, `Nuclear fission`).\n",
    "* `energy_source_level_2` is `Other` and `technology` is `Other` or `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:42.583174Z",
     "start_time": "2020-08-20T09:24:42.323292Z"
    }
   },
   "outputs": [],
   "source": [
    "# Define the mask for selecting rows with unusable info on electrical capacity\n",
    "ec_mask = (FR_re_df['electrical_capacity'] == 0) | (FR_re_df['electrical_capacity'].isna())\n",
    "\n",
    "# Define the mask for selecting the rows with non-renewable energy_source_level_2\n",
    "non_renewable_esl2 = ['Non-renewable thermal', 'Non-hydraulic storage', 'Nuclear']\n",
    "non_renewable_esl2_mask = FR_re_df['energy_source_level_2'].isin(non_renewable_esl2)\n",
    "\n",
    "# Define the mask to select the rows with non-renewable technology\n",
    "non_renewable_technologies = ['Steam turbine', 'Combustion cogeneration', 'Combustion engine',\n",
    "                              'Combined cycle', 'Pumped storage', 'Piston motor', 'Nuclear fission']\n",
    "non_renewable_technology_mask = FR_re_df['technology'].isin(non_renewable_technologies)\n",
    "\n",
    "# Define the mask to select the rows without specified energy type and technology\n",
    "other_mask = (FR_re_df['energy_source_level_2'] == 'Other') & \\\n",
    "            ((FR_re_df['technology'] == 'Other') | (pd.isnull(FR_re_df['technology'])))\n",
    "\n",
    "# Combine the masks\n",
    "drop_mask = ec_mask | non_renewable_esl2_mask | non_renewable_technology_mask | other_mask\n",
    "\n",
    "# Show how many rows are going to be dropped\n",
    "print('Dropping', drop_mask.sum(), 'rows out of', FR_re_df.shape[0])\n",
    "\n",
    "# Keep all the rows not selected by the drop mask\n",
    "keep_mask = ~drop_mask\n",
    "FR_re_df = FR_re_df[keep_mask].reindex()\n",
    "\n",
    "# Show some rows\n",
    "print(\"A sample of the kept data:\")\n",
    "FR_re_df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Standardize source levels 1-3 and technology"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see the energy types and technologies present in the filtered data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:43.047071Z",
     "start_time": "2020-08-20T09:24:43.033438Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df[['energy_source_level_2', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let us standardize the values for energy source level 2 and technology.\n",
    "1. We will use `np.nan` to indicate that technology should not be specified for the respective kind of sources according to the [**OPSD hierarchy**](http://open-power-system-data.org/2016-10-25-opsd_tree.svg).\n",
    "2. `'Other or unspecified technology'` will mean that technology should be specified but it was unclear or missing in the original dataset.\n",
    "\n",
    "That means that we need to apply the following correction rules to the current data:\n",
    "- All occurences of `Other` in the column `technology` should be replaced with `Other or unspecified technology`.\n",
    "- If `energy_source_level_2` is `Marine`, `Geothermal`, or `Bioenergy`, then `technology` should be set to `np.nan` regardless of what is specified in the data set.\n",
    "- If `energy_source_level_2` is `Solar` or `Hydro`, and `technology` is `NaN`, then `technology` should be set to `Other or unspecified technology`.\n",
    "- If `energy_source_level_2` is `Wind` and `technology` is `NaN`, then `technology` should be set to `Onshore` since France has no offshore wind farms.\n",
    "- If `energy_source_level_2` is `Hydro` and `technology` is `Lake` or `Closed`, then `technology` should be set to `Other or unspecified technology`.\n",
    "- If `energy_source_level_2` is `Solar` and `technology` is `Thermodynamic`, then `technology` should be set to `Other or unspecified technology`.\n",
    "- If `energy_source_level_2` is `Other` and `technology` is `Photovoltaics`, then `energy_source_level_2` should be set to `Solar`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:50.244184Z",
     "start_time": "2020-08-20T09:24:43.501242Z"
    }
   },
   "outputs": [],
   "source": [
    "# Make sure that the proper string is used to indicate other or unspecified technology\n",
    "FR_re_df['technology'].replace('Other', 'Other or unspecified technology', inplace=True)\n",
    "\n",
    "# Define a function that will deal with other cases\n",
    "def standardize(row):  \n",
    "    level_2 = row['energy_source_level_2']\n",
    "    technology = row['technology']\n",
    "    \n",
    "    if level_2 in ['Marine', 'Geothermal', 'Bioenergy']:\n",
    "        technology = np.nan\n",
    "    elif level_2 in ['Solar', 'Hydro', 'Other'] and pd.isna(technology):\n",
    "        technology = 'Other or unspecified technology'\n",
    "    elif level_2 == 'Wind' and (pd.isna(technology) or technology == 'Other or unspecified technology'):\n",
    "        technology = 'Onshore'\n",
    "    \n",
    "    if level_2 == 'Hydro' and technology in ['Lake', 'Closed']:\n",
    "        technology = 'Other or unspecified technology'\n",
    "    elif level_2 == 'Solar' and technology == 'Thermodynamic':\n",
    "        technology = 'Other or unspecified technology'\n",
    "    elif level_2 == 'Other' and technology == 'Photovoltaics':\n",
    "        level_2 = 'Solar'\n",
    "        \n",
    "    \n",
    "    return [level_2, technology]\n",
    "\n",
    "# Apply the rules coded in function standardize\n",
    "FR_re_df[['energy_source_level_2', 'technology']] = FR_re_df.apply(standardize, axis=1, result_type='expand')\n",
    "\n",
    "# Show the existing level 2 types and technologies\n",
    "FR_re_df[['energy_source_level_2', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now deal with the third level of the energy hierarchy. Only `Bioenergy` has the third level. Information on it can be found in the column `energy_source_level_3` (whose original name was `combustible`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:50.257008Z",
     "start_time": "2020-08-20T09:24:50.245667Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df[['energy_source_level_2', 'energy_source_level_3']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that only the following two corrections are needed:\n",
    "- If `energy_source_level_3` is `Wood`, set `energy_source_level_3` to `Biomass and biogas`.\n",
    "- If `energy_source_level_3` is `NaN`, and `energy_source_level_2` is `Bioenergy`, set `energy_source_level_3` to `Other or unspecified`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:50.410614Z",
     "start_time": "2020-08-20T09:24:50.258452Z"
    }
   },
   "outputs": [],
   "source": [
    "index = (pd.isna(FR_re_df['energy_source_level_3']) & \\\n",
    "        (FR_re_df['energy_source_level_2'] == 'Bioenergy'))\n",
    "FR_re_df.loc[index, 'energy_source_level_3'] = 'Other or unspecified'\n",
    "         \n",
    "index = FR_re_df['energy_source_level_3'] == 'Wood'\n",
    "FR_re_df.loc[index, 'energy_source_level_3'] = 'Biomass and biogas'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we declare all the plants as renewable and show the final hierarchy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:50.712964Z",
     "start_time": "2020-08-20T09:24:50.415378Z"
    }
   },
   "outputs": [],
   "source": [
    "# Assign energy_source_level_1 to the dataframe\n",
    "FR_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy\n",
    "energy_columns = ['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']\n",
    "FR_re_df[energy_columns].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Georeferencing\n",
    "\n",
    "First, we will determine the plants' longitude and latitude coordinates, and then assign them their NUTS codes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Municipality (INSEE) code to lon/lat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:53.435056Z",
     "start_time": "2020-08-20T09:24:50.715144Z"
    }
   },
   "outputs": [],
   "source": [
    "# Get the local path of the downloaded georeferencing data\n",
    "FR_geo_filepath = filepaths['Opendatasoft']\n",
    "\n",
    "# Read INSEE Code Data\n",
    "FR_geo = pd.read_csv(FR_geo_filepath,\n",
    "                     sep=';',\n",
    "                     header=0,\n",
    "                     converters={'Code_postal': str})\n",
    "\n",
    "# Drop possible duplicates of the same INSEE code\n",
    "FR_geo.drop_duplicates('INSEE_COM', keep='last', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:53.568592Z",
     "start_time": "2020-08-20T09:24:53.436686Z"
    }
   },
   "outputs": [],
   "source": [
    "# create columns for latitude/longitude\n",
    "lat = []\n",
    "lon = []\n",
    "\n",
    "# split in latitude/longitude\n",
    "for row in FR_geo['Geo Point']:\n",
    "    try:\n",
    "        # Split tuple format\n",
    "        # into the column lat and lon\n",
    "        row = row.lstrip('(').rstrip(')')\n",
    "        lat.append(row.split(',')[0])\n",
    "        lon.append(row.split(',')[1])\n",
    "    except:\n",
    "        # set NAN\n",
    "        lat.append(np.NaN)\n",
    "        lon.append(np.NaN)\n",
    "\n",
    "# add these columns to the INSEE DataFrame\n",
    "FR_geo['lat'] = pd.to_numeric(lat)\n",
    "FR_geo['lon'] = pd.to_numeric(lon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:24:54.071497Z",
     "start_time": "2020-08-20T09:24:53.569941Z"
    }
   },
   "outputs": [],
   "source": [
    "# Column names of merge key have to be named identically\n",
    "FR_re_df.rename(columns={'municipality_code': 'INSEE_COM'}, inplace=True)\n",
    "\n",
    "# Merge longitude and latitude columns by the Code INSEE\n",
    "FR_re_df = FR_re_df.merge(FR_geo[['INSEE_COM', 'lat', 'lon']],\n",
    "                          on=['INSEE_COM'],\n",
    "                          how='left')\n",
    "\n",
    "# Translate Code INSEE column back to municipality_code\n",
    "FR_re_df.rename(columns={'INSEE_COM': 'municipality_code'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Determine NUTS codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:34.781141Z",
     "start_time": "2020-08-20T09:24:54.073968Z"
    }
   },
   "outputs": [],
   "source": [
    "#import importlib\n",
    "#importlib.reload(util.nuts_converter)\n",
    "#from util.nuts_converter import NUTSConverter\n",
    "#nuts_converter = NUTSConverter(downloader, eurostat_eu_directory_path)\n",
    "\n",
    "FR_postcode2nuts_path = filepaths['Eurostat']\n",
    "FR_re_df = nuts_converter.add_nuts_information(FR_re_df, 'FR', FR_postcode2nuts_path,\n",
    "                                               lau_name_type='NATIONAL',\n",
    "                                               closest_approximation=True,\n",
    "                                               how=['municipality_code', 'latlon'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully determined\n",
    "determined = FR_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', FR_re_df.shape[0], 'facilities in FR.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = FR_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', FR_re_df.shape[0], 'facilities in FR.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now check the facilities without NUTS classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:35.022997Z",
     "start_time": "2020-08-20T09:25:34.782663Z"
    }
   },
   "outputs": [],
   "source": [
    "# Check the facilities without NUTS classification\n",
    "no_nuts = FR_re_df['nuts_1_region'].isnull()\n",
    "\n",
    "# Find the masks where some information for finding the proper NUTS code is present\n",
    "lat_or_lon_present = ~(FR_re_df['lat'].isna() & FR_re_df['lon'].isna())\n",
    "municipality_code_present = ~(FR_re_df['municipality_code'].isnull())\n",
    "municipality_name_present = ~(FR_re_df['municipality'].isnull())\n",
    "\n",
    "# Show the cases where NUTS classification failed even though it shouldn't have\n",
    "print('1. No NUTS code but latitude/longitude info present')\n",
    "problematic_lat_lon = FR_re_df[no_nuts & lat_or_lon_present][['lat', 'lon']]\n",
    "display(problematic_lat_lon)\n",
    "\n",
    "print('2. No NUTS code but municipality code info present')\n",
    "problematic_municipality_codes = FR_re_df[no_nuts & municipality_code_present]['municipality_code'].unique()\n",
    "display(problematic_municipality_codes)\n",
    "\n",
    "print('3. No NUTS code but municipality name info present')\n",
    "problematic_municipality_names = FR_re_df[no_nuts & municipality_name_present]['municipality'].unique()\n",
    "display(problematic_municipality_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that no row with known longitude and latitude was left unclassified. \n",
    "\n",
    "What we also see is that some municipality codes did not translate to the corresponding NUTS codes. Further inspection shows that those codes are not present in the official NUTS translation tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:35.478370Z",
     "start_time": "2020-08-20T09:25:35.092510Z"
    }
   },
   "outputs": [],
   "source": [
    "# Check if the any problematic code is actually present in the translation table\n",
    "present_any = False\n",
    "for code in problematic_municipality_codes:\n",
    "    mask = nuts_converter.municipality2nuts_df['municipality_code'].str.match(code)\n",
    "    present_any = present_any or mask.any()\n",
    "    \n",
    "print(present_any)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also see that problematic municipality names are either not present in the official translation tables or more than one municipality in the tables bears them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:35.753523Z",
     "start_time": "2020-08-20T09:25:35.479601Z"
    }
   },
   "outputs": [],
   "source": [
    "# Print only the names of those problematic municipalities, which appear in the translation table only once.\n",
    "for name in problematic_municipality_names:\n",
    "    mask = nuts_converter.municipality2nuts_df['municipality'].str.match(name)\n",
    "    if mask.sum() == 1:\n",
    "        print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore, we can confirm that NUTS classification codes were determined with the highest precision possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Convert electrical capacity to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:35.821296Z",
     "start_time": "2020-08-20T09:25:35.754713Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df['electrical_capacity'] = FR_re_df['electrical_capacity'] / 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Old data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:37.943469Z",
     "start_time": "2020-08-20T09:25:35.827910Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load the data\n",
    "FR_re_filepath = filepaths['gouv.fr']\n",
    "\n",
    "FR_re_df_old = pd.read_excel(FR_re_filepath,\n",
    "                         sheet_name='Commune',\n",
    "                         encoding='UTF8',\n",
    "                         thousands='.',\n",
    "                         decimals=',',\n",
    "                         header=[3, 4],\n",
    "                         skipfooter=9,      # skip the summary rows \n",
    "                         index_col=[0, 1],  # required for MultiIndex\n",
    "                         converters={'Code officiel géographique': str})\n",
    "FR_re_df_old.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This French data source contains number of installations and sum of installed capacity per energy source per municipality. The list is limited to the plants which are covered by article 10 of february 2000 by an agreement to a purchase commitment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:38.458650Z",
     "start_time": "2020-08-20T09:25:37.946652Z"
    }
   },
   "outputs": [],
   "source": [
    "# Rearrange data\n",
    "FR_re_df_old.index.rename(['insee_com', 'municipality'], inplace=True)\n",
    "FR_re_df_old.columns.rename(['energy_source_level_2', None], inplace=True)\n",
    "FR_re_df_old = (FR_re_df_old\n",
    "                .stack(level='energy_source_level_2', dropna=False)\n",
    "                .reset_index(drop=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:38.587560Z",
     "start_time": "2020-08-20T09:25:38.461226Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_FR = columnnames[(columnnames['country'] == 'FR') & (columnnames['data_source'] == 'gouv.fr')].index\n",
    "column_dict_FR = columnnames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:40.102888Z",
     "start_time": "2020-08-20T09:25:38.594130Z"
    }
   },
   "outputs": [],
   "source": [
    "# Translate columnnames\n",
    "FR_re_df_old.rename(columns=column_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:40.441036Z",
     "start_time": "2020-08-20T09:25:40.104510Z"
    }
   },
   "outputs": [],
   "source": [
    "# Drop all rows that contain NA \n",
    "FR_re_df_old = FR_re_df_old.dropna()\n",
    "FR_re_df_old.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:40.662943Z",
     "start_time": "2020-08-20T09:25:40.442217Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df_old['data_source'] = 'Ministry for the Ecological and Inclusive Transition'\n",
    "FR_re_df_old['as_of_year'] = 2017 # Year for which the dataset has been compiled by the data source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Translate values and harmonize energy source level 2\n",
    "\n",
    "##### Kept secret if number of installations < 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the number of installations is less than 3, it is marked with an s instead of the number 1 or 2 due to statistical confidentiality (as explained by the data provider). Here, the s is changed to < 3. This is done in the same step as the other value translations of the energy sources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:41.017556Z",
     "start_time": "2020-08-20T09:25:40.668571Z"
    }
   },
   "outputs": [],
   "source": [
    "# Choose the translation terms for France, create dictionary and show dictionary\n",
    "idx_FR = valuenames[(valuenames['country'] == 'FR') & (valuenames['data_source'] == 'gouv.fr')].index\n",
    "value_dict_FR = valuenames.loc[idx_FR].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_FR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:41.319429Z",
     "start_time": "2020-08-20T09:25:41.019427Z"
    }
   },
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "FR_re_df_old.replace(value_dict_FR, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Separate and assign energy source level 1-3 and technology"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:41.686562Z",
     "start_time": "2020-08-20T09:25:41.390406Z"
    }
   },
   "outputs": [],
   "source": [
    "energy_source_dict_FR = valuenames.loc[idx_FR].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "display(energy_source_dict_FR)\n",
    "display(FR_re_df_old[['energy_source_level_2']].drop_duplicates())\n",
    "(FR_re_df_old['energy_source_level_2'].replace(energy_source_dict_FR).unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:41.939362Z",
     "start_time": "2020-08-20T09:25:41.688107Z"
    }
   },
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source to its subtype\n",
    "energy_source_dict_FR = valuenames.loc[idx_FR].set_index(\n",
    "    'opsd_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Column energy_source partly contains subtype information, thus this column is copied\n",
    "# to new column for energy_source_subtype.\n",
    "FR_re_df_old['technology'] = FR_re_df_old['energy_source_level_2']\n",
    "\n",
    "# Only Photovoltaics should be kept as technology. Hydro should be changed to 'Other or unspecified technology',\n",
    "# Geothermal to NaN, and Wind to Onshore.\n",
    "# 1. np.nan means that technology should not be specified for the respective kind of sources\n",
    "#    according to the hierarchy (http://open-power-system-data.org/2016-10-25-opsd_tree.svg)\n",
    "# 2. 'Other or unspecified technology' means that technology should be specified\n",
    "#    but it was unclear or missing in the original dataset.\n",
    "technology_translation_dictionary = {\n",
    "    'Solar' : 'Photovoltaics',\n",
    "    'Wind': 'Onshore',\n",
    "    'Hydro': 'Other or unspecified technology',\n",
    "    'Geothermal': np.nan\n",
    "}\n",
    "FR_re_df_old['technology'].replace(technology_translation_dictionary, inplace=True)\n",
    "\n",
    "# The energy source subtype values in the energy_source column are replaced by\n",
    "# the higher level classification\n",
    "FR_re_df_old['energy_source_level_2'].replace(energy_source_dict_FR, inplace=True)\n",
    "\n",
    "# Assign energy_source_level_1 to the dataframe\n",
    "FR_re_df_old['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "FR_re_df_old.reset_index(drop=True, inplace=True)\n",
    "\n",
    "# Choose energy source level 2 entries where energy source level 2 is Bioenergy in order to  \n",
    "# seperate Bioenergy subtypes to energy source level 3 and subtypes for the rest to technology\n",
    "idx_FR_Bioenergy = FR_re_df_old[FR_re_df_old['energy_source_level_2'] == 'Bioenergy'].index\n",
    "\n",
    "# Assign technology to energy source level 3  for all entries where energy source level 2 is  \n",
    "# Bioenergy and delete those entries from  technology\n",
    "FR_re_df_old[['energy_source_level_3']] = FR_re_df_old.iloc[idx_FR_Bioenergy][['technology']]\n",
    "FR_re_df_old.loc[idx_FR_Bioenergy,'technology'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the hierarchy of the energy types present in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:42.324969Z",
     "start_time": "2020-08-20T09:25:41.940764Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df_old[['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Georeferencing\n",
    "\n",
    "##### Municipality (INSEE) code to lat/lon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:25:42.664065Z",
     "start_time": "2020-08-20T09:25:42.326952Z"
    }
   },
   "outputs": [],
   "source": [
    "# Column names of merge key have to be named identically\n",
    "FR_re_df_old.rename(columns={'municipality_code': 'INSEE_COM'}, inplace=True)\n",
    "\n",
    "# Merge longitude and latitude columns by the Code INSEE\n",
    "FR_re_df_old = FR_re_df_old.merge(FR_geo[['INSEE_COM', 'lat', 'lon']],\n",
    "                          on=['INSEE_COM'],\n",
    "                          how='left')\n",
    "\n",
    "# Translate Code INSEE column back to municipality_code\n",
    "FR_re_df_old.rename(columns={'INSEE_COM': 'municipality_code'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Determine NUTS codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:19.346740Z",
     "start_time": "2020-08-20T09:25:42.665466Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_postcode2nuts_path = filepaths['Eurostat']\n",
    "\n",
    "FR_re_df_old = nuts_converter.add_nuts_information(FR_re_df_old, 'FR', FR_postcode2nuts_path,\n",
    "                                               how=['municipality_code', 'latlon'])\n",
    "# how=['municipality', 'municipality_code', 'latlon']\n",
    "# Report the number of facilites whose NUTS codes were successfully determined\n",
    "determined = FR_re_df_old['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', FR_re_df_old.shape[0], 'facilities in FR.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = FR_re_df_old['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', FR_re_df_old.shape[0], 'facilities in FR.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:19.357609Z",
     "start_time": "2020-08-20T09:26:19.348150Z"
    }
   },
   "outputs": [],
   "source": [
    "# Show the facilities without NUTS classification\n",
    "FR_re_df_old[FR_re_df_old['nuts_1_region'].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the NUTS codes were determined successfully for all the facilities in the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Integrate old and new data\n",
    "\n",
    "Some municipalities are not covered by the new data set, provided by ODRE. Now, we find those municipalities and integrate them with the new data.\n",
    "\n",
    "The only column present in the old data, but not in the new, is `number_of_installations`. Since the old data\n",
    "were aggregated on the municipality level, the column in question refers to the numbers of power plants in the \n",
    "municipalitis. Since the new data covers individual plants, if we set the column `number_of_installations` to 1\n",
    "for all the plants in the the new data, we will make the two sets consistent with one another and be able\n",
    "to concatenate them. \n",
    "\n",
    "We will set `site_name` to 'Aggregated data for `municipality`' for all the rows from the old data, where `municipality` refers to the name of the municipality for which the row has been compiled.\n",
    "\n",
    "**Note**: the electrical capacity in the old data is already in MW, so conversion is not needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:19.624974Z",
     "start_time": "2020-08-20T09:26:19.358554Z"
    }
   },
   "outputs": [],
   "source": [
    "# For each column present in the new data's column space, but not the old,\n",
    "# add an empty column to the old data.\n",
    "\n",
    "for new_column in FR_re_df.columns:\n",
    "    if new_column not in FR_re_df.columns:\n",
    "        FR_re_df_old[new_column] = np.nan\n",
    "\n",
    "# Define the mask to select the municipalities from the old data, that are not covered\n",
    "# by the new.\n",
    "not_included = ~(FR_re_df_old['municipality_code'].isin(FR_re_df['municipality_code']))\n",
    "FR_re_df_old[not_included]\n",
    "\n",
    "# Add a dummy column to the new data frame\n",
    "# representing the number of power plants (always 1)\n",
    "FR_re_df['number_of_installations'] = 1\n",
    "\n",
    "# Mark the old data rows as aggregations on municipality level.\n",
    "FR_re_df_old['site_name'] = 'Aggregated data for ' + FR_re_df_old['municipality']\n",
    "\n",
    "# Concatenate the new data with the old rows referring to the municipalities\n",
    "# not covered by the new.\n",
    "FR_re_df = pd.concat([FR_re_df, FR_re_df_old[not_included]], ignore_index=True, axis='index', sort=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select the columns\n",
    "Now, we select the columns we want to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:19.846159Z",
     "start_time": "2020-08-20T09:26:19.626329Z"
    }
   },
   "outputs": [],
   "source": [
    "columns_to_keep = ['EIC_code', 'municipality_group_code', 'IRIS_code', 'as_of_year',\n",
    "       'commissioning_date', 'connection_date', 'data_source', 'departement',\n",
    "       'departement_code', 'disconnection_date',\n",
    "       'electrical_capacity', 'energy_source_level_1', 'energy_source_level_2',\n",
    "       'energy_source_level_3', 'lat', 'lon',\n",
    "       'municipality', 'municipality_code',\n",
    "       'municipality_group', 'number_of_installations', 'nuts_1_region',\n",
    "       'nuts_2_region', 'nuts_3_region', 'region', 'region_code', 'site_name',\n",
    "       'source_station_code', 'technology']\n",
    "FR_re_df = FR_re_df[columns_to_keep]\n",
    "FR_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:21.974925Z",
     "start_time": "2020-08-20T09:26:19.848102Z"
    }
   },
   "outputs": [],
   "source": [
    "visualize_points(FR_re_df['lat'],\n",
    "                 FR_re_df['lon'],\n",
    "                 'France',\n",
    "                 categories=FR_re_df['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:22.139639Z",
     "start_time": "2020-08-20T09:26:21.976125Z"
    }
   },
   "outputs": [],
   "source": [
    "FR_re_df.to_pickle('intermediate/FR_renewables.pickle')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-20T09:26:22.467234Z",
     "start_time": "2020-08-20T09:26:22.195330Z"
    }
   },
   "outputs": [],
   "source": [
    "del FR_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Poland PL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download\n",
    "The data which will be processed below is provided by the following data source:\n",
    "\n",
    "**[Urzad Regulacji Energetyki (URE) / Energy Regulatory Office](https://www.ure.gov.pl/pl/oze/potencjal-krajowy-oze/8108,Instalacje-odnawialnych-zrodel-energii-wg-stanu-na-dzien-31-grudnia-2019-r.html)** - Installed capacities of renewable-energy power plants in Poland. The plants are anonymized in the sense that no names, post codes or geographical coordinates are present. They are described by: the energy type their use, installed capacity, województwo (province) and powiat (district) that they are located in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data\n",
    "filepaths = downloader.download_data_for_country('PL')\n",
    "\n",
    "# Get the local paths to the data files\n",
    "PL_re_filepath = filepaths['Urzad Regulacji Energetyki']\n",
    "PL_postcode2nuts_filepath = filepaths['Eurostat']\n",
    "PL_geo_filepath = filepaths['Geonames']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load and explore the data\n",
    "\n",
    "The dataset comes in the `csv` format. Let us open it, inspect its columns and clean it a bit before processing it further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the data into a pandas dataframe\n",
    "PL_re_df = pd.read_excel(PL_re_filepath,\n",
    "                       encoding='latin',\n",
    "                       header=2,\n",
    "                       skipfooter=14\n",
    "                       )\n",
    "# Show 5 random rows\n",
    "PL_re_df.sample(n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are only five columns:\n",
    "- `Lp.`: the ordinal number of the entry (power plant), effectively serving as its identification number.\n",
    "- `Województwo`: the province (voivodeship) where the plant is located\n",
    "- `Powiat`: the district where the plant is located\n",
    "- `Rodzaj_OZE`: the code of the energy the plants uses. According to the legend in the .xlsx file, the codes are as follows:\n",
    "  - `BG`: biogas\n",
    "  - `BM`: biomass\n",
    "  - `PVA`: solar energy\n",
    "  - `WIL`: wind energy\n",
    "  - `WO`: hydroenergy\n",
    "  - `WS`: using the technology of co-firing biomass, biogas or bioliquids with other fuels (fossil fuels and biomass / biogas / bioliquids)\n",
    "- `Moc zainstalowana [MW]`: installed capacity (in MWs).\n",
    "\n",
    "The type corresponding to `WS` does not fit into the [**OPSD energy hiearchy**](http://open-power-system-data.org/2016-10-25-opsd_tree.svg), so we can drop such plants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mask for selecting the WS plants\n",
    "ws_mask = PL_re_df['Rodzaj_OZE'] == 'WS'\n",
    "\n",
    "# Drop them\n",
    "print('Dropping', ws_mask.sum(), 'out of', PL_re_df.shape[0], 'power plants.')\n",
    "PL_re_df.drop(PL_re_df.index[ws_mask], axis=0, inplace=True)\n",
    "PL_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To ease the work, we can translate the columns' names to English using the OPSD translation tables. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Poland, create and show the dictionary\n",
    "columnnames = pd.read_csv(os.path.join('input', 'column_translation_list.csv'))\n",
    "idx_PL = columnnames[(columnnames['country'] == 'PL') &\n",
    "                     (columnnames['data_source'] == 'Urzad Regulacji Energetyki')].index\n",
    "column_dict_PL = columnnames.loc[idx_PL].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_PL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate column names\n",
    "PL_re_df.rename(columns=column_dict_PL, inplace=True)\n",
    "\n",
    "# Show a couple of rows\n",
    "PL_re_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect the data\n",
    "\n",
    "Let us do few quick checks to see state of the data:\n",
    "- Are there any `NA` values?\n",
    "- Are all the values in the column `electrical_capacity` proper numbers?\n",
    "- Are all the values in the column `energy_type` (codes of energy types) consistent strings? Here we check if all the codes appear in one and only one form. For example, `PVA` is the code for solar energy and we want to make sure that only `PVA` appears in the column, not other variations such as `pva`, `Pva` etc.\n",
    "- What is the form of the geographical data? Are some districts represented by different strings in different rows? What about the regions (provinces, województwa, voivodeships)?\n",
    "\n",
    "We will need the answers to those questions to know how to proceed with processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The number of missing values in the data:', PL_re_df.isna().sum().sum())\n",
    "\n",
    "print('Are all capacities proper numbers?', PL_re_df['electrical_capacity'].dtype == 'float64')\n",
    "\n",
    "print('What about the energy codes?', PL_re_df['energy_type'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the voivodeships\n",
    "print('Show the names of the voivodeships.')\n",
    "PL_re_df['region'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that each name comes in two forms: (1) with the first letter capital and (2) with the first letter lowercase. One province is referred to by three different strings: `'Śląskie'`, `'śląskie'`, and `'śląskie '` (the last with a trailing white space). In order to standardize this column, we trim and capitalize all the strings appearing in it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df['region'] = PL_re_df['region'].str.strip().str.capitalize()\n",
    "PL_re_df['region'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us check the strings for districts (powiats)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "districts = PL_re_df['district'].unique()\n",
    "districts.sort()\n",
    "districts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see in the list, the same district can be referred to by more than one string. We identify the following ways a district is referred to in the dataset:\n",
    "1. by using the noun in the nominative case, capitalized (e.g. `Kraków`),\n",
    "2. by prepending `m. ` or `m. st. ` to the form 1 (e.g. `m. Kraków` or `m. st. Warszawy`) and\n",
    "3. by the possesive adjective, lowercase (e.g. `krakowski`).\n",
    "\n",
    "Some districts, such as Krakow, appear in all the three forms, but there are those which appear in two (e.g. `Bytom` and `m. Bytom`). This will pose a problem when we later try to assign the plants their NUTS codes. Furthermore, the NUTS translation tables do not map districts to the codes, but lower administrative units (municipalities) and postcodes to NUTS. We solve this issue at a later point in the notebook, Section *Georeferencing (NUTS classification)*, and not here as it requires heavier processing than warranted during initial explorative analysis and lightweight cleaning of the data.\n",
    "\n",
    "We note that the districts `lipowski` and `hojnowski` are misspelled, as they should actually be `lipnowski` and `hajnowski`, so we can correct the typos now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correct the typos\n",
    "PL_re_df.loc[PL_re_df['district'] == 'lipowski', 'district'] = 'lipnowski'\n",
    "PL_re_df.loc[PL_re_df['district'] == 'hojnowski', 'district'] = 'hajnowski'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Harmonising energy levels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Poland, create dictionary \n",
    "idx_PL = valuenames[valuenames['country'] == 'PL'].index\n",
    "value_dict_PL = valuenames.loc[idx_PL].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Set energy source level 3\n",
    "PL_re_df['energy_source_level_3'] = PL_re_df['energy_type'].replace(value_dict_PL)\n",
    "\n",
    "# Create dictionnary in order to assign energy_source_level_2 to its subtype\n",
    "idx_PL = valuenames[valuenames['country'] == 'PL'].index\n",
    "energy_source_dict_PL = valuenames.loc[idx_PL].set_index('original_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Add energy_source_level_2\n",
    "PL_re_df['energy_source_level_2'] = PL_re_df['energy_type'].replace(energy_source_dict_PL)\n",
    "\n",
    "\n",
    "# Standardize the values for technology\n",
    "# 1. np.nan means that technology should not be specified for the respective kind of sources\n",
    "#    according to the hierarchy (http://open-power-system-data.org/2016-10-25-opsd_tree.svg)\n",
    "# 2. 'Other or unspecified technology' means that technology should be specified\n",
    "#    but it was unclear or missing in the original dataset.\n",
    "technology_translation_dictionary = {\n",
    "    'BG': np.nan,\n",
    "    'BM': np.nan,\n",
    "    'PVA': 'Other or unspecified technology', # Photovoltaics?\n",
    "    'WIL': 'Other or unspecified technology', # Onshore?\n",
    "    'WO': 'Other or unspecified technology', # Run-of-river\n",
    "}\n",
    "\n",
    "PL_re_df['technology'] = PL_re_df['energy_type'].replace(technology_translation_dictionary)\n",
    "\n",
    "# Add energy_source_level_1\n",
    "PL_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy of sources present in the dataset\n",
    "PL_re_df[['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']].drop_duplicates().sort_values(by='energy_source_level_2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing (NUTS classification)\n",
    "\n",
    "We have already seen that the district names are not standardized and observed that we cannot use them directly to get the corresponding NUTS codes.\n",
    "\n",
    "There is a way to get around this issue. We can do it as folows:\n",
    "1. First, we find a postcode in the GeoNames zip for Poland that corresponds to each district in the URE data. To do so, we must standardize all the district names to the forms that appear in the GeoNames zip file.\n",
    "2. Then, we can easily map a postcode to the appropriate NUTS codes using `nuts_converter`.\n",
    "\n",
    "By inspection, we observe that all the district names in the zip have one of the following two forms:\n",
    "- Noun in the nominative case, capitalized.\n",
    "- `Powiat *` where `*` is a possessive adjective.\n",
    "\n",
    "So, we standardize all the strings in the `district` column as follows:\n",
    "- Remove all the trailing whitespaces and characters other than letters.\n",
    "- If the string starts with `m. ` or `m. st. `, remove `m. ` (or `m. st. `) from the beginning of the string.\n",
    "- If the string ends with a possessive suffix *ski*, *cki* or *zki*, prepend the string `Powiat ` (note the ending whitespace) to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the function to standardize district names from the original data\n",
    "def standardize_districts(original_string):\n",
    "    if original_string[-1] == ',': # there is one district whose name ends with ','; that's a typo in the data\n",
    "        original_string = original_string[:-1]\n",
    "        \n",
    "    if original_string.startswith('m. st. '):\n",
    "        return original_string[7:]\n",
    "    elif original_string.startswith('m. '):\n",
    "        return original_string[3:]\n",
    "    elif any([original_string.endswith(suffix) for suffix in ['ski', 'cki', 'zki']]):\n",
    "        return 'Powiat ' + original_string\n",
    "    else:\n",
    "        return original_string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_PL_geo = zipfile.ZipFile(PL_geo_filepath)\n",
    "\n",
    "# Read generated postcode/location file\n",
    "PL_geo = pd.read_csv(zip_PL_geo.open('PL.txt'), sep='\\t', header=None)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "PL_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "PL_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "PL_geo['postcode'] = PL_geo['postcode'].astype(str)\n",
    "\n",
    "# Get the names\n",
    "geonames_districts = PL_geo['admin_name2'].unique()\n",
    "\n",
    "# Show them\n",
    "geonames_districts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize the district names from the original data\n",
    "PL_re_df['standardized_district'] = PL_re_df['district'].apply(standardize_districts)\n",
    "standardized_districts = PL_re_df['standardized_district'].unique()\n",
    "\n",
    "# Check which districts could not be found in the GeoNames data\n",
    "#print(len([x for x in semi if x in geopowiats]), len([x for x in semi if x not in geopowiats]))\n",
    "not_found = set(standardized_districts).difference(set(geonames_districts))\n",
    "number_of_not_found = len(not_found)\n",
    "total = len(standardized_districts)\n",
    "print('{}/{} names could not be found. Those are:'.format(number_of_not_found, total))\n",
    "print(not_found)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now apply a heuristic method for finding the corresponding name in the GeoNames data. It is based on similarity between strings. It turns out that it works fine, except for a couple of cases, which we deal with manually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We define the similarity between two strings, string1 and string2,\n",
    "# as the length of the longest prefix of string1 that appears in string2.\n",
    "# Note 1: this measure of similarity is not necessarily symmetrical.\n",
    "# Note 2: a prefix of a string is its substring that starts from the beginning of the string.\n",
    "def calculate_similarity(string1, string2):\n",
    "    for n in range(len(string1), 1, -1):\n",
    "        prefix = string1[0:(n-1)] \n",
    "        if prefix in string2:\n",
    "            return len(prefix)\n",
    "    return 0\n",
    "\n",
    "# Define a function to find, among a group of candidate strings,\n",
    "# the most similar string to the one given as the reference string.\n",
    "def find_the_most_similar(reference_string, candidate_strings):\n",
    "    the_most_similar = None\n",
    "    maximal_similarity = 0\n",
    "    for candidate_string in candidate_strings:\n",
    "        similarity = calculate_similarity(reference_string, candidate_string)\n",
    "        if similarity > maximal_similarity:\n",
    "            maximal_similarity = similarity\n",
    "            the_most_similar = candidate_string\n",
    "    return the_most_similar, maximal_similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "already_mapped = PL_re_df[['district', 'standardized_district']].drop_duplicates().to_dict(orient='records')\n",
    "already_mapped = {mapping['district'] : mapping['standardized_district'] for mapping in already_mapped \n",
    "                 if mapping['standardized_district'] in geonames_districts}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make a dictionary to map each district from the original data to its GeoNames equivalent.\n",
    "# The districts whose standardized versions have been found in the GeoNames data to their standardizations.\n",
    "# The mappings for other districts will be found using the previously defined similarity measures.\n",
    "districts_map = PL_re_df[['district', 'standardized_district']].drop_duplicates().to_dict(orient='records')\n",
    "districts_map = {mapping['district'] : mapping['standardized_district'] for mapping in districts_map}\n",
    "\n",
    "\n",
    "# Override the mappings for the 49 districts whose standardized names have not been found in the GeoNames data.\n",
    "for district, standardized_district in districts_map.items():\n",
    "    #standardized_district = ['standardized_district']\n",
    "    if standardized_district not in geonames_districts:\n",
    "        #print('---------')\n",
    "        if standardized_district.startswith('Powiat'):\n",
    "            standardized_district = standardized_district[7:]\n",
    "        #print(district)\n",
    "        capitalized = standardized_district.capitalize()\n",
    "        lowercase = standardized_district.lower()\n",
    "        candidate1, similarity1 = find_the_most_similar(capitalized, geonames_districts)\n",
    "        candidate2, similarity2 = find_the_most_similar(lowercase, geonames_districts)\n",
    "        if similarity1 > similarity2:\n",
    "            districts_map[district] = candidate1\n",
    "            #print('\\t', candidate1, similarity1)\n",
    "        elif similarity2 > similarity1:\n",
    "            districts_map[district] = candidate2\n",
    "            #print('\\t', candidate2, similarity2)\n",
    "        else:\n",
    "            # Break the ties by mapping to the shorter string\n",
    "            if len(candidate1) < len(candidate2):\n",
    "                districts_map[district] = candidate1\n",
    "                #print('\\t', candidate1, '|', candidate2, similarity1)\n",
    "            else:\n",
    "                districts_map[district] = candidate2\n",
    "                #print('\\t', candidate2, '|', candidate1, similarity2)\n",
    "\n",
    "# Apply the override to PL_re_df\n",
    "PL_re_df['standardized_district'] = PL_re_df['district'].apply(lambda district: districts_map[district])\n",
    "\n",
    "# Show the results\n",
    "PL_re_df[['district', 'standardized_district']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following districts have not been mapped correctly: `wołowski`, `m. Nowy Sącz` and `rzeszowski`. Let us clear their mappings so that we can assign them their NUTS codes manually later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear the mappings for wołowski, Nowy Sącz, rzeszowski, hojnowski.\n",
    "for district in ['wołowski', 'm. Nowy Sącz', 'rzeszowski', 'hojnowski']:\n",
    "    districts_map[district] = ''\n",
    "    PL_re_df.loc[PL_re_df['district'] == district, 'standardized_district'] = ''\n",
    "\n",
    "# For each mapping, select a postcode from the GeoNames data\n",
    "df_dict = {'original' : [], 'geonames' : []}\n",
    "\n",
    "for original_name in districts_map:\n",
    "    geonames_name = districts_map[original_name]\n",
    "    df_dict['original'].append(original_name)\n",
    "    df_dict['geonames'].append(geonames_name)\n",
    "    \n",
    "mapping_df = pd.DataFrame.from_dict(df_dict)\n",
    "\n",
    "# To make sure that the selected postcodes do appear in the NUTS table,\n",
    "# we drop, from PL_geo, all rows with the postcodes not in the postcode-to-NUTS table for Poland.\n",
    "PL_table = nuts_converter.open_postcode2nuts(filepaths['Eurostat'])['CODE']\n",
    "\n",
    "PL_geo = pd.merge(PL_geo, PL_table, how='inner', left_on='postcode', right_on='CODE')\n",
    "PL_geo.drop(['CODE'], axis='columns', inplace=True)\n",
    "\n",
    "#\n",
    "merged = pd.merge(mapping_df,\n",
    "                  PL_geo[['admin_name2', 'postcode']],\n",
    "                  how='left',\n",
    "                  left_on='geonames',\n",
    "                  right_on='admin_name2')\n",
    "\n",
    "# Rename the column postcode to make its meaning straightforward\n",
    "merged.rename(columns={'postcode' : 'random_postcode'}, inplace=True)\n",
    "merged = merged.drop_duplicates(['geonames'])\n",
    "\n",
    "print(PL_re_df.shape)\n",
    "PL_re_df = pd.merge(PL_re_df,\n",
    "                     merged[['geonames', 'random_postcode']],\n",
    "                     how='left',\n",
    "                     left_on='standardized_district',\n",
    "                     right_on='geonames')\n",
    "\n",
    "# Show results\n",
    "PL_re_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the rows for which we could not find postcodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(PL_re_df[PL_re_df['random_postcode'].isnull()])\n",
    "PL_re_df['random_postcode'].isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are only 17 such power plants and all of them are placed in the districts which we deliberately left out for manual classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add NUTS information\n",
    "\n",
    "We add the NUTS information as usual, using the converter. After that, we manually add the codes for the left-out districts as follows:\n",
    "\n",
    "| District | NUTS_1 | NUTS_2 | NUTS_3 |\n",
    "|----------|--------|--------|--------|\n",
    "| wołowski | PL5 | PL51 | PL518 |\n",
    "| m. Nowy Sącz | PL2 | PL21 | PL218 |\n",
    "| rzeszowski | PL3 | PL32 | PL325 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_postcode2nuts_path = filepaths['Eurostat']\n",
    "\n",
    "PL_re_df = nuts_converter.add_nuts_information(PL_re_df, 'PL', PL_postcode2nuts_path,\n",
    "                                               postcode_column='random_postcode', how=['postcode'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = PL_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', PL_re_df.shape[0], 'facilities in PL.')\n",
    "\n",
    "# Manual assignments\n",
    "manual_nuts3_map = {\n",
    "    'wołowski' : 'PL518',\n",
    "    'm. Nowy Sącz' : 'PL218',\n",
    "    'rzeszowski' : 'PL325'\n",
    "}\n",
    "\n",
    "for district in manual_nuts3_map:\n",
    "    nuts3 = manual_nuts3_map[district]\n",
    "    nuts2 = nuts3[:-1]\n",
    "    nuts1 = nuts3[:-2]\n",
    "    mask = (PL_re_df['district'] == district)\n",
    "    PL_re_df.loc[mask, ['nuts_1_region', 'nuts_2_region', 'nuts_3_region']] = [nuts1, nuts2, nuts3]\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = PL_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', PL_re_df.shape[0], 'facilities in PL.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source and year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df['data_source'] = 'Urzad Regulacji Energetyki'\n",
    "PL_re_df['as_of_year'] = 2019 # The year for which the dataset has been compiled by the data source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose which column to keep\n",
    "PL_re_df = PL_re_df.loc[:, [ 'URE_id', 'region', 'district', \n",
    "                             'nuts_1_region', 'nuts_2_region', 'nuts_3_region',\n",
    "                             'electrical_capacity', \n",
    "                             'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', \n",
    "                             'technology',\n",
    "                             'data_source', 'as_of_year']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PL_re_df.to_pickle('intermediate/PL_renewables.pickle')\n",
    "del PL_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Switzerland CH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read\n",
    "The data which will be processed below is provided by the following data sources:\n",
    "\n",
    "**[Swiss Federal Office of Energy](http://www.bfe.admin.ch/themen/00612/02073/index.html?dossier_id=02166&lang=de)** - Data of all renewable power plants receiving \"Kostendeckende Einspeisevergütung\" (KEV) which is the Swiss feed in tarif for renewable power plants. \n",
    "Geodata is based on municipality codes.\n",
    "\n",
    "The available municipality code in the original data provides an approximation for the geocoordinates of the renewable power plants. The postcode will be assigned to latitude and longitude coordinates with the help of the postcode table.\n",
    "\n",
    "**[geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)** - The postcode  data from Switzerland is provided by Geonames and licensed under a [Creative Commons Attribution 3.0 license](http://creativecommons.org/licenses/by/3.0/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data and get the local paths of the downloaded files\n",
    "filepaths = downloader.download_data_for_country('CH')\n",
    "CH_re_filepath = filepaths['BFE']\n",
    "CH_geo_filepath = filepaths['Geonames']   \n",
    "CH_postcode2nuts_filepath = filepaths['Eurostat']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get data of renewables per municipality\n",
    "CH_re_df = pd.read_excel(CH_re_filepath,\n",
    "                         sheet_name='KEV Bezüger 2018',\n",
    "                         encoding='UTF8',\n",
    "                         thousands='.',\n",
    "                         decimals=','\n",
    "                         #header=[0]\n",
    "                         #skipfooter=9,  # contains summarized values\n",
    "                         #index_col=[0, 1], # required for MultiIndex\n",
    "                         #converters={'Code officiel géographique':str}\n",
    "                         )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Switzerland, create dictionary and show dictionary\n",
    "idx_CH = columnnames[columnnames['country'] == 'CH'].index\n",
    "column_dict_CH = columnnames.loc[idx_CH].set_index('original_name')['opsd_name'].to_dict()\n",
    "column_dict_CH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate columnnames\n",
    "CH_re_df.columns = [column_name.replace(\"\\n\", \"\") for column_name in CH_re_df.columns]\n",
    "CH_re_df.rename(columns=column_dict_CH, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df['data_source'] = 'BFE'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Harmonize energy source hierarchy and translate values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Switzerland, create dictionary \n",
    "idx_CH = valuenames[valuenames['country'] == 'CH'].index\n",
    "value_dict_CH = valuenames.loc[idx_CH].set_index('original_name')['opsd_name'].to_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate and assign energy source level 1-3 and technology**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign energy_source_level_1 to the dataframe\n",
    "CH_re_df['energy_source_level_1'] = 'Renewable energy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source to its subtype\n",
    "#energy_source_dict_CH = valuenames.loc[idx_CH].set_index('opsd_name')['energy_source_level_2'].to_dict()\n",
    "#\n",
    "# ...and the energy source subtype values in the energy_source column are replaced by \n",
    "# the higher level classification\n",
    "#CH_re_df['energy_source_level_2'].replace(energy_source_dict_CH, inplace=True)\n",
    "\n",
    "CH_re_df['energy_source_level_3'] = CH_re_df['technology']\n",
    "\n",
    "# Create dictionnary in order to assign energy_source_level_2 to its subtype\n",
    "idx_CH = valuenames[valuenames['country'] == 'CH'].index\n",
    "energy_source_dict_CH = valuenames.loc[idx_CH].set_index('original_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Add energy_source_level_2\n",
    "CH_re_df['energy_source_level_2'] = CH_re_df['energy_source_level_2'].replace(energy_source_dict_CH)\n",
    "\n",
    "# Translate values in order to standardize energy_source_level_3\n",
    "value_dict_CH = valuenames.loc[idx_CH].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "\n",
    "CH_re_df['energy_source_level_3'].replace(value_dict_CH, inplace=True)\n",
    "\n",
    "# Standardize the values for technology\n",
    "# 1. np.nan means that technology should not be specified for the respective kind of sources\n",
    "#    according to the hierarchy (http://open-power-system-data.org/2016-10-25-opsd_tree.svg)\n",
    "# 2. 'Other or unspecified technology' means that technology should be specified\n",
    "#    but it was unclear or missing in the original dataset.\n",
    "technology_translation_dictionary = {\n",
    "    'Klärgasanlage': np.nan,\n",
    "    'Dampfprozess': 'Steam turbine',\n",
    "    'übrige Biomasse - WKK-Anlage': 'Other or unspecified technology',\n",
    "    'übrige Biomasse - Dampfprozess': 'Steam turbine',\n",
    "    'Schlammverbrennungsanlage': 'Combustion engine',\n",
    "    'WKK-Prozess': 'Other or unspecified technology',\n",
    "    'Kehrrichtverbrennungsanlage': 'Combustion engine',\n",
    "    'Integrierte Anlage': 'Photovoltaics',\n",
    "    'Angebaute Anlage': 'Photovoltaics',\n",
    "    'Freistehende Anlage': 'Photovoltaics',\n",
    "    'Trinkwasserkraftwerk': 'Other or unspecified technology',\n",
    "    'Durchlaufkraftwerk': 'Run-of-river',\n",
    "    'Dotierwasserkraftwerk': 'Other or unspecified technology',\n",
    "    'Ausleitkraftwerk': 'Other or unspecified technology',\n",
    "    'Wind Offshore': 'Other or unspecified technology',\n",
    "    'Abwasserkraftwerk': 'Other or unspecified technology',\n",
    "    'Unbekannt': 'Other or unspecified technology',\n",
    "    np.nan: 'Onshore',\n",
    "    None: 'Onshore'\n",
    "}\n",
    "\n",
    "CH_re_df['technology'].replace(technology_translation_dictionary, inplace=True)\n",
    "\n",
    "# Add energy_source_level_1\n",
    "CH_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy of sources present in the dataset\n",
    "energy_columns = ['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']\n",
    "CH_re_df[energy_columns].drop_duplicates().sort_values(by='energy_source_level_2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The power plants with `energy_source_level_3=Biomass and biogas` and `technology=Steam turbine` do not belong to the renewable energy power plants, so we can remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop_mask = (CH_re_df['energy_source_level_3'] == 'Biomass and biogas') & \\\n",
    "            (CH_re_df['technology'] == 'Steam turbine')\n",
    "drop_indices = drop_mask[drop_mask].index\n",
    "CH_re_df.drop(drop_indices, axis='index', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Replace the rest of the original terms with their OPSD equivalents**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df.replace(value_dict_CH, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing\n",
    "\n",
    "#### Postcode to lat/lon (WGS84)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_CH_geo = zipfile.ZipFile(CH_geo_filepath)\n",
    "\n",
    "# Read generated postcode/location file\n",
    "CH_geo = pd.read_csv(zip_CH_geo.open('CH.txt'), sep='\\t', header=None)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "CH_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "CH_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "CH_geo['postcode'] = CH_geo['postcode'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# harmonise data class \n",
    "CH_geo.postcode = CH_geo.postcode.astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add longitude/latitude infomation assigned by municipality code\n",
    "CH_re_df = pd.merge(CH_re_df,\n",
    "                    CH_geo[['lat', 'lon', 'postcode']],\n",
    "                    left_on='municipality_code',\n",
    "                    right_on='postcode',\n",
    "                    how='left'\n",
    "                   )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zip_CH_geo.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add NUTS information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_postcode2nuts_path = filepaths['Eurostat']\n",
    "\n",
    "# Use the string versions of postcode and municipality code columns\n",
    "CH_re_df['postcode_str'] = CH_re_df['postcode'].astype(str).str[:-2]\n",
    "CH_re_df['municipality_code_str'] = CH_re_df['municipality_code'].astype(str)\n",
    "\n",
    "CH_re_df = nuts_converter.add_nuts_information(CH_re_df, 'CH', CH_postcode2nuts_path, \n",
    "                                                postcode_column='postcode_str',\n",
    "                                                municipality_code_column='municipality_code_str',\n",
    "                                                lau_name_type='NATIONAL', how=['postcode', 'municipality'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = CH_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', CH_re_df.shape[0], 'facilities in CH.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = CH_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', CH_re_df.shape[0], 'facilities in CH.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check the stations for which NUTS codes could not be determined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df[CH_re_df['nuts_1_region'].isnull()][['postcode', 'municipality']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the facilities without NUTS classification\n",
    "no_nuts = CH_re_df['nuts_1_region'].isnull()\n",
    "\n",
    "# Find the masks where some information for finding the proper NUTS code is present\n",
    "municipality_name_present = ~(CH_re_df['municipality'].isnull())\n",
    "\n",
    "# Show the cases where NUTS classification failed even though it shouldn't have\n",
    "problematic_municipality_names = CH_re_df[no_nuts & municipality_name_present]['municipality'].unique()\n",
    "print('Problematic municipalities:', ', '.join(list(problematic_municipality_names)) + '.')\n",
    "\n",
    "print('Are those names present in the official NUTS tables for CH?')\n",
    "if nuts_converter.municipality2nuts_df['municipality'].isin(problematic_municipality_names).any():\n",
    "    print('At least one is.')\n",
    "else:\n",
    "    print('No, none is.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the municipalities of only plants for which we could not determine the NUTS codes cannot be found in the official translation tables, so there was no possibility to assign them their NUTS classification codes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform electrical_capacity from kW to MW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# kW to MW\n",
    "CH_re_df['electrical_capacity'] /= 1000\n",
    "\n",
    "# kWh to MWh\n",
    "CH_re_df['production'] /= 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select columns to keep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_to_keep = ['project_name', 'energy_source_level_2','energy_source_level_3', 'technology', \n",
    "                   'electrical_capacity', 'production', 'tariff', 'commissioning_date', 'contract_period_end',\n",
    "                   'address', 'municipality_code', 'municipality', 'nuts_1_region', 'nuts_2_region',\n",
    "                   'nuts_3_region', 'canton', 'company', 'title', 'surname', 'first_name', 'data_source',\n",
    "                   'energy_source_level_1', 'lat', 'lon', 'postcode']\n",
    "CH_re_df = CH_re_df.loc[:, columns_to_keep]\n",
    "CH_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_points(CH_re_df['lat'],\n",
    "                 CH_re_df['lon'],\n",
    "                 'Switzerland',\n",
    "                 categories=CH_re_df['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CH_re_df.to_pickle('intermediate/CH_renewables.pickle')\n",
    "del CH_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check and validation of the renewable power plants list as well as the creation of CSV/XLSX/SQLite files can be found in Part 2 of this script. It also generates a daily time series of cumulated installed capacities by energy source."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## United Kingdom UK\n",
    "\n",
    "The data for the UK are provided by the following sources:\n",
    "\n",
    "**[UK Government Department of Business, Energy & Industrial Strategy (BEIS)](https://www.gov.uk/government/publications/renewable-energy-planning-database-monthly-extract)** - the data contain information on the UK renewable energy sources and are updated at the end of each quarter.\n",
    "\n",
    "**[geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)** - the data about latitued and longitudes of  the UK postcodes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and Read"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data and get the local paths to the corresponding files\n",
    "filepaths = downloader.download_data_for_country('UK')\n",
    "UK_re_filepath = filepaths['BEIS']\n",
    "UK_geo_filepath = filepaths['Geonames']\n",
    "UK_postcode2nuts_filepath = filepaths['Eurostat']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the renewable powerplants data into a dataframe\n",
    "UK_re_df = pd.read_csv(UK_re_filepath,\n",
    "                       header=2,\n",
    "                       encoding='latin1',\n",
    "                       parse_dates=['Record Last Updated (dd/mm/yyyy)','Operational'],\n",
    "                       infer_datetime_format=True,\n",
    "                       thousands=','\n",
    "                       )\n",
    "\n",
    "# Drop empty columns and rows\n",
    "UK_re_df.dropna(axis='index', how='all', inplace=True)\n",
    "UK_re_df.dropna(axis='columns', how='all', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean the data\n",
    "\n",
    "The downloaded dataset has to be cleaned:\n",
    "  - Both operational and nonoperational facilities are present in the set. However, only operational facilities are of the interest, so the dataset has to be filtered on this condition. \n",
    "  - Some columns don't have standardized values. For example, `CHP Enabled` contains five different strings: \"`No`\", \"`Yes`\", \"`no`\", \"`yes`\", and \"`No` \" with a trailing white space, even though they represent only two distinct values. So, we have to ensure a 1-to-1 mapping between the true values of a feature and their representations for all the features present in the set.\n",
    "  - The technologies `Battery`, `Flywheels` and `Liquid Air Energy Storage` are of no interest, so the facilities using them should be omitted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep only operational facilities in the dataset\n",
    "UK_re_df = UK_re_df.loc[UK_re_df[\"Development Status\"] == \"Operational\"]\n",
    "UK_re_df.reset_index(inplace=True, drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize string columns\n",
    "strip_and_lower = ['CHP Enabled']\n",
    "strip_only = ['Country', 'County', 'Operator (or Applicant)', 'Mounting Type for Solar']\n",
    "\n",
    "for column in strip_and_lower:\n",
    "    util.helper.standardize_column(UK_re_df, column, lower=True)\n",
    "\n",
    "for column in strip_only:\n",
    "    util.helper.standardize_column(UK_re_df, column, lower=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop Flywheels, Battery and Liquid Air Energy Storage\n",
    "UK_re_df = UK_re_df[~UK_re_df['Technology Type'].isin(['Flywheels', 'Battery', 'Liquid Air Energy Storage'])]\n",
    "UK_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy the column \"Technology Type\" to a new column named \"technology\"\n",
    "UK_re_df['technology'] = UK_re_df['Technology Type']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for the UK and create the translation dictionary\n",
    "idx_UK = columnnames[columnnames['country'] == 'UK'].index\n",
    "column_dict_UK = columnnames.loc[idx_UK].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Show the dictionary\n",
    "column_dict_UK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate column names\n",
    "UK_re_df.rename(columns=column_dict_UK, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "UK_re_df['data_source'] = 'BEIS'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonise energy source levels 1-3 and technology"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dictionnary in order to assign energy_source_level_2 to its subtype\n",
    "idx_UK = valuenames[valuenames['country'] == 'UK'].index\n",
    "energy_source_dict_UK = valuenames.loc[idx_UK].set_index('original_name')['energy_source_level_2'].to_dict()\n",
    "\n",
    "# Add energy_source_level_2\n",
    "UK_re_df['energy_source_level_2'] = UK_re_df['energy_source_level_3'].replace(energy_source_dict_UK)\n",
    "\n",
    "# Translate values in order to standardize energy_source_level_3\n",
    "value_dict_UK = valuenames.loc[idx_UK].set_index('original_name')['opsd_name'].to_dict()\n",
    "UK_re_df['energy_source_level_3'].replace(value_dict_UK, inplace=True)\n",
    "\n",
    "# Standardize the values for technology\n",
    "# 1. np.nan means that technology should not be specified for the respective kind of sources\n",
    "#    according to the hierarchy (http://open-power-system-data.org/2016-10-25-opsd_tree.svg)\n",
    "# 2. 'Other or unspecified technology' means that technology should be specified\n",
    "#    but it was unclear or missing in the original dataset.\n",
    "technology_translation_dictionary = {\n",
    "    'Biomass (co-firing)': 'Other or unspecified technology',\n",
    "    'Biomass (dedicated)': 'Other or unspecified technology',\n",
    "    'Advanced Conversion Technologies': 'Other or unspecified technology',\n",
    "    'Anaerobic Digestion': 'Other or unspecified technology',\n",
    "    'EfW Incineration': np.nan,\n",
    "    'Large Hydro': 'Other or unspecified technology',\n",
    "    'Small Hydro': 'Other or unspecified technology',\n",
    "    'Landfill Gas': np.nan,\n",
    "    'Solar Photovoltaics': 'Photovoltaics',\n",
    "    'Sewage Sludge Digestion': np.nan,\n",
    "    'Tidal Barrage and Tidal Stream': np.nan,\n",
    "    'Shoreline Wave': np.nan,\n",
    "    'Wind Offshore': 'Offshore',\n",
    "    'Wind Onshore': 'Onshore',\n",
    "    'Pumped Storage Hydroelectricity': 'Pumped storage'\n",
    "}\n",
    "UK_re_df['technology'].replace(technology_translation_dictionary, inplace=True)\n",
    "\n",
    "# Add energy_source_level_1\n",
    "UK_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy of sources present in the dataset\n",
    "UK_re_df[['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Georeferencing\n",
    "\n",
    "The facilities' location details comprise of the information on the address, county, region, country (England, Scotland, Wales, Northern Ireland), post code, and Easting (X) and Northing (Y) coordinates of each facility in the OSGB georeferencing system. To convert the easting and northing cordinates to standard WG84 latitude and longitude, we use package `bng_latlon`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a wrapper for bng_to_latlon for handling None values \n",
    "def to_lat_lon(easting, northing):\n",
    "    if pd.isnull(easting) or pd.isnull(northing):\n",
    "        return (None, None)\n",
    "    else:\n",
    "        return bng_to_latlon.OSGB36toWGS84(easting, northing)\n",
    "\n",
    "# Convert easting and northing columns to numbers\n",
    "UK_re_df['X-coordinate'] = pd.to_numeric(\n",
    "                             UK_re_df['X-coordinate'].astype(str).str.replace(',', ''),\n",
    "                             errors='coerce'\n",
    "                           )\n",
    "UK_re_df['Y-coordinate'] = pd.to_numeric(\n",
    "                             UK_re_df['Y-coordinate'].astype(str).str.replace(',', ''),\n",
    "                             errors='coerce'\n",
    "                           )\n",
    "\n",
    "# Convert easting and northing coordinates to standard latitude and longitude\n",
    "latlon = UK_re_df.apply(lambda row: to_lat_lon(row[\"X-coordinate\"], row[\"Y-coordinate\"]),\n",
    "                        axis=1\n",
    "                       )\n",
    "\n",
    "# Split a column of (latitude, longitude) pairs into two separate coordinate columns\n",
    "latitude = latlon.apply(lambda x: x[0])\n",
    "longitude = latlon.apply(lambda x: x[1])\n",
    "\n",
    "# Add them to the dataframe\n",
    "UK_re_df['latitude'] = latitude\n",
    "UK_re_df['longitude'] = longitude"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cases with unknown Easting and Northing coordinates\n",
    "\n",
    "If the Easting and Northing coordinates of a facility are not provided, its latitude and longitude cannot be determined. For such sources, we look up the WGS84 coordinates in the geodataset provided by **[geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)**, where the UK postcodes are paired with their latitudes and longitudes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_UK_geo = zipfile.ZipFile(UK_geo_filepath)\n",
    "\n",
    "# Read generated postcode/location file\n",
    "UK_geo = pd.read_csv(zip_UK_geo.open('GB_full.txt'), sep='\\t', header=None)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "UK_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "UK_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "UK_geo['postcode'] = UK_geo['postcode'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the rows where latitude and longitude are unknown\n",
    "missing_latlon_mask = UK_re_df['latitude'].isna() | UK_re_df['longitude'].isna()\n",
    "missing_latlon = UK_re_df[missing_latlon_mask]\n",
    "\n",
    "# Add longitude/latitude infomation assigned by post code\n",
    "updated_latlon = pd.merge(missing_latlon,\n",
    "                  UK_geo[['lat', 'lon', 'postcode']],\n",
    "                  left_on='postcode',\n",
    "                  right_on='postcode',\n",
    "                  how='left'\n",
    "                 )\n",
    "\n",
    "# Return the updated rows to the original frame\n",
    "UK_re_df = pd.merge(UK_re_df,\n",
    "             updated_latlon[['uk_beis_id', 'lat', 'lon']],\n",
    "             on='uk_beis_id',\n",
    "             how='left'\n",
    ")\n",
    "\n",
    "# Use the bng_to_latlon coordinates (columns: 'latitude' and 'longitude') if present, \n",
    "# otherwise, use those obtained with UK_geo (columns: 'lat' and 'lon').\n",
    "UK_re_df['longitude'] = UK_re_df.apply(lambda row: row['longitude'] if not pd.isnull(row['longitude']) \n",
    "                                                                    else row['lon'],\n",
    "                         axis=1\n",
    "                        )\n",
    "UK_re_df['latitude'] = UK_re_df.apply(lambda row: row['latitude'] if not pd.isnull(row['latitude']) \n",
    "                                                                  else row['lat'],\n",
    "                         axis=1\n",
    "                        )\n",
    "\n",
    "# Drop the UK_geo columns (lat/lon)\n",
    "# as the information was moved to the 'latitude' and 'longitude' columns.\n",
    "UK_re_df.drop(['lat', 'lon'], axis='columns', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zip_UK_geo.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cases for approximation\n",
    "\n",
    "In the cases where the full post code was not present in **[geonames.org](http://download.geonames.org/export/zip/?C=N;O=D)**, use its prefix to find the latitude / longitude pairs of locations covered by that prefix. Then, approximate those facilities' locations by the centroids of their prefix areas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the rows where latitude and longitude are unknown\n",
    "missing_latlon_mask = UK_re_df['latitude'].isna() | UK_re_df['longitude'].isna()\n",
    "missing_latlon = UK_re_df[missing_latlon_mask].copy()\n",
    "missing_latlon = missing_latlon.reset_index()\n",
    "\n",
    "# Determine their post code prefixes\n",
    "prefixes = missing_latlon.apply(lambda row: str(row['postcode']).split(' ')[0],\n",
    "                                axis=1\n",
    "                               )\n",
    "missing_latlon['Prefix'] = prefixes\n",
    "\n",
    "# Determine the centroids of the areas covered by the prefixes\n",
    "grouped_UK_geo=UK_geo.groupby(by=lambda i: str(UK_geo['postcode'].loc[i]).split(' ')[0])\n",
    "\n",
    "# Assing the centroid coordinates to the facilities with unknown coordinates\n",
    "updated_latlon = pd.merge(missing_latlon,\n",
    "                    grouped_UK_geo.mean(),\n",
    "                    left_on=\"Prefix\",\n",
    "                    right_index=True,\n",
    "                    how=\"left\"\n",
    "                   )\n",
    "\n",
    "# Return the updated rows to the original frame\n",
    "UK_re_df = pd.merge(UK_re_df,\n",
    "             updated_latlon[['uk_beis_id', 'lat', 'lon']],\n",
    "             on='uk_beis_id',\n",
    "             how='left'\n",
    ")\n",
    "\n",
    "# Keep the already known coordinates (columns: 'latitude' and 'longitude') if present, \n",
    "# otherwise, use those obtained by approximation (columns: 'lat' and 'lon').\n",
    "UK_re_df['longitude'] = UK_re_df.apply(lambda row: row['longitude'] if not pd.isnull(row['longitude']) \n",
    "                                                                    else row['lon'],\n",
    "                         axis=1\n",
    "                        )\n",
    "UK_re_df['latitude'] = UK_re_df.apply(lambda row: row['latitude'] if not pd.isnull(row['latitude']) \n",
    "                                                                  else row['lat'],\n",
    "                         axis=1\n",
    "                        )\n",
    "\n",
    "# Drop the UK_geo columns (lat/lon)\n",
    "# as the information was moved to the 'latitude' and 'longitude' columns.\n",
    "UK_re_df.drop(['lat', 'lon'], axis='columns', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add NUTS information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "UK_postcode2nuts_filepath = filepaths['Eurostat']\n",
    "\n",
    "UK_re_df = nuts_converter.add_nuts_information(UK_re_df, 'UK', UK_postcode2nuts_filepath, \n",
    "                                                latitude_column='latitude',\n",
    "                                                longitude_column='longitude', closest_approximation=True,\n",
    "                                                lau_name_type='NATIONAL', how=['latlon', 'municipality'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = UK_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', UK_re_df.shape[0], 'facilities in UK.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = UK_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', UK_re_df.shape[0], 'facilities in UK.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see the facilities for which the NUTS codes could not be determined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "UK_re_df[UK_re_df['nuts_1_region'].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two such rows only. The langitude and longitude coordinates, as well as municipality codes, are missing from the data set, so NUTS codes could not have been determined."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualize the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "visualize_points(UK_re_df['latitude'],\n",
    "                 UK_re_df['longitude'],\n",
    "                 'United Kingdom',\n",
    "                 categories=UK_re_df['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that some facilities appear to be located in the sea. Let us plot the original OSGB coordinates to see if translation to the standard longitude and latitude coordinates failed for some locations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_X = UK_re_df['X-coordinate'].max()\n",
    "min_X = UK_re_df['X-coordinate'].min()\n",
    "\n",
    "max_Y = UK_re_df['Y-coordinate'].max()\n",
    "min_Y = UK_re_df['Y-coordinate'].min()\n",
    "\n",
    "figure(num=None, figsize=(8, 6), dpi=100, facecolor='w', edgecolor='k')\n",
    "ax = plt.axes(projection=ccrs.OSGB())\n",
    "ax.coastlines('10m')\n",
    "\n",
    "ax.scatter(UK_re_df['X-coordinate'], UK_re_df['Y-coordinate'],s=0.5)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the maps are basically the same, which confirms that translation to the longitude and latitude coordinates is done correctly and that they reflect the positions specified by the original X and Y OSGB coordinates. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Keep only the columns of interest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename 'longitude' and 'latitude' to 'lon' and 'lat' to conform to the naming convention\n",
    "# used for other countries.\n",
    "UK_re_df.rename(columns={'longitude': 'lon', 'latitude': 'lat'}, inplace=True)\n",
    "\n",
    "\n",
    "# Define the columns to keep\n",
    "columns_of_interest = ['commissioning_date', 'uk_beis_id', 'operator', 'site_name',\n",
    "                       'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology',\n",
    "                       'electrical_capacity', 'chp', 'support_robranding', 'support_fit', 'support_cfd',\n",
    "                       'capacity_individual_turbine', 'number_of_turbines', 'solar_mounting_type',\n",
    "                       'status', 'address', 'municipality', 'nuts_1_region', 'nuts_2_region', 'nuts_3_region',\n",
    "                       'region', 'country', 'postcode', 'lon', 'lat', 'data_source'\n",
    "                      ]\n",
    "\n",
    "for col in columns_of_interest:\n",
    "    if col not in UK_re_df.columns:\n",
    "        print(col)\n",
    "\n",
    "# Clean the dataframe from columns other than those specified above\n",
    "UK_re_df = UK_re_df.loc[:, columns_of_interest]\n",
    "UK_re_df.reset_index(drop=True, inplace=True)\n",
    "UK_re_df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "UK_re_df.to_pickle('intermediate/UK_renewables.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sweden\n",
    "\n",
    "The data for Sweden are provided by the following sources:\n",
    "\n",
    "- **[Vindbrukskollen](http://ext-dokument.lansstyrelsen.se/Gemensamt/Geodata/Externa%20dokument/VBK/VBK_export_allman_prod.xlsx)** - Wind farms in Sweden."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data and get the local paths to the corresponding files\n",
    "filepaths = downloader.download_data_for_country('SE')\n",
    "print(filepaths)\n",
    "\n",
    "SE_re_filepath = filepaths['Vindbrukskollen']\n",
    "SE_geo_filepath = filepaths['Geonames']\n",
    "SE_postcode2nuts_filepath = filepaths['Eurostat']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the function for converting the column \"Senast sparads\" to date type\n",
    "#def from_int_to_date(int_date):\n",
    "#    print(int_date)\n",
    "#    str_date =str(int_date)\n",
    "#    year = str_date[:4]\n",
    "#    month = str_date[4:6]\n",
    "#    day = str_date[6:8]\n",
    "#    str_date = '{}/{}/{}'.format(year, month, day)\n",
    "#    return pd.to_datetime(str_date, format='%Y/%m/%d')\n",
    "    \n",
    "# Read the data\n",
    "SE_re_df = pd.read_excel(SE_re_filepath,\n",
    "                        sheet_name='Vindkraftverk',\n",
    "                        na_values='-',\n",
    "                        parse_dates=['Uppfört', 'Senast sparad'],\n",
    "                        infer_datetime_format=True,\n",
    "                        #converters={'Senast sparad' : from_int_to_date}\n",
    "                        )\n",
    "\n",
    "# Show 5 rows from the beginning\n",
    "SE_re_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean the data\n",
    " - Drop empty rows and columns.\n",
    " - Make sure that the column `Uppfört` is of the date type.\n",
    " - Keep only operational wind farms (`Status` is `Beviljat` (permission granted) or `Uppfört` (the farm exists)).\n",
    " - Remove the farms whose capacity is not known.\n",
    " - Standardize string columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop empty rows and columns \n",
    "SE_re_df.dropna(axis='index', how='all', inplace=True)\n",
    "SE_re_df.dropna(axis='columns', how='all', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure that the column Uppfört is of the date type and correctly formatted\n",
    "SE_re_df['Uppfört'] = pd.to_datetime(SE_re_df['Uppfört'], format='%Y-%m-%d')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep only operational wind farms\n",
    "subset_mask = SE_re_df['Status'].isin(['Beviljat', 'Uppfört'])\n",
    "SE_re_df.drop(SE_re_df[~subset_mask].index, axis='index', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove the farms whose capacity is not known.\n",
    "subset_mask = SE_re_df['Maxeffekt (MW)'].isna()\n",
    "SE_re_df.drop(SE_re_df[subset_mask].index, axis='index', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize string columns\n",
    "string_columns = ['Modell', 'Fabrikat', 'Elområde', 'Kommun', 'Län', 'Handlingstyp', 'Placering']\n",
    "for col in string_columns:\n",
    "    util.helper.standardize_column(SE_re_df, col, lower=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for the UK and create the translation dictionary\n",
    "idx_SE = columnnames[columnnames['country'] == 'SE'].index\n",
    "column_dict_SE = columnnames.loc[idx_SE].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Show the dictionary\n",
    "display(column_dict_SE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate column names\n",
    "SE_re_df.rename(columns=column_dict_SE, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correct the dates\n",
    "\n",
    "Some wind farms are declared to be commissioned in the year 1900. We set those dates to `np.nan`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SE_re_df.loc[(SE_re_df['commissioning_date'].dt.year == 1900), 'commissioning_date'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SE_re_df['data_source'] = 'Vindbrukskollen'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy source levels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Sweden\n",
    "idx_SE = valuenames[valuenames['country'] == 'SE'].index\n",
    "value_dict_SE = valuenames.loc[idx_SE].set_index('original_name')['opsd_name'].to_dict()\n",
    "value_dict_SE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace all original value names by the OPSD value names\n",
    "SE_re_df.replace(value_dict_SE, inplace=True)\n",
    "\n",
    "# Set nans in the technology column to 'Unknown or unspecified technology'\n",
    "SE_re_df['technology'].fillna('Unknown or unspecified technology', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add energy level 2\n",
    "SE_re_df['energy_source_level_2'] = 'Wind'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add energy_source_level_1\n",
    "SE_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy of sources present in the dataset\n",
    "SE_re_df[['energy_source_level_1', 'energy_source_level_2', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing\n",
    "\n",
    "The coordinates in the columns `sweref99tm_north` and `sweref99tm_east` are specified in the [SWEREF 99 TM](https://sv.wikipedia.org/wiki/SWEREF_99_TM) coordinate system, used in Sweden. To convert those coordinates to the usual [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) latitudes and longitudes, we use the function `sweref99tm_latlon_transform` from the module `util.helper`, provided by Jon Olauson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get latitude and longitude columns\n",
    "lat, lon = util.helper.sweref99tm_latlon_transform(SE_re_df['sweref99tm_north'], SE_re_df['sweref99tm_east'])\n",
    "\n",
    "# Include them in the dataframe\n",
    "SE_re_df['lat'] = lat\n",
    "SE_re_df['lon'] = lon"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assigning NUTS codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SE_postcode2nuts_filepath = filepaths['Eurostat']\n",
    "\n",
    "SE_re_df = nuts_converter.add_nuts_information(SE_re_df, 'SE', SE_postcode2nuts_filepath, \n",
    "                                                lau_name_type='NATIONAL', how=['municipality', 'latlon'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully sudetermined\n",
    "determined = SE_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', SE_re_df.shape[0], 'facilities in SE.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = SE_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', SE_re_df.shape[0], 'facilities in SE.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select the columns to keep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define which columns should be kept\n",
    "columns_to_keep = ['municipality', 'county', 'nuts_1_region', 'nuts_2_region', 'nuts_3_region', 'lat', 'lon',\n",
    "           'energy_source_level_1', 'energy_source_level_2', 'technology', 'se_vindbrukskollen_id',\n",
    "            'site_name', 'manufacturer',\n",
    "           'electrical_capacity', 'commissioning_date', 'data_source']\n",
    "\n",
    "# Keep only the selected columns\n",
    "SE_re_df = SE_re_df.loc[:, columns_to_keep]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_points(SE_re_df['lat'],\n",
    "                 SE_re_df['lon'],\n",
    "                 'Sweden',\n",
    "                 categories=SE_re_df['technology']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SE_re_df.reset_index(inplace=True, drop=True)\n",
    "SE_re_df.to_pickle('intermediate/SE_renewables.pickle')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del SE_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Czech Republic\n",
    "\n",
    "The data for Czech Republic are provided by the following source:\n",
    "-  **[ERU (Energetický regulační úřad, Energy Regulatory Office)](http://www.eru.cz/en/)** - Administrative authority responsible for regulation in the energy sector. Provides the data on renewable energy plants in Czech Republic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and read the data\n",
    "\n",
    "Downloading the data from the original source may take 1-2 hours because it's done by scraping the information from HTML pages. \n",
    "\n",
    "If downloading fails because of the ERU's server refusing connections:\n",
    "- pause and wait for some time;\n",
    "- delete the file `eru.csv` in the CZ input directory;\n",
    "- try downloading again.\n",
    "\n",
    "Alternatively, you can download the data from the OPSD server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data and get the local paths to the corresponding files\n",
    "print('Start:', datetime.datetime.now())\n",
    "downloader = Downloader(version, input_directory_path, source_list_filepath, download_from)\n",
    "filepaths = downloader.download_data_for_country('CZ')\n",
    "print('End:', datetime.datetime.now())\n",
    "\n",
    "CZ_re_filepath = filepaths['ERU']\n",
    "CZ_geo_filepath = filepaths['Geonames']\n",
    "CZ_postcode2nuts_filepath = filepaths['Eurostat']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a converter for CZ postcode strings\n",
    "def to_cz_postcode_format(postcode_str):\n",
    "    return postcode_str[:3] + ' ' + postcode_str[3:]\n",
    "\n",
    "# Read the data from the csv file\n",
    "CZ_re_df = pd.read_csv(CZ_re_filepath,\n",
    "                       escapechar='\\\\',\n",
    "                       dtype = {\n",
    "                           'number_of_sources' : int,\n",
    "                       },\n",
    "                       parse_dates=['licence_approval_date'],\n",
    "                       infer_datetime_format=True,\n",
    "                       converters = {\n",
    "                           'site_postcode' : to_cz_postcode_format,\n",
    "                           'holder_postcode' : to_cz_postcode_format\n",
    "                       }\n",
    "                      )\n",
    "# Show a few rows\n",
    "CZ_re_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect the dataframe's columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_re_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It contains 30 columns:\n",
    "- `site_name`, `site_region`, `site_postcode`, `site_locality`, `site_district` give us basic information on the site;\n",
    "- `megawatts_electric_total` shows us the total electric capacity of the site;\n",
    "- Since each site can use different types of energy, `megawatts_electric_hydro`, `megawatts_electric_solar`, `megawatts_electric_biogas_and_biomass`, `megawatts_electric_wind`, `megawatts_electric_unspecified` show us how total capacity breaks down to those renewable types from the OPSD energy hierarchy;\n",
    "- The columns beginning with `megawatts_thermal_` represent the amiunt of input energy required (and will be equal to zero in most cases);\n",
    "- `watercourse` and `watercourse_length_km` represent the name and length of the watercourse used by the site (if any);\n",
    "- `holder_name`, `holder_region`, `holder_address`, `holder_postcode`, `holder_locality`, `holder_district`, `holder_representative` give us basic information on the site's owner;\n",
    "- `licence_number` and `licence_approval_date` show us the licence number given to the holder and its approval date.\n",
    "- `link` points to the ERU page with the site's data in HTML.\n",
    "\n",
    "Since some sites use conventional types of energy, it is possible that `megawatts_electric_total > megawatts_electric_hydro + megawatts_electric_solar + megawatts_electric_biogas_and_biomass + megawatts_electric_wind + megawatts_electric_unspecified`. If the sum of renewable-energy capacities is equal to zero, that means that the correspoding row actually represents a conventional powerplant, so it should be excluded.\n",
    "\n",
    "Let us now check how many sites use how many types of renewable energy sources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mwe_columns = [col for col in CZ_re_df.columns if 'megawatts_electric' in col and col != 'megawatts_electric_total']\n",
    "mwt_columns = [col for col in CZ_re_df.columns if 'megawatts_thermal' in col and col != 'megawatts_thermal_total']\n",
    "\n",
    "def count_types(row):\n",
    "    global mwe_columns\n",
    "    different_types = sum([row[col] > 0 for col in mwe_columns])\n",
    "    return different_types\n",
    "\n",
    "CZ_re_df.apply(count_types, axis=1).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As of April 2020, as we can see in the output above, there are only 4 sites which use more than one type of renewable energy, and there are 193 sites which do not use renewable energy at all."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop empty columns and rows\n",
    "CZ_re_df.dropna(axis='index', how='all', inplace=True)\n",
    "CZ_re_df.dropna(axis='columns', how='all', inplace=True)\n",
    "\n",
    "# Drop rows with no data on electrical capacity and the rows where total electrical capacity is 0\n",
    "empty_mask = (CZ_re_df['megawatts_electric_total'] == 0) | (CZ_re_df['megawatts_electric_total'].isnull())\n",
    "CZ_re_df = CZ_re_df.loc[~empty_mask]\n",
    "CZ_re_df.reset_index(inplace=True, drop=True)\n",
    "\n",
    "# Replace NANs with zeroes in mwe and mwt columns\n",
    "replacement_dict = {col : 0 for col in mwe_columns + mwt_columns}\n",
    "CZ_re_df.fillna(replacement_dict, inplace=True)\n",
    "\n",
    "# Drop the rows where renewable-energy share of the total capacity is equal to zero\n",
    "conventional_mask = (CZ_re_df['megawatts_electric_hydro'] +\n",
    "                     CZ_re_df['megawatts_electric_solar'] +\n",
    "                     CZ_re_df['megawatts_electric_biogas_and_biomass'] + \n",
    "                     CZ_re_df['megawatts_electric_wind'] + \n",
    "                     CZ_re_df['megawatts_electric_unspecified']) == 0\n",
    "\n",
    "CZ_re_df = CZ_re_df.loc[~conventional_mask]\n",
    "CZ_re_df.reset_index(inplace=True, drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reformat the data\n",
    "\n",
    "There are sites which use different types of renewable source to produce electric energy. Those are the sites where at least two of the following columns are not equal to zero: `megawatts_electric_hydro`, `megawatts_electric_solar`, `megawatts_electric_biogas_and_biomass`, `megawatts_electric_wind`, `megawatts_electric_unspecified`. The data that come in this shape are said to be in the so called *wide format*. For the purpose of our later processing, it would be more convenient to have the data where each row is associated to one and only one type of energy (the so called *long format*). Therefore, we must first restructure our data from the wide to long format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the function which will extract the data about the type of energy specified by the given column\n",
    "# and return it as a dataframe in the \"long format\"\n",
    "def select_and_reformat(df, column):\n",
    "    # Use the mwe and mwt columns defined above\n",
    "    global mwe_columns\n",
    "    global mwt_columns\n",
    "    \n",
    "    # Declare the given column and its mwt counterpart as exceptions\n",
    "    mwt_exception = column.replace('electric', 'thermal')\n",
    "    exceptions = [column, mwt_exception]\n",
    "\n",
    "    # Exclude all the mwe and mwt columns which do not correspond to the given energy type\n",
    "    columns_to_skip = [col for col in mwe_columns + mwt_columns if col not in exceptions]\n",
    "    # Keep all the other columns\n",
    "    columns_to_keep = [col for col in df.columns if col not in columns_to_skip]\n",
    "    \n",
    "    # Find the stations which use the given type of energy\n",
    "    selection_mask = (df[column] > 0)\n",
    "    \n",
    "    # Keep them and select the columns we decided to keep\n",
    "    selection_df = df[selection_mask][columns_to_keep]\n",
    "    \n",
    "    # Create a new column which will indicate the energy type\n",
    "    selection_df['energy_type'] = \" \".join(column.split('_')[2:])\n",
    "    \n",
    "    # Remove the energy type name from the columns representing electrical capacity\n",
    "    # and megawatts thermal\n",
    "    selection_df.rename(columns = {column : 'electrical_capacity',\n",
    "                                   mwt_exception : 'megawatts_thermal'},\n",
    "                        inplace=True)\n",
    "    selection_df.drop(columns=['megawatts_electric_total', 'megawatts_thermal_total'],\n",
    "                     inplace=True)\n",
    "    \n",
    "    # Ensure the rows are properly indexed as 0,1,2,...\n",
    "    selection_df.reset_index(inplace=True, drop=True)\n",
    "    \n",
    "    return selection_df\n",
    "\n",
    "# Create a dataframe for each energy type\n",
    "dataframes = []\n",
    "for column in mwe_columns:\n",
    "    selection = select_and_reformat(CZ_re_df, column)\n",
    "    energy_type = selection['energy_type'].unique()[0]\n",
    "    dataframes.append(selection)\n",
    "\n",
    "# Concatenate the dataframes\n",
    "CZ_re_df = pd.concat(dataframes, ignore_index=False)\n",
    "CZ_re_df.reset_index(inplace=True, drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see what is this restructured dataframe like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of columns has been reduced as we have transformed the data to the long format. The rows representning conventional power plants have been excluded. Since only few sites use multiple types of energy, the total number of rows has not increased."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for CZ and create the translation dictionary\n",
    "idx_CZ = columnnames[columnnames['country'] == 'CZ'].index\n",
    "column_dict_CZ = columnnames.loc[idx_CZ].set_index('original_name')['opsd_name'].to_dict()\n",
    "\n",
    "# Show the dictionary\n",
    "column_dict_CZ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Translate column names\n",
    "CZ_re_df.rename(columns=column_dict_CZ, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Translate values and harmonize energy levels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for Czech Republic\n",
    "idx_CZ = valuenames[valuenames['country'] == 'CZ'].index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the translation terms for energy source level 3\n",
    "energy3_dict_CZ = valuenames.loc[idx_CZ].set_index('original_name')['opsd_name'].to_dict()\n",
    "energy3_dict_CZ\n",
    "\n",
    "# Add energy source level 3\n",
    "CZ_re_df['energy_source_level_3'] = CZ_re_df['technology'].replace(energy3_dict_CZ)\n",
    "\n",
    "# Choose the terms for energy source level 2\n",
    "energy2_dict_CZ = valuenames.loc[idx_CZ].set_index('original_name')['energy_source_level_2'].to_dict()\n",
    "CZ_re_df['energy_source_level_2'] = CZ_re_df['technology'].replace(energy2_dict_CZ)\n",
    "\n",
    "# Standardize the values for technology\n",
    "# 1. np.nan means that technology should not be specified for the respective kind of sources\n",
    "#    according to the hierarchy (http://open-power-system-data.org/2016-10-25-opsd_tree.svg)\n",
    "# 2. 'Other or unspecified technology' means that technology should be specified\n",
    "#    but it was unclear or missing in the original dataset.\n",
    "technology_dict = {\n",
    "    'biogas and biomass' : np.nan,\n",
    "    'wind' : 'Onshore',\n",
    "    'solar' : 'Other or unspecified technology',\n",
    "    'hydro' : 'Run-of-river',\n",
    "    'unspecified' : np.nan\n",
    "}\n",
    "CZ_re_df['technology'] = CZ_re_df['technology'].replace(technology_dict)\n",
    "\n",
    "# Add energy_source_level_1\n",
    "CZ_re_df['energy_source_level_1'] = 'Renewable energy'\n",
    "\n",
    "# Show the hierarchy of sources present in the dataset\n",
    "CZ_re_df[['energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology']].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_re_df['data_source'] = 'ERU'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Georeferencing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get geo-information\n",
    "zip_CZ_geo = zipfile.ZipFile(CZ_geo_filepath)\n",
    "\n",
    "# Read generated postcode/location file\n",
    "CZ_geo = pd.read_csv(zip_CZ_geo.open('CZ.txt'), sep='\\t', header=None)\n",
    "\n",
    "# add column names as defined in associated readme file\n",
    "CZ_geo.columns = ['country_code', 'postcode', 'place_name', 'admin_name1',\n",
    "                  'admin_code1', 'admin_name2', 'admin_code2', 'admin_name3',\n",
    "                  'admin_code3', 'lat', 'lon', 'accuracy']\n",
    "\n",
    "# Drop rows of possible duplicate postal_code\n",
    "CZ_geo.drop_duplicates('postcode', keep='last', inplace=True)\n",
    "\n",
    "# Add longitude/latitude infomation assigned by postcode\n",
    "CZ_re_df = pd.merge(CZ_re_df,\n",
    "                    CZ_geo[['lat', 'lon', 'postcode']],\n",
    "                    left_on='postcode',\n",
    "                    right_on='postcode',\n",
    "                    how='left'\n",
    "                   )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assign NUTS codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_postcode2nuts_filepath = filepaths['Eurostat']\n",
    "\n",
    "CZ_re_df = nuts_converter.add_nuts_information(CZ_re_df, 'CZ', CZ_postcode2nuts_filepath, how=['postcode'])\n",
    "\n",
    "# Report the number of facilites whose NUTS codes were successfully determined\n",
    "determined = CZ_re_df['nuts_1_region'].notnull().sum()\n",
    "print('NUTS successfully determined for', determined, 'out of', CZ_re_df.shape[0], 'facilities in CZ.')\n",
    "\n",
    "# Report the number of facilites whose NUTS codes could not be determined\n",
    "not_determined = CZ_re_df['nuts_1_region'].isnull().sum()\n",
    "print('NUTS could not be determined for', not_determined, 'out of', CZ_re_df.shape[0], 'facilities in CZ.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select the columns to keep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define which columns should be kept\n",
    "columns_to_keep = ['site_name', 'region', 'municipality', 'locality', 'postcode',\n",
    "                   'nuts_1_region', 'nuts_2_region', 'nuts_3_region', 'lat', 'lon',\n",
    "                   'energy_source_level_1', 'energy_source_level_2', 'energy_source_level_3', 'technology', \n",
    "                   'owner', 'electrical_capacity',  'data_source']\n",
    "\n",
    "# Keep only the selected columns\n",
    "CZ_re_df = CZ_re_df.loc[:, columns_to_keep]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Drop duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_re_df.drop_duplicates(inplace=True)\n",
    "CZ_re_df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualuze"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_points(CZ_re_df['lat'],\n",
    "                 CZ_re_df['lon'],\n",
    "                 'Czechia',\n",
    "                 categories=CZ_re_df['energy_source_level_2']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CZ_re_df.reset_index(inplace=True, drop=True)\n",
    "CZ_re_df.to_pickle('intermediate/CZ_renewables.pickle')\n",
    "del CZ_re_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zip the raw data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zip_archive = zipfile.ZipFile(input_directory_path + '.zip', 'w', zipfile.ZIP_DEFLATED)\n",
    "print(\"Zipping the raw files...\")\n",
    "for filename in os.listdir(input_directory_path):\n",
    "    print(\"Adding\", filename, \"to the zip.\")\n",
    "    filepath = os.path.join(input_directory_path, filename)\n",
    "    zip_archive.write(filepath)\n",
    "zip_archive.close()\n",
    "print(\"Done!\")\n",
    "#shutil.rmtree(input_directory_path)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "p33",
   "language": "python",
   "name": "p33"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 0
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "716px",
    "left": "104px",
    "top": "280px",
    "width": "231px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}