{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table style=\"width:100%; background-color: #D9EDF7\">\n",
    "  <tr>\n",
    "    <td style=\"border: 1px solid #CFCFCF\">\n",
    "      <b>Weather data: Main notebook</b>\n",
    "      <ul>\n",
    "        <li><a href=\"main.ipynb\">Main Notebook</a></li>\n",
    "        <li>Downloading Notebook</li>\n",
    "      </ul>\n",
    "      <br>This Notebook is part of the <a href=\"http://data.open-power-system-data.org/weather_data\">Weather data Datapackage</a> of <a href=\"http://open-power-system-data.org\">Open Power System Data</a>.\n",
    "    </td>\n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    "* [1. Introductory Notes](#1.-Introductory-Notes)\n",
    "\t* [1.1 State of the script:](#1.1-State-of-the-script:)\n",
    "\t* [1.2 How to use the script:](#1.2-How-to-use-the-script:)\n",
    "* [2. Script Setup](#2.-Script-Setup)\n",
    "* [3. Download raw data](#3.-Download-raw-data)\n",
    "\t* [3.1 Input](#3.1-Input)\n",
    "\t\t* [3.1.1 Parameter selection](#3.1.1-Parameter-selection)\n",
    "\t\t* [3.1.2 Timeframe](#3.1.2-Timeframe)\n",
    "\t\t* [3.1.3 Geography coordinates](#3.1.3-Geography-coordinates)\n",
    "\t* [3.2 Subsetting data](#3.2-Subsetting-data)\n",
    "* [4. Downloading data](#4.-Downloading-data)\n",
    "\t* [4.1 Get wind data](#4.1-Get-wind-data)\n",
    "\t* [4.2 Get roughness](#4.2-Get-roughness)\n",
    "\t* [4.3 Get lat and lon dimensions](#4.3-Get-lat-and-lon-dimensions)\n",
    "\t* [4.4 Check the precision of the downloaded data](#4.4-Check-the-precision-of-the-downloaded-data)\n",
    "* [5. Setting up DataFrame](#5.-Setting-up-DataFrame)\n",
    "\t* [5.1 Converting the timeformat to ISO 8601](#5.1-Converting-the-timeformat-to-ISO-8601)\n",
    "* [6. Setting up Roughness dataframe](#6.-Setting-up-Roughness-dataframe)\n",
    "\t* [6.1 Combining the wind and roughness dataframe](#6.1-Combining-the-wind-and-roughness-dataframe)\n",
    "* [7. Structure the dataframe, add and remove columns](#7.-Structure-the-dataframe,-add-and-remove-columns)\n",
    "\t* [7.1 Calculating the height with displacement height](#7.1-Calculating-the-height-with-displacement-height)\n",
    "\t* [7.2 Adding needed and removing not needed columns](#7.2-Adding-needed-and-removing-not-needed-columns)\n",
    "\t* [7.3 Renaming columns](#7.3-Renaming-columns)\n",
    "\t* [7.4 First look at the final data frame structure and format](#7.4-First-look-at-the-final-data-frame-structure-and-format)\n",
    "\t* [7.5 Saving dataframe](#7.5-Saving-dataframe)\n",
    "* [8. Create metadata](#8.-Create-metadata)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Introductory Notes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This script contains code that allows the download, subset and processing of [MERRA-2](http://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/) datasets (provided by NASA Goddard Space Flight Center) and export them as CSV.\n",
    "\n",
    "**Weather data differ significantly from the other data types used resp. provided by OPSD** in that the sheer size of the data packages greatly exceeds OPSD's capacity to host them in a similar way as feed-in timeseries, power plant data etc. While the other data packages also offer a complete one-klick download of the bundled data packages with all relevant data this is impossible for weather datasets like MERRA-2 due to their size (variety of variables, very long timespan, huge geographical coverage etc.). It would make no sense to mirror the data from the NASA servers.\n",
    "\n",
    "Instead we choose to provide only a **documented methodological script**. The  method describes one way to automatically obtain the desired weather data from the MERRA-2 database and simplifies resp. unifies alternative manual data obtaining methods in a single script.\n",
    "\n",
    "**More detailed background information** on weather data can be found in the <a href=\"main.ipynb\">Main notebook</a> and the [OPSD Wiki](https://github.com/Open-Power-System-Data/common/wiki/Information-on-weather-data) on Github."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 State of the script:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The script is still in development. It is currently working. The next step will be adding support for more datasets and more weather variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 How to use the script:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To download MERRA-2 data, you have to register at NASA earth data portal\n",
    "1. Register an account at [https://urs.earthdata.nasa.gov/](https://urs.earthdata.nasa.gov/)\n",
    "2. Go to my applications and add NASA GESDISC DATA ARCHIVE\n",
    "3. Input your username and password when requested by the script"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Script Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import xarray as xr\n",
    "import numpy as np\n",
    "import requests\n",
    "import logging\n",
    "import yaml\n",
    "import json\n",
    "import os\n",
    "from datetime import datetime\n",
    "from calendar import monthrange\n",
    "from opendap_download.multi_processing_download import DownloadManager\n",
    "import math\n",
    "from functools import partial\n",
    "import re\n",
    "import getpass\n",
    "from datetime import datetime, timedelta\n",
    "import dateutil.parser\n",
    "\n",
    "# Set up a log\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "log = logging.getLogger('notebook')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Download raw data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Input"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This part defines the input parameters according to the user and creates an URL that can download the desired MERRA-2 data via the OPeNDAP interface ([more information)](https://github.com/Open-Power-System-Data/common/wiki/Information-on-weather-data#4-what-is-opendap)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1.1 Parameter selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the possible parameters (i.e. weather data)\n",
    "* wind\n",
    "* solar radiation\n",
    "* temperature\n",
    "\n",
    "If you want to select more than one parameter, separate them with commas. For example: `wind, solar radiation, temperature`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-07-12T17:02:07.640325",
     "start_time": "2016-07-12T17:02:07.584319"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Getting user input\n",
    "# This version only supports wind so far. This line does not do anything at \n",
    "# this point.\n",
    "possible_params = ['wind', 'solar radiation', 'temperature']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1.2 Timeframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Definition of desired timespan the data is needed for. (currently only full years possible)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# User input of timespan\n",
    "download_year = 2014\n",
    "# Create the start date 2014-01-01\n",
    "download_start_date = str(download_year) + '-01-01'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1.3 Geography coordinates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Definition of desired coordinates. The user has to input two corner coordinates of a rectangular area (Format WGS84, decimal system).\n",
    "* Northeast coordinate: lat_1, lon_1\n",
    "* Southwest coordinate: lat_2, lon_2\n",
    "\n",
    "The area/coordinates will be converted from lat/lon to the MERRA-2 grid coordinates.\n",
    "Since the resolution of the MERRA-2 grid is 0.5 x 0.625°, the given coordinates will \n",
    "matched as close as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-07-12T17:24:14.539011",
     "start_time": "2016-07-12T17:24:14.526010"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# User input of coordinates\n",
    "# ------\n",
    "# Example: Schleswig-Holstein (lat/lon)\n",
    "# Northeastern point: 55,036823°N, 11,349297°E\n",
    "# Southwestern point: 53,366266°N, 7,887088°E\n",
    "\n",
    "# Northeastern coordinate \n",
    "lat_1, lon_1 = 53.366266, 7.887088\n",
    "# Southwestern coordinate\n",
    "lat_2, lon_2 = 55.036823, 11.349297\n",
    "\n",
    "def translate_lat_to_geos5_native(latitude):\n",
    "    \"\"\"\n",
    "    The source for this formula is in the MERRA2 \n",
    "    Variable Details - File specifications for GEOS pdf file.\n",
    "    \n",
    "    latitude: float Needs +/- instead of N/S\n",
    "    \"\"\"\n",
    "    return 1 + ((latitude + 90) / 0.5)\n",
    "\n",
    "def translate_lon_to_geos5_native(longitude):\n",
    "    \"\"\"See function above\"\"\"\n",
    "    return (1 + ((longitude) + 180) / 0.625)\n",
    "\n",
    "def find_closest_coordinate(calc_coord, coord_array):\n",
    "    \"\"\"\n",
    "    Since the resolution of the grid is 0.5 x 0.625, the 'real world'\n",
    "    coordinates will not be matched 100% correctly. This function matches \n",
    "    the coordinates as close as possible. \n",
    "    \"\"\"\n",
    "    # np.argmin() finds the smallest value in an array and returns its\n",
    "    # index. np.abs() returns the absolute value of each item of an array.\n",
    "    # To summarize, the function finds the difference closest to 0 and returns \n",
    "    # its index. \n",
    "    index = np.abs(coord_array-calc_coord).argmin()\n",
    "    return coord_array[index]\n",
    "\n",
    "# The arrays contain the coordinates of the grid used by the API.\n",
    "lat_coords = np.arange(0, 360, dtype=int)\n",
    "lon_coords = np.arange(0, 575, dtype=int)\n",
    "\n",
    "# Translate the coordinates that define your area to grid coordinates.\n",
    "lat_coord_1 = translate_lat_to_geos5_native(lat_1)\n",
    "lon_coord_1 = translate_lon_to_geos5_native(lon_1)\n",
    "lat_coord_2 = translate_lat_to_geos5_native(lat_2)\n",
    "lon_coord_2 = translate_lon_to_geos5_native(lon_2)\n",
    "\n",
    "\n",
    "# Find the closest coordinate in the grid.\n",
    "lat_co_1_closest = find_closest_coordinate(lat_coord_1, lat_coords)\n",
    "lon_co_1_closest = find_closest_coordinate(lon_coord_1, lon_coords)\n",
    "lat_co_2_closest = find_closest_coordinate(lat_coord_2, lat_coords)\n",
    "lon_co_2_closest = find_closest_coordinate(lon_coord_2, lon_coords)\n",
    "\n",
    "# Check the precision of the grid coordinates. These coordinates are not lat/lon. \n",
    "# They are coordinates on the MERRA-2 grid. \n",
    "log.info('Calculated coordinates for point 1: ' + str((lat_coord_1, lon_coord_1)))\n",
    "log.info('Closest coordinates for point 1: ' + str((lat_co_1_closest, lon_co_1_closest)))\n",
    "log.info('Calculated coordinates for point 2: ' + str((lat_coord_2, lon_coord_2)))\n",
    "log.info('Closest coordinates for point 2: ' + str((lat_co_2_closest, lon_co_2_closest)))\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 3.2 Subsetting data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Combining parameter choices above/translation according to OPenDAP guidelines into URL-appendix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def translate_year_to_file_number(year):\n",
    "    \"\"\"\n",
    "    The file names consist of a number and a meta data string. \n",
    "    The number changes over the years. 1980 until 1991 it is 100, \n",
    "    1992 until 2000 it is 200, 2001 until 2010 it is  300 \n",
    "    and from 2011 until now it is 400.\n",
    "    \"\"\"\n",
    "    file_number = ''\n",
    "    \n",
    "    if year >= 1980 and year < 1992:\n",
    "        file_number = '100'\n",
    "    elif year >= 1992 and year < 2001:\n",
    "        file_number = '200'\n",
    "    elif year >= 2001 and year < 2011:\n",
    "        file_number = '300'\n",
    "    elif year >= 2011:\n",
    "        file_number = '400'\n",
    "    else:\n",
    "        raise Exception('The specified year is out of range.')\n",
    "    \n",
    "    return file_number\n",
    "    \n",
    "\n",
    "\n",
    "def generate_url_params(parameter, time_para, lat_para, lon_para):\n",
    "    \"\"\"Creates a string containing all the parameters in query form\"\"\"\n",
    "    parameter = map(lambda x: x + time_para, parameter)\n",
    "    parameter = map(lambda x: x + lat_para, parameter)\n",
    "    parameter = map(lambda x: x + lon_para, parameter)\n",
    "    \n",
    "    return ','.join(parameter)\n",
    "    \n",
    "    \n",
    "\n",
    "def generate_download_links(download_years, base_url, dataset_name, url_params):\n",
    "    \"\"\"\n",
    "    Generates the links for the download. \n",
    "    download_years: The years you want to download as array. \n",
    "    dataset_name: The name of the data set. For example tavg1_2d_slv_Nx\n",
    "    \"\"\"\n",
    "    urls = []\n",
    "    for y in download_years: \n",
    "    # build the file_number\n",
    "        y_str = str(y)\n",
    "        file_num = translate_year_to_file_number(download_year)\n",
    "        for m in range(1,13):\n",
    "            # build the month string: for the month 1 - 9 it starts with a leading 0. \n",
    "            # zfill solves that problem\n",
    "            m_str = str(m).zfill(2)\n",
    "            # monthrange returns the first weekday and the number of days in a \n",
    "            # month. Also works for leap years.\n",
    "            _, nr_of_days = monthrange(y, m)\n",
    "            for d in range(1,nr_of_days+1):\n",
    "                d_str = str(d).zfill(2)\n",
    "                # Create the file name string\n",
    "                file_name = 'MERRA2_{num}.{name}.{y}{m}{d}.nc4'.format(\n",
    "                    num=file_num, name=dataset_name, \n",
    "                    y=y_str, m=m_str, d=d_str)\n",
    "                # Create the query\n",
    "                query = '{base}{y}/{m}/{name}.nc4?{params}'.format(\n",
    "                    base=base_url, y=y_str, m=m_str, \n",
    "                    name=file_name, params=url_params)\n",
    "                urls.append(query)\n",
    "    return urls\n",
    "\n",
    "requested_params = ['U2M', 'U10M', 'U50M', 'V2M', 'V10M', 'V50M', 'DISPH']\n",
    "requested_time = '[0:1:23]'\n",
    "# Creates a string that looks like [start:1:end]. start and end are the lat or\n",
    "# lon coordinates define your area.\n",
    "requested_lat = '[{lat_1}:1:{lat_2}]'.format(\n",
    "                lat_1=lat_co_1_closest, lat_2=lat_co_2_closest)\n",
    "requested_lon = '[{lon_1}:1:{lon_2}]'.format(\n",
    "                lon_1=lon_co_1_closest, lon_2=lon_co_2_closest)\n",
    "\n",
    "\n",
    "\n",
    "parameter = generate_url_params(requested_params, requested_time,\n",
    "                                requested_lat, requested_lon)\n",
    "\n",
    "BASE_URL = 'http://goldsmr4.sci.gsfc.nasa.gov:80/opendap/MERRA2/M2T1NXSLV.5.12.4/'\n",
    "generated_URL = generate_download_links([download_year], BASE_URL, \n",
    "                                        'tavg1_2d_slv_Nx', \n",
    "                                        parameter)\n",
    "            \n",
    "# See what a query to the MERRA-2 portal looks like.        \n",
    "log.info(generated_URL[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Downloading data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This part subsequently downloads the subsetted raw data from the MERRA-2-datasets. \n",
    "The download process is outsourced from the notebook, because it is a standard and repetitive process. If you are interested in the the code, see the [opendap_download module](opendap_download/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1 Get wind data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "from the dataset [tavg1_2d_slv_Nx (M2T1NXSLV)](http://goldsmr4.sci.gsfc.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/contents.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Download data (one file per day and dataset) with links to local directory.\n",
    "# Username and password for MERRA-2 (NASA earthdata portal)\n",
    "username = input('Username: ')\n",
    "password = getpass.getpass('Password:')\n",
    "# The DownloadManager is able to download files. If you have a fast internet \n",
    "# connection, setting this to 2 is enough. If you have slow wifi, try setting\n",
    "# it to 4 or 5. If you download too fast, the data portal might ban you for a \n",
    "# day. \n",
    "NUMBER_OF_CONNECTIONS = 2\n",
    "\n",
    "# The DownloadManager class is defined in the opendap_download module. \n",
    "download_manager = DownloadManager()\n",
    "download_manager.set_username_and_password(username, password)\n",
    "download_manager.download_path = 'download'\n",
    "download_manager.download_urls = generated_URL\n",
    "# If you want to see the download progress, check the download folder you \n",
    "# specified\n",
    "%time download_manager.start_download(NUMBER_OF_CONNECTIONS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Get roughness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "from the dataset [tavg1_2d_flx_Nx (M2T1NXFLX)](http://goldsmr4.sci.gsfc.nasa.gov/opendap/MERRA2/M2T1NXFLX.5.12.4/contents.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Roughness data is in a different data set. The parameter is called Z0M. \n",
    "roughness_para = generate_url_params(['Z0M'], requested_time, \n",
    "                                     requested_lat, requested_lon)\n",
    "ROUGHNESS_BASE_URL = 'http://goldsmr4.sci.gsfc.nasa.gov/opendap/MERRA2/M2T1NXFLX.5.12.4/'\n",
    "roughness_links = generate_download_links([download_year], ROUGHNESS_BASE_URL,\n",
    "                                          'tavg1_2d_flx_Nx', roughness_para)\n",
    "            \n",
    "download_manager.download_path = 'roughness_download'\n",
    "download_manager.download_urls = roughness_links\n",
    "\n",
    "# If you want to see the download progress, check the download folder you \n",
    "# specified.\n",
    "%time download_manager.start_download(NUMBER_OF_CONNECTIONS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 Get lat and lon dimensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For now, the dataset only has MERRA-2 grid coordinates. To translate the points\n",
    "back to \"real world\" coordinates, the data portal offers a dimension scale file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The dimensions map the MERRA2 grid coordinates to lat/lon. The coordinates \n",
    "# to request are 0:360 wheare as the other coordinates are 1:361\n",
    "requested_lat_dim = '[{lat_1}:1:{lat_2}]'.format(\n",
    "                    lat_1=lat_co_1_closest - 1, lat_2=lat_co_2_closest - 1)\n",
    "requested_lon_dim = '[{lon_1}:1:{lon_2}]'.format(\n",
    "                    lon_1=lon_co_1_closest - 1, lon_2=lon_co_2_closest - 1)\n",
    "\n",
    "lat_lon_dimension_para = 'lat' + requested_lat_dim + ',lon' + requested_lon_dim\n",
    "# Creating the download url.\n",
    "dimension_url = 'http://goldsmr4.sci.gsfc.nasa.gov:80/opendap/MERRA2/M2T1NXSLV.5.12.4/2014/01/MERRA2_400.tavg1_2d_slv_Nx.20140101.nc4.nc4?'\n",
    "dimension_url = dimension_url + lat_lon_dimension_para\n",
    "download_manager.download_path = 'dimension_scale'\n",
    "download_manager.download_urls = [dimension_url]\n",
    "# Since the dimension is only one file, we only need one connection. \n",
    "%time download_manager.start_download(1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.4 Check the precision of the downloaded data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to the back and forth conversion from \"real world\" coordinates to MERRA-2 grid points,\n",
    "this part helps you to check if the conversion was precise enough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "file_path = os.path.join('dimension_scale', DownloadManager.get_filename(\n",
    "        dimension_url))\n",
    "\n",
    "with xr.open_dataset(file_path) as ds_dim:\n",
    "    df_dim = ds_dim.to_dataframe()\n",
    "\n",
    "lat_array = ds_dim['lat'].data.tolist()\n",
    "lon_array = ds_dim['lon'].data.tolist()\n",
    "\n",
    "# The log output helps evaluating the precision of the received data.\n",
    "log.info('Requested lat: ' + str((lat_1, lat_2)))\n",
    "log.info('Received lat: ' + str(lat_array))\n",
    "log.info('Requested lon: ' + str((lon_1, lon_2)))\n",
    "log.info('Received lon: ' + str(lon_array))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Setting up DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This part sets up a DataFrame and reads the raw data into it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def extract_date(data_set):\n",
    "    \"\"\"\n",
    "    Extracts the date from the filename before merging the datasets. \n",
    "    \"\"\"\n",
    "    try:\n",
    "        # The attribute name changed during the development of this script\n",
    "        # from HDF5_Global.Filename to Filename. \n",
    "        if 'HDF5_GLOBAL.Filename' in data_set.attrs:\n",
    "            f_name = data_set.attrs['HDF5_GLOBAL.Filename']\n",
    "        elif 'Filename' in data_set.attrs:\n",
    "            f_name = data_set.attrs['Filename']\n",
    "        else: \n",
    "            raise AttributeError('The attribute name has changed again!')\n",
    "        \n",
    "        # find a match between \".\" and \".nc4\" that does not have \".\" .\n",
    "        exp = r'(?<=\\.)[^\\.]*(?=\\.nc4)'\n",
    "        res = re.search(exp, f_name).group(0)\n",
    "        # Extract the date. \n",
    "        y, m, d = res[0:4], res[4:6], res[6:8]\n",
    "        date_str = ('%s-%s-%s' % (y, m, d))\n",
    "        data_set = data_set.assign(date=date_str)\n",
    "        return data_set\n",
    "\n",
    "    except KeyError:\n",
    "        # The last dataset is the one all the other sets will be merged into. \n",
    "        # Therefore, no date can be extracted.\n",
    "        return data_set\n",
    "        \n",
    "\n",
    "file_path = os.path.join('download', '*.nc4')\n",
    "\n",
    "try:\n",
    "    with xr.open_mfdataset(file_path, concat_dim='date',\n",
    "                           preprocess=extract_date) as ds_wind:\n",
    "        print(ds_wind)\n",
    "        df_wind = ds_wind.to_dataframe()\n",
    "        \n",
    "except xr.MergeError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df_wind.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "start_date = datetime.strptime(download_start_date, '%Y-%m-%d')\n",
    "\n",
    "def calculate_datetime(d_frame):\n",
    "    \"\"\"\n",
    "    Calculates the accumulated hour based on the date.\n",
    "    \"\"\"\n",
    "    cur_date = datetime.strptime(d_frame['date'], '%Y-%m-%d')\n",
    "    hour = int(d_frame['time'])\n",
    "    delta = abs(cur_date - start_date).days\n",
    "    date_time_value = (delta * 24) + (hour)\n",
    "    return date_time_value\n",
    "\n",
    "\n",
    "df_wind['date_time_hours'] = df_wind.apply(calculate_datetime, axis=1)\n",
    "df_wind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1 Converting the timeformat to ISO 8601"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def converting_timeformat_to_ISO8601(row):\n",
    "    \"\"\"Generates datetime according to ISO 8601 (UTC)\"\"\"\n",
    "    date = dateutil.parser.parse(row['date'])\n",
    "    hour = int(row['time'])\n",
    "    # timedelta from the datetime module enables the programmer \n",
    "    # to add time to a date. \n",
    "    date_time = date + timedelta(hours = hour)\n",
    "    return str(date_time.isoformat()) + 'Z'  # MERRA2 datasets have UTC time zone.\n",
    "df_wind['date_utc'] = df_wind.apply(converting_timeformat_to_ISO8601, axis=1)\n",
    "\n",
    "\n",
    "df_wind['date_utc']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def calculate_windspeed(d_frame, idx_u, idx_v):\n",
    "    \"\"\"\n",
    "    Calculates the windspeed. The returned unit is m/s\n",
    "    \"\"\"\n",
    "    um = int(d_frame[idx_u])\n",
    "    vm = int(d_frame[idx_v])\n",
    "    speed = math.sqrt((um ** 2) + (vm ** 2))\n",
    "    return round(speed, 2)\n",
    "\n",
    "# partial is used to create a function with pre-set arguments. \n",
    "calc_windspeed_2m = partial(calculate_windspeed, idx_u='U2M', idx_v='V2M')\n",
    "calc_windspeed_10m = partial(calculate_windspeed, idx_u='U10M', idx_v='V10M')\n",
    "calc_windspeed_50m = partial(calculate_windspeed, idx_u='U50M', idx_v='V50M')\n",
    "\n",
    "df_wind['v_2m'] = df_wind.apply(calc_windspeed_2m, axis=1)\n",
    "df_wind['v_10m']= df_wind.apply(calc_windspeed_10m, axis=1)\n",
    "df_wind['v_50m'] = df_wind.apply(calc_windspeed_50m, axis=1)\n",
    "df_wind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Setting up Roughness dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "file_path = os.path.join('roughness_download', '*.nc4')\n",
    "with xr.open_mfdataset(file_path, concat_dim='date', \n",
    "                       preprocess=extract_date) as ds_rough:\n",
    "    df_rough = ds_rough.to_dataframe()\n",
    "\n",
    "df_rough.reset_index(inplace=True)\n",
    "\n",
    "df_rough"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 6.1 Combining the wind and roughness dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df = pd.merge(df_wind, df_rough, on=['date', 'lat', 'lon', 'time'])\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. Structure the dataframe, add and remove columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.1 Calculating the height with displacement height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Calculate height for v_2m and v_10m (2 + DISPH or 10 + DISPH).\n",
    "df['h_v1'] = df.apply((lambda x:int(x['DISPH']) + 2), axis=1)\n",
    "df['h_v2'] = df.apply((lambda x:int(x['DISPH']) + 10), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.2 Adding needed and removing not needed columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df['v_100m'] = np.nan\n",
    "df.drop('DISPH', axis=1, inplace=True)\n",
    "df.drop(['time', 'date'], axis=1, inplace=True)\n",
    "df.drop(['U2M', 'U10M', 'U50M', 'V2M', 'V10M', 'V50M'], axis=1, inplace=True)\n",
    "\n",
    "df['lat'] = df['lat'].apply(lambda x: lat_array[int(x)])\n",
    "df['lon'] = df['lon'].apply(lambda x: lon_array[int(x)])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.3 Renaming columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rename_map = {'date_time_hours': 'cumulated hours', \n",
    "              'date_utc': 'timestamp',\n",
    "              'v_2m': 'v1', \n",
    "              'v_10m': 'v2', \n",
    "              'Z0M': 'z0'\n",
    "             }\n",
    "\n",
    "df.rename(columns=rename_map, inplace=True)\n",
    "\n",
    "# Change order of the columns\n",
    "columns = ['timestamp', 'cumulated hours', 'lat', 'lon',\n",
    "        'v1', 'v2', 'v_50m', 'v_100m',\n",
    "        'h_v1', 'h_v2', 'z0']\n",
    "df = df[columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.4 First look at the final data frame structure and format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "Take a look at the resulting dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.5 Saving dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the final dataframe locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.to_csv('weather_data_result.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8. Create metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "metadata = \"\"\"\n",
    "name: opsd-weather-data\n",
    "title: Weather data\n",
    "description: Script for the download of MERRA-2 weather data\n",
    "long_description: >-\n",
    "    Weather data differ significantly from the other data types used resp. \n",
    "    provided by OPSD in that the sheer size of the data packages greatly \n",
    "    exceeds OPSD's capacity to host them in a similar way as feed-in \n",
    "    timeseries, power plant data etc. While the other data packages also\n",
    "    offer a complete one-klick download of the bundled data packages with \n",
    "    all relevant data this is impossible for weather datasets like MERRA-2 due \n",
    "    to their size (variety of variables, very long timespan, huge geographical\n",
    "    coverage etc.). It would make no sense to mirror the data from the NASA \n",
    "    servers.\n",
    "    Instead we choose to provide a documented methodological script \n",
    "    (as a kind of tutorial). The method describes one way to automatically \n",
    "    obtain the desired weather data from the MERRA-2 database and simplifies \n",
    "    resp. unifies alternative manual data obtaining methods in a single \n",
    "    script.\n",
    "version: \"2016-10-21\"\n",
    "resources:\n",
    "    - path: renewable_power_plants_DE.csv\n",
    "      format: csv\n",
    "      encoding: UTF-8\n",
    "      missingValue: \"\"\n",
    "      schema:         \n",
    "          fields:\n",
    "            - name: timestamp\n",
    "              type: timestamp\n",
    "              format: YYYY-MM-DDThh:mm:ssZ\n",
    "              description: Timestap \n",
    "            - name:  cumulated hours\n",
    "              type: int\n",
    "              description: Cumulated hours over the defined timerange\n",
    "            - name: lat\n",
    "              description: Latitude\n",
    "              type: string\n",
    "            - name: lon\n",
    "              description: Longitude\n",
    "              type: float\n",
    "            - name: v1\n",
    "              description: Windspeed at displacement height + 2m\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name:  v2\n",
    "              description: Windspeed at displacement height + 10m\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name: v_50m\n",
    "              description: Windspeed at 50m height\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name: v_100m\n",
    "              description: Windspeed at 100m height\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name: h_v1 \n",
    "              description: Height of v1\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name:  h_v2\n",
    "              description: Height of v2\n",
    "              type: float\n",
    "              unit: m\n",
    "            - name: z0\n",
    "              description: Roughness length\n",
    "              type: string\n",
    "keywords: [Open Power System Data, MERRA-2, wind, solar, ]\n",
    "geographical-scope: Worldwide\n",
    "licenses:\n",
    "    - type: MIT license\n",
    "      url: http://www.opensource.org/licenses/MIT\n",
    "sources:\n",
    "    - name: MERRA-2 \n",
    "      web: https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/\n",
    "      source: National Aeronautics and Space Administration - Goddard Space Flight Center\n",
    "contributors:\n",
    "    - name: Jan Urbansky\n",
    "      email: jan.urbansky@uni-flensburg.de\n",
    "      web: \n",
    "views: True\n",
    "openpowersystemdata-enable-listing: True\n",
    "documentation: https://github.com/Open-Power-System-Data/weather_data/blob/2016-10-21/main.ipynb\n",
    "last_changes: Published on the main repository\n",
    "\"\"\"\n",
    "\n",
    "metadata = yaml.load(metadata)\n",
    "\n",
    "datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '))\n",
    "\n",
    "with open('datapackage.json', 'w') as f:\n",
    "    f.write(datapackage_json)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "toc": {
   "toc_cell": true,
   "toc_number_sections": true,
   "toc_threshold": 6,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}