{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import temperature data from the DWD and process it\n",
    "\n",
    "This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.\n",
    "\n",
    "To reduce computing time, we also crop all data earlier than 2007. \n",
    "\n",
    "Files should be executed in the following pipeline:\n",
    "* 1-dwd_konverter_download\n",
    "* 2-dwd_konverter_extract\n",
    "* 3-dwd_konverter_build_df\n",
    "* 4-dwd_konverter_final_processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.) Import the .csv files into pandas and concat into a single df\n",
    "Now we need to import everything that we have extracted. This operation is going to take some time (aprox 20 mins). If you want to save time, you can just delete a few of the .csv-files in the 'import' folder. The script works as well with only a few files. \n",
    "\n",
    "### Process individual files\n",
    "The files are imported into a single df, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df. Because the loop takes a long time, we output some status messages, to ensure the process is still running. \n",
    "### Process the concatenated main_df\n",
    "Then we display some infos of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were copied.\n",
    "### Unstack and export\n",
    "For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step. Also, we display some output to get a grasp of what is going on. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'This is file 10'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Shape of the main_df is: (771356, 1)'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "float    771356\n",
       "Name: TT_TU, dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Shape of the main_df is: (113952, 9)'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"9\" halign=\"left\">TT_TU</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>STATIONS_ID</th>\n",
       "      <th>3</th>\n",
       "      <th>44</th>\n",
       "      <th>71</th>\n",
       "      <th>73</th>\n",
       "      <th>78</th>\n",
       "      <th>91</th>\n",
       "      <th>96</th>\n",
       "      <th>102</th>\n",
       "      <th>125</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MESS_DATUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2007-01-01 00:00:00</th>\n",
       "      <td>11.4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>9.4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.7</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007-01-01 01:00:00</th>\n",
       "      <td>12.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.4</td>\n",
       "      <td>9.6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.4</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007-01-01 02:00:00</th>\n",
       "      <td>12.3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.4</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007-01-01 03:00:00</th>\n",
       "      <td>11.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.3</td>\n",
       "      <td>9.7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007-01-01 04:00:00</th>\n",
       "      <td>9.6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.6</td>\n",
       "      <td>10.2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    TT_TU                                      \n",
       "STATIONS_ID           3   44  71  73    78    91  96    102 125\n",
       "MESS_DATUM                                                     \n",
       "2007-01-01 00:00:00  11.4 NaN NaN NaN  11.0   9.4 NaN   9.7 NaN\n",
       "2007-01-01 01:00:00  12.0 NaN NaN NaN  11.4   9.6 NaN  10.4 NaN\n",
       "2007-01-01 02:00:00  12.3 NaN NaN NaN   9.4  10.0 NaN   9.9 NaN\n",
       "2007-01-01 03:00:00  11.5 NaN NaN NaN   9.3   9.7 NaN   9.5 NaN\n",
       "2007-01-01 04:00:00   9.6 NaN NaN NaN   8.6  10.2 NaN   8.9 NaN"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"9\" halign=\"left\">TT_TU</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>STATIONS_ID</th>\n",
       "      <th>3</th>\n",
       "      <th>44</th>\n",
       "      <th>71</th>\n",
       "      <th>73</th>\n",
       "      <th>78</th>\n",
       "      <th>91</th>\n",
       "      <th>96</th>\n",
       "      <th>102</th>\n",
       "      <th>125</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>37224.000000</td>\n",
       "      <td>111003.000000</td>\n",
       "      <td>88391.000000</td>\n",
       "      <td>111471.000000</td>\n",
       "      <td>113950.000000</td>\n",
       "      <td>113950.000000</td>\n",
       "      <td>6399.000000</td>\n",
       "      <td>106379.000000</td>\n",
       "      <td>82589.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>10.103922</td>\n",
       "      <td>9.933213</td>\n",
       "      <td>8.399764</td>\n",
       "      <td>7.501486</td>\n",
       "      <td>9.872268</td>\n",
       "      <td>9.199869</td>\n",
       "      <td>12.730255</td>\n",
       "      <td>10.149991</td>\n",
       "      <td>1.045942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>7.200001</td>\n",
       "      <td>14.445973</td>\n",
       "      <td>8.779766</td>\n",
       "      <td>47.537112</td>\n",
       "      <td>7.281215</td>\n",
       "      <td>8.400713</td>\n",
       "      <td>23.189555</td>\n",
       "      <td>10.728030</td>\n",
       "      <td>86.520406</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-13.600000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-16.200000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-999.000000</td>\n",
       "      <td>-999.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.900000</td>\n",
       "      <td>2.200000</td>\n",
       "      <td>2.800000</td>\n",
       "      <td>4.700000</td>\n",
       "      <td>3.400000</td>\n",
       "      <td>7.250000</td>\n",
       "      <td>5.700000</td>\n",
       "      <td>1.800000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>9.900000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>8.300000</td>\n",
       "      <td>9.300000</td>\n",
       "      <td>9.700000</td>\n",
       "      <td>8.900000</td>\n",
       "      <td>13.200000</td>\n",
       "      <td>10.200000</td>\n",
       "      <td>8.200000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>15.300000</td>\n",
       "      <td>15.200000</td>\n",
       "      <td>14.200000</td>\n",
       "      <td>15.800000</td>\n",
       "      <td>15.000000</td>\n",
       "      <td>14.700000</td>\n",
       "      <td>18.500000</td>\n",
       "      <td>15.200000</td>\n",
       "      <td>14.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>36.200000</td>\n",
       "      <td>37.000000</td>\n",
       "      <td>33.700000</td>\n",
       "      <td>36.700000</td>\n",
       "      <td>39.000000</td>\n",
       "      <td>36.900000</td>\n",
       "      <td>37.900000</td>\n",
       "      <td>33.400000</td>\n",
       "      <td>33.700000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    TT_TU                                              \\\n",
       "STATIONS_ID           3              44            71             73    \n",
       "count        37224.000000  111003.000000  88391.000000  111471.000000   \n",
       "mean            10.103922       9.933213      8.399764       7.501486   \n",
       "std              7.200001      14.445973      8.779766      47.537112   \n",
       "min            -13.600000    -999.000000   -999.000000    -999.000000   \n",
       "25%              5.000000       4.900000      2.200000       2.800000   \n",
       "50%              9.900000      10.000000      8.300000       9.300000   \n",
       "75%             15.300000      15.200000     14.200000      15.800000   \n",
       "max             36.200000      37.000000     33.700000      36.700000   \n",
       "\n",
       "                                                                       \\\n",
       "STATIONS_ID            78             91           96             102   \n",
       "count        113950.000000  113950.000000  6399.000000  106379.000000   \n",
       "mean              9.872268       9.199869    12.730255      10.149991   \n",
       "std               7.281215       8.400713    23.189555      10.728030   \n",
       "min             -16.200000    -999.000000  -999.000000    -999.000000   \n",
       "25%               4.700000       3.400000     7.250000       5.700000   \n",
       "50%               9.700000       8.900000    13.200000      10.200000   \n",
       "75%              15.000000      14.700000    18.500000      15.200000   \n",
       "max              39.000000      36.900000    37.900000      33.400000   \n",
       "\n",
       "                           \n",
       "STATIONS_ID           125  \n",
       "count        82589.000000  \n",
       "mean             1.045942  \n",
       "std             86.520406  \n",
       "min           -999.000000  \n",
       "25%              1.800000  \n",
       "50%              8.200000  \n",
       "75%             14.500000  \n",
       "max             33.700000  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from IPython.display import clear_output\n",
    "\n",
    "from pathlib import Path\n",
    "import glob\n",
    "\n",
    "\n",
    "import_files = glob.glob('import/*')\n",
    "out_file = Path.cwd() / \"export_uncleaned\" / \"to_clean\"\n",
    "#msum_file=  Path.cwd() / \"export\" / \"monatssumme.csv\"\n",
    "\n",
    "obsolete_columns = [\n",
    "    'QN_9',\n",
    "    'RF_TU',\n",
    "    'eor'\n",
    "]\n",
    "\n",
    "main_df = pd.DataFrame()\n",
    "i = 1\n",
    "\n",
    "for file in import_files:\n",
    "\n",
    "    # Read in the next file\n",
    "    df = pd.read_csv(file, delimiter=\";\")\n",
    "    # Prepare the df befor merging (Drop obsolete, convert to datetime, filter to date, set index)\n",
    "    df.drop(columns=obsolete_columns, inplace=True)\n",
    "    df[\"MESS_DATUM\"] = pd.to_datetime(df[\"MESS_DATUM\"], format=\"%Y%m%d%H\")\n",
    "    df = df[df['MESS_DATUM']>= \"2007-01-01\"]\n",
    "    df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)\n",
    "    \n",
    "    # Merge to the main_df\n",
    "    main_df = pd.concat([main_df, df])\n",
    "    \n",
    "    # Display some status messages\n",
    "    clear_output(wait=True)\n",
    "    display('Finished file: {}'.format(file), 'This is file {}'.format(i))\n",
    "    display('Shape of the main_df is: {}'.format(main_df.shape))\n",
    "    i+=1\n",
    "\n",
    "# Check if all types are correct\n",
    "display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())\n",
    "    \n",
    "# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.\n",
    "# The ~ is a bitwise operation, meaning it flips all bits. \n",
    "main_df = main_df[~main_df.index.duplicated(keep='last')]\n",
    "\n",
    "\n",
    "# Unstack the main_df\n",
    "main_df = main_df.unstack('STATIONS_ID')\n",
    "display('Shape of the main_df is: {}'.format(main_df.shape))\n",
    "\n",
    "# Save main_df to a .csv file and a pickle to continue working in the next cell. \n",
    "main_df.to_pickle(Path(out_file).with_suffix('.pkl'))\n",
    "main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=\";\")\n",
    "\n",
    "display(main_df.head())\n",
    "display(main_df.describe())\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}