{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Import temperature data from the DWD and process it\n", "\n", "This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.\n", "\n", "To reduce computing time, we also crop all data earlier than 2007. \n", "\n", "Files should be executed in the following pipeline:\n", "* 1-dwd_konverter_download\n", "* 2-dwd_konverter_extract\n", "* 3-dwd_konverter_build_df\n", "* 4-dwd_konverter_final_processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.) Import the .csv files into pandas and concat into a single df\n", "Now we need to import everything that we have extracted. This operation is going to take some time (aprox 20 mins). If you want to save time, you can just delete a few of the .csv-files in the 'import' folder. The script works as well with only a few files. \n", "\n", "### Process individual files\n", "The files are imported into a single df, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df. Because the loop takes a long time, we output some status messages, to ensure the process is still running. \n", "### Process the concatenated main_df\n", "Then we display some infos of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were copied.\n", "### Unstack and export\n", "For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step. Also, we display some output to get a grasp of what is going on. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'This is file 10'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Shape of the main_df is: (771356, 1)'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "float 771356\n", "Name: TT_TU, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Shape of the main_df is: (113952, 9)'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TT_TU
STATIONS_ID3447173789196102125
MESS_DATUM
2007-01-01 00:00:0011.4NaNNaNNaN11.09.4NaN9.7NaN
2007-01-01 01:00:0012.0NaNNaNNaN11.49.6NaN10.4NaN
2007-01-01 02:00:0012.3NaNNaNNaN9.410.0NaN9.9NaN
2007-01-01 03:00:0011.5NaNNaNNaN9.39.7NaN9.5NaN
2007-01-01 04:00:009.6NaNNaNNaN8.610.2NaN8.9NaN
\n", "
" ], "text/plain": [ " TT_TU \n", "STATIONS_ID 3 44 71 73 78 91 96 102 125\n", "MESS_DATUM \n", "2007-01-01 00:00:00 11.4 NaN NaN NaN 11.0 9.4 NaN 9.7 NaN\n", "2007-01-01 01:00:00 12.0 NaN NaN NaN 11.4 9.6 NaN 10.4 NaN\n", "2007-01-01 02:00:00 12.3 NaN NaN NaN 9.4 10.0 NaN 9.9 NaN\n", "2007-01-01 03:00:00 11.5 NaN NaN NaN 9.3 9.7 NaN 9.5 NaN\n", "2007-01-01 04:00:00 9.6 NaN NaN NaN 8.6 10.2 NaN 8.9 NaN" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TT_TU
STATIONS_ID3447173789196102125
count37224.000000111003.00000088391.000000111471.000000113950.000000113950.0000006399.000000106379.00000082589.000000
mean10.1039229.9332138.3997647.5014869.8722689.19986912.73025510.1499911.045942
std7.20000114.4459738.77976647.5371127.2812158.40071323.18955510.72803086.520406
min-13.600000-999.000000-999.000000-999.000000-16.200000-999.000000-999.000000-999.000000-999.000000
25%5.0000004.9000002.2000002.8000004.7000003.4000007.2500005.7000001.800000
50%9.90000010.0000008.3000009.3000009.7000008.90000013.20000010.2000008.200000
75%15.30000015.20000014.20000015.80000015.00000014.70000018.50000015.20000014.500000
max36.20000037.00000033.70000036.70000039.00000036.90000037.90000033.40000033.700000
\n", "
" ], "text/plain": [ " TT_TU \\\n", "STATIONS_ID 3 44 71 73 \n", "count 37224.000000 111003.000000 88391.000000 111471.000000 \n", "mean 10.103922 9.933213 8.399764 7.501486 \n", "std 7.200001 14.445973 8.779766 47.537112 \n", "min -13.600000 -999.000000 -999.000000 -999.000000 \n", "25% 5.000000 4.900000 2.200000 2.800000 \n", "50% 9.900000 10.000000 8.300000 9.300000 \n", "75% 15.300000 15.200000 14.200000 15.800000 \n", "max 36.200000 37.000000 33.700000 36.700000 \n", "\n", " \\\n", "STATIONS_ID 78 91 96 102 \n", "count 113950.000000 113950.000000 6399.000000 106379.000000 \n", "mean 9.872268 9.199869 12.730255 10.149991 \n", "std 7.281215 8.400713 23.189555 10.728030 \n", "min -16.200000 -999.000000 -999.000000 -999.000000 \n", "25% 4.700000 3.400000 7.250000 5.700000 \n", "50% 9.700000 8.900000 13.200000 10.200000 \n", "75% 15.000000 14.700000 18.500000 15.200000 \n", "max 39.000000 36.900000 37.900000 33.400000 \n", "\n", " \n", "STATIONS_ID 125 \n", "count 82589.000000 \n", "mean 1.045942 \n", "std 86.520406 \n", "min -999.000000 \n", "25% 1.800000 \n", "50% 8.200000 \n", "75% 14.500000 \n", "max 33.700000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from IPython.display import clear_output\n", "\n", "from pathlib import Path\n", "import glob\n", "\n", "\n", "import_files = glob.glob('import/*')\n", "out_file = Path.cwd() / \"export_uncleaned\" / \"to_clean\"\n", "#msum_file= Path.cwd() / \"export\" / \"monatssumme.csv\"\n", "\n", "obsolete_columns = [\n", " 'QN_9',\n", " 'RF_TU',\n", " 'eor'\n", "]\n", "\n", "main_df = pd.DataFrame()\n", "i = 1\n", "\n", "for file in import_files:\n", "\n", " # Read in the next file\n", " df = pd.read_csv(file, delimiter=\";\")\n", " # Prepare the df befor merging (Drop obsolete, convert to datetime, filter to date, set index)\n", " df.drop(columns=obsolete_columns, inplace=True)\n", " df[\"MESS_DATUM\"] = pd.to_datetime(df[\"MESS_DATUM\"], format=\"%Y%m%d%H\")\n", " df = df[df['MESS_DATUM']>= \"2007-01-01\"]\n", " df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)\n", " \n", " # Merge to the main_df\n", " main_df = pd.concat([main_df, df])\n", " \n", " # Display some status messages\n", " clear_output(wait=True)\n", " display('Finished file: {}'.format(file), 'This is file {}'.format(i))\n", " display('Shape of the main_df is: {}'.format(main_df.shape))\n", " i+=1\n", "\n", "# Check if all types are correct\n", "display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())\n", " \n", "# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.\n", "# The ~ is a bitwise operation, meaning it flips all bits. \n", "main_df = main_df[~main_df.index.duplicated(keep='last')]\n", "\n", "\n", "# Unstack the main_df\n", "main_df = main_df.unstack('STATIONS_ID')\n", "display('Shape of the main_df is: {}'.format(main_df.shape))\n", "\n", "# Save main_df to a .csv file and a pickle to continue working in the next cell. \n", "main_df.to_pickle(Path(out_file).with_suffix('.pkl'))\n", "main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=\";\")\n", "\n", "display(main_df.head())\n", "display(main_df.describe())\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }