{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis gasoline prices\n", "\n", "Based on data from https://www.e-control.at/ for district Mödling. \n", "\n", "Gatheres via https://api.e-control.at/sprit/1.0/doc/index.html?url=https://api.e-control.at/sprit/1.0/api-docs%3Fgroup%3Dpublic-api#/" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting parameters" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "URL = 'https://raw.githubusercontent.com/Datenspieler/notebooks_for_blog/master/2020-treibstoff/treibstoff.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting data from Github" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(URL)\n", "df.date = pd.to_datetime(df.date)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateidpricetype
72882020-07-21 00:20:022187120.999SUP
72892020-07-21 00:20:022245191.003SUP
72902020-07-21 00:20:0213547611.004SUP
72912020-07-21 00:20:026656141.013SUP
72922020-07-21 00:20:0213547991.013SUP
\n", "
" ], "text/plain": [ " date id price type\n", "7288 2020-07-21 00:20:02 218712 0.999 SUP\n", "7289 2020-07-21 00:20:02 224519 1.003 SUP\n", "7290 2020-07-21 00:20:02 1354761 1.004 SUP\n", "7291 2020-07-21 00:20:02 665614 1.013 SUP\n", "7292 2020-07-21 00:20:02 1354799 1.013 SUP" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 7293 entries, 0 to 7292\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 7293 non-null datetime64[ns]\n", " 1 id 7293 non-null int64 \n", " 2 price 7293 non-null float64 \n", " 3 type 7293 non-null object \n", "dtypes: datetime64[ns](1), float64(1), int64(1), object(1)\n", "memory usage: 228.0+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2020-06-23 22:20:02\n", "2020-07-21 00:20:02\n" ] } ], "source": [ "print(df.date.min())\n", "print(df.date.max())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preparing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add price (min and mean) over time " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df['min'] = df.groupby(['type', 'date'])[['price']].transform('min')\n", "df['mean'] = df.groupby(['type', 'date'])[['price']].transform('mean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dropping duplicate data\n", "\n", "Since for this analysis only interested in minimum/mean price, not what gas station is offering that price, can drop a lot of data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "df = df[['date', 'type', 'min', 'mean']].drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypeminmean
02020-06-23 22:20:02DIE0.9490.9546
52020-06-23 22:20:02SUP0.9991.0046
102020-06-23 23:20:01DIE0.9490.9546
\n", "
" ], "text/plain": [ " date type min mean\n", "0 2020-06-23 22:20:02 DIE 0.949 0.9546\n", "5 2020-06-23 22:20:02 SUP 0.999 1.0046\n", "10 2020-06-23 23:20:01 DIE 0.949 0.9546" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First look at the data " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.RepeatChart(...)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chart_zero = alt.Chart(df).mark_line().encode(\n", " x='date',\n", " y=alt.Y(alt.repeat(\"column\"), type='quantitative'),\n", " color='type',\n", " strokeDash='type',\n", ").repeat(\n", " column=['min', 'mean']\n", ")\n", "\n", "chart_zero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some movement, escepcially if we look at the mean price. We can make it look more spectacular, if we don't let the price start at 0 - great trick and bad visualization practive at the same time 😉" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.RepeatChart(...)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chart_not_zero = alt.Chart(df).mark_line().encode(\n", " x='date',\n", " y=alt.Y(alt.repeat(\"column\"), type='quantitative', scale=alt.Scale(zero=False)),\n", " color='type',\n", " strokeDash='type',\n", ").repeat(\n", " column=['min', 'mean']\n", ")\n", "\n", "chart_not_zero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis on day of week and hour of day" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add rolling average " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To 'correct' for long term trend, let's take the 1 week (24 x 7 measurements) average." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIE_minSUP_minDIE_meanSUP_mean
date
2020-06-23 22:20:020.9490.9990.9546001.004600
2020-06-23 23:20:010.9490.9990.9546001.004600
2020-06-24 00:20:020.9490.9990.9588571.008857
2020-06-24 01:20:020.9490.9990.9588571.008857
2020-06-24 02:20:010.9490.9990.9588571.008857
\n", "
" ], "text/plain": [ " DIE_min SUP_min DIE_mean SUP_mean\n", "date \n", "2020-06-23 22:20:02 0.949 0.999 0.954600 1.004600\n", "2020-06-23 23:20:01 0.949 0.999 0.954600 1.004600\n", "2020-06-24 00:20:02 0.949 0.999 0.958857 1.008857\n", "2020-06-24 01:20:02 0.949 0.999 0.958857 1.008857\n", "2020-06-24 02:20:01 0.949 0.999 0.958857 1.008857" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.set_index(['type', 'date']).unstack('type')\n", "df1.columns = ['_'.join(col[::-1]).strip() for col in df1.columns.values]\n", "df1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Look at difference from average, switch to cent values " ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIE_minSUP_minDIE_meanSUP_mean
date
2020-06-23 22:20:02NaNNaNNaNNaN
2020-06-23 23:20:01NaNNaNNaNNaN
2020-06-24 00:20:02NaNNaNNaNNaN
2020-06-24 01:20:02NaNNaNNaNNaN
2020-06-24 02:20:01NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " DIE_min SUP_min DIE_mean SUP_mean\n", "date \n", "2020-06-23 22:20:02 NaN NaN NaN NaN\n", "2020-06-23 23:20:01 NaN NaN NaN NaN\n", "2020-06-24 00:20:02 NaN NaN NaN NaN\n", "2020-06-24 01:20:02 NaN NaN NaN NaN\n", "2020-06-24 02:20:01 NaN NaN NaN NaN" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_delta = df1 - df1.rolling(7*24, center = True).mean()\n", "df_delta = df_delta * 100 #Let's switch to Cent values\n", "df_delta.head() # of course the first entries are NaN" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIE_minSUP_minDIE_meanSUP_mean
date
2020-07-20 20:20:01NaNNaNNaNNaN
2020-07-20 21:20:02NaNNaNNaNNaN
2020-07-20 22:20:01NaNNaNNaNNaN
2020-07-20 23:20:02NaNNaNNaNNaN
2020-07-21 00:20:02NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " DIE_min SUP_min DIE_mean SUP_mean\n", "date \n", "2020-07-20 20:20:01 NaN NaN NaN NaN\n", "2020-07-20 21:20:02 NaN NaN NaN NaN\n", "2020-07-20 22:20:01 NaN NaN NaN NaN\n", "2020-07-20 23:20:02 NaN NaN NaN NaN\n", "2020-07-21 00:20:02 NaN NaN NaN NaN" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_delta.tail()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIE_minSUP_minDIE_meanSUP_mean
count484.000000484.000000484.000000484.000000
mean-0.006018-0.055542-0.034468-0.071316
std0.5239490.6312280.8243001.033422
min-1.725000-1.467262-2.210951-2.151137
25%-0.088690-0.322917-0.418081-0.688466
50%0.0809520.155357-0.061518-0.036823
75%0.2000000.3562500.2802300.390915
max0.9892860.9113102.8858193.697436
\n", "
" ], "text/plain": [ " DIE_min SUP_min DIE_mean SUP_mean\n", "count 484.000000 484.000000 484.000000 484.000000\n", "mean -0.006018 -0.055542 -0.034468 -0.071316\n", "std 0.523949 0.631228 0.824300 1.033422\n", "min -1.725000 -1.467262 -2.210951 -2.151137\n", "25% -0.088690 -0.322917 -0.418081 -0.688466\n", "50% 0.080952 0.155357 -0.061518 -0.036823\n", "75% 0.200000 0.356250 0.280230 0.390915\n", "max 0.989286 0.911310 2.885819 3.697436" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_delta.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data for plots" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "df_delta = df_delta.reset_index()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIE_minSUP_minDIE_meanSUP_mean
Day of weekHour
Friday00.0061510.3750000.0635810.399897
1-0.0011900.3730160.0597260.398581
2-0.0085320.3710320.0535420.397589
3-0.0158730.3690480.0469160.396549
4-0.0232140.3670630.0396930.392836
..................
Wednesday190.2019840.295040-0.018503-0.056196
200.1958330.2926590.023958-0.058444
210.1896830.2902780.0308620.063117
220.1835320.287698-0.0377880.366133
230.1773810.285119-0.0442040.363513
\n", "

168 rows × 4 columns

\n", "
" ], "text/plain": [ " DIE_min SUP_min DIE_mean SUP_mean\n", "Day of week Hour \n", "Friday 0 0.006151 0.375000 0.063581 0.399897\n", " 1 -0.001190 0.373016 0.059726 0.398581\n", " 2 -0.008532 0.371032 0.053542 0.397589\n", " 3 -0.015873 0.369048 0.046916 0.396549\n", " 4 -0.023214 0.367063 0.039693 0.392836\n", "... ... ... ... ...\n", "Wednesday 19 0.201984 0.295040 -0.018503 -0.056196\n", " 20 0.195833 0.292659 0.023958 -0.058444\n", " 21 0.189683 0.290278 0.030862 0.063117\n", " 22 0.183532 0.287698 -0.037788 0.366133\n", " 23 0.177381 0.285119 -0.044204 0.363513\n", "\n", "[168 rows x 4 columns]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_chart = df_delta.groupby([df_delta.date.dt.day_name(), df_delta.date.dt.hour]).mean()#.stack()\n", "df_chart.index.names = ['Day of week', 'Hour']\n", "df_chart" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HourDIE_minSUP_minDIE_meanSUP_mean
count168.000000168.000000168.000000168.000000168.000000
mean11.5000000.014533-0.035066-0.018058-0.054979
std6.9428810.4627450.5425030.7473980.945292
min0.000000-1.290079-1.226190-1.605849-1.867506
25%5.750000-0.003026-0.367857-0.307451-0.750507
50%11.5000000.1296630.173810-0.0712180.063262
75%17.2500000.2890870.3596230.3113330.393182
max23.0000000.8666670.5071432.3266242.936946
\n", "
" ], "text/plain": [ " Hour DIE_min SUP_min DIE_mean SUP_mean\n", "count 168.000000 168.000000 168.000000 168.000000 168.000000\n", "mean 11.500000 0.014533 -0.035066 -0.018058 -0.054979\n", "std 6.942881 0.462745 0.542503 0.747398 0.945292\n", "min 0.000000 -1.290079 -1.226190 -1.605849 -1.867506\n", "25% 5.750000 -0.003026 -0.367857 -0.307451 -0.750507\n", "50% 11.500000 0.129663 0.173810 -0.071218 0.063262\n", "75% 17.250000 0.289087 0.359623 0.311333 0.393182\n", "max 23.000000 0.866667 0.507143 2.326624 2.936946" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_chart = df_chart.reset_index()\n", "df_chart.describe()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hour 23.000000\n", "DIE_min 2.156746\n", "SUP_min 1.733333\n", "DIE_mean 3.932473\n", "SUP_mean 4.804451\n", "dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_chart.select_dtypes('number').max() - df_chart.select_dtypes('number').min()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (col, title_part) in [('DIE_min','diesel, cheapest gas station') , \n", " ('DIE_mean', 'diesel, mean of 5 cheapest gas stations'),\n", " ('SUP_min','premium gas, cheapest gas station') , \n", " ('SUP_mean', 'premium gas, mean of 5 cheapest gas stations')]:\n", " chart = alt.Chart(df_chart).mark_rect().encode(\n", " x='Hour:O',\n", " y=alt.Y('Day of week:O', sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']),\n", " color=alt.Color(col, type='quantitative', title='Cent', \n", " scale=alt.Scale(scheme='redyellowgreen', domainMid = 0, domain = [-2, 2], reverse=True))\n", " ).properties(\n", " title='Difference in gasoline price - ' + title_part\n", " )\n", " chart.save(col + '.svg')\n", " display(chart)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }