{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 8\n", "\n", "## Video 34: Hypothesis Testing I\n", "**Python for the Energy Industry**\n", "\n", "If we have a hypothesis about how two variables are related, we may wish to test this hypothesis statistically using data from the SDK. In this lesson and the next, we will see how to do this. This first lesson focuses on looking at the correlation between time series data.\n", "\n", "[Here is a good example of these concepts applied.](https://github.com/VorTECHsa/python-sdk/blob/master/docs/examples/Crude_Floating_Storage.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# initial imports\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime\n", "from dateutil.relativedelta import relativedelta\n", "import vortexasdk as v\n", "# The cargo unit for the time series (barrels)\n", "TS_UNIT = 'b'\n", "\n", "# The granularity of the time series\n", "TS_FREQ = 'day'\n", "\n", "# datetimes to access last 7 weeks of data\n", "now = datetime.utcnow()\n", "seven_weeks_ago = now - relativedelta(weeks=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will consider a fairly trivial example here to demonstrate the process - let's say we want to consider the correlation between crude exports out of Southeast Asia destined for China, with the crude imports into China that originated in Southeast Asia. These are clearly strongly correlated, with a time lag that depends on travel time. Let's start by accessing these exports and imports datasets:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "crude = [p.id for p in v.Products().search('crude').to_list() if p.name=='Crude']\n", "assert len(crude) == 1\n", "\n", "china = v.Geographies().search('China',exact_term_match=True)[0]['id']\n", "SEA = v.Geographies().search('Southeast Asia',exact_term_match=True)[0]['id']\n", "\n", "SEA_exports = v.CargoTimeSeries().search(\n", " timeseries_frequency=TS_FREQ,\n", " timeseries_unit=TS_UNIT,\n", " filter_time_min=seven_weeks_ago,\n", " filter_time_max=now,\n", " filter_activity=\"loading_end\",\n", " filter_origins=SEA,\n", " filter_destinations=china,\n", ").to_df()\n", "\n", "SEA_exports = SEA_exports.rename(columns={'key':'date','value':'SEA_exp'})[['date','SEA_exp']]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "china_imports = v.CargoTimeSeries().search(\n", " timeseries_frequency=TS_FREQ,\n", " timeseries_unit=TS_UNIT,\n", " filter_time_min=seven_weeks_ago,\n", " filter_time_max=now,\n", " filter_activity=\"unloading_start\",\n", " filter_origins=SEA,\n", " filter_destinations=china,\n", ").to_df()\n", "\n", "china_imports = china_imports.rename(columns={'key':'date','value':'china_imp'})[['date','china_imp']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the exports and imports data into one DataFrame:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | date | \n", "SEA_exp | \n", "china_imp | \n", "
---|---|---|---|
0 | \n", "2020-10-23 00:00:00+00:00 | \n", "1322832 | \n", "2670087 | \n", "
1 | \n", "2020-10-24 00:00:00+00:00 | \n", "3562655 | \n", "2940471 | \n", "
2 | \n", "2020-10-25 00:00:00+00:00 | \n", "4314260 | \n", "2373007 | \n", "
3 | \n", "2020-10-26 00:00:00+00:00 | \n", "4277880 | \n", "1440071 | \n", "
4 | \n", "2020-10-27 00:00:00+00:00 | \n", "3618404 | \n", "1614060 | \n", "