{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Downloading and loading benchmarking datasets\n",
    "\n",
    "It is common to use standard collections of data to compare different estimators for\n",
    "classification, clustering, regression and forecasting. Some of these datasets are\n",
    "shipped with aeon in the datasets/data directory. However, the files are far too\n",
    "big to include them all. aeon provides tools to download these data to use in benchmarking experiments.\n",
    "Classification and regression data are stored in .ts format. Forecasting\n",
    "data are stored in the equivalent .tsf format. See the [data loading notebook](data_loading.ipynb) for more info.\n",
    "\n",
    "Classification and regression are loaded into 3D numpy arrays of shape `(n_cases, n_channels, n_timepoints)` if equal length\n",
    "or a list of `[n_cases]` of 2D numpy if `n_timepoints` is different for different\n",
    "cases. Forecasting data are loaded into pd.DataFrame. For more information on\n",
    "aeon data types see the [data structures notebook](data_structures.ipynb).\n",
    "\n",
    "Note that this notebook is dependent on external websites, so will not function if\n",
    "you are not online or the associated website is down. We use the following three\n",
    "functions"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [],
   "source": [
    "from aeon.datasets import load_classification, load_forecasting, load_regression"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Time Series Classification Archive\n",
    "\n",
    "[UCR/TSML Time Series Classification Archive](https://timeseriesclassification.com)\n",
    "hosts the UCR univariate TSC archive [1], also available from [UCR](https://www.cs.ucr.edu/~eamonn/time_series_data_2018/) and\n",
    "the multivariate archive [2] (previously called the UEA archive, soon to change). We\n",
    "provide seven of these in the datasets/data directort: ACSF1, ArrowHead, BasicMotions,\n",
    "GunPoint, ItalyPowerDemand, JapaneseVowels and PLAID. The archive is much bigger. The\n",
    " last batch release was for 128 univariate [1] and 33 multivariate [2]. If you just\n",
    " want to download them all, please go to the [website](https://timeseriesclassification.com)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Univariate length =  128\n",
      "Multivariate length =  30\n"
     ]
    }
   ],
   "source": [
    "from aeon.datasets.tsc_data_lists import multivariate, univariate\n",
    "\n",
    "# This file also contains sub lists by type, e.g. unequal length\n",
    "print(\"Univariate length = \", len(univariate))\n",
    "print(\"Multivariate length = \", len(multivariate))"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "A default train and test split is provided for this data. The file structure for a\n",
    "problem such as Chinatown is\n",
    "\n",
    "        <extract_path>/Chinatown/Chinatown_TRAIN.ts\n",
    "        <extract_path>/Chinatown/Chinatown_TEST.ts\n",
    "\n",
    "You can load these problems directly from TSC.com and load them into memory. These\n",
    "functions can return associated metadata in addition to the data. This usage combines\n",
    " the train and test splits and loads them into one `X` and one `y` array."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of X =  (363, 1, 24)\n",
      "First case =  [ 573.  375.  301.  212.   55.   34.   25.   33.  113.  143.  303.  615.\n",
      " 1226. 1281. 1221. 1081.  866. 1096. 1039.  975.  746.  581.  409.  182.]  has label =  1\n",
      "\n",
      "Meta data =  {'problemname': 'chinatown', 'timestamps': False, 'missing': False, 'univariate': True, 'equallength': True, 'classlabel': True, 'targetlabel': False, 'class_values': ['1', '2']}\n"
     ]
    }
   ],
   "source": [
    "X, y, meta = load_classification(\"Chinatown\", return_metadata=True)\n",
    "print(\"Shape of X = \", X.shape)\n",
    "print(\"First case = \", X[0][0], \" has label = \", y[0])\n",
    "print(\"\\nMeta data = \", meta)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "If you look in aeon/datasets you should see a directory called `local_data`\n",
    "containing the Chinatown datasets. All of the zips have `.ts` files. Some also have\n",
    "`.arff` and `.txt` files. File structure looks something like this:\n",
    "\n",
    "<img src=\"img/download2.png\" width=\"300\" alt=\"time series classification\">\n",
    "\n",
    "Within each folder are the data in text files formatted as .ts files (see the [data\n",
    "loading notebook](data_loading.ipynb) for file format description). They may also be\n",
    "available in .arff format and .txt format.\n",
    "\n",
    "<img src=\"img/download1.png\" width=\"600\" alt=\"time series classification\">\n",
    "\n",
    "If you load again with the same extract path it will not download again if the file is\n",
    "already there. If you want to store data somewhere else, you can specify a file path.\n",
    " Also, you can load the train and test separately. This code will download the data\n",
    " to Temp once, and load into separate train/test splits. The split argument is not\n",
    " case sensitive. Once downloaded, `load_classification` is a equivalent to a call to\n",
    " `load_from_tsfile`"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train shape =  (20, 1, 512)\n",
      "Test shape =  (20, 1, 512)\n",
      "Loaded directly shape =  (20, 1, 512)\n"
     ]
    },
    {
     "data": {
      "text/plain": "array([1.7400873, 1.7331051, 1.7091917, 1.6333304, 1.5405759])"
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, y_train = load_classification(\n",
    "    \"BeetleFly\", extract_path=\"./Temp/\", split=\"TRAIN\"\n",
    ")\n",
    "X_test, y_test = load_classification(\"BeetleFly\", extract_path=\"./Temp/\", split=\"test\")\n",
    "print(\"Train shape = \", X_train.shape)\n",
    "print(\"Test shape = \", X_test.shape)\n",
    "from aeon.datasets import load_from_tsfile\n",
    "\n",
    "X_train, y_train = load_from_tsfile(\n",
    "    full_file_path_and_name=\"./Temp/BeetleFly/BeetleFLY_TRAIN\"\n",
    ")\n",
    "print(\"Loaded directly shape = \", X_train.shape)\n",
    "\n",
    "X_test[0][0][:5]"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Time Series (Extrinsic) Regression\n",
    "\n",
    "[The Monash Time Series Extrinsic Regression Archive]() [3] repo (called extrinsic to\n",
    " diffentiate if from sliding window based regression) currently contains 19\n",
    " regression problems in .ts format. One of these, Covid3Month, is in `datasets\\data`.\n",
    "  We have recently expanded this repo to include 63 problems in .ts format.\n",
    "  The usage of `load_regression` is identical to `load_classification`\n"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from aeon.datasets.dataset_collections import get_available_tser_datasets\n",
    "\n",
    "get_available_tser_datasets()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of X =  (673, 1, 266)  meta data =  {'problemname': 'floodmodeling1', 'timestamps': False, 'missing': False, 'univariate': True, 'equallength': True, 'classlabel': False, 'targetlabel': True, 'class_values': []}\n"
     ]
    }
   ],
   "source": [
    "X, y, meta = load_regression(\"FloodModeling1\", return_metadata=True)\n",
    "print(\"Shape of X = \", X.shape, \" meta data = \", meta)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Time Series Forecasting\n",
    "\n",
    "The [Monash time series forecasting](https://forecastingdata.org/) repo contains a\n",
    "large number of forecasting data, including competition data such as M1, M3 and M4.\n",
    "Usage is the same as the other problems, although there is no provided train/test\n",
    "splits.\n"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from aeon.datasets.dataset_collections import get_available_tsf_datasets\n",
    "\n",
    "get_available_tsf_datasets()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(23000, 3)\n",
      "{'frequency': 'yearly', 'forecast_horizon': 6, 'contain_missing_values': False, 'contain_equal_length': False}\n",
      "  series_name     start_timestamp  \\\n",
      "0          T1 1979-01-01 12:00:00   \n",
      "1          T2 1979-01-01 12:00:00   \n",
      "2          T3 1979-01-01 12:00:00   \n",
      "3          T4 1979-01-01 12:00:00   \n",
      "4          T5 1979-01-01 12:00:00   \n",
      "\n",
      "                                        series_value  \n",
      "0  [5172.1, 5133.5, 5186.9, 5084.6, 5182.0, 5414....  \n",
      "1  [2070.0, 2104.0, 2394.0, 1651.0, 1492.0, 1348....  \n",
      "2  [2760.0, 2980.0, 3200.0, 3450.0, 3670.0, 3850....  \n",
      "3  [3380.0, 3670.0, 3960.0, 4190.0, 4440.0, 4700....  \n",
      "4  [1980.0, 2030.0, 2220.0, 2530.0, 2610.0, 2720....  \n"
     ]
    }
   ],
   "source": [
    "X, metadata = load_forecasting(\"m4_yearly_dataset\", return_metadata=True)\n",
    "print(X.shape)\n",
    "print(metadata)\n",
    "data = X.head()\n",
    "print(data)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## References\n",
    "[1] Dau et. al, The UCR time series archive, IEEE/CAA Journal of Automatica Sinica, 2019\n",
    "[2] Ruiz et. al, The great multivariate time series classification bake off: a review\n",
    " and experimental evaluation of recent algorithmic advances, Data Mining and\n",
    " Knowledge Discovery 35(2), 2021\n",
    "[3] Tan et. al, Time Series Extrinsic Regression, Data Mining and Knowledge\n",
    "Discovery, 2021\n",
    "[4] Godahewa et. al, Monash Time Series Forecasting Archive,Neural Information\n",
    "Processing Systems Track on Datasets and Benchmarks, 2021"
   ],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}