{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "allWIe5kwPcS" }, "source": [ "# Time Series Datasets\n", "\n", "This notebook shows how to create a time series dataset from some csv file in order to then share it on the [🤗 hub](https://huggingface.co/docs/datasets/index). We will use the GluonTS library to read the csv into the appropriate format. We start by installing the libraries" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "id": "4XnNcdWbwrNo" }, "outputs": [], "source": [ "! pip install -q datasets gluonts orjson" ] }, { "cell_type": "markdown", "metadata": { "id": "dI1yo_vHw5CV" }, "source": [ "GluonTS comes with a pandas DataFrame based dataset so our strategy will be to read the csv file, and process it as a `PandasDataset`. We will then iterate over it and convert it to a 🤗 dataset with the appropriate schema for time series. So lets get started!\n", "\n", "## `PandasDataset`\n", "\n", "Suppose we are given multiple (10) time series stacked on top of each other in a dataframe with an `item_id` column that distinguishes different series:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "e9FaT_VpwuI2", "outputId": "8a10c908-41e1-4ca7-b420-01c0810c5c4b" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
targetitem_id
2021-01-01 00:00:00-1.3378A
2021-01-01 01:00:00-1.6111A
2021-01-01 02:00:00-1.9259A
2021-01-01 03:00:00-1.9184A
2021-01-01 04:00:00-1.9168A
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " target item_id\n", "2021-01-01 00:00:00 -1.3378 A\n", "2021-01-01 01:00:00 -1.6111 A\n", "2021-01-01 02:00:00 -1.9259 A\n", "2021-01-01 03:00:00 -1.9184 A\n", "2021-01-01 04:00:00 -1.9168 A" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "url = (\n", " \"https://gist.githubusercontent.com/rsnirwan/a8b424085c9f44ef2598da74ce43e7a3\"\n", " \"/raw/b6fdef21fe1f654787fa0493846c546b7f9c4df2/ts_long.csv\"\n", ")\n", "df = pd.read_csv(url, index_col=0, parse_dates=True)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "s4WCjvBqxi3B" }, "source": [ "After converting it into a `pd.Dataframe` we can then convert it into GluonTS's `PandasDataset`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "id": "su1i8ZdDxf7c" }, "outputs": [], "source": [ "from gluonts.dataset.pandas import PandasDataset\n", "\n", "ds = PandasDataset.from_long_dataframe(df, target=\"target\", item_id=\"item_id\")" ] }, { "cell_type": "markdown", "metadata": { "id": "cYnHkLdex_n3" }, "source": [ "\n", "## 🤗 Datasets\n", "\n", "From here we have to map the pandas dataset's `start` field into a time stamp instead of a `pd.Period`. We do this by defining the following class:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "gv5Ytpwrx4FQ" }, "outputs": [], "source": [ "class ProcessStartField():\n", " ts_id = 0\n", " \n", " def __call__(self, data):\n", " data[\"start\"] = data[\"start\"].to_timestamp()\n", " data[\"feat_static_cat\"] = [self.ts_id]\n", " self.ts_id += 1\n", " \n", " return data" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "id": "DrhbT1QIyPMA" }, "outputs": [], "source": [ "from gluonts.itertools import Map\n", "\n", "process_start = ProcessStartField()\n", "\n", "list_ds = list(Map(process_start, ds))" ] }, { "cell_type": "markdown", "metadata": { "id": "Ug2kLNUPyeyJ" }, "source": [ "Next we need to define our schema features and create our dataset from this list via the `from_list` function:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "r1rQUtvGycaF" }, "outputs": [], "source": [ "from datasets import Dataset, Features, Value, Sequence\n", "\n", "features = Features(\n", " { \n", " \"start\": Value(\"timestamp[s]\"),\n", " \"target\": Sequence(Value(\"float32\")),\n", " \"feat_static_cat\": Sequence(Value(\"uint64\")),\n", " # \"feat_static_real\": Sequence(Value(\"float32\")),\n", " # \"feat_dynamic_real\": Sequence(Sequence(Value(\"uint64\"))),\n", " # \"feat_dynamic_cat\": Sequence(Sequence(Value(\"uint64\"))),\n", " \"item_id\": Value(\"string\"),\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "RrpP2oAxywC8" }, "outputs": [], "source": [ "dataset = Dataset.from_list(list_ds, features=features)" ] }, { "cell_type": "markdown", "metadata": { "id": "GrbGlDtYzyIf" }, "source": [ "We can thus use this strategy to [share](https://huggingface.co/docs/datasets/share) the dataset to the hub." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 1 }