{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Convert NASA data to Orion format\n", "\n", "In this notebook we download the data from the Telemanom S3 bucket and reformat it\n", "as Orion pipelines expect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import urllib\n", "import zipfile\n", "\n", "DATA_URL = 'https://s3-us-west-2.amazonaws.com/telemanom/data.zip'\n", "\n", "if not os.path.exists('data'):\n", " response = urllib.request.urlopen(DATA_URL)\n", " bytes_io = io.BytesIO(response.read())\n", " \n", " with zipfile.ZipFile(bytes_io) as zf:\n", " zf.extractall()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train_signals = os.listdir('data/train')\n", "test_signals = os.listdir('data/test')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_signals == test_signals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the NPY matrices to CSVs\n", "\n", "We convert the NPY matrices to CSV files with two columns: `timestamp` and `value`.\n", "\n", "For this, what we do is loading both the train and test matrices for each signals\n", "and concantenate them to generate a single matrix for each signal.\n", "\n", "Afterwards, we add a timestamp column by taking the value 1222819200 (2008-10-01T00:00:00)\n", "as for the first row and then increasing the timestamp by 21600 seconds (6h) for each other row." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "NASA_DIR = os.path.join('data', '{}', '{}')\n", "\n", "def build_df(data, start=0):\n", " index = np.array(range(start, start + len(data)))\n", " timestamp = index * 21600 + 1222819200\n", " \n", " return pd.DataFrame({'timestamp': timestamp, 'value': data[:, 0]})\n", "\n", "data = build_df(np.load(NASA_DIR.format('train', 'S-1.npy')))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | timestamp | \n", "value | \n", "
|---|---|---|
| 0 | \n", "1222819200 | \n", "-0.366359 | \n", "
| 1 | \n", "1222840800 | \n", "-0.394108 | \n", "
| 2 | \n", "1222862400 | \n", "0.403625 | \n", "
| 3 | \n", "1222884000 | \n", "-0.362759 | \n", "
| 4 | \n", "1222905600 | \n", "-0.370746 | \n", "