{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "64ae6820", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from orion.data import load_signal" ] }, { "cell_type": "markdown", "id": "5b49b132", "metadata": {}, "source": [ "# 1. Data" ] }, { "cell_type": "code", "execution_count": 2, "id": "5ce48fb3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampvalue
01222819200-0.366359
11222840800-0.394108
212228624000.403625
31222884000-0.362759
41222905600-0.370746
\n", "
" ], "text/plain": [ " timestamp value\n", "0 1222819200 -0.366359\n", "1 1222840800 -0.394108\n", "2 1222862400 0.403625\n", "3 1222884000 -0.362759\n", "4 1222905600 -0.370746" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "signal_name = 'S-1'\n", "\n", "data = load_signal(signal_name)\n", "\n", "data.head()" ] }, { "cell_type": "markdown", "id": "8ee4b439", "metadata": {}, "source": [ "# 2. Pipeline" ] }, { "cell_type": "code", "execution_count": 3, "id": "8321a6e1", "metadata": {}, "outputs": [], "source": [ "from mlblocks import MLPipeline\n", "\n", "pipeline_name = 'matrixprofile'\n", "\n", "pipeline = MLPipeline(pipeline_name)" ] }, { "cell_type": "markdown", "id": "6d072edb", "metadata": {}, "source": [ "## step by step execution\n", "\n", "MLPipelines are compose of a squence of primitives, these primitives apply tranformation and calculation operations to the data and updates the variables within the pipeline. To view the primitives used by the pipeline, we access its `primtivies` attribute. \n", "\n", "The `matrixprofile` contains 7 primitives. we will observe how the `context` (which are the variables held within the pipeline) are updated after the execution of each primitive." ] }, { "cell_type": "code", "execution_count": 4, "id": "90ee9a9b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['mlstars.custom.timeseries_preprocessing.time_segments_aggregate',\n", " 'sklearn.impute.SimpleImputer',\n", " 'sklearn.preprocessing.MinMaxScaler',\n", " 'numpy.reshape',\n", " 'stumpy.stump',\n", " 'orion.primitives.timeseries_preprocessing.slice_array_by_dims',\n", " 'numpy.reshape',\n", " 'orion.primitives.timeseries_anomalies.find_anomalies']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.primitives" ] }, { "cell_type": "markdown", "id": "24253007", "metadata": {}, "source": [ "### time segments aggregate\n", "this primitive creates an equi-spaced time series by aggregating values over fixed specified interval.\n", "\n", "* **input**: `X` which is an n-dimensional sequence of values.\n", "* **output**:\n", " - `X` sequence of aggregated values, one column for each aggregation method.\n", " - `index` sequence of index values (first index of each aggregated segment)." ] }, { "cell_type": "code", "execution_count": 5, "id": "2065a476", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['X', 'index'])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context = pipeline.fit(data, output_=0)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 6, "id": "0795d470", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "entry at 1222819200 has value [-0.36635895]\n", "entry at 1222840800 has value [-0.39410778]\n", "entry at 1222862400 has value [0.4036246]\n", "entry at 1222884000 has value [-0.36275906]\n", "entry at 1222905600 has value [-0.37074649]\n" ] } ], "source": [ "for i, x in list(zip(context['index'], context['X']))[:5]:\n", " print(\"entry at {} has value {}\".format(i, x))" ] }, { "cell_type": "markdown", "id": "3c9fa853", "metadata": {}, "source": [ "### SimpleImputer\n", "this primitive is an imputation transformer for filling missing values.\n", "* **input**: `X` which is an n-dimensional sequence of values.\n", "* **output**: `X` which is a transformed version of X." ] }, { "cell_type": "code", "execution_count": 7, "id": "dfc91e13", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 1\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "markdown", "id": "f5be9e33", "metadata": {}, "source": [ "### MinMaxScaler\n", "this primitive transforms features by scaling each feature to a given range.\n", "* **input**: `X` the data used to compute the per-feature minimum and maximum used for later scaling along the features axis.\n", "* **output**: `X` which is a transformed version of X." ] }, { "cell_type": "code", "execution_count": 8, "id": "72d00ac4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 2\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 9, "id": "bced090c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "entry at 1222819200 has value [0.31682053]\n", "entry at 1222840800 has value [0.30294611]\n", "entry at 1222862400 has value [0.7018123]\n", "entry at 1222884000 has value [0.31862047]\n", "entry at 1222905600 has value [0.31462675]\n" ] } ], "source": [ "# after scaling the data between [0, 1]\n", "# in this example, no change is observed\n", "# since the data was pre-handedly scaled\n", "\n", "for i, x in list(zip(context['index'], context['X']))[:5]:\n", " print(\"entry at {} has value {}\".format(i, x))" ] }, { "cell_type": "markdown", "id": "21bddd28", "metadata": {}, "source": [ "### reshape\n", "\n", "this primitive flattens the array.\n", "* **input**: `X` n-dimensional values.\n", "* **output**: `X` which is a flat version of X." ] }, { "cell_type": "code", "execution_count": 10, "id": "a01e3ac7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 3\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 11, "id": "c6d69156", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10149,)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context['X'].shape" ] }, { "cell_type": "markdown", "id": "49eb9017", "metadata": {}, "source": [ "### stump\n", "\n", "this primitive computes the matrix profile of `X`.\n", "* **input**: `X` n-dimensional values.\n", "* **output**: `y` which is the matrix profile of X." ] }, { "cell_type": "code", "execution_count": 12, "id": "8a094ead", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X', 'y'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 4\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 13, "id": "ba14e3f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10050, 4)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context['y'].shape" ] }, { "cell_type": "markdown", "id": "bddfcb51", "metadata": {}, "source": [ "### slice array by dim\n", "\n", "this primitive extracts the distance to the nearest neighbor from the matrix profile.\n", "* **input**: `y` n-dimensional values.\n", "* **output**: `y` which is the distance array in y." ] }, { "cell_type": "code", "execution_count": 14, "id": "142bd39e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X', 'y'])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 5\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 15, "id": "d0e45482", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10050, 1)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context['y'].shape" ] }, { "cell_type": "markdown", "id": "efba757f", "metadata": {}, "source": [ "### reshape\n", "\n", "this primitive flattens the array.\n", "* **input**: `y` n-dimensional values.\n", "* **output**: `errors` which is a flat version of y." ] }, { "cell_type": "code", "execution_count": 16, "id": "4878d223", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['index', 'X', 'y', 'errors'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 6\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "markdown", "id": "6b312320", "metadata": {}, "source": [ "### find anomalies\n", "\n", "this primitive extracts anomalies from sequences of errors following the approach explained in the [related paper](https://arxiv.org/pdf/1802.04431.pdf).\n", "\n", "* **input**: \n", " - `errors` array of errors.\n", " - `index` array of indices of errors.\n", "* **output**: `y` array containing start-index, end-index, score for each anomalous sequence that was found." ] }, { "cell_type": "code", "execution_count": 17, "id": "16466f32", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sarah/opt/anaconda3/envs/stumpy/lib/python3.7/site-packages/scipy/optimize/optimize.py:761: RuntimeWarning: invalid value encountered in subtract\n", " np.max(np.abs(fsim[0] - fsim[1:])) <= fatol):\n" ] }, { "data": { "text/plain": [ "dict_keys(['index', 'errors', 'X', 'y'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step = 7\n", "\n", "context = pipeline.fit(**context, output_=step, start_=step)\n", "context.keys()" ] }, { "cell_type": "code", "execution_count": 18, "id": "cf066de9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startendseverity
01.310386e+091.312826e+090.198253
11.398125e+091.401408e+091.728175
\n", "
" ], "text/plain": [ " start end severity\n", "0 1.310386e+09 1.312826e+09 0.198253\n", "1 1.398125e+09 1.401408e+09 1.728175" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(context['y'], columns=['start', 'end', 'severity'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.16" } }, "nbformat": 4, "nbformat_minor": 5 }