{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# TSML (Time Series Machine Learning)\n",
"- **Speaker: Paulito Palmes**\n",
"- **IBM Dublin Research Lab**\n",
"- July 23, 2019"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Motivations\n",
"- innovations in industry sectors brought automations \n",
"- automations require installation of sensor networks \n",
"- main challenges:\n",
" - collect large volume of data, detect anomalies, monitor status\n",
" - discover patterns to reduce downtimes and manufacturing errors\n",
" - reduce energy usage\n",
" - predict faults/failures\n",
" - effective maintenance schedules\n",
"\n",
"_TSML leverages AI and ML libraries from ScikitLearn, Caret, and Julia as building blocks for processing huge amount of industrial time series data._"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Typical TSML Workflow"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## First, let's create an artificial data with missing values"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
| Date | Value |
|---|
| DateTime | Float64⍰ |
|---|
10 rows × 2 columns
| 1 | 2014-01-01T00:00:00 | 0.768448 |
|---|
| 2 | 2014-01-01T00:15:00 | 0.940515 |
|---|
| 3 | 2014-01-01T00:30:00 | 0.673959 |
|---|
| 4 | 2014-01-01T00:45:00 | 0.395453 |
|---|
| 5 | 2014-01-01T01:00:00 | missing |
|---|
| 6 | 2014-01-01T01:15:00 | 0.662555 |
|---|
| 7 | 2014-01-01T01:30:00 | 0.586022 |
|---|
| 8 | 2014-01-01T01:45:00 | missing |
|---|
| 9 | 2014-01-01T02:00:00 | 0.26864 |
|---|
| 10 | 2014-01-01T02:15:00 | missing |
|---|
"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& Date & Value\\\\\n",
"\t\\hline\n",
"\t& DateTime & Float64⍰\\\\\n",
"\t\\hline\n",
"\t1 & 2014-01-01T00:00:00 & 0.768448 \\\\\n",
"\t2 & 2014-01-01T00:15:00 & 0.940515 \\\\\n",
"\t3 & 2014-01-01T00:30:00 & 0.673959 \\\\\n",
"\t4 & 2014-01-01T00:45:00 & 0.395453 \\\\\n",
"\t5 & 2014-01-01T01:00:00 & \\\\\n",
"\t6 & 2014-01-01T01:15:00 & 0.662555 \\\\\n",
"\t7 & 2014-01-01T01:30:00 & 0.586022 \\\\\n",
"\t8 & 2014-01-01T01:45:00 & \\\\\n",
"\t9 & 2014-01-01T02:00:00 & 0.26864 \\\\\n",
"\t10 & 2014-01-01T02:15:00 & \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10×2 DataFrame\n",
"│ Row │ Date │ Value │\n",
"│ │ \u001b[90mDateTime\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │\n",
"├─────┼─────────────────────┼──────────┤\n",
"│ 1 │ 2014-01-01T00:00:00 │ 0.768448 │\n",
"│ 2 │ 2014-01-01T00:15:00 │ 0.940515 │\n",
"│ 3 │ 2014-01-01T00:30:00 │ 0.673959 │\n",
"│ 4 │ 2014-01-01T00:45:00 │ 0.395453 │\n",
"│ 5 │ 2014-01-01T01:00:00 │ \u001b[90mmissing\u001b[39m │\n",
"│ 6 │ 2014-01-01T01:15:00 │ 0.662555 │\n",
"│ 7 │ 2014-01-01T01:30:00 │ 0.586022 │\n",
"│ 8 │ 2014-01-01T01:45:00 │ \u001b[90mmissing\u001b[39m │\n",
"│ 9 │ 2014-01-01T02:00:00 │ 0.26864 │\n",
"│ 10 │ 2014-01-01T02:15:00 │ \u001b[90mmissing\u001b[39m │"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using DataFrames\n",
"using Dates\n",
"using Random\n",
"ENV[\"COLUMNS\"]=1000 # for dataframe column size\n",
"\n",
"function generateXY()\n",
" Random.seed!(123)\n",
" gdate = DateTime(2014,1,1):Dates.Minute(15):DateTime(2014,1,5)\n",
" gval = Array{Union{Missing,Float64}}(rand(length(gdate)))\n",
" gmissing = floor(0.30*length(gdate)) |> Integer\n",
" gndxmissing = Random.shuffle(1:length(gdate))[1:gmissing]\n",
" X = DataFrame(Date=gdate,Value=gval)\n",
" X.Value[gndxmissing] .= missing\n",
" Y = rand(length(gdate))\n",
" (X,Y)\n",
"end;\n",
"(df,outY)=generateXY(); first(df,10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's load the TSML modules and filters to process data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"using TSML"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's use Pipeline with Plotter filter to plot artificial data"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pltr=Plotter(Dict(:interactive => true))\n",
"\n",
"mypipeline = Pipeline(Dict(\n",
" :transformers => [pltr]\n",
" )\n",
")\n",
"\n",
"fit!(mypipeline, df)\n",
"transform!(mypipeline, df) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's get the statistics/data quality including blocks of missing data"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" | tstart | tend | sfreq | count | max | min | median | mean | q1 | q2 | q25 | q75 | q8 | q9 | kurtosis | skewness | variation | entropy | autocor | pacf |
|---|
| DateTime | DateTime | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 |
|---|
1 rows × 20 columns
| 1 | 2014-01-01T00:00:00 | 2014-01-05T00:00:00 | 0.249351 | 270 | 0.995414 | 0.000412399 | 0.521184 | 0.505873 | 0.121582 | 0.213152 | 0.279623 | 0.745784 | 0.781425 | 0.870951 | -1.14079 | -0.065312 | 0.546211 | 69.5203 | 0.320605 | 0.312706 |
|---|
"
],
"text/latex": [
"\\begin{tabular}{r|cccccccccccccccccccc}\n",
"\t& tstart & tend & sfreq & count & max & min & median & mean & q1 & q2 & q25 & q75 & q8 & q9 & kurtosis & skewness & variation & entropy & autocor & pacf\\\\\n",
"\t\\hline\n",
"\t& DateTime & DateTime & Float64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 2014-01-01T00:00:00 & 2014-01-05T00:00:00 & 0.249351 & 270 & 0.995414 & 0.000412399 & 0.521184 & 0.505873 & 0.121582 & 0.213152 & 0.279623 & 0.745784 & 0.781425 & 0.870951 & -1.14079 & -0.065312 & 0.546211 & 69.5203 & 0.320605 & 0.312706 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"1×20 DataFrame\n",
"│ Row │ tstart │ tend │ sfreq │ count │ max │ min │ median │ mean │ q1 │ q2 │ q25 │ q75 │ q8 │ q9 │ kurtosis │ skewness │ variation │ entropy │ autocor │ pacf │\n",
"│ │ \u001b[90mDateTime\u001b[39m │ \u001b[90mDateTime\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼─────────────────────┼─────────────────────┼──────────┼───────┼──────────┼─────────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼───────────┼───────────┼─────────┼──────────┼──────────┤\n",
"│ 1 │ 2014-01-01T00:00:00 │ 2014-01-05T00:00:00 │ 0.249351 │ 270 │ 0.995414 │ 0.000412399 │ 0.521184 │ 0.505873 │ 0.121582 │ 0.213152 │ 0.279623 │ 0.745784 │ 0.781425 │ 0.870951 │ -1.14079 │ -0.065312 │ 0.546211 │ 69.5203 │ 0.320605 │ 0.312706 │"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"statfier = Statifier(Dict(:processmissing=>false))\n",
"\n",
"mypipeline = Pipeline(Dict(\n",
" :transformers => [statfier]\n",
" )\n",
")\n",
"\n",
"fit!(mypipeline, df)\n",
"res = transform!(mypipeline, df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's extend the Pipeline workflow with aggregate, impute, and plot "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))\n",
"\n",
"mypipeline = Pipeline(Dict(\n",
" :transformers => [valgator,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(mypipeline, df)\n",
"transform!(mypipeline, df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's now try real data"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"fname = joinpath(dirname(pathof(TSML)),\"../data/testdata.csv\")\n",
"csvreader = CSVDateValReader(Dict(:filename=>fname,:dateformat=>\"dd/mm/yyyy HH:MM\"))\n",
"\n",
"outputname = joinpath(dirname(pathof(TSML)),\"/tmp/testdata_output.csv\")\n",
"csvwriter = CSVDateValWriter(Dict(:filename=>outputname))\n",
"\n",
"valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))\n",
"valputer = DateValNNer(Dict(:dateinterval=>Dates.Hour(1)))\n",
"stfier = Statifier(Dict(:processmissing=>true))\n",
"outliernicer = Outliernicer(Dict(:dateinterval=>Dates.Hour(1)));"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's plot the real data and check for missing values"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mpipeline1 = Pipeline(Dict(\n",
" :transformers => [csvreader,valgator,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(mpipeline1)\n",
"transform!(mpipeline1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's get the statistics to assess data quality"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" | tstart | tend | sfreq | count | max | min | median | mean | q1 | q2 | q25 | q75 | q8 | q9 | kurtosis | skewness | variation | entropy | autocor | pacf | bmedian | bmean | bq25 | bq75 | bmin | bmax |
|---|
| DateTime | DateTime | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 |
|---|
1 rows × 26 columns
| 1 | 2014-01-01T00:00:00 | 2015-01-01T00:00:00 | 0.999886 | 3830 | 18.8 | 8.5 | 10.35 | 11.557 | 9.9 | 10.0 | 10.0 | 12.3 | 13.0 | 16.0 | 0.730635 | 1.41283 | 0.200055 | -1.09145e5 | 4.39315 | 1.04644 | 5.0 | 10.5589 | 3.0 | 6.0 | 1.0 | 2380.0 |
|---|
"
],
"text/latex": [
"\\begin{tabular}{r|cccccccccccccccccccccccccc}\n",
"\t& tstart & tend & sfreq & count & max & min & median & mean & q1 & q2 & q25 & q75 & q8 & q9 & kurtosis & skewness & variation & entropy & autocor & pacf & bmedian & bmean & bq25 & bq75 & bmin & bmax\\\\\n",
"\t\\hline\n",
"\t& DateTime & DateTime & Float64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 2014-01-01T00:00:00 & 2015-01-01T00:00:00 & 0.999886 & 3830 & 18.8 & 8.5 & 10.35 & 11.557 & 9.9 & 10.0 & 10.0 & 12.3 & 13.0 & 16.0 & 0.730635 & 1.41283 & 0.200055 & -1.09145e5 & 4.39315 & 1.04644 & 5.0 & 10.5589 & 3.0 & 6.0 & 1.0 & 2380.0 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"1×26 DataFrame\n",
"│ Row │ tstart │ tend │ sfreq │ count │ max │ min │ median │ mean │ q1 │ q2 │ q25 │ q75 │ q8 │ q9 │ kurtosis │ skewness │ variation │ entropy │ autocor │ pacf │ bmedian │ bmean │ bq25 │ bq75 │ bmin │ bmax │\n",
"│ │ \u001b[90mDateTime\u001b[39m │ \u001b[90mDateTime\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼─────────────────────┼─────────────────────┼──────────┼───────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┼──────────┼───────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤\n",
"│ 1 │ 2014-01-01T00:00:00 │ 2015-01-01T00:00:00 │ 0.999886 │ 3830 │ 18.8 │ 8.5 │ 10.35 │ 11.557 │ 9.9 │ 10.0 │ 10.0 │ 12.3 │ 13.0 │ 16.0 │ 0.730635 │ 1.41283 │ 0.200055 │ -1.09145e5 │ 4.39315 │ 1.04644 │ 5.0 │ 10.5589 │ 3.0 │ 6.0 │ 1.0 │ 2380.0 │"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mpipeline1 = Pipeline(Dict(\n",
" :transformers => [csvreader,valgator,stfier]\n",
" )\n",
")\n",
"\n",
"fit!(mpipeline1)\n",
"respipe1 = transform!(mpipeline1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's try imputing and verify the statistical features"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" | tstart | tend | sfreq | count | max | min | median | mean | q1 | q2 | q25 | q75 | q8 | q9 | kurtosis | skewness | variation | entropy | autocor | pacf |
|---|
| DateTime | DateTime | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 |
|---|
1 rows × 20 columns
| 1 | 2014-01-01T00:00:00 | 2015-01-01T00:00:00 | 0.999886 | 8761 | 18.8 | 8.5 | 10.0 | 11.1362 | 9.95 | 10.0 | 10.0 | 11.5 | 12.0 | 14.95 | 2.37274 | 1.87452 | 0.187997 | -2.36714e5 | 4.47886 | 1.06917 |
|---|
"
],
"text/latex": [
"\\begin{tabular}{r|cccccccccccccccccccc}\n",
"\t& tstart & tend & sfreq & count & max & min & median & mean & q1 & q2 & q25 & q75 & q8 & q9 & kurtosis & skewness & variation & entropy & autocor & pacf\\\\\n",
"\t\\hline\n",
"\t& DateTime & DateTime & Float64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 2014-01-01T00:00:00 & 2015-01-01T00:00:00 & 0.999886 & 8761 & 18.8 & 8.5 & 10.0 & 11.1362 & 9.95 & 10.0 & 10.0 & 11.5 & 12.0 & 14.95 & 2.37274 & 1.87452 & 0.187997 & -2.36714e5 & 4.47886 & 1.06917 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"1×20 DataFrame\n",
"│ Row │ tstart │ tend │ sfreq │ count │ max │ min │ median │ mean │ q1 │ q2 │ q25 │ q75 │ q8 │ q9 │ kurtosis │ skewness │ variation │ entropy │ autocor │ pacf │\n",
"│ │ \u001b[90mDateTime\u001b[39m │ \u001b[90mDateTime\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼─────────────────────┼─────────────────────┼──────────┼───────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┼──────────┼───────────┼────────────┼─────────┼─────────┤\n",
"│ 1 │ 2014-01-01T00:00:00 │ 2015-01-01T00:00:00 │ 0.999886 │ 8761 │ 18.8 │ 8.5 │ 10.0 │ 11.1362 │ 9.95 │ 10.0 │ 10.0 │ 11.5 │ 12.0 │ 14.95 │ 2.37274 │ 1.87452 │ 0.187997 │ -2.36714e5 │ 4.47886 │ 1.06917 │"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mpipeline2 = Pipeline(Dict(\n",
" :transformers => [csvreader,valgator,valputer,statfier]\n",
" )\n",
")\n",
"\n",
"fit!(mpipeline2)\n",
"respipe2 = transform!(mpipeline2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's visualize the imputted data"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mpipeline2 = Pipeline(Dict(\n",
" :transformers => [csvreader,valgator,valputer,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(mpipeline2)\n",
"transform!(mpipeline2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's have examples of Monotonic data"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"regularfile = joinpath(dirname(pathof(TSML)),\"../data/typedetection/regular.csv\")\n",
"monofile = joinpath(dirname(pathof(TSML)),\"../data/typedetection/monotonic.csv\")\n",
"dailymonofile = joinpath(dirname(pathof(TSML)),\"../data/typedetection/dailymonotonic.csv\")\n",
"\n",
"regularfilecsv = CSVDateValReader(Dict(:filename=>regularfile,:dateformat=>\"dd/mm/yyyy HH:MM\"))\n",
"monofilecsv = CSVDateValReader(Dict(:filename=>monofile,:dateformat=>\"dd/mm/yyyy HH:MM\"))\n",
"dailymonofilecsv = CSVDateValReader(Dict(:filename=>dailymonofile,:dateformat=>\"dd/mm/yyyy HH:MM\"))\n",
"\n",
"valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))\n",
"valputer = DateValNNer(Dict(:dateinterval=>Dates.Hour(1)))\n",
"stfier = Statifier(Dict(:processmissing=>true))\n",
"mononicer = Monotonicer(Dict())\n",
"stfier = Statifier(Dict(:processmissing=>true))\n",
"outliernicer = Outliernicer(Dict(:dateinterval=>Dates.Hour(1)));"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's plot an example of monotonic data"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"monopipeline = Pipeline(Dict(\n",
" :transformers => [monofilecsv,valgator,valputer,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(monopipeline)\n",
"transform!(monopipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's plot after normalizing the monotonic data"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"monopipeline = Pipeline(Dict(\n",
" :transformers => [monofilecsv,valgator,valputer,mononicer, pltr]\n",
" )\n",
")\n",
"\n",
"fit!(monopipeline)\n",
"transform!(monopipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's remove outliers and plot the result"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"monopipeline = Pipeline(Dict(\n",
" :transformers => [monofilecsv,valgator,valputer,mononicer,outliernicer,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(monopipeline)\n",
"transform!(monopipeline)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's plot and example of a daily monotonic data"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dailymonopipeline = Pipeline(Dict(\n",
" :transformers => [dailymonofilecsv,valgator,valputer,pltr]\n",
" )\n",
")\n",
"\n",
"fit!(dailymonopipeline)\n",
"transform!(dailymonopipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's normalize and plot"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dailymonopipeline = Pipeline(Dict(\n",
" :transformers => [dailymonofilecsv,valgator,valputer,mononicer,pltr]\n",
" )\n",
")\n",
"fit!(dailymonopipeline)\n",
"transform!(dailymonopipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's add the Outliernicer filter and plot"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dailymonopipeline = Pipeline(Dict(\n",
" :transformers => [dailymonofilecsv,valgator,valputer,mononicer,outliernicer,pltr]\n",
" )\n",
")\n",
"fit!(dailymonopipeline)\n",
"transform!(dailymonopipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's use what we have learned so far to perform automatic data type classification"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"getting stats of AirOffTemp1.csv\n",
"getting stats of AirOffTemp2.csv\n",
"getting stats of AirOffTemp3.csv\n",
"getting stats of Energy1.csv\n",
"getting stats of Energy10.csv\n",
"getting stats of Energy2.csv\n",
"getting stats of Energy3.csv\n",
"getting stats of Energy4.csv\n",
"getting stats of Energy6.csv\n",
"getting stats of Energy7.csv\n",
"getting stats of Energy8.csv\n",
"getting stats of Energy9.csv\n",
"getting stats of Pressure1.csv\n",
"getting stats of Pressure3.csv\n",
"getting stats of Pressure4.csv\n",
"getting stats of Pressure6.csv\n",
"getting stats of RetTemp11.csv\n",
"getting stats of RetTemp21.csv\n",
"getting stats of RetTemp41.csv\n",
"getting stats of RetTemp51.csv\n",
"getting stats of AirOffTemp4.csv\n",
"getting stats of AirOffTemp5.csv\n",
"getting stats of Energy5.csv\n",
"getting stats of Pressure5.csv\n",
"getting stats of RetTemp31.csv\n",
"loading model from file: /Users/ppalmes/.julia/packages/TSML/lqjQn/src/../data/realdatatsclassification/model/juliarfmodel.serialized\n"
]
},
{
"data": {
"text/plain": [
"80.0"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using TSML: TSClassifier\n",
"Random.seed!(12)\n",
"\n",
"trdirname = joinpath(dirname(pathof(TSML)),\"../data/realdatatsclassification/training\")\n",
"tstdirname = joinpath(dirname(pathof(TSML)),\"../data/realdatatsclassification/testing\")\n",
"modeldirname = joinpath(dirname(pathof(TSML)),\"../data/realdatatsclassification/model\")\n",
"\n",
"tscl = TSClassifier(Dict(:trdirectory=>trdirname,\n",
" :tstdirectory=>tstdirname,\n",
" :modeldirectory=>modeldirname,\n",
" :feature_range => 6:20,\n",
" :num_trees=>50)\n",
")\n",
"\n",
"fit!(tscl)\n",
"dfresults = transform!(tscl);\n",
"apredict = dfresults.predtype\n",
"fnames = dfresults.fname\n",
"myregex = r\"(?[A-Z _ - a-z]+)(?\\d*).(?\\w+)\"\n",
"mtypes=map(fnames) do fname\n",
" mymatch=match(myregex,fname)\n",
" mymatch[:dtype]\n",
"end\n",
"\n",
"sum(mtypes .== apredict)/length(mtypes) * 100 |> x-> round(x,digits=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## TSML features\n",
"- TS data type clustering/classification for automatic data discovery\n",
"- TS aggregation based on date/time interval\n",
"- TS imputation based on symmetric Nearest Neighbors\n",
"- TS statistical metrics for data quality assessment\n",
"- TS ML wrapper with more than 100+ libraries from caret, scikitlearn, and julia\n",
"- TS date/value matrix conversion of 1-D TS using sliding windows for ML input"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## More TSML features\n",
"- Common API wrappers for ML libs from JuliaML, PyCall, and RCall\n",
"- Pipeline API allows high-level description of the processing workflow\n",
"- Specific cleaning/normalization workflow based on data type\n",
"- Automatic selection of optimised ML model\n",
"- Automatic segmentation of time-series data into matrix form for ML training and prediction\n",
"- Easily extensible architecture by using just two main interfaces: fit and transform\n",
"- Meta-ensembles for robust prediction\n",
"- Support for distributed computation, for scalability, and speed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"@webio": {
"lastCommId": "530372faf78d4151b4a95cfcbdce67ad",
"lastKernelId": "d0a22632-562b-4706-97fe-42a3e4c24931"
},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Julia 1.2.0",
"language": "julia",
"name": "julia-1.2"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.2.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}