{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Introduction to daru (Data Analysis in RUby)\n", "\n", "## Sameer Deshmukh\n", "\n", "### Deccan Ruby Conf 2015, Pune, India." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "application/javascript": [ "if(window['d3'] === undefined ||\n", " window['Nyaplot'] === undefined){\n", " var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n", "\n", "\n", "\n", " var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n", "\n", " require.config({paths: path, shim:shim});\n", "\n", "\n", "require(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n", "\n", "\tvar script = d3.select(\"head\")\n", "\t .append(\"script\")\n", "\t .attr(\"src\", \"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n", "\t .attr(\"async\", true);\n", "\n", "\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n", "\n", "\n", "\t var event = document.createEvent(\"HTMLEvents\");\n", "\t event.initEvent(\"load_nyaplot\",false,false);\n", "\t window.dispatchEvent(event);\n", "\t console.log('Finished loading Nyaplotjs');\n", "\n", "\t};\n", "\n", "\n", "});});\n", "}\n" ], "text/plain": [ "\"if(window['d3'] === undefined ||\\n window['Nyaplot'] === undefined){\\n var path = {\\\"d3\\\":\\\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\\\",\\\"downloadable\\\":\\\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\\\"};\\n\\n\\n\\n var shim = {\\\"d3\\\":{\\\"exports\\\":\\\"d3\\\"},\\\"downloadable\\\":{\\\"exports\\\":\\\"downloadable\\\"}};\\n\\n require.config({paths: path, shim:shim});\\n\\n\\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\\n\\n\\tvar script = d3.select(\\\"head\\\")\\n\\t .append(\\\"script\\\")\\n\\t .attr(\\\"src\\\", \\\"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\\\")\\n\\t .attr(\\\"async\\\", true);\\n\\n\\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\\n\\n\\n\\t var event = document.createEvent(\\\"HTMLEvents\\\");\\n\\t event.initEvent(\\\"load_nyaplot\\\",false,false);\\n\\t window.dispatchEvent(event);\\n\\t console.log('Finished loading Nyaplotjs');\\n\\n\\t};\\n\\n\\n});});\\n}\\n\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "true" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'\n", "require 'distribution'\n", "require 'gnuplotrb'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a Daru::Vector\n", "\n", "**Vectors are indexed by passing data using the `index` option, and named with `name`**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:15070620 size: 6
Prices of stuff.
cherry20
apple40
barley25
wheat50
rice45
sugar12
" ], "text/plain": [ "\n", "#\n", " Prices of stuff.\n", " cherry 20\n", " apple 40\n", " barley 25\n", " wheat 50\n", " rice 45\n", " sugar 12\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Daru::Vector.new(\n", " [20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'], \n", " name: \"Prices of stuff.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retreive a single value\n", "\n", "**Specify the index you want to retrieve in the `#[]` operator**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "45" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector['rice']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retreive multiple values\n", "\n", "**Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:14387920 size: 3
Prices of stuff.
rice45
wheat50
sugar12
" ], "text/plain": [ "\n", "#\n", " Prices of stuff.\n", " rice 45\n", " wheat 50\n", " sugar 12\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector['rice', 'wheat', 'sugar']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retreive a slice with a Range\n", "\n", "**Specifying a range of indexes will retrieve a slice of the Daru::Vector**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:14063700 size: 4
Prices of stuff.
barley25
wheat50
rice45
sugar12
" ], "text/plain": [ "\n", "#\n", " Prices of stuff.\n", " barley 25\n", " wheat 50\n", " rice 45\n", " sugar 12\n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector['barley'..'sugar']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assign a value\n", "\n", "**Assign a value by specifying the index directly to the #[]= operator**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:15070620 size: 6
Prices of stuff.
cherry20
apple40
barley1500
wheat50
rice45
sugar12
" ], "text/plain": [ "\n", "#\n", " Prices of stuff.\n", " cherry 20\n", " apple 40\n", " barley 1500\n", " wheat 50\n", " rice 45\n", " sugar 12\n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector['barley'] = 1500\n", "vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a Daru::DataFrame\n", "\n", "**The `:index` option is used for specifying the row index of the DataFrame and the `:order` option determines the order in which they will be stored.**\n", "\n", "**Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:13337740 rows: 6 cols: 3
col0col1col2
one111a
two222b
three333c
four444d
five555e
six666f
" ], "text/plain": [ "\n", "#\n", " col0 col1 col2 \n", " one 1 11 a \n", " two 2 22 b \n", " three 3 33 c \n", " four 4 44 d \n", " five 5 55 e \n", " six 6 66 f \n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.new({\n", " 'col0' => [1,2,3,4,5,6],\n", " 'col2' => ['a','b','c','d','e','f'],\n", " 'col1' => [11,22,33,44,55,66]\n", " }, \n", " index: ['one', 'two', 'three', 'four', 'five', 'six'], \n", " order: ['col0', 'col1', 'col2']\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing a Column\n", "\n", "**A DataFrame column can be accessed using the DataFrame#[] operator.**\n", "\n", "**Note that it returns a Daru::Vector**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:13292960 size: 6
col1
one11
two22
three33
four44
five55
six66
" ], "text/plain": [ "\n", "#\n", " col1\n", " one 11\n", " two 22\n", "three 33\n", " four 44\n", " five 55\n", " six 66\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['col1']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing multiple Columns\n", "\n", "**Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:12423020 rows: 6 cols: 2
col2col0
onea1
twob2
threec3
fourd4
fivee5
sixf6
" ], "text/plain": [ "\n", "#\n", " col2 col0 \n", " one a 1 \n", " two b 2 \n", " three c 3 \n", " four d 4 \n", " five e 5 \n", " six f 6 \n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['col2', 'col0']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing a Range of Columns\n", "\n", "**A slice of the DataFrame by columns can be obtained by specifying a Range in #[]**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:12007160 rows: 6 cols: 2
col1col2
one11a
two22b
three33c
four44d
five55e
six66f
" ], "text/plain": [ "\n", "#\n", " col1 col2 \n", " one 11 a \n", " two 22 b \n", " three 33 c \n", " four 44 d \n", " five 55 e \n", " six 66 f \n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['col1'..'col2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assigning a Column\n", "\n", "**You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:13337740 rows: 6 cols: 3
col0col1col2
one1thisa
two2someb
three3isc
four4datad
five5heree
six6newf
" ], "text/plain": [ "\n", "#\n", " col0 col1 col2 \n", " one 1 this a \n", " two 2 some b \n", " three 3 is c \n", " four 4 data d \n", " five 5 here e \n", " six 6 new f \n" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'], \n", " index: ['one', 'three','two','six','four', 'five'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing a Row\n", "\n", "**A single row can be accessed using the `#row[]` function.**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:11115780 size: 3
four
col04
col1data
col2d
" ], "text/plain": [ "\n", "#\n", " four\n", "col0 4\n", "col1 data\n", "col2 d\n" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.row['four']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing a Range of Rows\n", "\n", "**Specifying a Range of Row indexes in `#row[]` will select a DataFrame with those rows**" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:9135240 rows: 3 cols: 3
col0col1col2
three3isc
four4datad
five5heree
" ], "text/plain": [ "\n", "#\n", " col0 col1 col2 \n", " three 3 is c \n", " four 4 data d \n", " five 5 here e \n" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.row['three'..'five']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assigning a Row\n", "\n", "**You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.**" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[666, 555, 333]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.row['five'] = [666,555,333]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistics on Vector with missing data\n", "\n", "**A host of static and rolling statistics methods are provided on Daru::Vector.**\n", "\n", "**Note that missing data (very common in most real world scenarios) is gracefully handled**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "12.8" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Daru::Vector.new([1,3,5,nil,2,53,nil])\n", "vector.mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistics on DataFrame\n", "\n", "**DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:8060380 size: 1
mean
col0113.66666666666667
" ], "text/plain": [ "\n", "#\n", " mean\n", " col0 113.66666666666667\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Useful statistics about the vectors in a DataFrame can be observed with `#describe`**" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:7470980 rows: 5 cols: 1
col0
count6
mean113.66666666666667
std270.5924364550249
min1
max666
" ], "text/plain": [ "\n", "#\n", " col0 \n", " count 6 \n", " mean 113.666666 \n", " std 270.592436 \n", " min 1 \n", " max 666 \n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time Series Support\n", "\n", "**Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The DateTimeIndex\n", "\n", "**The DateTimeIndex is a special index for indexing data based on timestamps.**\n", "\n", "**A date index range can be created using the DateTimeIndex.date_range function. The `:freq` option decides the time frequency between each timestamp in the date index.**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "#" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**A Daru::Vector can be created by simply passing the newly created index object into the `:index` argument.**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:5628020 size: 1000
nil
2012-01-01T00:00:00+00:000.692831672574459
2012-01-04T00:00:00+00:000.6971783281963972
2012-01-07T00:00:00+00:000.34687766698487965
2012-01-10T00:00:00+00:000.5509404993547384
2012-01-13T00:00:00+00:000.10166975999865946
2012-01-16T00:00:00+00:000.34183413903843207
2012-01-19T00:00:00+00:000.018428168123970967
2012-01-22T00:00:00+00:000.7792652522504137
2012-01-25T00:00:00+00:000.24793667731961144
2012-01-28T00:00:00+00:000.7200752551979407
2012-01-31T00:00:00+00:000.770756064084555
2012-02-03T00:00:00+00:000.6475396341969668
2012-02-06T00:00:00+00:000.00034544180080875453
2012-02-09T00:00:00+00:000.9881939271758362
2012-02-12T00:00:00+00:000.042428559674003274
2012-02-15T00:00:00+00:000.6604582692043693
2012-02-18T00:00:00+00:000.6446959879056338
2012-02-21T00:00:00+00:000.11606340772777746
2012-02-24T00:00:00+00:000.5238981665473298
2012-02-27T00:00:00+00:000.25979569124671453
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055
2012-04-03T00:00:00+00:000.8216050188191338
......
2020-03-16T00:00:00+00:000.8324422863437039
" ], "text/plain": [ "\n", "#\n", " nil\n", "2012-01-01T00:00:00+ 0.692831672574459\n", "2012-01-04T00:00:00+ 0.6971783281963972\n", "2012-01-07T00:00:00+ 0.34687766698487965\n", "2012-01-10T00:00:00+ 0.5509404993547384\n", "2012-01-13T00:00:00+ 0.10166975999865946\n", "2012-01-16T00:00:00+ 0.34183413903843207\n", "2012-01-19T00:00:00+ 0.018428168123970967\n", "2012-01-22T00:00:00+ 0.7792652522504137\n", "2012-01-25T00:00:00+ 0.24793667731961144\n", "2012-01-28T00:00:00+ 0.7200752551979407\n", "2012-01-31T00:00:00+ 0.770756064084555\n", "2012-02-03T00:00:00+ 0.6475396341969668\n", "2012-02-06T00:00:00+ 0.000345441800808754\n", "2012-02-09T00:00:00+ 0.9881939271758362\n", "2012-02-12T00:00:00+ 0.042428559674003274\n", "2012-02-15T00:00:00+ 0.6604582692043693\n", "2012-02-18T00:00:00+ 0.6446959879056338\n", " ... ...\n" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing data by partial timestamps\n", "\n", "**When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.**\n", "\n", "**For example, to access all the data belonging to the year 2012.**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:15406520 size: 122
nil
2012-01-01T00:00:00+00:000.692831672574459
2012-01-04T00:00:00+00:000.6971783281963972
2012-01-07T00:00:00+00:000.34687766698487965
2012-01-10T00:00:00+00:000.5509404993547384
2012-01-13T00:00:00+00:000.10166975999865946
2012-01-16T00:00:00+00:000.34183413903843207
2012-01-19T00:00:00+00:000.018428168123970967
2012-01-22T00:00:00+00:000.7792652522504137
2012-01-25T00:00:00+00:000.24793667731961144
2012-01-28T00:00:00+00:000.7200752551979407
2012-01-31T00:00:00+00:000.770756064084555
2012-02-03T00:00:00+00:000.6475396341969668
2012-02-06T00:00:00+00:000.00034544180080875453
2012-02-09T00:00:00+00:000.9881939271758362
2012-02-12T00:00:00+00:000.042428559674003274
2012-02-15T00:00:00+00:000.6604582692043693
2012-02-18T00:00:00+00:000.6446959879056338
2012-02-21T00:00:00+00:000.11606340772777746
2012-02-24T00:00:00+00:000.5238981665473298
2012-02-27T00:00:00+00:000.25979569124671453
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055
2012-04-03T00:00:00+00:000.8216050188191338
......
2012-12-29T00:00:00+00:000.26155523165437944
" ], "text/plain": [ "\n", "#\n", " nil\n", "2012-01-01T00:00:00+ 0.692831672574459\n", "2012-01-04T00:00:00+ 0.6971783281963972\n", "2012-01-07T00:00:00+ 0.34687766698487965\n", "2012-01-10T00:00:00+ 0.5509404993547384\n", "2012-01-13T00:00:00+ 0.10166975999865946\n", "2012-01-16T00:00:00+ 0.34183413903843207\n", "2012-01-19T00:00:00+ 0.018428168123970967\n", "2012-01-22T00:00:00+ 0.7792652522504137\n", "2012-01-25T00:00:00+ 0.24793667731961144\n", "2012-01-28T00:00:00+ 0.7200752551979407\n", "2012-01-31T00:00:00+ 0.770756064084555\n", "2012-02-03T00:00:00+ 0.6475396341969668\n", "2012-02-06T00:00:00+ 0.000345441800808754\n", "2012-02-09T00:00:00+ 0.9881939271758362\n", "2012-02-12T00:00:00+ 0.042428559674003274\n", "2012-02-15T00:00:00+ 0.6604582692043693\n", "2012-02-18T00:00:00+ 0.6446959879056338\n", " ... ...\n" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeseries['2012']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Or to access data whose time stamp is March 2012...**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:14832480 size: 11
nil
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055
" ], "text/plain": [ "\n", "#\n", " nil\n", "2012-03-01T00:00:00+ 0.1808967702663009\n", "2012-03-04T00:00:00+ 0.04614156947957693\n", "2012-03-07T00:00:00+ 0.8935716437439504\n", "2012-03-10T00:00:00+ 0.7197074871013468\n", "2012-03-13T00:00:00+ 0.20741375904156445\n", "2012-03-16T00:00:00+ 0.501647901862296\n", "2012-03-19T00:00:00+ 0.9470421480253584\n", "2012-03-22T00:00:00+ 0.2954430257659184\n", "2012-03-25T00:00:00+ 0.18422816661946229\n", "2012-03-28T00:00:00+ 0.48737285121462925\n", "2012-03-31T00:00:00+ 0.7549290269495055\n" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeseries['2012-3']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).**" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.7197074871013468" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeseries['2012-3-10']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:28416340 size: 60
nil
2012-03-23T12:42:00+00:004
2012-03-23T12:42:01+00:0032
2012-03-23T12:42:02+00:0035
2012-03-23T12:42:03+00:0035
2012-03-23T12:42:04+00:0014
2012-03-23T12:42:05+00:001
2012-03-23T12:42:06+00:0043
2012-03-23T12:42:07+00:0039
2012-03-23T12:42:08+00:0020
2012-03-23T12:42:09+00:0016
2012-03-23T12:42:10+00:0043
2012-03-23T12:42:11+00:000
2012-03-23T12:42:12+00:0027
2012-03-23T12:42:13+00:0043
2012-03-23T12:42:14+00:0043
2012-03-23T12:42:15+00:0018
2012-03-23T12:42:16+00:0035
2012-03-23T12:42:17+00:0039
2012-03-23T12:42:18+00:0035
2012-03-23T12:42:19+00:0023
2012-03-23T12:42:20+00:0025
2012-03-23T12:42:21+00:0013
2012-03-23T12:42:22+00:005
2012-03-23T12:42:23+00:0043
2012-03-23T12:42:24+00:0013
2012-03-23T12:42:25+00:0028
2012-03-23T12:42:26+00:002
2012-03-23T12:42:27+00:0042
2012-03-23T12:42:28+00:0029
2012-03-23T12:42:29+00:0036
2012-03-23T12:42:30+00:0044
2012-03-23T12:42:31+00:0036
......
2012-03-23T12:42:59+00:008
" ], "text/plain": [ "\n", "#\n", " nil\n", "2012-03-23T12:42:00+ 4\n", "2012-03-23T12:42:01+ 32\n", "2012-03-23T12:42:02+ 35\n", "2012-03-23T12:42:03+ 35\n", "2012-03-23T12:42:04+ 14\n", "2012-03-23T12:42:05+ 1\n", "2012-03-23T12:42:06+ 43\n", "2012-03-23T12:42:07+ 39\n", "2012-03-23T12:42:08+ 20\n", "2012-03-23T12:42:09+ 16\n", "2012-03-23T12:42:10+ 43\n", "2012-03-23T12:42:11+ 0\n", "2012-03-23T12:42:12+ 27\n", "2012-03-23T12:42:13+ 43\n", "2012-03-23T12:42:14+ 43\n", "2012-03-23T12:42:15+ 18\n", "2012-03-23T12:42:16+ 35\n", " ... ...\n" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = Daru::DateTimeIndex.date_range(\n", " :start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')\n", "\n", "seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)\n", "seconds_ts['2012-3-23 12:42']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "### Simple Visualization with interactive graphs\n", "\n", "**Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.**\n", "\n", "**DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.**" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:4800060 rows: 10 cols: 4
citysalesstafftemperature
0Pune3501530.4
1Delhi1502023.5
2Pune5001544.5
3Delhi2002020.3
4Pune4801534
5Delhi2502024
6Pune3301531.45
7Delhi4002028.34
8Pune4201537
9Delhi5602024
" ], "text/plain": [ "\n", "#\n", " city sales staff temperatur \n", " 0 Pune 350 15 30.4 \n", " 1 Delhi 150 20 23.5 \n", " 2 Pune 500 15 44.5 \n", " 3 Delhi 200 20 20.3 \n", " 4 Pune 480 15 34 \n", " 5 Delhi 250 20 24 \n", " 6 Pune 330 15 31.45 \n", " 7 Delhi 400 20 28.34 \n", " 8 Pune 420 15 37 \n", " 9 Delhi 560 20 24 \n" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.new({\n", " :temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],\n", " :sales => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],\n", " :city => ['Pune', 'Delhi']*5,\n", " :staff => [15,20]*5\n", "})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The plot below is between Temperature in the city and the sales of ice cream.**" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "#[#[#:scatter, :options=>{:x=>:temperature, :y=>:sales, :tooltip_contents=>[:city, :staff], :color=>#, :fill_by=>:city, :shape_by=>:city}, :data=>\"1a49603a-2bea-4353-b752-4ede957a4440\"}, @xrange=[20.3, 44.5], @yrange=[150, 560]>], :options=>{:x_label=>\"Temperature\", :y_label=>\"Sales\", :yrange=>[100, 600], :xrange=>[15, 50], :zoom=>true, :width=>700}}>], :data=>{\"1a49603a-2bea-4353-b752-4ede957a4440\"=>#\"Pune\", :sales=>350, :staff=>15, :temperature=>30.4}, {:city=>\"Delhi\", :sales=>150, :staff=>20, :temperature=>23.5}, {:city=>\"Pune\", :sales=>500, :staff=>15, :temperature=>44.5}, {:city=>\"Delhi\", :sales=>200, :staff=>20, :temperature=>20.3}, {:city=>\"Pune\", :sales=>480, :staff=>15, :temperature=>34}, {:city=>\"Delhi\", :sales=>250, :staff=>20, :temperature=>24}, {:city=>\"Pune\", :sales=>330, :staff=>15, :temperature=>31.45}, {:city=>\"Delhi\", :sales=>400, :staff=>20, :temperature=>28.34}, {:city=>\"Pune\", :sales=>420, :staff=>15, :temperature=>37}, {:city=>\"Delhi\", :sales=>560, :staff=>20, :temperature=>24}]>}, :extension=>[]}>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|\n", " plot.x_label \"Temperature\"\n", " plot.y_label \"Sales\"\n", " plot.yrange [100, 600]\n", " plot.xrange [15, 50]\n", " diagram.tooltip_contents([:city, :staff])\n", " # Set the color scheme for this diagram.\n", " diagram.color(Nyaplot::Colors.qual) \n", " # Change color of each point WRT to the city that it belongs to.\n", " diagram.fill_by(:city)\n", " # Shape each point WRT to the city that it belongs to.\n", " diagram.shape_by(:city) \n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use with GNU plot\n", "\n", "#### Plotting a time series with it's rolling mean\n", "\n", "**Init a random number generator for creating a normal distribution **" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "#" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = Distribution::Normal.rng" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t-15\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-10\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 10\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 15\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 20\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 25\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tApr\n", "\t\n", "\t\n", "\t\t2012\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJul\n", "\t\n", "\t\n", "\t\t2012\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tOct\n", "\t\n", "\t\n", "\t\t2012\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJan\n", "\t\n", "\t\n", "\t\t2013\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tApr\n", "\t\n", "\t\n", "\t\t2013\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJul\n", "\t\n", "\t\n", "\t\t2013\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tOct\n", "\t\n", "\t\n", "\t\t2013\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJan\n", "\t\n", "\t\n", "\t\t2014\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tApr\n", "\t\n", "\t\n", "\t\t2014\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJul\n", "\t\n", "\t\n", "\t\t2014\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tOct\n", "\t\n", "\t\n", "\t\t2014\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t01\n", "\t\n", "\t\n", "\t\tJan\n", "\t\n", "\t\n", "\t\t2015\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\t\n", "\t\tValue\n", "\t\n", "\n", "\n", "\t\n", "\t\tTime\n", "\t\n", "\n", "\n", "\n", "\tVector\n", "\n", "\t\n", "\t\tVector\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\tRolling Mean\n", "\n", "\t\n", "\t\tRolling Mean\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n" ], "text/plain": [ "# \"time\", :format_x => \"%d\\\\n%b\\\\n%Y\", :timefmt => \"%Y-%m-%dT%H:%M:%S\", :ylabel => \"Value\", :xlabel => \"Time\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:using => \"1:2\", :with => \"lines\", :title => \"Vector\"]>, #, @options=Hamster::Hash[:using => \"1:2\", :with => \"lines\", :title => \"Rolling Mean\"]>], @cmd=\"plot \">" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)\n", "vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)\n", "vector = vector.cumsum\n", "rolling_mean = vector.rolling_mean 60\n", "\n", "GnuplotRB::Plot.new(\n", " [vector , with: 'lines', title: 'Vector'], \n", " [rolling_mean, with: 'lines', title: 'Rolling Mean'],\n", " xlabel: 'Time', ylabel: 'Value'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arel-like syntax\n", "\n", "**Web devs will feel right at home!**\n", "\n", "**Fast and intuitive syntax for retreiving data with boolean indexing.**\n", "\n", "### The 'where' clause" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:5195920 rows: 600 cols: 3
abc
1021a11
1772b22
3543c33
1634d44
2305e55
3326f66
1711a11
1232b22
4703c33
4714d44
3095e55
236f66
151a11
262b22
3123c33
4844d44
3865e55
726f66
5061a11
962b22
1833c33
904d44
4515e55
2786f66
5291a11
872b22
2563c33
4154d44
4215e55
4856f66
1391a11
4822b22
............
5136f66
" ], "text/plain": [ "\n", "#\n", " a b c \n", " 102 1 a 11 \n", " 177 2 b 22 \n", " 354 3 c 33 \n", " 163 4 d 44 \n", " 230 5 e 55 \n", " 332 6 f 66 \n", " 171 1 a 11 \n", " 123 2 b 22 \n", " 470 3 c 33 \n", " 471 4 d 44 \n", " 309 5 e 55 \n", " 23 6 f 66 \n", " 15 1 a 11 \n", " 26 2 b 22 \n", " 312 3 c 33 \n", " ... ... ... ... \n" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.new({\n", " a: [1,2,3,4,5,6]*100,\n", " b: ['a','b','c','d','e','f']*100,\n", " c: [11,22,33,44,55,66]*100\n", "}, index: (1..600).to_a.shuffle)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Compares with a bunch of scalar quantities and returns a DataFrame wherever they return *true***" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:14856680 rows: 200 cols: 3
abc
1772b22
2305e55
1232b22
3095e55
262b22
3865e55
962b22
4515e55
872b22
4215e55
4822b22
2545e55
522b22
2825e55
2672b22
3045e55
362b22
4245e55
3032b22
3535e55
3762b22
1155e55
552b22
75e55
4782b22
2395e55
3562b22
5305e55
992b22
815e55
5952b22
4365e55
............
5325e55
" ], "text/plain": [ "\n", "#\n", " a b c \n", " 177 2 b 22 \n", " 230 5 e 55 \n", " 123 2 b 22 \n", " 309 5 e 55 \n", " 26 2 b 22 \n", " 386 5 e 55 \n", " 96 2 b 22 \n", " 451 5 e 55 \n", " 87 2 b 22 \n", " 421 5 e 55 \n", " 482 2 b 22 \n", " 254 5 e 55 \n", " 52 2 b 22 \n", " 282 5 e 55 \n", " 267 2 b 22 \n", " ... ... ... ... \n" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.where(df[:a].eq(2).or(df[:c].eq(55)))" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.2.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }