{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching and Combining data in daru\n", "\n", "## Arel-like query syntax\n", "\n", "[Arel](https://github.com/rails/arel) is a very popular ruby gem that is one of the major components of the most popular ruby frameworks, [Rails](https://github.com/rails/rails). It is an ORM-helper of sorts that exposes a beatiful and intuitive syntax for creating SQL strings by chaining Ruby methods.\n", "\n", "In daru, we have successfully adopted this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.\n", "\n", "Lets see how this syntax works with some examples:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "application/javascript": [ "if(window['d3'] === undefined ||\n", " window['Nyaplot'] === undefined){\n", " var path = {\"d3\":\"http://d3js.org/d3.v3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n", "\n", "\n", "\n", " var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n", "\n", " require.config({paths: path, shim:shim});\n", "\n", "\n", "require(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n", "\n", "\tvar script = d3.select(\"head\")\n", "\t .append(\"script\")\n", "\t .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n", "\t .attr(\"async\", true);\n", "\n", "\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n", "\n", "\n", "\t var event = document.createEvent(\"HTMLEvents\");\n", "\t event.initEvent(\"load_nyaplot\",false,false);\n", "\t window.dispatchEvent(event);\n", "\t console.log('Finished loading Nyaplotjs');\n", "\n", "\t};\n", "\n", "\n", "});});\n", "}\n" ], "text/plain": [ "\"if(window['d3'] === undefined ||\\n window['Nyaplot'] === undefined){\\n var path = {\\\"d3\\\":\\\"http://d3js.org/d3.v3.min\\\",\\\"downloadable\\\":\\\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\\\"};\\n\\n\\n\\n var shim = {\\\"d3\\\":{\\\"exports\\\":\\\"d3\\\"},\\\"downloadable\\\":{\\\"exports\\\":\\\"downloadable\\\"}};\\n\\n require.config({paths: path, shim:shim});\\n\\n\\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\\n\\n\\tvar script = d3.select(\\\"head\\\")\\n\\t .append(\\\"script\\\")\\n\\t .attr(\\\"src\\\", \\\"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\\\")\\n\\t .attr(\\\"async\\\", true);\\n\\n\\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\\n\\n\\n\\t var event = document.createEvent(\\\"HTMLEvents\\\");\\n\\t event.initEvent(\\\"load_nyaplot\\\",false,false);\\n\\t window.dispatchEvent(event);\\n\\t console.log('Finished loading Nyaplotjs');\\n\\n\\t};\\n\\n\\n});});\\n}\\n\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "true" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use this syntax we call any of the comparator methods defined on `Daru::Vector` and pass the results from these to the `#where` method available for both `DataFrame` and `Vector`.\n", "\n", "To demonstrate with a quick example:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:75332930 size: 5
nil
25
45
75
115
135
" ], "text/plain": [ "\n", "#\n", " nil\n", " 2 5\n", " 4 5\n", " 7 5\n", " 11 5\n", " 13 5\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Daru::Vector.new([2,4,5,51,5,16,2,5,3,2,1,5,2,5,2,1,56,234,6,21])\n", "vector.where((vector.eq(5) | vector.eq(1)) & vector.mt(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Note that we use union OR (`|`) and union AND (`&`) and not logical OR (`||`) or logical AND (`&&`) **\n", "\n", "Thus the results returned by the `Vector#eq` method are evaluated by `Vector#where` and the generated Vector is returned. The index is also preserved.\n", "\n", "The `where` clause can also be used with DataFrame, with similar results." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:81901580 rows: 200 cols: 3
abc
332b22
4655e55
5272b22
1145e55
482b22
5935e55
4162b22
4155e55
3342b22
1655e55
5332b22
495e55
3602b22
1645e55
2402b22
3695e55
4142b22
5655e55
5252b22
765e55
3902b22
3225e55
1872b22
4005e55
3412b22
2885e55
3482b22
625e55
4212b22
3365e55
4742b22
2295e55
............
1965e55
" ], "text/plain": [ "\n", "#\n", " a b c \n", " 33 2 b 22 \n", " 465 5 e 55 \n", " 527 2 b 22 \n", " 114 5 e 55 \n", " 48 2 b 22 \n", " 593 5 e 55 \n", " 416 2 b 22 \n", " 415 5 e 55 \n", " 334 2 b 22 \n", " 165 5 e 55 \n", " 533 2 b 22 \n", " 49 5 e 55 \n", " 360 2 b 22 \n", " 164 5 e 55 \n", " 240 2 b 22 \n", " ... ... ... ... \n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.new({\n", " a: [1,2,3,4,5,6]*100,\n", " b: ['a','b','c','d','e','f']*100,\n", " c: [11,22,33,44,55,66]*100\n", "}, index: (1..600).to_a.shuffle)\n", "\n", "df.where(df[:a].eq(2) | df[:c].eq(55))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The comparator methods on Vector return an object of type `Daru::Core::Query::BoolArray` which lets us perform **OR** and **AND** operations on it. See [this blog post]() for more information on BoolArray and other comparator methods.\n", "\n", "As a convenience for readability, you can also you the `#and` or `#or` methods instead of `#&` and `#|`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:82473630 size: 5
nil
25
45
75
115
135
" ], "text/plain": [ "\n", "#\n", " nil\n", " 2 5\n", " 4 5\n", " 7 5\n", " 11 5\n", " 13 5\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.where(vector.eq(2).or(vector.eq(5)).and(vector.mt(2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the major advantages of using the `where` clause over other more robust methods like `DataFrame#filter` or `Vector#keep_if` is that it is much faster (though not destructive). [These benchmarks](https://github.com/v0dro/daru/blob/master/benchmarks/where_vs_filter.rb) prove my point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing joins between DataFrames\n", "\n", "\n", "Daru::DataFrame offers the `#join` method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.\n", "\n", "To demonstrate:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "left = Daru::DataFrame.new({\n", " :id => [1,2,3,4],\n", " :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti']\n", "})\n", "right = Daru::DataFrame.new({\n", " :id => [1,2,3,4],\n", " :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja']\n", "})\n", "nil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform an inner join on the `:name` column:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:82190730 rows: 2 cols: 3
id_1nameid_2
01Pirate2
13Ninja4
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 3 Ninja 4 \n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :inner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An outer left join can be done with:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:82015360 rows: 4 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 2 Monkey nil \n", " 2 3 Ninja 4 \n", " 3 4 Spaghetti nil \n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :left)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An outer right join can be done like so:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:81791190 rows: 4 cols: 3
id_1nameid_2
0Rutabaga1
11Pirate2
2Darth Vader3
33Ninja4
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 nil Rutabaga 1 \n", " 1 1 Pirate 2 \n", " 2 nil Darth Vade 3 \n", " 3 3 Ninja 4 \n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :right)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, a full outer join:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:81527400 rows: 6 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti
4Rutabaga1
5Darth Vader3
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 2 Monkey nil \n", " 2 3 Ninja 4 \n", " 3 4 Spaghetti nil \n", " 4 nil Rutabaga 1 \n", " 5 nil Darth Vade 3 \n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :outer)" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.2.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": "rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }