{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching and Combining data in daru\n", "\n", "## Arel-like query syntax\n", "\n", "[Arel](https://github.com/rails/arel) is a very popular ruby gem that is one of the major components of the most popular ruby frameworks, [Rails](https://github.com/rails/rails). It is an ORM-helper of sorts that exposes a beautiful and intuitive syntax for creating SQL strings by chaining Ruby methods.\n", "\n", "In daru, we have successfully adopted this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.\n", "\n", "Lets see how this syntax works with some examples:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "application/javascript": [ "if(window['d3'] === undefined ||\n", " window['Nyaplot'] === undefined){\n", " var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n", "\n", "\n", "\n", " var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n", "\n", " require.config({paths: path, shim:shim});\n", "\n", "\n", "require(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n", "\n", "\tvar script = d3.select(\"head\")\n", "\t .append(\"script\")\n", "\t .attr(\"src\", \"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n", "\t .attr(\"async\", true);\n", "\n", "\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n", "\n", "\n", "\t var event = document.createEvent(\"HTMLEvents\");\n", "\t event.initEvent(\"load_nyaplot\",false,false);\n", "\t window.dispatchEvent(event);\n", "\t console.log('Finished loading Nyaplotjs');\n", "\n", "\t};\n", "\n", "\n", "});});\n", "}\n" ], "text/plain": [ "\"if(window['d3'] === undefined ||\\n window['Nyaplot'] === undefined){\\n var path = {\\\"d3\\\":\\\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\\\",\\\"downloadable\\\":\\\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\\\"};\\n\\n\\n\\n var shim = {\\\"d3\\\":{\\\"exports\\\":\\\"d3\\\"},\\\"downloadable\\\":{\\\"exports\\\":\\\"downloadable\\\"}};\\n\\n require.config({paths: path, shim:shim});\\n\\n\\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\\n\\n\\tvar script = d3.select(\\\"head\\\")\\n\\t .append(\\\"script\\\")\\n\\t .attr(\\\"src\\\", \\\"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\\\")\\n\\t .attr(\\\"async\\\", true);\\n\\n\\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\\n\\n\\n\\t var event = document.createEvent(\\\"HTMLEvents\\\");\\n\\t event.initEvent(\\\"load_nyaplot\\\",false,false);\\n\\t window.dispatchEvent(event);\\n\\t console.log('Finished loading Nyaplotjs');\\n\\n\\t};\\n\\n\\n});});\\n}\\n\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "true" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use this syntax we call any of the comparator methods defined on `Daru::Vector` and pass the results from these to the `#where` method available for both `DataFrame` and `Vector`.\n", "\n", "To demonstrate with a quick example:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:21491980 size: 5
nil
25
45
75
115
135
" ], "text/plain": [ "\n", "#\n", " nil\n", " 2 5\n", " 4 5\n", " 7 5\n", " 11 5\n", " 13 5\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Daru::Vector.new([2,4,5,51,5,16,2,5,3,2,1,5,2,5,2,1,56,234,6,21])\n", "vector.where((vector.eq(5) | vector.eq(1)) & vector.mt(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Note that we use union OR (`|`) and union AND (`&`) and not logical OR (`||`) or logical AND (`&&`) **\n", "\n", "Thus the results returned by the `Vector#eq` method are evaluated by `Vector#where` and the generated Vector is returned. The index is also preserved.\n", "\n", "The `where` clause can also be used with DataFrame, with similar results." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:19261300 rows: 200 cols: 3
abc
2632b22
2485e55
3392b22
1605e55
5052b22
3115e55
1992b22
2325e55
4102b22
1255e55
3502b22
5725e55
292b22
1665e55
682b22
2705e55
2732b22
4165e55
112b22
4925e55
852b22
1985e55
3062b22
3645e55
5892b22
2205e55
4672b22
3415e55
2692b22
1905e55
1582b22
3545e55
............
5145e55
" ], "text/plain": [ "\n", "#\n", " a b c \n", " 263 2 b 22 \n", " 248 5 e 55 \n", " 339 2 b 22 \n", " 160 5 e 55 \n", " 505 2 b 22 \n", " 311 5 e 55 \n", " 199 2 b 22 \n", " 232 5 e 55 \n", " 410 2 b 22 \n", " 125 5 e 55 \n", " 350 2 b 22 \n", " 572 5 e 55 \n", " 29 2 b 22 \n", " 166 5 e 55 \n", " 68 2 b 22 \n", " ... ... ... ... \n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.new({\n", " a: [1,2,3,4,5,6]*100,\n", " b: ['a','b','c','d','e','f']*100,\n", " c: [11,22,33,44,55,66]*100\n", "}, index: (1..600).to_a.shuffle)\n", "\n", "df.where(df[:a].eq(2) | df[:c].eq(55))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The comparator methods on Vector return an object of type `Daru::Core::Query::BoolArray` which lets us perform **OR** and **AND** operations on it. See [this blog post]() for more information on BoolArray and other comparator methods.\n", "\n", "As a convenience for readability, you can also you the `#and` or `#or` methods instead of `#&` and `#|`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::Vector:18369620 size: 5
nil
25
45
75
115
135
" ], "text/plain": [ "\n", "#\n", " nil\n", " 2 5\n", " 4 5\n", " 7 5\n", " 11 5\n", " 13 5\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.where(vector.eq(2).or(vector.eq(5)).and(vector.mt(2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the major advantages of using the `where` clause over other more robust methods like `DataFrame#filter` or `Vector#keep_if` is that it is much faster (though not destructive). [These benchmarks](https://github.com/v0dro/daru/blob/master/benchmarks/where_vs_filter.rb) prove my point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing joins between DataFrames\n", "\n", "\n", "Daru::DataFrame offers the `#join` method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.\n", "\n", "To demonstrate:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "left = Daru::DataFrame.new({\n", " :id => [1,2,3,4],\n", " :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti']\n", "})\n", "right = Daru::DataFrame.new({\n", " :id => [1,2,3,4],\n", " :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja']\n", "})\n", "nil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform an inner join on the `:name` column:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:12529040 rows: 2 cols: 3
id_1nameid_2
01Pirate2
13Ninja4
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 3 Ninja 4 \n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :inner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An outer left join can be done with:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:12260160 rows: 4 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 2 Monkey nil \n", " 2 3 Ninja 4 \n", " 3 4 Spaghetti nil \n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :left)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An outer right join can be done like so:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:11720640 rows: 4 cols: 3
id_1nameid_2
0Rutabaga1
11Pirate2
2Darth Vader3
33Ninja4
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 nil Rutabaga 1 \n", " 1 1 Pirate 2 \n", " 2 nil Darth Vade 3 \n", " 3 3 Ninja 4 \n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :right)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, a full outer join:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:10555680 rows: 6 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti
4Rutabaga1
5Darth Vader3
" ], "text/plain": [ "\n", "#\n", " id_1 name id_2 \n", " 0 1 Pirate 2 \n", " 1 2 Monkey nil \n", " 2 3 Ninja 4 \n", " 3 4 Spaghetti nil \n", " 4 nil Rutabaga 1 \n", " 5 nil Darth Vade 3 \n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, on: [:name], how: :outer)" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.2.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }