{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Discretizers\n", "\n", "This package supports discretization methods and mapping functions.\n", "\n", "## Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Pkg.add(\"Discretizers\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the installation is complete you can use it anywhere by running" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "using Discretizers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretization\n", "\n", "### Categorical Labels\n", "\n", "You can construct an object for mapping labels to integer indeces" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = [:cat, :dog, :dog, :cat, :cat, :elephant]\n", "catdisc = CategoricalDiscretizer(data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting object can be used to encode your source labels to their categorical labels" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ":cat becomes: 1\n", ":dog becomes: 2\n", "data becomes: [1,2,2,1,1,3]\n" ] } ], "source": [ "println(\":cat becomes: \", encode(catdisc, :cat))\n", "println(\":dog becomes: \", encode(catdisc, :dog))\n", "println(\"data becomes: \", encode(catdisc, data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also transform back" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 becomes: cat\n", "2 becomes: dog\n", "[1,2,3] becomes: [:cat,:dog,:elephant]\n" ] } ], "source": [ "println(\"1 becomes: \", decode(catdisc, 1))\n", "println(\"2 becomes: \", decode(catdisc, 2))\n", "println(\"[1,2,3] becomes: \", decode(catdisc, [1,2,3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The CategoricalDiscretizer works with any object type" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "CategoricalDiscretizer([\"A\", \"B\", \"C\"])\n", "CategoricalDiscretizer([5000, 1200, 100])\n", "CategoricalDiscretizer([:dog, \"hello world\", NaN]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear Discretization\n", "\n", "Linear discretization into a series of bins is supported as well\n", "\n", "Here we construct a linear discretizer that maps $[0,0.5) \\rightarrow 1$ and $[0.5,1] \\rightarrow 2$" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bin_edges = [0.0,0.5,1.0]\n", "lindisc = LinearDiscretizer(bin_edges);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoding works the same way" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.2 becomes: 1\n", "0.7 becomes: 2\n", "0.5 becomes: 2\n", "it works on arrays: [1,2,1]\n" ] } ], "source": [ "println(\"0.2 becomes: \", encode(lindisc, 0.2))\n", "println(\"0.7 becomes: \", encode(lindisc, 0.7))\n", "println(\"0.5 becomes: \", encode(lindisc, 0.5))\n", "println(\"it works on arrays: \", encode(lindisc, [0.0,0.8,0.2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decoding is a bit different. Here we obtain the bin and sample from it uniformally" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 becomes: 0.1887493587068908\n", "2 becomes: 0.5460050358882282\n", "it works on arrays: [0.8104340883570698,0.454060686695556,0.9457654175718269]\n" ] } ], "source": [ "println(\"1 becomes: \", decode(lindisc, 1))\n", "println(\"2 becomes: \", decode(lindisc, 2))\n", "println(\"it works on arrays: \", decode(lindisc, [2,1,2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some other functions are supported" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of labels: 3 2\n", "bin centers: [0.25,0.75]\n", "extrama of a bin: (0.5,1.0)\n" ] } ], "source": [ "println(\"number of labels: \", nlabels(catdisc), \" \", nlabels(lindisc))\n", "println(\"bin centers: \", bincenters(lindisc))\n", "println(\"extrama of a bin: \", extrema(lindisc, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both discretizers can be constructed to map to other integer types" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0x01" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catdisc = CategoricalDiscretizer(data, Int32)\n", "lindisc = LinearDiscretizer(bin_edges, UInt8)\n", "encode(lindisc, 0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretization Algorithms\n", "\n", "In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Uniform Width\n", " \n", " `DiscretizeUniformWidth(nbins)` - divide the domain evenly into `nbins`\n", " \n", "* Uniform Count\n", "\n", " `DiscretizeUniformCount(nbins)` - divide the domain into `nbins` where each bin has approximately equal count\n", " \n", "* Bayesian Blocks\n", "\n", " `DiscretizeBayesianBlocks()` - determines an appropriate number of bins by maximizing a Bayesian prior.\n", " See [this website](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html) for an overview." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4-element Array{Float64,1}:\n", " -2.85658 \n", " -0.598978\n", " 1.65863 \n", " 3.91623 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nbins = 3\n", "data = randn(1000)\n", "edges = binedges(DiscretizeUniformWidth(nbins), data)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "PGFPlots.GroupPlot([PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x17 Array{Real,2}:\n", " -14.8231 -14.8231 -12.8364 -10.8497 … 11.004 12.9907 14.9774\n", " 0.0 11.577 16.1071 33.2209 9.06025 7.04686 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Uniform Width\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x17 Array{Real,2}:\n", " -14.8231 -14.8231 -6.60423 -5.19549 … 3.0474 4.67614 14.9774\n", " 0.0 35.0412 204.439 476.45 176.824 27.8607 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Uniform Count\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x21 Array{Real,2}:\n", " -14.8231 -14.8231 -9.90599 … 4.53939 5.85829 8.8699 14.9774\n", " 0.0 15.6596 41.1833 99.325 38.1856 9.3328 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Bayesian Blocks\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing)],(3,1),nothing,\"horizontal sep = 1.75cm\")" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using PGFPlots\n", "using Distributions\n", "\n", "# draw a set of variables and\n", "# filter values to a reasonable range\n", "srand(0)\n", "data = [rand(Cauchy(-5, 1.8), 500);\n", " rand(Cauchy(-4, 0.8), 2000);\n", " rand(Cauchy(-1, 0.3), 500);\n", " rand(Cauchy( 2, 0.8), 1000);\n", " rand(Cauchy( 4, 1.5), 500)]\n", "data = filter!(x->-15.0 <= x <= 15.0, data)\n", "\n", "g = GroupPlot(3, 1, groupStyle = \"horizontal sep = 1.75cm\")\n", "\n", "discalgs = [(\"Uniform Width\", DiscretizeUniformWidth(15)),\n", " (\"Uniform Count\", DiscretizeUniformCount(15)),\n", " (\"Bayesian Blocks\", DiscretizeBayesianBlocks())]\n", "\n", "for (name, discalg) in discalgs\n", " disc = LinearDiscretizer(binedges(discalg, data))\n", " counts = get_discretization_counts(disc, data) \n", " arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n", " push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n", " ymin=0, xlabel=\"x\", ylabel=\"pdf(x)\", title=name))\n", "end\n", "\n", "g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically Determine Number of Uniform-Width Bins\n", "\n", "Several algorithms exist for deterimining the number of uniform-width bins.\n", "Simply enter the algorithm as a symbol into `DiscretizeUniformWidth`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "PGFPlots.GroupPlot([PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x68 Array{Real,2}:\n", " -14.8231 -14.8231 -14.3716 -13.9201 … 14.0743 14.5259 14.9774\n", " 0.0 6.64418 17.7178 11.0736 4.42945 6.64418 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"sqrt\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x16 Array{Real,2}:\n", " -14.8231 -14.8231 -12.6945 -10.5659 … 10.7202 12.8488 14.9774\n", " 0.0 12.6843 16.9125 36.6437 9.39581 7.04686 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"sturges\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x35 Array{Real,2}:\n", " -14.8231 -14.8231 -13.9201 -13.017 … 13.1713 14.0743 14.9774\n", " 0.0 12.181 13.2884 12.181 9.96627 5.53682 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"rice\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x25 Array{Real,2}:\n", " -14.8231 -14.8231 -13.5274 -12.2318 … 12.386 13.6817 14.9774\n", " 0.0 12.3488 12.3488 15.436 9.26159 7.71799 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"doane\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x36 Array{Real,2}:\n", " -14.8231 -14.8231 -13.9466 … 12.3479 13.2244 14.1009 14.9774\n", " 0.0 11.4092 13.691 10.2683 9.12736 5.7046 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"scott\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x43 Array{Real,2}:\n", " -14.8231 -14.8231 -14.0963 -13.3694 … 13.5237 14.2505 14.9774\n", " 0.0 11.0065 13.7582 13.7582 8.25489 6.87908 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"fd\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x43 Array{Real,2}:\n", " -14.8231 -14.8231 -14.0963 -13.3694 … 13.5237 14.2505 14.9774\n", " 0.0 11.0065 13.7582 13.7582 8.25489 6.87908 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"auto\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing)],(3,3),nothing,\"horizontal sep = 1.75cm, vertical sep = 1.5cm\")" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = GroupPlot(3, 3, groupStyle = \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")\n", "\n", "discalgs = [:sqrt, # used by Excel and others for its simplicity and speed\n", " :sturges, # R's default method, only good for near-Gaussian data\n", " :rice, # commonly overestimates the number of bins required\n", " :doane, # improves Sturges’ for non-normal datasets.\n", " :scott, # less robust estimator that that takes into account data variability and data size.\n", " :fd, # Freedman Diaconis Estimator, robust\n", " :auto, # max between :fd and :sturges. Good all-round performance\n", " ]\n", "\n", "for discalg in discalgs\n", " disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))\n", " counts = get_discretization_counts(disc, data) \n", " arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n", " push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n", " ymin=0, title=string(discalg)))\n", "end\n", "\n", "g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4-element Array{AbstractFloat,1}:\n", " -2.58175 \n", " -0.229589\n", " 1.87765 \n", " 2.6983 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [randn(100); randn(100)+1.0]\n", "labels = [fill(:cat, 100); fill(:dog, 100)]\n", "integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)\n", "edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More information on MODL can be found [here](http://nbviewer.ipython.org/github/sisl/Discretizers.jl/blob/master/doc/MODL/DiscretizationMODL.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.4.5", "language": "julia", "name": "julia-0.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.4.6" } }, "nbformat": 4, "nbformat_minor": 0 }