{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Discretizers\n", "\n", "This package supports discretization methods and mapping functions.\n", "\n", "## Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Pkg.add(\"Discretizers\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the installation is complete you can use it anywhere by running" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using Discretizers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretization\n", "\n", "### Categorical Labels\n", "\n", "You can construct an object for mapping labels to integer indeces" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data = [:cat, :dog, :dog, :cat, :cat, :elephant]\n", "catdisc = CategoricalDiscretizer(data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting object can be used to encode your source labels to their categorical labels" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ":cat becomes: 1\n", ":dog becomes: 2\n", "data becomes: [1, 2, 2, 1, 1, 3]\n" ] } ], "source": [ "println(\":cat becomes: \", encode(catdisc, :cat))\n", "println(\":dog becomes: \", encode(catdisc, :dog))\n", "println(\"data becomes: \", encode(catdisc, data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also transform back" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 becomes: cat\n", "2 becomes: dog\n", "[1,2,3] becomes: [:cat, :dog, :elephant]\n" ] } ], "source": [ "println(\"1 becomes: \", decode(catdisc, 1))\n", "println(\"2 becomes: \", decode(catdisc, 2))\n", "println(\"[1,2,3] becomes: \", decode(catdisc, [1,2,3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The CategoricalDiscretizer works with any object type" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "CategoricalDiscretizer([\"A\", \"B\", \"C\"])\n", "CategoricalDiscretizer([5000, 1200, 100])\n", "CategoricalDiscretizer([:dog, \"hello world\", NaN]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear Discretization\n", "\n", "Linear discretization into a series of bins is supported as well\n", "\n", "Here we construct a linear discretizer that maps $[0,0.5) \\rightarrow 1$ and $[0.5,1] \\rightarrow 2$" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "bin_edges = [0.0,0.5,1.0]\n", "lindisc = LinearDiscretizer(bin_edges);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoding works the same way" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.2 becomes: 1\n", "0.7 becomes: 2\n", "0.5 becomes: 2\n", "it works on arrays: [1, 2, 1]\n" ] } ], "source": [ "println(\"0.2 becomes: \", encode(lindisc, 0.2))\n", "println(\"0.7 becomes: \", encode(lindisc, 0.7))\n", "println(\"0.5 becomes: \", encode(lindisc, 0.5))\n", "println(\"it works on arrays: \", encode(lindisc, [0.0,0.8,0.2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decoding is a bit different. Here we obtain the bin and sample from it uniformally" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 becomes: 0.34261737476959025\n", "2 becomes: 0.8347257253483246\n", "it works on arrays: [0.7957856786261945, 0.21404469243416652, 0.8682774319312638]\n" ] } ], "source": [ "println(\"1 becomes: \", decode(lindisc, 1))\n", "println(\"2 becomes: \", decode(lindisc, 2))\n", "println(\"it works on arrays: \", decode(lindisc, [2,1,2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some other functions are supported" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of labels: 3 2\n", "bin centers: [0.25, 0.75]\n", "extrama of a bin: (0.5, 1.0)\n" ] } ], "source": [ "println(\"number of labels: \", nlabels(catdisc), \" \", nlabels(lindisc))\n", "println(\"bin centers: \", bincenters(lindisc))\n", "println(\"extrama of a bin: \", extrema(lindisc, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both discretizers can be constructed to map to other integer types" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0x01" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catdisc = CategoricalDiscretizer(data, Int32)\n", "lindisc = LinearDiscretizer(bin_edges, UInt8)\n", "encode(lindisc, 0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretization Algorithms\n", "\n", "In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Uniform Width\n", " \n", " `DiscretizeUniformWidth(nbins)` - divide the domain evenly into `nbins`\n", " \n", "* Uniform Count\n", "\n", " `DiscretizeUniformCount(nbins)` - divide the domain into `nbins` where each bin has approximately equal count\n", " \n", "* Quantile\n", "\n", " `DiscretizeQuantile(nbins)` - similar to `DiscretizeUniformCount`, but it leverages the `quantile` method from `Statistics.jl`, and automatically drops duplicate edges. This method mimics [the KBinsDiscretizer from sci-kit learn](https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09bcc2eaeba98f7e737aac2ac782f0e5f1/sklearn/preprocessing/_discretization.py#L21).\n", " \n", "* Bayesian Blocks\n", "\n", " `DiscretizeBayesianBlocks()` - determines an appropriate number of bins by maximizing a Bayesian prior.\n", " See [this website](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html) for an overview." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4-element Vector{Float64}:\n", " -2.9097578179044077\n", " -0.9195885729377203\n", " 1.070580672028967\n", " 3.0607499169956545" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nbins = 3\n", "data = randn(1000)\n", "edges = binedges(DiscretizeUniformWidth(nbins), data)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "GroupPlot(Axis[Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 12.932839834052759 14.90397400519417; 0.0 11.668409150799484 … 7.1025099178779465 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Uniform Width\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 4.581940852292759 14.90397400519417; 0.0 35.45029724240893 … 27.61083943233505 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Uniform Count\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 4.567337556730759 14.90397400519417; 0.0 35.44472204113863 … 27.668574920472818 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Quantile\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 7.387048028205961 14.90397400519417; 0.0 12.405107435022224 … 12.239045626050112 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Bayesian Blocks\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\")], (4, 1), nothing, \"horizontal sep = 1.75cm\")" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using PGFPlots\n", "using Distributions\n", "using Random\n", "\n", "# draw a set of variables and\n", "# filter values to a reasonable range\n", "Random.seed!(0)\n", "data = [rand(Cauchy(-5, 1.8), 500);\n", " rand(Cauchy(-4, 0.8), 2000);\n", " rand(Cauchy(-1, 0.3), 500);\n", " rand(Cauchy( 2, 0.8), 1000);\n", " rand(Cauchy( 4, 1.5), 500)]\n", "data = filter!(x->-15.0 <= x <= 15.0, data)\n", "\n", "g = GroupPlot(3, 1, groupStyle = \"horizontal sep = 1.75cm\")\n", "\n", "discalgs = [(\"Uniform Width\", DiscretizeUniformWidth(15)),\n", " (\"Uniform Count\", DiscretizeUniformCount(15)),\n", " (\"Quantile\", DiscretizeQuantile(15)),\n", " (\"Bayesian Blocks\", DiscretizeBayesianBlocks())]\n", "\n", "for (name, discalg) in discalgs\n", " disc = LinearDiscretizer(binedges(discalg, data))\n", " counts = get_discretization_counts(disc, data) \n", " arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n", " push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y),\n", " style=\"const plot, mark=none, fill=blue!60\"), \n", " ymin=0, xlabel=\"x\", ylabel=\"pdf(x)\", title=name, width=\"6cm\"))\n", "end\n", "\n", "g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically Determine Number of Uniform-Width Bins\n", "\n", "Several algorithms exist for deterimining the number of uniform-width bins.\n", "Simply enter the algorithm as a symbol into `DiscretizeUniformWidth`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "GroupPlot(Axis[Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.455988966298396 14.90397400519417; 0.0 24.55439143037808 … 8.928869611046574 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"sqrt\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 12.792044536114085 14.90397400519417; 0.0 11.83751652979657 … 7.102509917877942 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"sturges\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.00800392740262 14.90397400519417; 0.0 17.857739222093112 … 4.464434805523278 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"rice\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 13.618451719667162 14.90397400519417; 0.0 14.002090980959368 … 7.001045490479684 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"doane\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.034355988514136 14.90397400519417; 0.0 18.39888283488383 … 4.599720708720957 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"scott\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.182827357215604 14.90397400519417; 0.0 19.413527108866365 … 5.546722031104675 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"fd\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.182827357215604 14.90397400519417; 0.0 19.413527108866365 … 5.546722031104675 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"auto\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\")], (3, 3), nothing, \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = GroupPlot(3, 3, groupStyle = \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")\n", "\n", "discalgs = [:sqrt, # used by Excel and others for its simplicity and speed\n", " :sturges, # R's default method, only good for near-Gaussian data\n", " :rice, # commonly overestimates the number of bins required\n", " :doane, # improves Sturges’ for non-normal datasets.\n", " :scott, # less robust estimator that that takes into account data variability and data size.\n", " :fd, # Freedman Diaconis Estimator, robust\n", " :auto, # max between :fd and :sturges. Good all-round performance\n", " ]\n", "\n", "for discalg in discalgs\n", " disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))\n", " counts = get_discretization_counts(disc, data) \n", " arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n", " push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n", " ymin=0, title=string(discalg)))\n", "end\n", "\n", "g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Vector{AbstractFloat}:\n", " -3.118600577762223\n", " 0.4973214821864169\n", " 3.4098333701115866" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [randn(100); randn(100).+1.0]\n", "labels = [fill(:cat, 100); fill(:dog, 100)]\n", "integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)\n", "edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More information on MODL can be found [here](http://nbviewer.ipython.org/github/sisl/Discretizers.jl/blob/master/doc/MODL/DiscretizationMODL.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.7.0", "language": "julia", "name": "julia-1.7" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }