{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Discretizers\n",
"\n",
"This package supports discretization methods and mapping functions.\n",
"\n",
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Pkg.add(\"Discretizers\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the installation is complete you can use it anywhere by running"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"using Discretizers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discretization\n",
"\n",
"### Categorical Labels\n",
"\n",
"You can construct an object for mapping labels to integer indeces"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = [:cat, :dog, :dog, :cat, :cat, :elephant]\n",
"catdisc = CategoricalDiscretizer(data);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting object can be used to encode your source labels to their categorical labels"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
":cat becomes: 1\n",
":dog becomes: 2\n",
"data becomes: [1,2,2,1,1,3]\n"
]
}
],
"source": [
"println(\":cat becomes: \", encode(catdisc, :cat))\n",
"println(\":dog becomes: \", encode(catdisc, :dog))\n",
"println(\"data becomes: \", encode(catdisc, data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also transform back"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 becomes: cat\n",
"2 becomes: dog\n",
"[1,2,3] becomes: [:cat,:dog,:elephant]\n"
]
}
],
"source": [
"println(\"1 becomes: \", decode(catdisc, 1))\n",
"println(\"2 becomes: \", decode(catdisc, 2))\n",
"println(\"[1,2,3] becomes: \", decode(catdisc, [1,2,3]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The CategoricalDiscretizer works with any object type"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"CategoricalDiscretizer([\"A\", \"B\", \"C\"])\n",
"CategoricalDiscretizer([5000, 1200, 100])\n",
"CategoricalDiscretizer([:dog, \"hello world\", NaN]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Discretization\n",
"\n",
"Linear discretization into a series of bins is supported as well\n",
"\n",
"Here we construct a linear discretizer that maps $[0,0.5) \\rightarrow 1$ and $[0.5,1] \\rightarrow 2$"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"bin_edges = [0.0,0.5,1.0]\n",
"lindisc = LinearDiscretizer(bin_edges);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Encoding works the same way"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.2 becomes: 1\n",
"0.7 becomes: 2\n",
"0.5 becomes: 2\n",
"it works on arrays: [1,2,1]\n"
]
}
],
"source": [
"println(\"0.2 becomes: \", encode(lindisc, 0.2))\n",
"println(\"0.7 becomes: \", encode(lindisc, 0.7))\n",
"println(\"0.5 becomes: \", encode(lindisc, 0.5))\n",
"println(\"it works on arrays: \", encode(lindisc, [0.0,0.8,0.2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decoding is a bit different. Here we obtain the bin and sample from it uniformally"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 becomes: 0.1887493587068908\n",
"2 becomes: 0.5460050358882282\n",
"it works on arrays: [0.8104340883570698,0.454060686695556,0.9457654175718269]\n"
]
}
],
"source": [
"println(\"1 becomes: \", decode(lindisc, 1))\n",
"println(\"2 becomes: \", decode(lindisc, 2))\n",
"println(\"it works on arrays: \", decode(lindisc, [2,1,2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some other functions are supported"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of labels: 3 2\n",
"bin centers: [0.25,0.75]\n",
"extrama of a bin: (0.5,1.0)\n"
]
}
],
"source": [
"println(\"number of labels: \", nlabels(catdisc), \" \", nlabels(lindisc))\n",
"println(\"bin centers: \", bincenters(lindisc))\n",
"println(\"extrama of a bin: \", extrema(lindisc, 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both discretizers can be constructed to map to other integer types"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0x01"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catdisc = CategoricalDiscretizer(data, Int32)\n",
"lindisc = LinearDiscretizer(bin_edges, UInt8)\n",
"encode(lindisc, 0.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discretization Algorithms\n",
"\n",
"In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Uniform Width\n",
" \n",
" `DiscretizeUniformWidth(nbins)` - divide the domain evenly into `nbins`\n",
" \n",
"* Uniform Count\n",
"\n",
" `DiscretizeUniformCount(nbins)` - divide the domain into `nbins` where each bin has approximately equal count\n",
" \n",
"* Bayesian Blocks\n",
"\n",
" `DiscretizeBayesianBlocks()` - determines an appropriate number of bins by maximizing a Bayesian prior.\n",
" See [this website](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html) for an overview."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{Float64,1}:\n",
" -2.85658 \n",
" -0.598978\n",
" 1.65863 \n",
" 3.91623 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nbins = 3\n",
"data = randn(1000)\n",
"edges = binedges(DiscretizeUniformWidth(nbins), data)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"PGFPlots.GroupPlot([PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x17 Array{Real,2}:\n",
" -14.8231 -14.8231 -12.8364 -10.8497 … 11.004 12.9907 14.9774\n",
" 0.0 11.577 16.1071 33.2209 9.06025 7.04686 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Uniform Width\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x17 Array{Real,2}:\n",
" -14.8231 -14.8231 -6.60423 -5.19549 … 3.0474 4.67614 14.9774\n",
" 0.0 35.0412 204.439 476.45 176.824 27.8607 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Uniform Count\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x21 Array{Real,2}:\n",
" -14.8231 -14.8231 -9.90599 … 4.53939 5.85829 8.8699 14.9774\n",
" 0.0 15.6596 41.1833 99.325 38.1856 9.3328 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"Bayesian Blocks\",\"x\",nothing,\"pdf(x)\",nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing)],(3,1),nothing,\"horizontal sep = 1.75cm\")"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using PGFPlots\n",
"using Distributions\n",
"\n",
"# draw a set of variables and\n",
"# filter values to a reasonable range\n",
"srand(0)\n",
"data = [rand(Cauchy(-5, 1.8), 500);\n",
" rand(Cauchy(-4, 0.8), 2000);\n",
" rand(Cauchy(-1, 0.3), 500);\n",
" rand(Cauchy( 2, 0.8), 1000);\n",
" rand(Cauchy( 4, 1.5), 500)]\n",
"data = filter!(x->-15.0 <= x <= 15.0, data)\n",
"\n",
"g = GroupPlot(3, 1, groupStyle = \"horizontal sep = 1.75cm\")\n",
"\n",
"discalgs = [(\"Uniform Width\", DiscretizeUniformWidth(15)),\n",
" (\"Uniform Count\", DiscretizeUniformCount(15)),\n",
" (\"Bayesian Blocks\", DiscretizeBayesianBlocks())]\n",
"\n",
"for (name, discalg) in discalgs\n",
" disc = LinearDiscretizer(binedges(discalg, data))\n",
" counts = get_discretization_counts(disc, data) \n",
" arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n",
" push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n",
" ymin=0, xlabel=\"x\", ylabel=\"pdf(x)\", title=name))\n",
"end\n",
"\n",
"g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatically Determine Number of Uniform-Width Bins\n",
"\n",
"Several algorithms exist for deterimining the number of uniform-width bins.\n",
"Simply enter the algorithm as a symbol into `DiscretizeUniformWidth`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"PGFPlots.GroupPlot([PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x68 Array{Real,2}:\n",
" -14.8231 -14.8231 -14.3716 -13.9201 … 14.0743 14.5259 14.9774\n",
" 0.0 6.64418 17.7178 11.0736 4.42945 6.64418 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"sqrt\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x16 Array{Real,2}:\n",
" -14.8231 -14.8231 -12.6945 -10.5659 … 10.7202 12.8488 14.9774\n",
" 0.0 12.6843 16.9125 36.6437 9.39581 7.04686 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"sturges\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x35 Array{Real,2}:\n",
" -14.8231 -14.8231 -13.9201 -13.017 … 13.1713 14.0743 14.9774\n",
" 0.0 12.181 13.2884 12.181 9.96627 5.53682 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"rice\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x25 Array{Real,2}:\n",
" -14.8231 -14.8231 -13.5274 -12.2318 … 12.386 13.6817 14.9774\n",
" 0.0 12.3488 12.3488 15.436 9.26159 7.71799 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"doane\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x36 Array{Real,2}:\n",
" -14.8231 -14.8231 -13.9466 … 12.3479 13.2244 14.1009 14.9774\n",
" 0.0 11.4092 13.691 10.2683 9.12736 5.7046 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"scott\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x43 Array{Real,2}:\n",
" -14.8231 -14.8231 -14.0963 -13.3694 … 13.5237 14.2505 14.9774\n",
" 0.0 11.0065 13.7582 13.7582 8.25489 6.87908 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"fd\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing),PGFPlots.Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(2x43 Array{Real,2}:\n",
" -14.8231 -14.8231 -14.0963 -13.3694 … 13.5237 14.2505 14.9774\n",
" 0.0 11.0065 13.7582 13.7582 8.25489 6.87908 0.0 ,nothing,nothing,\"const plot, mark=none, fill=blue!60\",nothing,nothing)],\"auto\",nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,0,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing,nothing)],(3,3),nothing,\"horizontal sep = 1.75cm, vertical sep = 1.5cm\")"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"g = GroupPlot(3, 3, groupStyle = \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")\n",
"\n",
"discalgs = [:sqrt, # used by Excel and others for its simplicity and speed\n",
" :sturges, # R's default method, only good for near-Gaussian data\n",
" :rice, # commonly overestimates the number of bins required\n",
" :doane, # improves Sturges’ for non-normal datasets.\n",
" :scott, # less robust estimator that that takes into account data variability and data size.\n",
" :fd, # Freedman Diaconis Estimator, robust\n",
" :auto, # max between :fd and :sturges. Good all-round performance\n",
" ]\n",
"\n",
"for discalg in discalgs\n",
" disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))\n",
" counts = get_discretization_counts(disc, data) \n",
" arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n",
" push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n",
" ymin=0, title=string(discalg)))\n",
"end\n",
"\n",
"g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{AbstractFloat,1}:\n",
" -2.58175 \n",
" -0.229589\n",
" 1.87765 \n",
" 2.6983 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = [randn(100); randn(100)+1.0]\n",
"labels = [fill(:cat, 100); fill(:dog, 100)]\n",
"integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)\n",
"edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More information on MODL can be found [here](http://nbviewer.ipython.org/github/sisl/Discretizers.jl/blob/master/doc/MODL/DiscretizationMODL.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 0.4.5",
"language": "julia",
"name": "julia-0.4"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "0.4.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}