{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Discretizers\n",
"\n",
"This package supports discretization methods and mapping functions.\n",
"\n",
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Pkg.add(\"Discretizers\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the installation is complete you can use it anywhere by running"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using Discretizers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discretization\n",
"\n",
"### Categorical Labels\n",
"\n",
"You can construct an object for mapping labels to integer indeces"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = [:cat, :dog, :dog, :cat, :cat, :elephant]\n",
"catdisc = CategoricalDiscretizer(data);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting object can be used to encode your source labels to their categorical labels"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
":cat becomes: 1\n",
":dog becomes: 2\n",
"data becomes: [1, 2, 2, 1, 1, 3]\n"
]
}
],
"source": [
"println(\":cat becomes: \", encode(catdisc, :cat))\n",
"println(\":dog becomes: \", encode(catdisc, :dog))\n",
"println(\"data becomes: \", encode(catdisc, data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also transform back"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 becomes: cat\n",
"2 becomes: dog\n",
"[1,2,3] becomes: [:cat, :dog, :elephant]\n"
]
}
],
"source": [
"println(\"1 becomes: \", decode(catdisc, 1))\n",
"println(\"2 becomes: \", decode(catdisc, 2))\n",
"println(\"[1,2,3] becomes: \", decode(catdisc, [1,2,3]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The CategoricalDiscretizer works with any object type"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"CategoricalDiscretizer([\"A\", \"B\", \"C\"])\n",
"CategoricalDiscretizer([5000, 1200, 100])\n",
"CategoricalDiscretizer([:dog, \"hello world\", NaN]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Discretization\n",
"\n",
"Linear discretization into a series of bins is supported as well\n",
"\n",
"Here we construct a linear discretizer that maps $[0,0.5) \\rightarrow 1$ and $[0.5,1] \\rightarrow 2$"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"bin_edges = [0.0,0.5,1.0]\n",
"lindisc = LinearDiscretizer(bin_edges);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Encoding works the same way"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.2 becomes: 1\n",
"0.7 becomes: 2\n",
"0.5 becomes: 2\n",
"it works on arrays: [1, 2, 1]\n"
]
}
],
"source": [
"println(\"0.2 becomes: \", encode(lindisc, 0.2))\n",
"println(\"0.7 becomes: \", encode(lindisc, 0.7))\n",
"println(\"0.5 becomes: \", encode(lindisc, 0.5))\n",
"println(\"it works on arrays: \", encode(lindisc, [0.0,0.8,0.2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decoding is a bit different. Here we obtain the bin and sample from it uniformally"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 becomes: 0.34261737476959025\n",
"2 becomes: 0.8347257253483246\n",
"it works on arrays: [0.7957856786261945, 0.21404469243416652, 0.8682774319312638]\n"
]
}
],
"source": [
"println(\"1 becomes: \", decode(lindisc, 1))\n",
"println(\"2 becomes: \", decode(lindisc, 2))\n",
"println(\"it works on arrays: \", decode(lindisc, [2,1,2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some other functions are supported"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of labels: 3 2\n",
"bin centers: [0.25, 0.75]\n",
"extrama of a bin: (0.5, 1.0)\n"
]
}
],
"source": [
"println(\"number of labels: \", nlabels(catdisc), \" \", nlabels(lindisc))\n",
"println(\"bin centers: \", bincenters(lindisc))\n",
"println(\"extrama of a bin: \", extrema(lindisc, 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both discretizers can be constructed to map to other integer types"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0x01"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catdisc = CategoricalDiscretizer(data, Int32)\n",
"lindisc = LinearDiscretizer(bin_edges, UInt8)\n",
"encode(lindisc, 0.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discretization Algorithms\n",
"\n",
"In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Uniform Width\n",
" \n",
" `DiscretizeUniformWidth(nbins)` - divide the domain evenly into `nbins`\n",
" \n",
"* Uniform Count\n",
"\n",
" `DiscretizeUniformCount(nbins)` - divide the domain into `nbins` where each bin has approximately equal count\n",
" \n",
"* Quantile\n",
"\n",
" `DiscretizeQuantile(nbins)` - similar to `DiscretizeUniformCount`, but it leverages the `quantile` method from `Statistics.jl`, and automatically drops duplicate edges. This method mimics [the KBinsDiscretizer from sci-kit learn](https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09bcc2eaeba98f7e737aac2ac782f0e5f1/sklearn/preprocessing/_discretization.py#L21).\n",
" \n",
"* Bayesian Blocks\n",
"\n",
" `DiscretizeBayesianBlocks()` - determines an appropriate number of bins by maximizing a Bayesian prior.\n",
" See [this website](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html) for an overview."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4-element Vector{Float64}:\n",
" -2.9097578179044077\n",
" -0.9195885729377203\n",
" 1.070580672028967\n",
" 3.0607499169956545"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nbins = 3\n",
"data = randn(1000)\n",
"edges = binedges(DiscretizeUniformWidth(nbins), data)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"GroupPlot(Axis[Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 12.932839834052759 14.90397400519417; 0.0 11.668409150799484 … 7.1025099178779465 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Uniform Width\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 4.581940852292759 14.90397400519417; 0.0 35.45029724240893 … 27.61083943233505 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Uniform Count\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 4.567337556730759 14.90397400519417; 0.0 35.44472204113863 … 27.668574920472818 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Quantile\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 7.387048028205961 14.90397400519417; 0.0 12.405107435022224 … 12.239045626050112 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"Bayesian Blocks\", \"x\", nothing, \"pdf(x)\", nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, \"6cm\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\")], (4, 1), nothing, \"horizontal sep = 1.75cm\")"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using PGFPlots\n",
"using Distributions\n",
"using Random\n",
"\n",
"# draw a set of variables and\n",
"# filter values to a reasonable range\n",
"Random.seed!(0)\n",
"data = [rand(Cauchy(-5, 1.8), 500);\n",
" rand(Cauchy(-4, 0.8), 2000);\n",
" rand(Cauchy(-1, 0.3), 500);\n",
" rand(Cauchy( 2, 0.8), 1000);\n",
" rand(Cauchy( 4, 1.5), 500)]\n",
"data = filter!(x->-15.0 <= x <= 15.0, data)\n",
"\n",
"g = GroupPlot(3, 1, groupStyle = \"horizontal sep = 1.75cm\")\n",
"\n",
"discalgs = [(\"Uniform Width\", DiscretizeUniformWidth(15)),\n",
" (\"Uniform Count\", DiscretizeUniformCount(15)),\n",
" (\"Quantile\", DiscretizeQuantile(15)),\n",
" (\"Bayesian Blocks\", DiscretizeBayesianBlocks())]\n",
"\n",
"for (name, discalg) in discalgs\n",
" disc = LinearDiscretizer(binedges(discalg, data))\n",
" counts = get_discretization_counts(disc, data) \n",
" arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n",
" push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y),\n",
" style=\"const plot, mark=none, fill=blue!60\"), \n",
" ymin=0, xlabel=\"x\", ylabel=\"pdf(x)\", title=name, width=\"6cm\"))\n",
"end\n",
"\n",
"g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatically Determine Number of Uniform-Width Bins\n",
"\n",
"Several algorithms exist for deterimining the number of uniform-width bins.\n",
"Simply enter the algorithm as a symbol into `DiscretizeUniformWidth`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"GroupPlot(Axis[Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.455988966298396 14.90397400519417; 0.0 24.55439143037808 … 8.928869611046574 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"sqrt\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 12.792044536114085 14.90397400519417; 0.0 11.83751652979657 … 7.102509917877942 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"sturges\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.00800392740262 14.90397400519417; 0.0 17.857739222093112 … 4.464434805523278 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"rice\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 13.618451719667162 14.90397400519417; 0.0 14.002090980959368 … 7.001045490479684 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"doane\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.034355988514136 14.90397400519417; 0.0 18.39888283488383 … 4.599720708720957 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"scott\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.182827357215604 14.90397400519417; 0.0 19.413527108866365 … 5.546722031104675 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"fd\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\"), Axis(PGFPlots.Plots.Plot[PGFPlots.Plots.Linear(Real[-14.663038561927014 -14.663038561927014 … 14.182827357215604 14.90397400519417; 0.0 19.413527108866365 … 5.546722031104675 0.0], nothing, nothing, \"const plot, mark=none, fill=blue!60\", nothing, nothing, nothing, nothing, false)], \"auto\", nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, 0, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, \"axis\")], (3, 3), nothing, \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"g = GroupPlot(3, 3, groupStyle = \"horizontal sep = 1.75cm, vertical sep = 1.5cm\")\n",
"\n",
"discalgs = [:sqrt, # used by Excel and others for its simplicity and speed\n",
" :sturges, # R's default method, only good for near-Gaussian data\n",
" :rice, # commonly overestimates the number of bins required\n",
" :doane, # improves Sturges’ for non-normal datasets.\n",
" :scott, # less robust estimator that that takes into account data variability and data size.\n",
" :fd, # Freedman Diaconis Estimator, robust\n",
" :auto, # max between :fd and :sturges. Good all-round performance\n",
" ]\n",
"\n",
"for discalg in discalgs\n",
" disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))\n",
" counts = get_discretization_counts(disc, data) \n",
" arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))\n",
" push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style=\"const plot, mark=none, fill=blue!60\"), \n",
" ymin=0, title=string(discalg)))\n",
"end\n",
"\n",
"g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Vector{AbstractFloat}:\n",
" -3.118600577762223\n",
" 0.4973214821864169\n",
" 3.4098333701115866"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = [randn(100); randn(100).+1.0]\n",
"labels = [fill(:cat, 100); fill(:dog, 100)]\n",
"integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)\n",
"edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More information on MODL can be found [here](http://nbviewer.ipython.org/github/sisl/Discretizers.jl/blob/master/doc/MODL/DiscretizationMODL.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.7.0",
"language": "julia",
"name": "julia-1.7"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}