{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: daltoolbox\n", "\n", "Registered S3 method overwritten by 'quantmod':\n", " method from\n", " as.zoo.data.frame zoo \n", "\n", "\n", "Attaching package: ‘daltoolbox’\n", "\n", "\n", "The following object is masked from ‘package:base’:\n", "\n", " transform\n", "\n", "\n" ] } ], "source": [ "# DAL ToolBox\n", "# version 1.0.777\n", "\n", "source(\"https://raw.githubusercontent.com/cefet-rj-dal/daltoolbox/main/jupyter.R\")\n", "\n", "#loading DAL\n", "load_library(\"daltoolbox\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretization & smoothing\n", "Discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. \n", "\n", "Smoothing is a technique that creates an approximating function that attempts to capture important patterns in the data while leaving out noise or other fine-scale structures/rapid phenomena.\n", "\n", "An important part of the discretization/smoothing is to set up bins for proceeding the approximation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## general function to evaluate different smoothing technique" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
\n" ], "text/latex": [ "A data.frame: 6 × 5\n", "\\begin{tabular}{r|lllll}\n", " & Sepal.Length & Sepal.Width & Petal.Length & Petal.Width & Species\\\\\n", " & & & & & \\\\\n", "\\hline\n", "\t1 & 5.1 & 3.5 & 1.4 & 0.2 & setosa\\\\\n", "\t2 & 4.9 & 3.0 & 1.4 & 0.2 & setosa\\\\\n", "\t3 & 4.7 & 3.2 & 1.3 & 0.2 & setosa\\\\\n", "\t4 & 4.6 & 3.1 & 1.5 & 0.2 & setosa\\\\\n", "\t5 & 5.0 & 3.6 & 1.4 & 0.2 & setosa\\\\\n", "\t6 & 5.4 & 3.9 & 1.7 & 0.4 & setosa\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 5\n", "\n", "| | Sepal.Length <dbl> | Sepal.Width <dbl> | Petal.Length <dbl> | Petal.Width <dbl> | Species <fct> |\n", "|---|---|---|---|---|---|\n", "| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |\n", "| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |\n", "| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |\n", "| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |\n", "| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |\n", "| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |\n", "\n" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa \n", "2 4.9 3.0 1.4 0.2 setosa \n", "3 4.7 3.2 1.3 0.2 setosa \n", "4 4.6 3.1 1.5 0.2 setosa \n", "5 5.0 3.6 1.4 0.2 setosa \n", "6 5.4 3.9 1.7 0.4 setosa " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iris <- datasets::iris\n", "head(iris)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sl.bi\n", "5.22409638554217 6.61044776119403 \n", " 83 67 \n" ] }, { "data": { "text/html": [ "\n", "
  1. 4.3
  2. 5.9172720733681
  3. 7.9
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 4.3\n", "\\item 5.9172720733681\n", "\\item 7.9\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 4.3\n", "2. 5.9172720733681\n", "3. 7.9\n", "\n", "\n" ], "text/plain": [ "[1] 4.300000 5.917272 7.900000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# smoothing using clustering\n", "obj <- smoothing_cluster(n = 2) \n", "obj <- fit(obj, iris$Sepal.Length)\n", "sl.bi <- transform(obj, iris$Sepal.Length)\n", "print(table(sl.bi))\n", "obj$interval" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 1.12088\n" ] } ], "source": [ "entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)\n", "print(entro$entropy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimizing the number of binnings" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "8" ], "text/latex": [ "8" ], "text/markdown": [ "8" ], "text/plain": [ "[1] 8" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "opt_obj <- smoothing_cluster(n=1:20)\n", "obj <- fit(opt_obj, iris$Sepal.Length)\n", "obj$n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sl.bi\n", "4.52727272727273 5.00294117647059 5.49 5.88333333333333 \n", " 11 34 20 30 \n", " 6.352 6.75294117647059 7.2 7.71666666666667 \n", " 25 17 7 6 \n" ] } ], "source": [ "obj <- fit(obj, iris$Sepal.Length)\n", "sl.bi <- transform(obj, iris$Sepal.Length)\n", "print(table(sl.bi))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.3.3" } }, "nbformat": 4, "nbformat_minor": 4 }