{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NA and Outlier analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading required package: daltoolbox\n",
      "\n",
      "Registered S3 method overwritten by 'quantmod':\n",
      "  method            from\n",
      "  as.zoo.data.frame zoo \n",
      "\n",
      "\n",
      "Attaching package: ‘daltoolbox’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:base’:\n",
      "\n",
      "    transform\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# DAL ToolBox\n",
    "# version 1.01.727\n",
    "\n",
    "source(\"https://raw.githubusercontent.com/cefet-rj-dal/daltoolbox/main/jupyter.R\")\n",
    "\n",
    "#loading DAL\n",
    "load_library(\"daltoolbox\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Outlier removal\n",
    "The following class uses box-plot definition for outliers.\n",
    "\n",
    "An outlier is a value that is below than $Q_1 - 1.5 \\cdot IQR$ or higher than $Q_3 + 1.5 \\cdot IQR$.\n",
    "\n",
    "The class remove outliers for numeric attributes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### removing outliers of a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"dataframe\">\n",
       "<caption>A data.frame: 6 × 5</caption>\n",
       "<thead>\n",
       "\t<tr><th></th><th scope=col>Sepal.Length</th><th scope=col>Sepal.Width</th><th scope=col>Petal.Length</th><th scope=col>Petal.Width</th><th scope=col>Species</th></tr>\n",
       "\t<tr><th></th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;fct&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>1</th><td>5.1</td><td>3.5</td><td>1.4</td><td>0.2</td><td>setosa</td></tr>\n",
       "\t<tr><th scope=row>2</th><td>4.9</td><td>3.0</td><td>1.4</td><td>0.2</td><td>setosa</td></tr>\n",
       "\t<tr><th scope=row>3</th><td>4.7</td><td>3.2</td><td>1.3</td><td>0.2</td><td>setosa</td></tr>\n",
       "\t<tr><th scope=row>4</th><td>4.6</td><td>3.1</td><td>1.5</td><td>0.2</td><td>setosa</td></tr>\n",
       "\t<tr><th scope=row>5</th><td>5.0</td><td>3.6</td><td>1.4</td><td>0.2</td><td>setosa</td></tr>\n",
       "\t<tr><th scope=row>6</th><td>5.4</td><td>3.9</td><td>1.7</td><td>0.4</td><td>setosa</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A data.frame: 6 × 5\n",
       "\\begin{tabular}{r|lllll}\n",
       "  & Sepal.Length & Sepal.Width & Petal.Length & Petal.Width & Species\\\\\n",
       "  & <dbl> & <dbl> & <dbl> & <dbl> & <fct>\\\\\n",
       "\\hline\n",
       "\t1 & 5.1 & 3.5 & 1.4 & 0.2 & setosa\\\\\n",
       "\t2 & 4.9 & 3.0 & 1.4 & 0.2 & setosa\\\\\n",
       "\t3 & 4.7 & 3.2 & 1.3 & 0.2 & setosa\\\\\n",
       "\t4 & 4.6 & 3.1 & 1.5 & 0.2 & setosa\\\\\n",
       "\t5 & 5.0 & 3.6 & 1.4 & 0.2 & setosa\\\\\n",
       "\t6 & 5.4 & 3.9 & 1.7 & 0.4 & setosa\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A data.frame: 6 × 5\n",
       "\n",
       "| <!--/--> | Sepal.Length &lt;dbl&gt; | Sepal.Width &lt;dbl&gt; | Petal.Length &lt;dbl&gt; | Petal.Width &lt;dbl&gt; | Species &lt;fct&gt; |\n",
       "|---|---|---|---|---|---|\n",
       "| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |\n",
       "| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |\n",
       "| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |\n",
       "| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |\n",
       "| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |\n",
       "| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |\n",
       "\n"
      ],
      "text/plain": [
       "  Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n",
       "1 5.1          3.5         1.4          0.2         setosa \n",
       "2 4.9          3.0         1.4          0.2         setosa \n",
       "3 4.7          3.2         1.3          0.2         setosa \n",
       "4 4.6          3.1         1.5          0.2         setosa \n",
       "5 5.0          3.6         1.4          0.2         setosa \n",
       "6 5.4          3.9         1.7          0.4         setosa "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "146"
      ],
      "text/latex": [
       "146"
      ],
      "text/markdown": [
       "146"
      ],
      "text/plain": [
       "[1] 146"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# code for outlier removal\n",
    "out_obj <- outliers() # class for outlier analysis\n",
    "out_obj <- fit(out_obj, iris) # computing boundaries\n",
    "iris.clean <- transform(out_obj, iris) # returning cleaned dataset\n",
    "\n",
    "# inspection of cleaned dataset\n",
    "head(iris.clean)\n",
    "nrow(iris.clean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualizing the actual outliers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "idx\n",
      "FALSE  TRUE \n",
      "  146     4 \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"dataframe\">\n",
       "<caption>A data.frame: 4 × 5</caption>\n",
       "<thead>\n",
       "\t<tr><th></th><th scope=col>Sepal.Length</th><th scope=col>Sepal.Width</th><th scope=col>Petal.Length</th><th scope=col>Petal.Width</th><th scope=col>Species</th></tr>\n",
       "\t<tr><th></th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;fct&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>16</th><td>5.7</td><td>4.4</td><td>1.5</td><td>0.4</td><td>setosa    </td></tr>\n",
       "\t<tr><th scope=row>33</th><td>5.2</td><td>4.1</td><td>1.5</td><td>0.1</td><td>setosa    </td></tr>\n",
       "\t<tr><th scope=row>34</th><td>5.5</td><td>4.2</td><td>1.4</td><td>0.2</td><td>setosa    </td></tr>\n",
       "\t<tr><th scope=row>61</th><td>5.0</td><td>2.0</td><td>3.5</td><td>1.0</td><td>versicolor</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A data.frame: 4 × 5\n",
       "\\begin{tabular}{r|lllll}\n",
       "  & Sepal.Length & Sepal.Width & Petal.Length & Petal.Width & Species\\\\\n",
       "  & <dbl> & <dbl> & <dbl> & <dbl> & <fct>\\\\\n",
       "\\hline\n",
       "\t16 & 5.7 & 4.4 & 1.5 & 0.4 & setosa    \\\\\n",
       "\t33 & 5.2 & 4.1 & 1.5 & 0.1 & setosa    \\\\\n",
       "\t34 & 5.5 & 4.2 & 1.4 & 0.2 & setosa    \\\\\n",
       "\t61 & 5.0 & 2.0 & 3.5 & 1.0 & versicolor\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A data.frame: 4 × 5\n",
       "\n",
       "| <!--/--> | Sepal.Length &lt;dbl&gt; | Sepal.Width &lt;dbl&gt; | Petal.Length &lt;dbl&gt; | Petal.Width &lt;dbl&gt; | Species &lt;fct&gt; |\n",
       "|---|---|---|---|---|---|\n",
       "| 16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa     |\n",
       "| 33 | 5.2 | 4.1 | 1.5 | 0.1 | setosa     |\n",
       "| 34 | 5.5 | 4.2 | 1.4 | 0.2 | setosa     |\n",
       "| 61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor |\n",
       "\n"
      ],
      "text/plain": [
       "   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   \n",
       "16 5.7          4.4         1.5          0.4         setosa    \n",
       "33 5.2          4.1         1.5          0.1         setosa    \n",
       "34 5.5          4.2         1.4          0.2         setosa    \n",
       "61 5.0          2.0         3.5          1.0         versicolor"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "idx <- attr(iris.clean, \"idx\")\n",
    "print(table(idx))\n",
    "iris.outliers <- iris[idx,]\n",
    "head(iris.outliers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}