{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Explainable Outlier Detection in Titanic dataset\n", "\n", "This short notebook illustrates basic usage of the [OutlierTree](https://github.com/david-cortes/outliertree) library for explainable outlier detection using the Titanic dataset. For more details, you can check the package's documentation [here](http://outliertree.readthedocs.io/en/latest/).\n", "\n", "The dataset is very popular and can be downloaded from different sources, such as Kaggle or many university webpages. This notebook took it from the following link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv\n", "** *\n", "\n", "### Loading the raw data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
310Allison, Mr. Hudson Joshua Creightonmale30.0012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ON
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
\n", "
" ], "text/plain": [ " pclass survived name sex \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 1 0 Allison, Miss. Helen Loraine female \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton male \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked boat body \\\n", "0 29.00 0 0 24160 211.3375 B5 S 2 NaN \n", "1 0.92 1 2 113781 151.5500 C22 C26 S 11 NaN \n", "2 2.00 1 2 113781 151.5500 C22 C26 S NaN NaN \n", "3 30.00 1 2 113781 151.5500 C22 C26 S NaN 135.0 \n", "4 25.00 1 2 113781 151.5500 C22 C26 S NaN NaN \n", "\n", " home.dest \n", "0 St Louis, MO \n", "1 Montreal, PQ / Chesterville, ON \n", "2 Montreal, PQ / Chesterville, ON \n", "3 Montreal, PQ / Chesterville, ON \n", "4 Montreal, PQ / Chesterville, ON " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np, pandas as pd\n", "from outliertree import OutlierTree\n", "\n", "## Read the raw data, downloaded from here:\n", "## https://github.com/jbryer/CompStats/raw/master/Data/titanic3.csv\n", "titanic = pd.read_csv(\"titanic3.csv\")\n", "titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-processing the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSurvivedSexAgeSibSpParchFareCabinEmbarkedBoatBody
01YesFemale29.0000211.3375B5S2NaN
11YesMale0.9212151.5500C22 C26S11NaN
21NoFemale2.0012151.5500C22 C26SNaNNaN
31NoMale30.0012151.5500C22 C26SNaN135.0
41NoFemale25.0012151.5500C22 C26SNaNNaN
\n", "
" ], "text/plain": [ " Pclass Survived Sex Age SibSp Parch Fare Cabin Embarked Boat \\\n", "0 1 Yes Female 29.00 0 0 211.3375 B5 S 2 \n", "1 1 Yes Male 0.92 1 2 151.5500 C22 C26 S 11 \n", "2 1 No Female 2.00 1 2 151.5500 C22 C26 S NaN \n", "3 1 No Male 30.00 1 2 151.5500 C22 C26 S NaN \n", "4 1 No Female 25.00 1 2 151.5500 C22 C26 S NaN \n", "\n", " Body \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 135.0 \n", "4 NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Capitalize column names and some values for easier reading\n", "titanic.columns = titanic.columns.str.capitalize()\n", "titanic = titanic.rename(columns = {\"Sibsp\" : \"SibSp\"})\n", "titanic[\"Sex\"] = titanic[\"Sex\"].str.capitalize()\n", "\n", "## Convert 'survived' to yes/no for easier reading\n", "titanic[\"Survived\"] = titanic[\"Survived\"].astype(\"category\").replace({1:\"Yes\", 0:\"No\"})\n", "\n", "## Some columns are not useful, such as name (an ID), ticket number (another ID),\n", "## or destination (too many values, many non-repeated)\n", "cols_drop = [\"Name\", \"Ticket\", \"Home.dest\"]\n", "titanic = titanic.drop(cols_drop, axis=1)\n", "\n", "## Ordinal columns need to be passed as ordered categoricals\n", "cols_ord = [\"Pclass\", \"Parch\", \"SibSp\"]\n", "for col in cols_ord:\n", " titanic[col] = pd.Categorical(titanic[col], ordered=True)\n", "\n", "titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting a model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reporting top 9 outliers [out of 9 found]\n", "\n", "\n", "row [170] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.571% >= 25.74 - [mean: 55.22] - [sd: 27.56] - [norm. obs: 69]\n", "\tgiven:\n", "\t\t[Pclass] = [1]\n", "\t\t[Boat] in [9, B, 5, 7, C, 5 9, 1, 15, 5 7, 8 10, 12, 16, 13 15 B, C D, 15 16, 13 15] (value: C)\n", "\n", "\n", "row [18] - suspicious column: [Age] - suspicious value: [32.00]\n", "\tdistribution: 96.000% >= 43.00 - [mean: 48.35] - [sd: 3.16] - [norm. obs: 24]\n", "\tgiven:\n", "\t\t[Cabin] in [E12, D15, B10, E31, E58, C86, A16, A20, E63, C92, B82 B84, D33, B52 B54 B56, C124, D17, C110, C116, C126, D46] (value: D15)\n", "\n", "\n", "row [896] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [898] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [963] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [1254] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [1044] - suspicious column: [Fare] - suspicious value: [15.50]\n", "\tdistribution: 96.774% <= 8.52 - [mean: 7.73] - [sd: 0.28] - [norm. obs: 30]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\t\t[Boat] in [3, 10, 4, 9, 6, B, 8, A, 5, 7, 5 9, 1, 5 7, 8 10, 16, 13 15 B, 15 16, 13 15] (value: 16)\n", "\n", "\n", "row [1146] - suspicious column: [Fare] - suspicious value: [29.12]\n", "\tdistribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\t\t[Embarked] = [Q]\n", "\n", "\n", "row [1163] - suspicious column: [Fare] - suspicious value: [24.15]\n", "\tdistribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\t\t[Embarked] = [Q]\n", "\n", "\n" ] }, { "data": { "text/plain": [ "OutlierTree model\n", "\tNumeric variables: 3\n", "\tCategorical variables: 5\n", "\tOrdinal variables: 3\n", "\n", "Consists of 221 clusters, spread across 18 tree branches" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Fit model with default hyperparameters\n", "otree = OutlierTree()\n", "otree.fit(titanic)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examining the results more closely" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSurvivedSexAgeSibSpParchFareCabinEmbarkedBoatBody
11463NoFemale39.00529.125NaNQNaN327.0
11633NoMaleNaN0024.150NaNQNaNNaN
\n", "
" ], "text/plain": [ " Pclass Survived Sex Age SibSp Parch Fare Cabin Embarked Boat \\\n", "1146 3 No Female 39.0 0 5 29.125 NaN Q NaN \n", "1163 3 No Male NaN 0 0 24.150 NaN Q NaN \n", "\n", " Body \n", "1146 327.0 \n", "1163 NaN " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Double-check the data (last 2 outliers)\n", "titanic.loc[[1146, 1163]]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## Distribution of the group from which those two outliers were flagged\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "titanic.loc[\n", " (titanic.Pclass == 3) &\n", " (titanic.SibSp == 0) &\n", " (titanic.Embarked == \"Q\")\n", "] .Fare.hist(bins=50, color=\"navy\", edgecolor='black', linewidth=1.2)\n", "plt.xlabel(\"Fare\", fontsize=15)\n", "plt.ylabel(\"Frequency\", fontsize=15)\n", "plt.title(\"Distribution of Fare within cluster\", fontsize=20)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
suspicious_valuegroup_statisticsconditionstree_depthuses_NA_branchoutlier_score
1146{'column': 'Fare', 'value': 29.125, 'decimals'...{'upper_thr': 15.5, 'pct_below': 0.97849462365...[{'column': 'Embarked', 'comparison': '=', 'va...4.0False0.003805
1163{'column': 'Fare', 'value': 24.15, 'decimals': 0}{'upper_thr': 15.5, 'pct_below': 0.97849462365...[{'column': 'Embarked', 'comparison': '=', 'va...4.0False0.005227
\n", "
" ], "text/plain": [ " suspicious_value \\\n", "1146 {'column': 'Fare', 'value': 29.125, 'decimals'... \n", "1163 {'column': 'Fare', 'value': 24.15, 'decimals': 0} \n", "\n", " group_statistics \\\n", "1146 {'upper_thr': 15.5, 'pct_below': 0.97849462365... \n", "1163 {'upper_thr': 15.5, 'pct_below': 0.97849462365... \n", "\n", " conditions tree_depth \\\n", "1146 [{'column': 'Embarked', 'comparison': '=', 'va... 4.0 \n", "1163 [{'column': 'Embarked', 'comparison': '=', 'va... 4.0 \n", "\n", " uses_NA_branch outlier_score \n", "1146 False 0.003805 \n", "1163 False 0.005227 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Get the outliers in a manipulable format\n", "otree.predict(titanic).loc[[1146, 1163]]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
suspicious_valuegroup_statisticsconditionstree_depthuses_NA_branchoutlier_score
18{'column': 'Age', 'value': 32.0, 'decimals': 0}{'lower_thr': 43.0, 'pct_above': 0.96, 'mean':...[{'column': 'Cabin', 'comparison': 'in', 'valu...3.0False0.007545
170{'column': 'Fare', 'value': 0.0, 'decimals': 0}{'lower_thr': 25.7417, 'pct_above': 0.98571428...[{'column': 'Boat', 'comparison': 'in', 'value...2.0False0.015339
896{'column': 'Fare', 'value': 0.0, 'decimals': 0}{'lower_thr': 3.1708, 'pct_above': 0.992156862...[{'column': 'Pclass', 'comparison': '=', 'valu...3.0False0.011148
898{'column': 'Fare', 'value': 0.0, 'decimals': 0}{'lower_thr': 3.1708, 'pct_above': 0.992156862...[{'column': 'Pclass', 'comparison': '=', 'valu...3.0False0.011148
963{'column': 'Fare', 'value': 0.0, 'decimals': 0}{'lower_thr': 3.1708, 'pct_above': 0.992156862...[{'column': 'Pclass', 'comparison': '=', 'valu...3.0False0.011148
1044{'column': 'Fare', 'value': 15.5, 'decimals': 0}{'upper_thr': 8.5167, 'pct_below': 0.967741935...[{'column': 'Boat', 'comparison': 'in', 'value...4.0False0.002018
1146{'column': 'Fare', 'value': 29.125, 'decimals'...{'upper_thr': 15.5, 'pct_below': 0.97849462365...[{'column': 'Embarked', 'comparison': '=', 'va...4.0False0.003805
1163{'column': 'Fare', 'value': 24.15, 'decimals': 0}{'upper_thr': 15.5, 'pct_below': 0.97849462365...[{'column': 'Embarked', 'comparison': '=', 'va...4.0False0.005227
1254{'column': 'Fare', 'value': 0.0, 'decimals': 0}{'lower_thr': 3.1708, 'pct_above': 0.992156862...[{'column': 'Pclass', 'comparison': '=', 'valu...3.0False0.011148
\n", "
" ], "text/plain": [ " suspicious_value \\\n", "18 {'column': 'Age', 'value': 32.0, 'decimals': 0} \n", "170 {'column': 'Fare', 'value': 0.0, 'decimals': 0} \n", "896 {'column': 'Fare', 'value': 0.0, 'decimals': 0} \n", "898 {'column': 'Fare', 'value': 0.0, 'decimals': 0} \n", "963 {'column': 'Fare', 'value': 0.0, 'decimals': 0} \n", "1044 {'column': 'Fare', 'value': 15.5, 'decimals': 0} \n", "1146 {'column': 'Fare', 'value': 29.125, 'decimals'... \n", "1163 {'column': 'Fare', 'value': 24.15, 'decimals': 0} \n", "1254 {'column': 'Fare', 'value': 0.0, 'decimals': 0} \n", "\n", " group_statistics \\\n", "18 {'lower_thr': 43.0, 'pct_above': 0.96, 'mean':... \n", "170 {'lower_thr': 25.7417, 'pct_above': 0.98571428... \n", "896 {'lower_thr': 3.1708, 'pct_above': 0.992156862... \n", "898 {'lower_thr': 3.1708, 'pct_above': 0.992156862... \n", "963 {'lower_thr': 3.1708, 'pct_above': 0.992156862... \n", "1044 {'upper_thr': 8.5167, 'pct_below': 0.967741935... \n", "1146 {'upper_thr': 15.5, 'pct_below': 0.97849462365... \n", "1163 {'upper_thr': 15.5, 'pct_below': 0.97849462365... \n", "1254 {'lower_thr': 3.1708, 'pct_above': 0.992156862... \n", "\n", " conditions tree_depth \\\n", "18 [{'column': 'Cabin', 'comparison': 'in', 'valu... 3.0 \n", "170 [{'column': 'Boat', 'comparison': 'in', 'value... 2.0 \n", "896 [{'column': 'Pclass', 'comparison': '=', 'valu... 3.0 \n", "898 [{'column': 'Pclass', 'comparison': '=', 'valu... 3.0 \n", "963 [{'column': 'Pclass', 'comparison': '=', 'valu... 3.0 \n", "1044 [{'column': 'Boat', 'comparison': 'in', 'value... 4.0 \n", "1146 [{'column': 'Embarked', 'comparison': '=', 'va... 4.0 \n", "1163 [{'column': 'Embarked', 'comparison': '=', 'va... 4.0 \n", "1254 [{'column': 'Pclass', 'comparison': '=', 'valu... 3.0 \n", "\n", " uses_NA_branch outlier_score \n", "18 False 0.007545 \n", "170 False 0.015339 \n", "896 False 0.011148 \n", "898 False 0.011148 \n", "963 False 0.011148 \n", "1044 False 0.002018 \n", "1146 False 0.003805 \n", "1163 False 0.005227 \n", "1254 False 0.011148 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## To programatically get all the outliers that were flagged\n", "pred = otree.predict(titanic)\n", "pred.loc[~pred.outlier_score.isnull()]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reporting top 1 outliers [out of 1 found]\n", "\n", "\n", "row [1146] - suspicious column: [Fare] - suspicious value: [29.12]\n", "\tdistribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]\n", "\tgiven:\n", "\t\t[Pclass] = [3]\n", "\t\t[SibSp] = [0]\n", "\t\t[Embarked] = [Q]\n", "\n", "\n" ] } ], "source": [ "## To print selected rows only\n", "otree.print_outliers(pred.loc[[1146]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Trying different hyperparameters" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reporting top 5 outliers [out of 20 found]\n", "\n", "\n", "row [363] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [384] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [410] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [473] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [528] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n" ] }, { "data": { "text/plain": [ "OutlierTree model\n", "\tNumeric variables: 3\n", "\tCategorical variables: 5\n", "\tOrdinal variables: 3\n", "\n", "Consists of 217 clusters, spread across 18 tree branches" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## In order to flag more outliers, one can also experiment\n", "## with lowering the threshold hyperparameters\n", "OutlierTree(z_outlier=6.).fit(titanic, outliers_print=5)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reporting top 5 outliers [out of 27 found]\n", "\n", "\n", "row [545] - suspicious column: [SibSp] - suspicious value: [3]\n", "\tdistribution: 99.701% in [0, 1, 2, 5, 8]\n", "\t( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )\n", "\tgiven:\n", "\t\t[Parch] = [0]\n", "\n", "\n", "row [656] - suspicious column: [SibSp] - suspicious value: [3]\n", "\tdistribution: 99.701% in [0, 1, 2, 5, 8]\n", "\t( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )\n", "\tgiven:\n", "\t\t[Parch] = [0]\n", "\n", "\n", "row [1274] - suspicious column: [SibSp] - suspicious value: [3]\n", "\tdistribution: 99.701% in [0, 1, 2, 5, 8]\n", "\t( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )\n", "\tgiven:\n", "\t\t[Parch] = [0]\n", "\n", "\n", "row [363] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n", "row [384] - suspicious column: [Fare] - suspicious value: [0.00]\n", "\tdistribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]\n", "\tgiven:\n", "\t\t[Pclass] in [2, 3] (value: 2)\n", "\t\t[SibSp] = [0]\n", "\n", "\n" ] }, { "data": { "text/plain": [ "OutlierTree model\n", "\tNumeric variables: 3\n", "\tCategorical variables: 5\n", "\tOrdinal variables: 3\n", "\n", "Consists of 283 clusters, spread across 23 tree branches" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## One can also lower the gain threshold, but this tends\n", "## to result in more spurious outliers which come from\n", "## not-so-good splits (not recommended)\n", "OutlierTree(z_outlier=6, min_gain=1e-6).fit(titanic, outliers_print=5)" ] } ], "metadata": { "kernelspec": { "display_name": "Python (OpenBLAS)", "language": "python", "name": "py3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }