{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Phi_K advanced tutorial\n",
    "\n",
    "This notebook guides you through the more advanced functionality of the phik package. This notebook will not cover all the underlying theory, but will just attempt to give an overview of all the options that are available. For a theoretical description the user is referred to our paper.\n",
    "\n",
    "The package offers functionality on three related topics:\n",
    "\n",
    "1. Phik correlation matrix\n",
    "2. Significance matrix\n",
    "3. Outlier significance matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# install phik (if not installed yet)\n",
    "import sys\n",
    "\n",
    "!\"{sys.executable}\" -m pip install phik"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import standard packages\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import itertools\n",
    "\n",
    "import phik\n",
    "\n",
    "from phik import resources\n",
    "from phik.binning import bin_data\n",
    "from phik.decorators import *\n",
    "from phik.report import plot_correlation_matrix\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if one changes something in the phik-package one can automatically reload the package or module\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load data\n",
    "\n",
    "A simulated dataset is part of the phik-package. The dataset concerns car insurance data. Load the dataset here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>black</td>\n",
       "      <td>26.377219</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>156806.288398</td>\n",
       "      <td>XXL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>black</td>\n",
       "      <td>58.976840</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>74400.323559</td>\n",
       "      <td>XL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>multicolor</td>\n",
       "      <td>55.744988</td>\n",
       "      <td>downtown</td>\n",
       "      <td>267856.748015</td>\n",
       "      <td>XXL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>metalic</td>\n",
       "      <td>57.629139</td>\n",
       "      <td>downtown</td>\n",
       "      <td>259028.249060</td>\n",
       "      <td>XXL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>green</td>\n",
       "      <td>21.490637</td>\n",
       "      <td>downtown</td>\n",
       "      <td>110712.216080</td>\n",
       "      <td>XL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    car_color  driver_age      area        mileage car_size\n",
       "0       black   26.377219   suburbs  156806.288398      XXL\n",
       "1       black   58.976840   suburbs   74400.323559       XL\n",
       "2  multicolor   55.744988  downtown  267856.748015      XXL\n",
       "3     metalic   57.629139  downtown  259028.249060      XXL\n",
       "4       green   21.490637  downtown  110712.216080       XL"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Specify bin types\n",
    "\n",
    "The phik-package offers a way to calculate correlations between variables of mixed types. Variable types can be inferred automatically although we recommend to variable types to be specified by the user. \n",
    "\n",
    "Because interval type variables need to be binned in order to calculate phik and the significance, a list of interval variables is created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['driver_age', 'mileage']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types = {'severity': 'interval',\n",
    "             'driver_age':'interval',\n",
    "             'satisfaction':'ordinal',\n",
    "             'mileage':'interval',\n",
    "             'car_size':'ordinal',\n",
    "             'car_use':'ordinal',\n",
    "             'car_color':'categorical',\n",
    "             'area':'categorical'}\n",
    "\n",
    "interval_cols = [col for col, v in data_types.items() if v=='interval' and col in data.columns]\n",
    "interval_cols\n",
    "# interval_cols is used below"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Phik correlation matrix\n",
    "\n",
    "Now let's start calculating the correlation phik between pairs of variables. \n",
    "\n",
    "Note that the original dataset is used as input, the binning of interval variables is done automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.389671</td>\n",
       "      <td>0.590456</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>0.389671</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.105506</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>0.590456</td>\n",
       "      <td>0.105506</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.768589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.768589</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age      area   mileage  car_size\n",
       "car_color    1.000000    0.389671  0.590456  0.000000  0.000000\n",
       "driver_age   0.389671    1.000000  0.105506  0.000000  0.000000\n",
       "area         0.590456    0.105506  1.000000  0.000000  0.000000\n",
       "mileage      0.000000    0.000000  0.000000  1.000000  0.768589\n",
       "car_size     0.000000    0.000000  0.000000  0.768589  1.000000"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phik_overview = data.phik_matrix(interval_cols=interval_cols)\n",
    "phik_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify binning per interval variable\n",
    "\n",
    "Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges. Note that the measured phik correlation is dependent on the chosen binning. \n",
    "The default binning is uniform between the min and max values of the interval variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.388350</td>\n",
       "      <td>0.590456</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>0.388350</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.071189</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>0.590456</td>\n",
       "      <td>0.071189</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.665845</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.665845</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age      area   mileage  car_size\n",
       "car_color    1.000000    0.388350  0.590456  0.000000  0.000000\n",
       "driver_age   0.388350    1.000000  0.071189  0.000000  0.000000\n",
       "area         0.590456    0.071189  1.000000  0.000000  0.000000\n",
       "mileage      0.000000    0.000000  0.000000  1.000000  0.665845\n",
       "car_size     0.000000    0.000000  0.000000  0.665845  1.000000"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = {'mileage':5, 'driver_age':[18,25,35,45,55,65,125]}\n",
    "phik_overview = data.phik_matrix(interval_cols=interval_cols, bins=bins)\n",
    "phik_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Do not apply noise correction\n",
    "\n",
    "For low statistics samples often a correlation larger than zero is measured when no correlation is actually present in the true underlying distribution. This is not only the case for phik, but also for the pearson correlation and Cramer's phi (see figure 4 in <font color='red'> XX </font>). In the phik calculation a noise correction is applied by default, to take into account erroneous correlation values as a result of low statistics. To switch off this noise cancellation (not recommended), do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.407860</td>\n",
       "      <td>0.594172</td>\n",
       "      <td>0.136267</td>\n",
       "      <td>0.096629</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>0.407860</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.190390</td>\n",
       "      <td>0.199606</td>\n",
       "      <td>0.121585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>0.594172</td>\n",
       "      <td>0.190390</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.149679</td>\n",
       "      <td>0.067452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>0.136267</td>\n",
       "      <td>0.199606</td>\n",
       "      <td>0.149679</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.770836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>0.096629</td>\n",
       "      <td>0.121585</td>\n",
       "      <td>0.067452</td>\n",
       "      <td>0.770836</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age      area   mileage  car_size\n",
       "car_color    1.000000    0.407860  0.594172  0.136267  0.096629\n",
       "driver_age   0.407860    1.000000  0.190390  0.199606  0.121585\n",
       "area         0.594172    0.190390  1.000000  0.149679  0.067452\n",
       "mileage      0.136267    0.199606  0.149679  1.000000  0.770836\n",
       "car_size     0.096629    0.121585  0.067452  0.770836  1.000000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phik_overview = data.phik_matrix(interval_cols=interval_cols, noise_correction=False)\n",
    "phik_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a different expectation histogram\n",
    "\n",
    "By default phik compares the 2d distribution of two (binned) variables with the distribution that assumes no dependency between them. One can also change the expected distribution though. Phi_K is calculated in the same way, but using the other expectation distribution. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phik.binning import auto_bin_data\n",
    "from phik.phik import phik_observed_vs_expected_from_rebinned_df, phik_from_hist2d\n",
    "from phik.statistics import get_dependent_frequency_estimates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get observed 2d histogram of two variables\n",
    "cols = [\"mileage\", \"car_size\"]\n",
    "icols = [\"mileage\"]\n",
    "observed = data[cols].hist2d(interval_cols=icols).values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.768588829489185\n"
     ]
    }
   ],
   "source": [
    "# default phik evaluation from observed distribution\n",
    "phik_value = phik_from_hist2d(observed)\n",
    "print (phik_value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.768588829489185\n"
     ]
    }
   ],
   "source": [
    "# phik evaluation from an observed and expected distribution\n",
    "expected = get_dependent_frequency_estimates(observed)\n",
    "phik_value = phik_from_hist2d(observed=observed, expected=expected)\n",
    "print (phik_value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one can also compare two datasets against each other, and get a full phik matrix that way.\n",
    "# this needs binned datasets though. \n",
    "# (the user needs to make sure the binnings of both datasets are identical.) \n",
    "data_binned, _ = auto_bin_data(data, interval_cols=interval_cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# here we are comparing data_binned against itself\n",
    "phik_matrix = phik_observed_vs_expected_from_rebinned_df(data_binned, data_binned)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age  area  mileage  car_size\n",
       "car_color         1.0         0.0   0.0      0.0       0.0\n",
       "driver_age        0.0         1.0   0.0      0.0       0.0\n",
       "area              0.0         0.0   1.0      0.0       0.0\n",
       "mileage           0.0         0.0   0.0      1.0       0.0\n",
       "car_size          0.0         0.0   0.0      0.0       1.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# all off-diagonal entries are zero, meaning the all 2d distributions of both datasets are identical.\n",
    "# (by construction the diagonal is one.)\n",
    "phik_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Statistical significance of the correlation\n",
    "\n",
    "When assessing correlations it is good practise to evaluate both the correlation and the significance of the correlation: a large correlation may be statistically insignificant, and vice versa a small correlation may be very significant. For instance, scipy.stats.pearsonr returns both the pearson correlation and the p-value. Similarly, the phik package offers functionality the calculate a significance matrix. Significance is defined as:\n",
    "\n",
    "$$Z = \\Phi^{-1}(1-p)\\ ;\\quad \\Phi(z)=\\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{z} e^{-t^{2}/2}\\,{\\rm d}t $$\n",
    "\n",
    "Several corrections to the 'standard' p-value calculation are taken into account, making the method more robust for low statistics and sparse data cases. The user is referred to our paper for more details.\n",
    "\n",
    "Due to the corrections, the significance calculation can take a few seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>85.498655</td>\n",
       "      <td>19.836720</td>\n",
       "      <td>37.623764</td>\n",
       "      <td>-0.559532</td>\n",
       "      <td>-0.483387</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>19.836720</td>\n",
       "      <td>84.370542</td>\n",
       "      <td>1.852524</td>\n",
       "      <td>-0.572284</td>\n",
       "      <td>-0.459980</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>37.623764</td>\n",
       "      <td>1.852524</td>\n",
       "      <td>72.415600</td>\n",
       "      <td>-0.560672</td>\n",
       "      <td>-0.273138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>-0.559532</td>\n",
       "      <td>-0.572284</td>\n",
       "      <td>-0.560672</td>\n",
       "      <td>91.262677</td>\n",
       "      <td>49.285368</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>-0.483387</td>\n",
       "      <td>-0.459980</td>\n",
       "      <td>-0.273138</td>\n",
       "      <td>49.285368</td>\n",
       "      <td>69.064056</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age       area    mileage   car_size\n",
       "car_color   85.498655   19.836720  37.623764  -0.559532  -0.483387\n",
       "driver_age  19.836720   84.370542   1.852524  -0.572284  -0.459980\n",
       "area        37.623764    1.852524  72.415600  -0.560672  -0.273138\n",
       "mileage     -0.559532   -0.572284  -0.560672  91.262677  49.285368\n",
       "car_size    -0.483387   -0.459980  -0.273138  49.285368  69.064056"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "significance_overview = data.significance_matrix(interval_cols=interval_cols)\n",
    "significance_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify binning per interval variable\n",
    "Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges. Note that the measure phik correlation is dependent on the chosen binning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>85.480870</td>\n",
       "      <td>20.544400</td>\n",
       "      <td>37.613135</td>\n",
       "      <td>-0.214896</td>\n",
       "      <td>-0.447747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>20.544400</td>\n",
       "      <td>83.344168</td>\n",
       "      <td>2.478032</td>\n",
       "      <td>-0.563892</td>\n",
       "      <td>-0.534263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>37.613135</td>\n",
       "      <td>2.478032</td>\n",
       "      <td>72.428355</td>\n",
       "      <td>-0.309349</td>\n",
       "      <td>-0.260994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>-0.214896</td>\n",
       "      <td>-0.563892</td>\n",
       "      <td>-0.309349</td>\n",
       "      <td>77.784086</td>\n",
       "      <td>47.010736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>-0.447747</td>\n",
       "      <td>-0.534263</td>\n",
       "      <td>-0.260994</td>\n",
       "      <td>47.010736</td>\n",
       "      <td>69.081712</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age       area    mileage   car_size\n",
       "car_color   85.480870   20.544400  37.613135  -0.214896  -0.447747\n",
       "driver_age  20.544400   83.344168   2.478032  -0.563892  -0.534263\n",
       "area        37.613135    2.478032  72.428355  -0.309349  -0.260994\n",
       "mileage     -0.214896   -0.563892  -0.309349  77.784086  47.010736\n",
       "car_size    -0.447747   -0.534263  -0.260994  47.010736  69.081712"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = {'mileage':5, 'driver_age':[18,25,35,45,55,65,125]}\n",
    "significance_overview = data.significance_matrix(interval_cols=interval_cols, bins=bins)\n",
    "significance_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify significance method\n",
    "\n",
    "The recommended method to calculate the significance of the correlation is a hybrid approach, which uses the G-test statistic. The number of degrees of freedom and an analytical, empirical description of the $\\chi^2$ distribution are sed, based on Monte Carlo simulations. This method works well for both high as low statistics samples.\n",
    "\n",
    "Other approaches to calculate the significance are implemented:\n",
    "- asymptotic: fast, but over-estimates the number of degrees of freedom for low statistics samples, leading to erroneous values of the significance\n",
    "- MC: Many simulated samples are needed to accurately measure significances larger than 3, making this method computationally expensive.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>car_color</th>\n",
       "      <th>driver_age</th>\n",
       "      <th>area</th>\n",
       "      <th>mileage</th>\n",
       "      <th>car_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>car_color</th>\n",
       "      <td>85.526574</td>\n",
       "      <td>19.681564</td>\n",
       "      <td>37.661844</td>\n",
       "      <td>-0.385023</td>\n",
       "      <td>-0.333340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <td>19.681564</td>\n",
       "      <td>84.014654</td>\n",
       "      <td>1.742050</td>\n",
       "      <td>-0.947153</td>\n",
       "      <td>-0.793434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>area</th>\n",
       "      <td>37.661844</td>\n",
       "      <td>1.742050</td>\n",
       "      <td>72.440209</td>\n",
       "      <td>-0.465002</td>\n",
       "      <td>-0.123678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mileage</th>\n",
       "      <td>-0.385023</td>\n",
       "      <td>-0.947153</td>\n",
       "      <td>-0.465002</td>\n",
       "      <td>91.301129</td>\n",
       "      <td>49.332305</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>car_size</th>\n",
       "      <td>-0.333340</td>\n",
       "      <td>-0.793434</td>\n",
       "      <td>-0.123678</td>\n",
       "      <td>49.332305</td>\n",
       "      <td>69.107448</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            car_color  driver_age       area    mileage   car_size\n",
       "car_color   85.526574   19.681564  37.661844  -0.385023  -0.333340\n",
       "driver_age  19.681564   84.014654   1.742050  -0.947153  -0.793434\n",
       "area        37.661844    1.742050  72.440209  -0.465002  -0.123678\n",
       "mileage     -0.385023   -0.947153  -0.465002  91.301129  49.332305\n",
       "car_size    -0.333340   -0.793434  -0.123678  49.332305  69.107448"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "significance_overview = data.significance_matrix(interval_cols=interval_cols, significance_method='asymptotic')\n",
    "significance_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulation method\n",
    "\n",
    "The chi2 of a contingency table is measured using a comparison of the expected frequencies with the true frequencies in a contingency table. The expected frequencies can be simulated in a variety of ways. The following methods are implemented:\n",
    "\n",
    " - multinominal: Only the total number of records is fixed. (default)\n",
    " - row_product_multinominal: The row totals fixed in the sampling.\n",
    " - col_product_multinominal: The column totals fixed in the sampling.\n",
    " - hypergeometric: Both the row or column totals are fixed in the sampling. (Note that this type of sampling is only available when row and column totals are integers, which is usually the case.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# --- Warning, can be slow\n",
    "#     turned off here by default for unit testing purposes\n",
    "\n",
    "#significance_overview = data.significance_matrix(interval_cols=interval_cols, simulation_method='hypergeometric')\n",
    "#significance_overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Expected frequencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phik.simulation import sim_2d_data_patefield, sim_2d_product_multinominal, sim_2d_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>area</th>\n",
       "      <th>country_side</th>\n",
       "      <th>downtown</th>\n",
       "      <th>hills</th>\n",
       "      <th>suburbs</th>\n",
       "      <th>unpaved_roads</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>driver_age</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>123.0</td>\n",
       "      <td>147.0</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9.0</td>\n",
       "      <td>77.0</td>\n",
       "      <td>137.0</td>\n",
       "      <td>125.0</td>\n",
       "      <td>31.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>13.0</td>\n",
       "      <td>68.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>47.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "area        country_side  downtown  hills  suburbs  unpaved_roads\n",
       "driver_age                                                       \n",
       "1                   11.0      86.0  123.0    147.0           21.0\n",
       "2                    9.0      77.0  137.0    125.0           31.0\n",
       "3                    7.0     102.0  131.0    130.0           18.0\n",
       "4                   17.0      83.0  130.0     95.0           14.0\n",
       "5                   13.0      68.0  120.0     72.0            8.0\n",
       "6                    7.0      30.0   51.0     47.0            9.0\n",
       "7                    1.0      11.0   23.0     14.0            7.0\n",
       "8                    0.0       4.0    7.0      8.0            2.0\n",
       "9                    0.0       0.0    1.0      1.0            0.0\n",
       "10                   0.0       1.0    1.0      0.0            0.0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputdata = data[['driver_age', 'area']].hist2d(interval_cols=['driver_age'])\n",
    "inputdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multinominal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data total: 2000.0\n",
      "sim  total: 2000\n",
      "data row totals: [ 65. 462. 724. 639. 110.]\n",
      "sim  row totals: [ 75 468 748 586 123]\n",
      "data column totals: [388. 379. 388. 339. 281. 144.  56.  21.   2.   2.]\n",
      "sim  column totals: [378 380 375 335 281 164  59  25   1   2]\n"
     ]
    }
   ],
   "source": [
    "simdata = sim_2d_data(inputdata.values)\n",
    "print('data total:', inputdata.sum().sum())\n",
    "print('sim  total:', simdata.sum().sum())\n",
    "print('data row totals:', inputdata.sum(axis=0).values)\n",
    "print('sim  row totals:', simdata.sum(axis=0))\n",
    "print('data column totals:', inputdata.sum(axis=1).values)\n",
    "print('sim  column totals:', simdata.sum(axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### product multinominal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data total: 2000.0\n",
      "sim  total: 2000\n",
      "data row totals: [ 65 462 724 639 110]\n",
      "sim  row totals: [ 65 462 724 639 110]\n",
      "data column totals: [388 379 388 339 281 144  56  21   2   2]\n",
      "sim  column totals: [399 353 415 349 272 139  45  22   4   2]\n"
     ]
    }
   ],
   "source": [
    "simdata = sim_2d_product_multinominal(inputdata.values, axis=0)\n",
    "print('data total:', inputdata.sum().sum())\n",
    "print('sim  total:', simdata.sum().sum())\n",
    "print('data row totals:', inputdata.sum(axis=0).astype(int).values)\n",
    "print('sim  row totals:', simdata.sum(axis=0).astype(int))\n",
    "print('data column totals:', inputdata.sum(axis=1).astype(int).values)\n",
    "print('sim  column totals:', simdata.sum(axis=1).astype(int))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### hypergeometric (\"patefield\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data total: 2000.0\n",
      "sim  total: 2000\n",
      "data row totals: [ 65 462 724 639 110]\n",
      "sim  row totals: [ 65 462 724 639 110]\n",
      "data column totals: [388 379 388 339 281 144  56  21   2   2]\n",
      "sim  column totals: [388 379 388 339 281 144  56  21   2   2]\n"
     ]
    }
   ],
   "source": [
    "# patefield simulation needs compiled c++ code.\n",
    "# only run this if the python binding to the (compiled) patefiled simulation function is found.\n",
    "try:\n",
    "    from phik.simcore import _sim_2d_data_patefield\n",
    "    CPP_SUPPORT = True\n",
    "except ImportError:\n",
    "    CPP_SUPPORT = False\n",
    "\n",
    "if CPP_SUPPORT:\n",
    "    simdata = sim_2d_data_patefield(inputdata.values)\n",
    "    print('data total:', inputdata.sum().sum())\n",
    "    print('sim  total:', simdata.sum().sum())\n",
    "    print('data row totals:', inputdata.sum(axis=0).astype(int).values)\n",
    "    print('sim  row totals:', simdata.sum(axis=0))\n",
    "    print('data column totals:', inputdata.sum(axis=1).astype(int).values)\n",
    "    print('sim  column totals:', simdata.sum(axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Outlier significance\n",
    "\n",
    "The normal pearson correlation between two interval variables is easy to interpret. However, the phik correlation between two variables of mixed type is not always easy to interpret, especially when it concerns categorical variables. Therefore, functionality is provided to detect \"outliers\": excesses and deficits over the expected frequencies  in the contingency table of two variables. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 1: mileage versus car_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the categorical variable pair mileage - car_size we measured:\n",
    "\n",
    "$$\\phi_k = 0.77 \\, ,\\quad\\quad \\mathrm{significance} = 46.3$$\n",
    "\n",
    "Let's use the outlier significance functionality to gain a better understanding of this significance correlation between mileage and car size.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "c0 = 'mileage'\n",
    "c1 = 'car_size'\n",
    "\n",
    "tmp_interval_cols = ['mileage']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>car_size</th>\n",
       "      <th>L</th>\n",
       "      <th>M</th>\n",
       "      <th>S</th>\n",
       "      <th>XL</th>\n",
       "      <th>XS</th>\n",
       "      <th>XXL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>53.5_30047.0</th>\n",
       "      <td>6.882155</td>\n",
       "      <td>21.483476</td>\n",
       "      <td>18.076204</td>\n",
       "      <td>-8.209536</td>\n",
       "      <td>10.820863</td>\n",
       "      <td>-22.423985</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30047.0_60040.5</th>\n",
       "      <td>20.034528</td>\n",
       "      <td>-0.251737</td>\n",
       "      <td>-3.408409</td>\n",
       "      <td>2.534277</td>\n",
       "      <td>-1.973628</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60040.5_90033.9</th>\n",
       "      <td>1.627610</td>\n",
       "      <td>-3.043497</td>\n",
       "      <td>-2.265809</td>\n",
       "      <td>10.215936</td>\n",
       "      <td>-1.246784</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90033.9_120027.4</th>\n",
       "      <td>-3.711579</td>\n",
       "      <td>-3.827278</td>\n",
       "      <td>-2.885475</td>\n",
       "      <td>12.999048</td>\n",
       "      <td>-1.638288</td>\n",
       "      <td>-7.185622</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120027.4_150020.9</th>\n",
       "      <td>-7.665861</td>\n",
       "      <td>-6.173001</td>\n",
       "      <td>-4.746762</td>\n",
       "      <td>9.629145</td>\n",
       "      <td>-2.841508</td>\n",
       "      <td>-0.504521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>150020.9_180014.4</th>\n",
       "      <td>-7.533189</td>\n",
       "      <td>-6.063786</td>\n",
       "      <td>-4.660049</td>\n",
       "      <td>1.559370</td>\n",
       "      <td>-2.785049</td>\n",
       "      <td>6.765549</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180014.4_210007.8</th>\n",
       "      <td>-5.541940</td>\n",
       "      <td>-4.425929</td>\n",
       "      <td>-3.360023</td>\n",
       "      <td>-4.802787</td>\n",
       "      <td>-1.942469</td>\n",
       "      <td>10.520540</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>210007.8_240001.3</th>\n",
       "      <td>-3.496905</td>\n",
       "      <td>-2.745103</td>\n",
       "      <td>-2.030802</td>\n",
       "      <td>-5.850529</td>\n",
       "      <td>-1.100873</td>\n",
       "      <td>8.723925</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240001.3_269994.8</th>\n",
       "      <td>-5.275976</td>\n",
       "      <td>-4.207164</td>\n",
       "      <td>-3.186534</td>\n",
       "      <td>-8.616464</td>\n",
       "      <td>-1.830944</td>\n",
       "      <td>13.303101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>269994.8_299988.2</th>\n",
       "      <td>-8.014016</td>\n",
       "      <td>-6.458253</td>\n",
       "      <td>-4.973240</td>\n",
       "      <td>-12.868389</td>\n",
       "      <td>-2.989055</td>\n",
       "      <td>20.992824</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "car_size                   L          M          S         XL         XS  \\\n",
       "53.5_30047.0        6.882155  21.483476  18.076204  -8.209536  10.820863   \n",
       "30047.0_60040.5    20.034528  -0.251737  -3.408409   2.534277  -1.973628   \n",
       "60040.5_90033.9     1.627610  -3.043497  -2.265809  10.215936  -1.246784   \n",
       "90033.9_120027.4   -3.711579  -3.827278  -2.885475  12.999048  -1.638288   \n",
       "120027.4_150020.9  -7.665861  -6.173001  -4.746762   9.629145  -2.841508   \n",
       "150020.9_180014.4  -7.533189  -6.063786  -4.660049   1.559370  -2.785049   \n",
       "180014.4_210007.8  -5.541940  -4.425929  -3.360023  -4.802787  -1.942469   \n",
       "210007.8_240001.3  -3.496905  -2.745103  -2.030802  -5.850529  -1.100873   \n",
       "240001.3_269994.8  -5.275976  -4.207164  -3.186534  -8.616464  -1.830944   \n",
       "269994.8_299988.2  -8.014016  -6.458253  -4.973240 -12.868389  -2.989055   \n",
       "\n",
       "car_size                 XXL  \n",
       "53.5_30047.0      -22.423985  \n",
       "30047.0_60040.5    -8.209536  \n",
       "60040.5_90033.9    -8.209536  \n",
       "90033.9_120027.4   -7.185622  \n",
       "120027.4_150020.9  -0.504521  \n",
       "150020.9_180014.4   6.765549  \n",
       "180014.4_210007.8  10.520540  \n",
       "210007.8_240001.3   8.723925  \n",
       "240001.3_269994.8  13.303101  \n",
       "269994.8_299988.2  20.992824  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, \n",
    "                                                                          retbins=True)\n",
    "outlier_signifs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify binning per interval variable\n",
    "Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges. \n",
    "\n",
    "Note: in case a bin is created without any records this bin will be automatically dropped in the phik and (outlier) significance calculations. However, in the outlier significance calculation this will currently lead to an error as the number of provided bin edges does not match the number of bins anymore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>car_size</th>\n",
       "      <th>L</th>\n",
       "      <th>M</th>\n",
       "      <th>S</th>\n",
       "      <th>XL</th>\n",
       "      <th>XS</th>\n",
       "      <th>XXL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0.0_100.0</th>\n",
       "      <td>-0.223635</td>\n",
       "      <td>-0.153005</td>\n",
       "      <td>-0.096640</td>\n",
       "      <td>-0.504167</td>\n",
       "      <td>2.150837</td>\n",
       "      <td>-1.337308</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100.0_1000.0</th>\n",
       "      <td>-0.742899</td>\n",
       "      <td>-0.533211</td>\n",
       "      <td>2.164954</td>\n",
       "      <td>-1.469996</td>\n",
       "      <td>5.704340</td>\n",
       "      <td>-3.272689</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000.0_10000.0</th>\n",
       "      <td>-3.489668</td>\n",
       "      <td>3.499856</td>\n",
       "      <td>18.061724</td>\n",
       "      <td>-6.831062</td>\n",
       "      <td>11.617394</td>\n",
       "      <td>-13.063085</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000.0_100000.0</th>\n",
       "      <td>25.086723</td>\n",
       "      <td>15.956527</td>\n",
       "      <td>-0.251877</td>\n",
       "      <td>5.162309</td>\n",
       "      <td>-3.896807</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100000.0_1000000.0</th>\n",
       "      <td>-8.209536</td>\n",
       "      <td>-17.223164</td>\n",
       "      <td>-13.626621</td>\n",
       "      <td>-2.140870</td>\n",
       "      <td>-8.688844</td>\n",
       "      <td>44.933133</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "car_size                    L          M          S        XL         XS  \\\n",
       "0.0_100.0           -0.223635  -0.153005  -0.096640 -0.504167   2.150837   \n",
       "100.0_1000.0        -0.742899  -0.533211   2.164954 -1.469996   5.704340   \n",
       "1000.0_10000.0      -3.489668   3.499856  18.061724 -6.831062  11.617394   \n",
       "10000.0_100000.0    25.086723  15.956527  -0.251877  5.162309  -3.896807   \n",
       "100000.0_1000000.0  -8.209536 -17.223164 -13.626621 -2.140870  -8.688844   \n",
       "\n",
       "car_size                  XXL  \n",
       "0.0_100.0           -1.337308  \n",
       "100.0_1000.0        -3.272689  \n",
       "1000.0_10000.0     -13.063085  \n",
       "10000.0_100000.0    -8.209536  \n",
       "100000.0_1000000.0  44.933133  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = [0,1E2, 1E3, 1E4, 1E5, 1E6]\n",
    "outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, \n",
    "                                                                          bins=bins, retbins=True)\n",
    "outlier_signifs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify binning per interval variable -- dealing with underflow and overflow\n",
    "\n",
    "When specifying custom bins as situation can occur when the minimal (maximum) value in the data is smaller (larger) than the minimum (maximum) bin edge. Data points outside the specified range will be collected in the underflow (UF) and overflow (OF) bins. One can choose how to deal with these under/overflow bins, by setting the drop_underflow and drop_overflow variables.\n",
    "\n",
    "Note that the drop_underflow and drop_overflow options are also available for the calculation of the phik matrix and the significance matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>car_size</th>\n",
       "      <th>L</th>\n",
       "      <th>M</th>\n",
       "      <th>S</th>\n",
       "      <th>XL</th>\n",
       "      <th>XS</th>\n",
       "      <th>XXL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100.0_1000.0</th>\n",
       "      <td>-0.742899</td>\n",
       "      <td>-0.533211</td>\n",
       "      <td>2.164954</td>\n",
       "      <td>-1.469996</td>\n",
       "      <td>5.704340</td>\n",
       "      <td>-3.272689</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000.0_10000.0</th>\n",
       "      <td>-3.489668</td>\n",
       "      <td>3.499856</td>\n",
       "      <td>18.061724</td>\n",
       "      <td>-6.831062</td>\n",
       "      <td>11.617394</td>\n",
       "      <td>-13.063085</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000.0_100000.0</th>\n",
       "      <td>25.086723</td>\n",
       "      <td>15.956527</td>\n",
       "      <td>-0.251877</td>\n",
       "      <td>5.162309</td>\n",
       "      <td>-3.896807</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>OF</th>\n",
       "      <td>-8.209536</td>\n",
       "      <td>-17.223164</td>\n",
       "      <td>-13.626621</td>\n",
       "      <td>-2.140870</td>\n",
       "      <td>-8.688844</td>\n",
       "      <td>44.933133</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UF</th>\n",
       "      <td>-0.223635</td>\n",
       "      <td>-0.153005</td>\n",
       "      <td>-0.096640</td>\n",
       "      <td>-0.504167</td>\n",
       "      <td>2.150837</td>\n",
       "      <td>-1.337308</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "car_size                  L          M          S        XL         XS  \\\n",
       "100.0_1000.0      -0.742899  -0.533211   2.164954 -1.469996   5.704340   \n",
       "1000.0_10000.0    -3.489668   3.499856  18.061724 -6.831062  11.617394   \n",
       "10000.0_100000.0  25.086723  15.956527  -0.251877  5.162309  -3.896807   \n",
       "OF                -8.209536 -17.223164 -13.626621 -2.140870  -8.688844   \n",
       "UF                -0.223635  -0.153005  -0.096640 -0.504167   2.150837   \n",
       "\n",
       "car_size                XXL  \n",
       "100.0_1000.0      -3.272689  \n",
       "1000.0_10000.0   -13.063085  \n",
       "10000.0_100000.0  -8.209536  \n",
       "OF                44.933133  \n",
       "UF                -1.337308  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = [1E2, 1E3, 1E4, 1E5]\n",
    "outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, \n",
    "                                                                          bins=bins, retbins=True, \n",
    "                                                                          drop_underflow=False,\n",
    "                                                                          drop_overflow=False)\n",
    "outlier_signifs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with NaN's in the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add some missing values to our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.loc[np.random.choice(range(len(data)), size=10), 'car_size'] = np.nan\n",
    "data.loc[np.random.choice(range(len(data)), size=10), 'mileage'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes there can be information in the missing values and in which case you might want to consider the NaN values as a separate category. This can be achieved by setting the dropna argument to False."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>car_size</th>\n",
       "      <th>L</th>\n",
       "      <th>M</th>\n",
       "      <th>NaN</th>\n",
       "      <th>S</th>\n",
       "      <th>XL</th>\n",
       "      <th>XS</th>\n",
       "      <th>XXL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100.0_1000.0</th>\n",
       "      <td>-0.742899</td>\n",
       "      <td>-0.533211</td>\n",
       "      <td>-0.053620</td>\n",
       "      <td>2.185319</td>\n",
       "      <td>-1.467322</td>\n",
       "      <td>5.704340</td>\n",
       "      <td>-3.254118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000.0_10000.0</th>\n",
       "      <td>-3.489668</td>\n",
       "      <td>3.499856</td>\n",
       "      <td>1.632438</td>\n",
       "      <td>17.591610</td>\n",
       "      <td>-6.821511</td>\n",
       "      <td>11.617394</td>\n",
       "      <td>-13.000691</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000.0_100000.0</th>\n",
       "      <td>24.909164</td>\n",
       "      <td>15.798682</td>\n",
       "      <td>-1.078812</td>\n",
       "      <td>-0.081242</td>\n",
       "      <td>4.943028</td>\n",
       "      <td>-3.875525</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NaN</th>\n",
       "      <td>0.132649</td>\n",
       "      <td>0.488424</td>\n",
       "      <td>-0.073439</td>\n",
       "      <td>-0.455333</td>\n",
       "      <td>-0.132365</td>\n",
       "      <td>-0.211155</td>\n",
       "      <td>-0.012896</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>OF</th>\n",
       "      <td>-8.209536</td>\n",
       "      <td>-17.158980</td>\n",
       "      <td>-0.283391</td>\n",
       "      <td>-13.396642</td>\n",
       "      <td>-1.909226</td>\n",
       "      <td>-8.651800</td>\n",
       "      <td>43.560131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UF</th>\n",
       "      <td>-0.223635</td>\n",
       "      <td>-0.153005</td>\n",
       "      <td>-0.013130</td>\n",
       "      <td>-0.094218</td>\n",
       "      <td>-0.503051</td>\n",
       "      <td>2.150837</td>\n",
       "      <td>-1.328194</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "car_size                  L          M       NaN          S        XL  \\\n",
       "100.0_1000.0      -0.742899  -0.533211 -0.053620   2.185319 -1.467322   \n",
       "1000.0_10000.0    -3.489668   3.499856  1.632438  17.591610 -6.821511   \n",
       "10000.0_100000.0  24.909164  15.798682 -1.078812  -0.081242  4.943028   \n",
       "NaN                0.132649   0.488424 -0.073439  -0.455333 -0.132365   \n",
       "OF                -8.209536 -17.158980 -0.283391 -13.396642 -1.909226   \n",
       "UF                -0.223635  -0.153005 -0.013130  -0.094218 -0.503051   \n",
       "\n",
       "car_size                 XS        XXL  \n",
       "100.0_1000.0       5.704340  -3.254118  \n",
       "1000.0_10000.0    11.617394 -13.000691  \n",
       "10000.0_100000.0  -3.875525  -8.209536  \n",
       "NaN               -0.211155  -0.012896  \n",
       "OF                -8.651800  43.560131  \n",
       "UF                 2.150837  -1.328194  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = [1E2, 1E3, 1E4, 1E5]\n",
    "outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, \n",
    "                                                                          bins=bins, retbins=True, \n",
    "                                                                          drop_underflow=False,\n",
    "                                                                          drop_overflow=False,\n",
    "                                                                          dropna=False)\n",
    "outlier_signifs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here OF and UF are the underflow and overflow bin of car_size, respectively.\n",
    "\n",
    "To just ignore records with missing values set dropna to True (default)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>car_size</th>\n",
       "      <th>L</th>\n",
       "      <th>M</th>\n",
       "      <th>S</th>\n",
       "      <th>XL</th>\n",
       "      <th>XS</th>\n",
       "      <th>XXL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100.0_1000.0</th>\n",
       "      <td>-0.745805</td>\n",
       "      <td>-0.534179</td>\n",
       "      <td>2.177522</td>\n",
       "      <td>-1.473602</td>\n",
       "      <td>5.695755</td>\n",
       "      <td>-3.268662</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000.0_10000.0</th>\n",
       "      <td>-3.451793</td>\n",
       "      <td>3.559705</td>\n",
       "      <td>17.674546</td>\n",
       "      <td>-6.770807</td>\n",
       "      <td>11.651568</td>\n",
       "      <td>-12.916946</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000.0_100000.0</th>\n",
       "      <td>25.035896</td>\n",
       "      <td>15.868135</td>\n",
       "      <td>-0.121191</td>\n",
       "      <td>4.904070</td>\n",
       "      <td>-3.896177</td>\n",
       "      <td>-8.209536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>OF</th>\n",
       "      <td>-8.209536</td>\n",
       "      <td>-17.164792</td>\n",
       "      <td>-13.459625</td>\n",
       "      <td>-1.934622</td>\n",
       "      <td>-8.695547</td>\n",
       "      <td>44.449479</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UF</th>\n",
       "      <td>-0.224643</td>\n",
       "      <td>-0.153312</td>\n",
       "      <td>-0.095154</td>\n",
       "      <td>-0.505661</td>\n",
       "      <td>2.146765</td>\n",
       "      <td>-1.335316</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "car_size                  L          M          S        XL         XS  \\\n",
       "100.0_1000.0      -0.745805  -0.534179   2.177522 -1.473602   5.695755   \n",
       "1000.0_10000.0    -3.451793   3.559705  17.674546 -6.770807  11.651568   \n",
       "10000.0_100000.0  25.035896  15.868135  -0.121191  4.904070  -3.896177   \n",
       "OF                -8.209536 -17.164792 -13.459625 -1.934622  -8.695547   \n",
       "UF                -0.224643  -0.153312  -0.095154 -0.505661   2.146765   \n",
       "\n",
       "car_size                XXL  \n",
       "100.0_1000.0      -3.268662  \n",
       "1000.0_10000.0   -12.916946  \n",
       "10000.0_100000.0  -8.209536  \n",
       "OF                44.449479  \n",
       "UF                -1.335316  "
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = [1E2, 1E3, 1E4, 1E5]\n",
    "outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, \n",
    "                                                                          bins=bins, retbins=True, \n",
    "                                                                          drop_underflow=False,\n",
    "                                                                          drop_overflow=False,\n",
    "                                                                          dropna=True)\n",
    "outlier_signifs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the dropna option is also available for the calculation of the phik matrix and the significance matrix."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}