{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "Python 3.8.5 64-bit ('streamlit': conda)",
   "display_name": "Python 3.8.5 64-bit ('streamlit': conda)",
   "metadata": {
    "interpreter": {
     "hash": "d5b47a323bfc67f94a37e7cb8fd52db1ad06cc55d227a5b8c375f2f5af6e26dc"
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "# Inverse normal cumulative distribution - finding $z$ values from probabilities with Python"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "\n",
    "Sometimes we would like to find the $z$ values belonging to a particular probability. For this we use an inverse cumulative distribution function.\n",
    "\n",
    "Because the cumulative distribution function is strictly increasing and continuous, we can find the $z$ by narrowing down with binary bisecting the distance between high and low $z$ values.\n",
    "\n",
    "\n",
    "The code is based upon the respective example from [Data Science from Scratch](https://www.oreilly.com/library/view/data-science-from/9781492041122/)."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Libraries, helper functions"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math as m\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calc_normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:\n",
    "    return (1 + m.erf((x - mu) / m.sqrt(2) / sigma)) / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def to_string(number: float, column_width: int = 20) -> str:\n",
    "    # return str(number).ljust(column_width)\n",
    "    return f\"{str(number): <{column_width}}\""
   ]
  },
  {
   "source": [
    "## Inverse normal CDF"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "The inverse normal cumulative distribution function.\n",
    "\n",
    "Takes a probability and optionaly a mean and standard deviation of the distribution, and a tolerance value.\n",
    "\n",
    "1. If the distribution is not standard normal, it calculates $z$ for the standard normal one and then rescales the result\n",
    "\n",
    "2. Take a very low and a very high $z$ value\n",
    "\n",
    "3. While the distance between the high and low $z$ values are greater than our tolerance\n",
    "\n",
    "    a. Take their midpoint  \n",
    "    b. Calculate the probability belonging to that midpoint  \n",
    "    c. Based on the position of the probability belonging of the midpoint in respect to the probabilty for which we try to find out the $z$ value, assign the midpoint to the low or high $z$ value  "
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calc_inverse_normal_cdf(p: float, mu:float = 0, sigma: float = 1, tolerance: float = 1E-5, show_steps=False) -> float:\n",
    "\n",
    "    # In case it is not a standard normal distribution, calculate the standard normal first and then rescale\n",
    "    if mu != 0 or sigma != 1:\n",
    "        return mu + sigma * calc_inverse_normal_cdf(p, tolerance=tolerance)\n",
    "\n",
    "    low_z = -10\n",
    "    hi_z = 10\n",
    "\n",
    "    while hi_z - low_z > tolerance:\n",
    "        mid_z = (low_z + hi_z) / 2\n",
    "        mid_p = calc_normal_cdf(mid_z)\n",
    "\n",
    "        if mid_p < p:\n",
    "            low_z = mid_z\n",
    "        else:\n",
    "            hi_z = mid_z\n",
    "\n",
    "        if show_steps: print(\"\\t\".join(map(to_string, [low_z, mid_z, hi_z])))\n",
    "\n",
    "    return mid_z"
   ]
  },
  {
   "source": [
    "We can examine the steps as the $z$ closes down to its final value."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "-10                 \t0.0                 \t0.0                 \n-5.0                \t-5.0                \t0.0                 \n-2.5                \t-2.5                \t0.0                 \n-1.25               \t-1.25               \t0.0                 \n-0.625              \t-0.625              \t0.0                 \n-0.625              \t-0.3125             \t-0.3125             \n-0.625              \t-0.46875            \t-0.46875            \n-0.546875           \t-0.546875           \t-0.46875            \n-0.546875           \t-0.5078125          \t-0.5078125          \n-0.52734375         \t-0.52734375         \t-0.5078125          \n-0.52734375         \t-0.517578125        \t-0.517578125        \n-0.52734375         \t-0.5224609375       \t-0.5224609375       \n-0.52490234375      \t-0.52490234375      \t-0.5224609375       \n-0.52490234375      \t-0.523681640625     \t-0.523681640625     \n-0.52490234375      \t-0.5242919921875    \t-0.5242919921875    \n-0.52459716796875   \t-0.52459716796875   \t-0.5242919921875    \n-0.524444580078125  \t-0.524444580078125  \t-0.5242919921875    \n-0.524444580078125  \t-0.5243682861328125 \t-0.5243682861328125 \n-0.5244064331054688 \t-0.5244064331054688 \t-0.5243682861328125 \n-0.5244064331054688 \t-0.5243873596191406 \t-0.5243873596191406 \n-0.5244064331054688 \t-0.5243968963623047 \t-0.5243968963623047 \n"
     ]
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "-0.5243968963623047"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "calc_inverse_normal_cdf(p=.3, show_steps=True)"
   ]
  },
  {
   "source": [
    "Finally, we get a number of reference values for some different probabilities"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "0.0:   -9.999990463256836\n0.1:  -1.2815570831298828\n0.2:  -0.8416271209716797\n0.3:  -0.5243968963623047\n0.4:  -0.2533435821533203\n0.5: -9.5367431640625e-06\n0.6:   0.2533435821533203\n0.7:   0.5243968963623047\n0.8:   0.8416271209716797\n0.9:   1.2815570831298828\n1.0:    8.244009017944336\n"
     ]
    }
   ],
   "source": [
    "for i in np.arange(0, 1.1, .1):\n",
    "    print(f\"{i:.1f}: {calc_inverse_normal_cdf(i) : >20}\")"
   ]
  }
 ]
}