{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5-final" }, "orig_nbformat": 2, "kernelspec": { "name": "Python 3.8.5 64-bit ('streamlit': conda)", "display_name": "Python 3.8.5 64-bit ('streamlit': conda)", "metadata": { "interpreter": { "hash": "d5b47a323bfc67f94a37e7cb8fd52db1ad06cc55d227a5b8c375f2f5af6e26dc" } } } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Inverse normal cumulative distribution - finding $z$ values from probabilities with Python" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "\n", "Sometimes we would like to find the $z$ values belonging to a particular probability. For this we use an inverse cumulative distribution function.\n", "\n", "Because the cumulative distribution function is strictly increasing and continuous, we can find the $z$ by narrowing down with binary bisecting the distance between high and low $z$ values.\n", "\n", "\n", "The code is based upon the respective example from [Data Science from Scratch](https://www.oreilly.com/library/view/data-science-from/9781492041122/)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Libraries, helper functions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import math as m\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def calc_normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:\n", " return (1 + m.erf((x - mu) / m.sqrt(2) / sigma)) / 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def to_string(number: float, column_width: int = 20) -> str:\n", " # return str(number).ljust(column_width)\n", " return f\"{str(number): <{column_width}}\"" ] }, { "source": [ "## Inverse normal CDF" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "The inverse normal cumulative distribution function.\n", "\n", "Takes a probability and optionaly a mean and standard deviation of the distribution, and a tolerance value.\n", "\n", "1. If the distribution is not standard normal, it calculates $z$ for the standard normal one and then rescales the result\n", "\n", "2. Take a very low and a very high $z$ value\n", "\n", "3. While the distance between the high and low $z$ values are greater than our tolerance\n", "\n", " a. Take their midpoint \n", " b. Calculate the probability belonging to that midpoint \n", " c. Based on the position of the probability belonging of the midpoint in respect to the probabilty for which we try to find out the $z$ value, assign the midpoint to the low or high $z$ value " ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def calc_inverse_normal_cdf(p: float, mu:float = 0, sigma: float = 1, tolerance: float = 1E-5, show_steps=False) -> float:\n", "\n", " # In case it is not a standard normal distribution, calculate the standard normal first and then rescale\n", " if mu != 0 or sigma != 1:\n", " return mu + sigma * calc_inverse_normal_cdf(p, tolerance=tolerance)\n", "\n", " low_z = -10\n", " hi_z = 10\n", "\n", " while hi_z - low_z > tolerance:\n", " mid_z = (low_z + hi_z) / 2\n", " mid_p = calc_normal_cdf(mid_z)\n", "\n", " if mid_p < p:\n", " low_z = mid_z\n", " else:\n", " hi_z = mid_z\n", "\n", " if show_steps: print(\"\\t\".join(map(to_string, [low_z, mid_z, hi_z])))\n", "\n", " return mid_z" ] }, { "source": [ "We can examine the steps as the $z$ closes down to its final value." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "-10 \t0.0 \t0.0 \n-5.0 \t-5.0 \t0.0 \n-2.5 \t-2.5 \t0.0 \n-1.25 \t-1.25 \t0.0 \n-0.625 \t-0.625 \t0.0 \n-0.625 \t-0.3125 \t-0.3125 \n-0.625 \t-0.46875 \t-0.46875 \n-0.546875 \t-0.546875 \t-0.46875 \n-0.546875 \t-0.5078125 \t-0.5078125 \n-0.52734375 \t-0.52734375 \t-0.5078125 \n-0.52734375 \t-0.517578125 \t-0.517578125 \n-0.52734375 \t-0.5224609375 \t-0.5224609375 \n-0.52490234375 \t-0.52490234375 \t-0.5224609375 \n-0.52490234375 \t-0.523681640625 \t-0.523681640625 \n-0.52490234375 \t-0.5242919921875 \t-0.5242919921875 \n-0.52459716796875 \t-0.52459716796875 \t-0.5242919921875 \n-0.524444580078125 \t-0.524444580078125 \t-0.5242919921875 \n-0.524444580078125 \t-0.5243682861328125 \t-0.5243682861328125 \n-0.5244064331054688 \t-0.5244064331054688 \t-0.5243682861328125 \n-0.5244064331054688 \t-0.5243873596191406 \t-0.5243873596191406 \n-0.5244064331054688 \t-0.5243968963623047 \t-0.5243968963623047 \n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "-0.5243968963623047" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "calc_inverse_normal_cdf(p=.3, show_steps=True)" ] }, { "source": [ "Finally, we get a number of reference values for some different probabilities" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0.0: -9.999990463256836\n0.1: -1.2815570831298828\n0.2: -0.8416271209716797\n0.3: -0.5243968963623047\n0.4: -0.2533435821533203\n0.5: -9.5367431640625e-06\n0.6: 0.2533435821533203\n0.7: 0.5243968963623047\n0.8: 0.8416271209716797\n0.9: 1.2815570831298828\n1.0: 8.244009017944336\n" ] } ], "source": [ "for i in np.arange(0, 1.1, .1):\n", " print(f\"{i:.1f}: {calc_inverse_normal_cdf(i) : >20}\")" ] } ] }