{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[MIT License](https://github.com/cdslaborg/paramonte#license) \n",
"[ParaMonte: plain powerful parallel Monte Carlo library](https://github.com/cdslaborg/paramonte). \n",
"Copyright (C) 2012-present, [The Computational Data Science Lab](https://www.cdslab.org/#about) \n",
"https://github.com/cdslaborg/paramonte \n",
"[References](https://www.cdslab.org/paramonte/notes/overview/preface/#how-to-acknowledge-the-use-of-the-paramonte-library-in-your-work) "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"warnings.simplefilter('ignore')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"paramonte 2.3.0\n",
"numpy 1.19.2\n",
"scipy 1.5.2\n",
"pandas 1.1.3\n",
"seaborn 0.11.0\n",
"matplotlib 3.3.2\n"
]
}
],
"source": [
"%matplotlib notebook\n",
"import paramonte as pm\n",
"import numpy as np\n",
"import scipy as sp\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"print('\\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))\n",
"sns.set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling the distribution of 1-dimensional data subject to sample incompleteness \n",
"\n",
"Supposed we have observed a dataset comprised of $15027$ events with one attribute variable in this [data.csv](https://github.com/cdslaborg/paramontex/raw/main/Python/Jupyter/regression_censored_lognormal_data_paradram/data.csv) file. Plotting these points would yield a red-colored histogram like the following figure, \n",
" \n",
"
\n",
" \n",
"The pale black shaded area represents the missing points from our observational dataset. These are points that we could NOT observe (or detect) because of some instrumental bias and sample incompleteness as represented by the black line.\n",
"\n",
"**Statement of problem** \n",
"\n",
"Now, our goal is to form a hypothesis about this dataset, that is, a hypothesis about the distribution of the events in the above plot. To make a correct assessment, we will have to also carefully consider the effects of the detection threshold (the black line) in our inference.\n",
"\n",
"The first thing we can do is to obtain a better visualization of this dataset. The data looks highly skewed to the right. Therefore, it makes sense to take the logarithm of this dataset in the hope that it can help us better understand the possible underlying generating distribution of this data, \n",
" \n",
"
\n",
"\n",
"Just by looking at the observed (red) distribution, we can form a relatively good hypothesis about the distribution of the data: If the detection threshold did not exist, the complete dataset (including the points in the black shaded area) would likely very well resemble a lognormal distribution (or a normal distribution on the logarithmic axes). \n",
"\n",
"However, this dataset is affected by the detection threshold and we need to also take a model of the detection threshold into account. The logarithmic transformation makes it crystal-clear to us that the detection threshold is likely best modeled by as single value represented by the vertical line. \n",
"\n",
"Therefore, we form the following hypothesis, \n",
"\n",
"**Hypothesis**: Our data comes from a Lognormal distribution subject to censorship by sharp hard cutoff threshold. The proposition that data is lognormally distributed is mathematically equivalent to the proposition that `log(data)` is normally distributed. \n",
"\n",
"**Goal**: To **estimate the unknown parameters** of the Normal distribution, the **mean** and the **standard deviation**: ($\\mu, \\sigma$), as well as the **location of the sharp censorship threshold** in our data set ($\\eta$). \n",
"\n",
"**Methodology**: We will use the maximum likelihood approach and the **ParaDRAM** (**Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo**) sampler of the `paramonte` library to estimate the 3 unknown parameters of our hypothesis. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Read the data and log-transform it \n",
"\n",
"First read the data using Pandas library, then log-transform data to make it look like a Normal distribution."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | 0 | \n", "
---|---|
count | \n", "15027.000000 | \n", "
mean | \n", "0.824782 | \n", "
std | \n", "0.498326 | \n", "
min | \n", "0.019044 | \n", "
25% | \n", "0.413535 | \n", "
50% | \n", "0.725890 | \n", "
75% | \n", "1.177000 | \n", "
max | \n", "1.999700 | \n", "