{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[MIT License](https://github.com/cdslaborg/paramonte#license) \n",
"[ParaMonte: plain powerful parallel Monte Carlo library](https://github.com/cdslaborg/paramonte). \n",
"Copyright (C) 2012-present, [The Computational Data Science Lab](https://www.cdslab.org/#about) \n",
"https://github.com/cdslaborg/paramonte \n",
"[References](https://www.cdslab.org/paramonte/notes/overview/preface/#how-to-acknowledge-the-use-of-the-paramonte-library-in-your-work) "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"warnings.simplefilter('ignore')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"paramonte 2.4.0\n",
"numpy 1.19.4\n",
"scipy 1.5.4\n",
"pandas 1.1.5\n",
"seaborn 0.11.0\n",
"matplotlib 3.3.3\n"
]
}
],
"source": [
"%matplotlib notebook\n",
"import paramonte as pm\n",
"import numpy as np\n",
"import scipy as sp\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"print('\\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))\n",
"sns.set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling the bivariate distribution of a data set subject to sample incompleteness \n",
"\n",
"Supposed we have observed a dataset comprised of $1000$ events with two attributes `variable1` and `variable2` in this [data.csv](https://github.com/cdslaborg/paramontex/raw/main/Python/Jupyter/regression_censored_bivariate_lognormal_data_paradram/data.csv) file. Plotting these points would yield a red-colored bivariate distribution similar to the following figure, \n",
" \n",
"
\n",
" \n",
"The pale black points represent the missing data from our observational dataset. These are points that we could NOT observe (or detect) because of some instrumental bias and sample incompleteness as represented by the black line.\n",
"\n",
"**Statement of problem** \n",
"\n",
"Now, our goal is to form a hypothesis about this dataset, that is, a hypothesis about the distribution of the red points in the above plot. To make a correct assessment, we will have to also carefully consider the effects of the detection threshold (the black line) in our inference. \n",
"\n",
"The first thing we can do is to obtain a better visualization of this dataset. **A Good visualization of data is essential to form a good hypothesis (model) for our problem**. The bivariate data in the above plot looks highly skewed to the right. Therefore, it makes sense to take the logarithm of this dataset in the hope that it can help us better understand the possible underlying generating distribution of this data, \n",
" \n",
"
\n",
"\n",
"Just by looking at the observed (red) data points, we can form a relatively good hypothesis about the distribution of the data: **If the detection threshold did not exist, the complete dataset (including the points in the black points) would have likely very well resembled a bivariate lognormal distribution (or a normal distribution on the logarithmic axes)**. \n",
"\n",
"However, this dataset is affected by the detection threshold and we need to take a model of the detection threshold also into account. The logarithmic transformation makes it crystal-clear to us that the detection threshold is likely best modeled as a line. \n",
"\n",
"Therefore, we form the following hypothesis, \n",
"\n",
"**Hypothesis**: Our data comes from a bivariate Lognormal distribution subject to censorship by a sharp hard cutoff threshold in the form of a line. The proposition that the data is jointly lognormally distributed is mathematically equivalent to the proposition stating that `log(data)` is normally distributed. \n",
"\n",
"**Goal**: To **estimate the unknown parameters** of the bivariate Normal distribution which is hypothesized to well fit the bivariate distribution of the *log-transformed* data. These parameters include the **mean vector** of length two and the **covariance matrix** of the full bivariate Normal distribution ($\\mu, \\Sigma$), as well as the intercept and the slope ($c_0,c_1$) of the **sharp linear detection threshold** on our data set ($\\eta$). A two-dimensional covariance matrix has 3 free parameters because of its symmetry. Therefore, our censored model has overall 7 unknown parameters which must be constrained. \n",
"\n",
"**Methodology**: We will use the maximum likelihood approach and the **ParaDRAM** (**Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo**) sampler of the `paramonte` library to estimate the 7 unknown parameters of our hypothesis. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Read the data and log-transform it \n",
"\n",
"First read the data using Pandas library, then log-transform data to make it look like a bivariate Normal distribution."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | variable1 | \n", "variable2 | \n", "
---|---|---|
0 | \n", "0.87479 | \n", "1.46020 | \n", "
1 | \n", "0.12309 | \n", "0.19028 | \n", "
2 | \n", "3.32940 | \n", "6.21360 | \n", "
3 | \n", "0.62906 | \n", "1.66630 | \n", "
4 | \n", "0.38734 | \n", "0.52181 | \n", "