{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#ffa500\">9 | EXPLORATORY DATA ANALYSIS</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dct=\"http://purl.org/dc/terms/\"><span property=\"dct:title\">This chapter of an Introduction to Health Data Science</span> by <span property=\"cc:attributionName\">Dr JH Klopper</span> is licensed under <a href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1\" target=\"_blank\" rel=\"license noopener noreferrer\" style=\"display:inline-block;\">Attribution-NonCommercial-NoDerivatives 4.0 International<img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nd.svg?ref=chooser-v1\"></a></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Introduction</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chapter serves as an introduction to the fundamental concepts and techniques of exploratory data analysis (EDA), a critical step in any data analysis process. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "EDA is an approach and a set of techniques that allows the initial understanding and apprecaition of the information in complex datasets. It is an essential component of data science, machine learning, and statistical modeling. It is the first step in the data analysis process, exploring the data to understand its main characteristics before making any assumptions, conducting statistical test, or building predictive models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The primary goal of EDA is to maximize the data scientist's insight into a dataset and into the underlying structure of the data and to provide all of the specific items that a data scientist would want to extract from a data set. These include the following.\n",
    "\n",
    "- A good-fitting, parsimonious model\n",
    "- Estimates for parameters\n",
    "- Uncertainties for those estimates\n",
    "- A ranked list of important factors\n",
    "- Conclusions as to whether individual factors are statistically significant\n",
    "- Optimal settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "EDA is not merely a set of techniques, but a mindset. It encourages the data scientist to remain open-minded, to explore the data from multiple angles, to question assumptions, and to be ready to modify initial hypotheses based on the insights gained from the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data explores the summary of data, termed __descriptive statistics__."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Packages used in this notebook</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Import data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data for this chapter is in the `heart_failure.csv` file. It is imported below and metadata about the data is returned using the attributes and methods used in chapter 8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the heart_failure.csv file and assign it to the vaiable df\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/heart_failure.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>anaemia</th>\n",
       "      <th>creatinine_phosphokinase</th>\n",
       "      <th>diabetes</th>\n",
       "      <th>ejection_fraction</th>\n",
       "      <th>hypertension</th>\n",
       "      <th>platelets</th>\n",
       "      <th>serum_creatinine</th>\n",
       "      <th>serum_sodium</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoking</th>\n",
       "      <th>time</th>\n",
       "      <th>death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>75.0</td>\n",
       "      <td>No</td>\n",
       "      <td>582</td>\n",
       "      <td>No</td>\n",
       "      <td>20</td>\n",
       "      <td>Yes</td>\n",
       "      <td>265000.00</td>\n",
       "      <td>1.9</td>\n",
       "      <td>130</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>4</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>55.0</td>\n",
       "      <td>No</td>\n",
       "      <td>7861</td>\n",
       "      <td>No</td>\n",
       "      <td>38</td>\n",
       "      <td>No</td>\n",
       "      <td>263358.03</td>\n",
       "      <td>1.1</td>\n",
       "      <td>136</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>6</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>65.0</td>\n",
       "      <td>No</td>\n",
       "      <td>146</td>\n",
       "      <td>No</td>\n",
       "      <td>20</td>\n",
       "      <td>No</td>\n",
       "      <td>162000.00</td>\n",
       "      <td>1.3</td>\n",
       "      <td>129</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "      <td>7</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>50.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>111</td>\n",
       "      <td>No</td>\n",
       "      <td>20</td>\n",
       "      <td>No</td>\n",
       "      <td>210000.00</td>\n",
       "      <td>1.9</td>\n",
       "      <td>137</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>7</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>65.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>160</td>\n",
       "      <td>Yes</td>\n",
       "      <td>20</td>\n",
       "      <td>No</td>\n",
       "      <td>327000.00</td>\n",
       "      <td>2.7</td>\n",
       "      <td>116</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>8</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    age anaemia  creatinine_phosphokinase diabetes  ejection_fraction  \\\n",
       "0  75.0      No                       582       No                 20   \n",
       "1  55.0      No                      7861       No                 38   \n",
       "2  65.0      No                       146       No                 20   \n",
       "3  50.0     Yes                       111       No                 20   \n",
       "4  65.0     Yes                       160      Yes                 20   \n",
       "\n",
       "  hypertension  platelets  serum_creatinine  serum_sodium     sex smoking  \\\n",
       "0          Yes  265000.00               1.9           130    Male      No   \n",
       "1           No  263358.03               1.1           136    Male      No   \n",
       "2           No  162000.00               1.3           129    Male     Yes   \n",
       "3           No  210000.00               1.9           137    Male      No   \n",
       "4           No  327000.00               2.7           116  Female      No   \n",
       "\n",
       "   time death  \n",
       "0     4   Yes  \n",
       "1     6   Yes  \n",
       "2     7   Yes  \n",
       "3     7   Yes  \n",
       "4     8   Yes  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View the first 5 rows of the dataframe\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(299, 13)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Return the shape of the dataframe\n",
    "df.shape # Returns a tuple with number of rows and columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 299 entries, 0 to 298\n",
      "Data columns (total 13 columns):\n",
      " #   Column                    Non-Null Count  Dtype  \n",
      "---  ------                    --------------  -----  \n",
      " 0   age                       299 non-null    float64\n",
      " 1   anaemia                   299 non-null    object \n",
      " 2   creatinine_phosphokinase  299 non-null    int64  \n",
      " 3   diabetes                  299 non-null    object \n",
      " 4   ejection_fraction         299 non-null    int64  \n",
      " 5   hypertension              299 non-null    object \n",
      " 6   platelets                 299 non-null    float64\n",
      " 7   serum_creatinine          299 non-null    float64\n",
      " 8   serum_sodium              299 non-null    int64  \n",
      " 9   sex                       299 non-null    object \n",
      " 10  smoking                   299 non-null    object \n",
      " 11  time                      299 non-null    int64  \n",
      " 12  death                     299 non-null    object \n",
      "dtypes: float64(3), int64(4), object(6)\n",
      "memory usage: 30.5+ KB\n"
     ]
    }
   ],
   "source": [
    "# Return info about the dataframe\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Exploring categorical variables</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `info` method used above shows various column names (statistical variables) that are object types. This indicates categorical variables that are were not encoded by numbers when the data files was generated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exploratory data analysis of categorical variables starts by determining all the possible values that the variable can take, called the __sample space__ of the variable. The different elements in the sample space are called __classes__ or __levels__. The number or frequency of occurrence of each class can be determined. Finally, the relative frequency or proportion of each class can also be calculated by dividing the frequency of the class by the total number of observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `anaemia` column indicates whether a suject has anemia. The sample space (all the values or elements or classes) that the `anaemia` variable can takes can be determined using the `unique` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['No', 'Yes'], dtype=object)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the unique classes for the anaemia column\n",
    "df['anaemia'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sample space elements are `No` (no anemia present) and `Yes` (anemia present). The `value_counts` method returns the frequency of each of the classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "anaemia\n",
       "No     170\n",
       "Yes    129\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the value counts of the classes in the anaemia column\n",
    "df.anaemia.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A total of $170$ subjects did not have anemia and $129$ did."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The passing the value of `True` to the `normalize` argument returns the relative frequency (or proportion) of each class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "anaemia\n",
       "No     0.568562\n",
       "Yes    0.431438\n",
       "Name: proportion, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the relative frequency of the classes in the anaemia column\n",
    "df.anaemia.value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "About $56.9\\%$ of the subjects did not have anemia and about $43.1\\%$ of subjects did. The __mode__ of this variable is therefor `No`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Determine the sample space elements, the frequency, and the relevant frequencies of the `diabetes` variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['No', 'Yes'], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.diabetes.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "diabetes\n",
       "No     174\n",
       "Yes    125\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.diabetes.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "diabetes\n",
       "No     0.58194\n",
       "Yes    0.41806\n",
       "Name: proportion, dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.diabetes.value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Comparative descriptive statistics__ for categorical variables expressed the frequency (or proportion) of the classes in the data, after grouping the vlaues by the sample space elements of another categorical variable. Below, the frequency of the `anaemia` variable is determined for each of the classes in the `diabetes` variable, by making use of the `groupby` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "diabetes  anaemia\n",
       "No        No         98\n",
       "          Yes        76\n",
       "Yes       No         72\n",
       "          Yes        53\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the frequency of the classes in the anaemia column after grouping by the classses in the diabetes column\n",
    "df.groupby('diabetes').anaemia.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `value_counts` method can be added to show the proportion of each. By multiplying by $100$, the percentages are obtained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "diabetes  anaemia\n",
       "No        No         56.321839\n",
       "          Yes        43.678161\n",
       "Yes       No         57.600000\n",
       "          Yes        42.400000\n",
       "Name: proportion, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Express the proportions as percentages\n",
    "df.groupby('diabetes').anaemia.value_counts(normalize=True) * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pandas `crosstab` function can create a contingency table of the comparitive frequencies. The variable that is passed first, populates the rowss of the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>anaemia</th>\n",
       "      <th>No</th>\n",
       "      <th>Yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>diabetes</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>No</th>\n",
       "      <td>98</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yes</th>\n",
       "      <td>72</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "anaemia   No  Yes\n",
       "diabetes         \n",
       "No        98   76\n",
       "Yes       72   53"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a cross tabulation of the anaemia and diabetes columns\n",
    "pd.crosstab(df.diabetes, df.anaemia)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The joint frequencies from the table above is expressed as a percentage below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>anaemia</th>\n",
       "      <th>No</th>\n",
       "      <th>Yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>diabetes</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>No</th>\n",
       "      <td>32.775920</td>\n",
       "      <td>25.418060</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yes</th>\n",
       "      <td>24.080268</td>\n",
       "      <td>17.725753</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "anaemia          No        Yes\n",
       "diabetes                      \n",
       "No        32.775920  25.418060\n",
       "Yes       24.080268  17.725753"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Express the cross tabulation as percentages\n",
    "pd.crosstab(df.diabetes, df.anaemia, normalize=True) * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Generate a contingency table, indicating the joint percentages of the `anaemia` variable for each of the classes in the `death` column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>anaemia</th>\n",
       "      <th>No</th>\n",
       "      <th>Yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>death</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>No</th>\n",
       "      <td>40.133779</td>\n",
       "      <td>27.759197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yes</th>\n",
       "      <td>16.722408</td>\n",
       "      <td>15.384615</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "anaemia         No        Yes\n",
       "death                        \n",
       "No       40.133779  27.759197\n",
       "Yes      16.722408  15.384615"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(df.death, df.anaemia, normalize=True) * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Exploring numerical variables</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Various statistics can be calculated for continuous numerical variables. The include measures of central tendency and measures of dispersion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Measures of central tendency</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Measures of central tendency are statistical indicators that represent the center point or typical value of a dataset. These measures indicate where most values in a dataset fall and are also referred to as the central location of a distribution. The three main measures of central tendency are as follows.\n",
    "\n",
    "1. **Mean**: The mean, often called the average, is calculated by adding all data points in the dataset and then dividing by the number of data points. The mean is sensitive to outliers, meaning that extremely high or low values can skew the mean.\n",
    "\n",
    "2. **Median**: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle numbers. The median is not affected by outliers or skewed data.\n",
    "\n",
    "3. **Mode**: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode was used to explore categorical variables in the previous section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These measures help provide a summary of the dataset and can give a general sense of the _typical_ value that might be expected."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mean for a numerical variable can be calculated using the `mean` method for a pandas series object. The mean of the `age` variable is calculated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "60.83389297658862"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the mean of the age variable\n",
    "df.age.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mean for a variable can be calculated for different groups, created from the classes of a categorical variable. The mean of the `age` variable is calculated for each of the classes in the `death` variable, using the `groupby` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "death\n",
       "No     58.761906\n",
       "Yes    65.215281\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the mean age by the groups in the death column\n",
    "df.groupby('death').age.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `median` method calculates the median of a pandas series object that contains numerical data. The median of the `age` variable is calculated below for each class in the `death` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "death\n",
       "No     60.0\n",
       "Yes    65.0\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the median age by the groups in the death column\n",
    "df.groupby('death').age.median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Determine the median `ejection_fraction` for each of the classes in the `anaemia` variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "anaemia\n",
       "No     38.0\n",
       "Yes    38.0\n",
       "Name: ejection_fraction, dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('anaemia')['ejection_fraction'].median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Measures of dispersion</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Measures of dispersion, also known as measures of variability, provide insights into the spread or variability of a dataset. They indicate how spread out the values in a dataset are around the center (mean or median). The main measures of dispersion are as follows.\n",
    "\n",
    "1. **Range**: The range is the simplest measure of dispersion and is calculated as the difference between the highest and the lowest value in the dataset.\n",
    "\n",
    "2. **Variance**: Variance measures how far each number in the set is from the mean (average) and thus from every other number in the set. It's often used in statistical and probability theory.\n",
    "\n",
    "3. **Standard Deviation**: The standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It's the most commonly used measure of spread.\n",
    "\n",
    "4. **Interquartile Range (IQR)**: The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is used to measure statistical dispersion and data variability by dividing a data set into quartiles."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These measures help to understand the variability within a dataset, the reliability of statistical estimations, and the level of uncertainty. They are crucial in many statistical analyses as they provide context to the measures of central tendency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The minimum and maximum values of a numerical variable (expressed as a pandas series object) are calculatd using the `min` and `max` methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the minimum value of the age column\n",
    "df.age.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "95.0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the maximum value of the age column\n",
    "df.age.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "55.0"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the range of the age column\n",
    "df.age.max() - df.age.min()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sample variance of a numerical variable (expressed as a pandas series object) is determined using the `var` method. The `ddof` argument is set to $1$ to calculate the sample variance. The `ddof` argument is the _delta degrees of freedom_ and is used to calculate the sample variance. The default value is $0$ and is used to calculate the population variance. The difference between these two equations are shown in (1) for the sake of interest. In (1) $s^{2}$ is the sample variance, $\\sigma^{2}$ is the population variance, $x_{i}$ is the values for the numerical variable in each of the $n$ subjects in the sample, $N$ is the population size, and $\\bar{x}$ is the mean of the variable, and $n$ is the sample size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{align*}\n",
    "&s^{2} = \\frac{\\sum_{i=1}^{n} \\left( x_{i} - \\bar{x} \\right)^{2}}{n-1} \\\\ \\\\\n",
    "&\\sigma^{2} = \\frac{\\sum_{i=1}^{n} \\left( x_{i} - \\bar{x} \\right)^{2}}{N} \\\\ \\\\\n",
    "\\end{align*}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "141.48648290797084"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the sample variance of the age column\n",
    "df.age.var(ddof=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `std` methhod determines the standard deviation. The sample standard deviation is calculated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11.894809074044478"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the sample standard deviation of the age column\n",
    "df.age.std(ddof=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `stats` module of the scipy package contains many functions for statistical analysis. The `iqr` function in this module determines the interquartile range. A quartile values takes a fraction (of $1$) as argument. If the fraction is $0.25$ then the value that is returned is the first quartile (the $25^{th}$ percentile). This is a values from the data such that a quarter of the values are less than it. If the fraction is $0.75$ then the value that is returned is the third quartile (the $75^{th}$ percentile). This is the value in the data set for which three-quarter of values are less than it. The interquartile range of the `age` values is calculated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "19.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Determine the interquartile range of the age column\n",
    "stats.iqr(df.age)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Determine the variance in `ejection_fraction` for each of the classes in the `anaemia` variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "anaemia\n",
       "No     141.763348\n",
       "Yes    139.675064\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('anaemia').age.var(ddof=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Correlation between numerical variables</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The correlation between two continuous numerical variable indicates how the values in one variable changes as the value of the other changes. The correlation requires a value for each variable from each subject in a sample. The correlation between two variables is a number between $-1$ and $1$. A value of $-1$ indicates a perfect negative correlation, a value of $0$ indicates no correlation, and a value of $1$ indicates a perfect positive correlation. A positive value inidcates a positive correlation (as the values of one variable increases, so does the other.) A negative values indicates a negative correlation (as the values of one variable increases, the other decreases.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a rule of thumb the following values are used to indicate the strength of the correlation.\n",
    "\n",
    "- $0.00$ to $0.19$ _very weak_\n",
    "- $0.20$ to $0.39$ _weak_\n",
    "- $0.40$ to $0.59$ _moderate_\n",
    "- $0.60$ to $0.79$ _strong_\n",
    "- $0.80$ to $1.00$ _very strong_\n",
    "\n",
    "The same rule of thumb is used for negative values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The correlation between two variables can be calculated using the `corr` method. The correlation between the `age` and `ejection_fraction` variables is calculated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.060098363232912864"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the correlation between the age and ejection-fraction variables\n",
    "df.age.corr(df['ejection_fraction'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a weak positive correlation between the two variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Using pandas built-in descriptive statistics</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `describe` method calculate various statistics for a pandas series object that contains numerical data. The method is used for the `age` variable below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    299.000000\n",
       "mean      60.833893\n",
       "std       11.894809\n",
       "min       40.000000\n",
       "25%       51.000000\n",
       "50%       60.000000\n",
       "75%       70.000000\n",
       "max       95.000000\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Describe the age column\n",
    "df.age.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the value for $50\\%$ is the fiftieth percentile value. Half of the ages are less than this value, i.e. it is the median."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#0096FF\">Quiz questions</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Questions</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. How do you import the pandas package in Python?\n",
    "\n",
    "2. What function do you use to read a CSV file in pandas?\n",
    "\n",
    "3. How do you display the first 5 rows of a DataFrame in pandas?\n",
    "\n",
    "4. How do you calculate the mean of a column named `Age` in a DataFrame named `df`?\n",
    "\n",
    "5. How do you calculate the median of a column named `Salary` in a DataFrame named `df`?\n",
    "\n",
    "6. How do you calculate the standard deviation of a column named `Score` in a DataFrame named `df`?\n",
    "\n",
    "7. How do you find the number of missing values in each column of a DataFrame named `df`?\n",
    "\n",
    "8. How do you calculate the correlation between two columns, `Age` and `Salary`, in a DataFrame named `df`?\n",
    "\n",
    "9. How do you select a subset of a DataFrame `df` where the column `Age` is greater than $30$?\n",
    "\n",
    "10. How do you calculate the range (maximum - minimum) of a column named `Score` in a DataFrame named `df`?\n",
    "\n",
    "11. How do you group a DataFrame `df` by a column named `Department` and calculate the mean of `Salary` within each group?\n",
    "\n",
    "12. How do you group a DataFrame `df` by two columns, `Department` and `Job Title`, and count the number of rows within each group?\n",
    "\n",
    "13. How do you use the groupby method to find the maximum `Age` in each `Department` in a DataFrame `df`?\n",
    "\n",
    "14. How do you create a cross-tabulation table that shows the frequency count of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?\n",
    "\n",
    "15. How do you create a cross-tabulation table that shows the mean `Salary` for each combination of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "datascience",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}