# <span style="color:#ffa500">9 | EXPLORATORY DATA ANALYSIS</span>

<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><span property="dct:title">This chapter of an Introduction to Health Data Science</span> by <span property="cc:attributionName">Dr JH Klopper</span> is licensed under <a href="http://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">Attribution-NonCommercial-NoDerivatives 4.0 International<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nd.svg?ref=chooser-v1"></a></p>

 ## <span style="color:#0096FF">Introduction</span>

This chapter serves as an introduction to the fundamental concepts and techniques of exploratory data analysis (EDA), a critical step in any data analysis process. 

EDA is an approach and a set of techniques that allows the initial understanding and apprecaition of the information in complex datasets. It is an essential component of data science, machine learning, and statistical modeling. It is the first step in the data analysis process, exploring the data to understand its main characteristics before making any assumptions, conducting statistical test, or building predictive models.

The primary goal of EDA is to maximize the data scientist's insight into a dataset and into the underlying structure of the data and to provide all of the specific items that a data scientist would want to extract from a data set. These include the following.

- A good-fitting, parsimonious model
- Estimates for parameters
- Uncertainties for those estimates
- A ranked list of important factors
- Conclusions as to whether individual factors are statistically significant
- Optimal settings

EDA is not merely a set of techniques, but a mindset. It encourages the data scientist to remain open-minded, to explore the data from multiple angles, to question assumptions, and to be ready to modify initial hypotheses based on the insights gained from the data.

This data explores the summary of data, termed __descriptive statistics__.

 ## <span style="color:#0096FF">Packages used in this notebook</span>

In [1]:
import pandas as pd
from scipy import stats

 ## <span style="color:#0096FF">Import data</span>

The data for this chapter is in the `heart_failure.csv` file. It is imported below and metadata about the data is returned using the attributes and methods used in chapter 8.

In [2]:
# Import the heart_failure.csv file and assign it to the vaiable df
df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/heart_failure.csv')

In [3]:
# View the first 5 rows of the dataframe
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,hypertension,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death
0,75.0,No,582,No,20,Yes,265000.0,1.9,130,Male,No,4,Yes
1,55.0,No,7861,No,38,No,263358.03,1.1,136,Male,No,6,Yes
2,65.0,No,146,No,20,No,162000.0,1.3,129,Male,Yes,7,Yes
3,50.0,Yes,111,No,20,No,210000.0,1.9,137,Male,No,7,Yes
4,65.0,Yes,160,Yes,20,No,327000.0,2.7,116,Female,No,8,Yes


In [4]:
# Return the shape of the dataframe
df.shape # Returns a tuple with number of rows and columns

(299, 13)

In [5]:
# Return info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    object 
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    object 
 4   ejection_fraction         299 non-null    int64  
 5   hypertension              299 non-null    object 
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    object 
 10  smoking                   299 non-null    object 
 11  time                      299 non-null    int64  
 12  death                     299 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 30.5+ KB


 ## <span style="color:#0096FF">Exploring categorical variables</span>

The `info` method used above shows various column names (statistical variables) that are object types. This indicates categorical variables that are were not encoded by numbers when the data files was generated.

Exploratory data analysis of categorical variables starts by determining all the possible values that the variable can take, called the __sample space__ of the variable. The different elements in the sample space are called __classes__ or __levels__. The number or frequency of occurrence of each class can be determined. Finally, the relative frequency or proportion of each class can also be calculated by dividing the frequency of the class by the total number of observations.

The `anaemia` column indicates whether a suject has anemia. The sample space (all the values or elements or classes) that the `anaemia` variable can takes can be determined using the `unique` method.

In [6]:
# Determine the unique classes for the anaemia column
df['anaemia'].unique()

array(['No', 'Yes'], dtype=object)

The sample space elements are `No` (no anemia present) and `Yes` (anemia present). The `value_counts` method returns the frequency of each of the classes.

In [7]:
# Determine the value counts of the classes in the anaemia column
df.anaemia.value_counts()

anaemia
No     170
Yes    129
Name: count, dtype: int64

A total of $170$ subjects did not have anemia and $129$ did.

The passing the value of `True` to the `normalize` argument returns the relative frequency (or proportion) of each class.

In [8]:
# Determine the relative frequency of the classes in the anaemia column
df.anaemia.value_counts(normalize=True)

anaemia
No     0.568562
Yes    0.431438
Name: proportion, dtype: float64

About $56.9\%$ of the subjects did not have anemia and about $43.1\%$ of subjects did. The __mode__ of this variable is therefor `No`.

<span style="color:#00FF00">Task</span>

Determine the sample space elements, the frequency, and the relevant frequencies of the `diabetes` variable.

<span style="color:#00FF00">Solution</span>

In [9]:
df.diabetes.unique()

array(['No', 'Yes'], dtype=object)

In [10]:
df.diabetes.value_counts()

diabetes
No     174
Yes    125
Name: count, dtype: int64

In [11]:
df.diabetes.value_counts(normalize=True)

diabetes
No     0.58194
Yes    0.41806
Name: proportion, dtype: float64

__Comparative descriptive statistics__ for categorical variables expressed the frequency (or proportion) of the classes in the data, after grouping the vlaues by the sample space elements of another categorical variable. Below, the frequency of the `anaemia` variable is determined for each of the classes in the `diabetes` variable, by making use of the `groupby` method.

In [12]:
# Determine the frequency of the classes in the anaemia column after grouping by the classses in the diabetes column
df.groupby('diabetes').anaemia.value_counts()

diabetes  anaemia
No        No         98
          Yes        76
Yes       No         72
          Yes        53
Name: count, dtype: int64

The `value_counts` method can be added to show the proportion of each. By multiplying by $100$, the percentages are obtained.

In [13]:
# Express the proportions as percentages
df.groupby('diabetes').anaemia.value_counts(normalize=True) * 100

diabetes  anaemia
No        No         56.321839
          Yes        43.678161
Yes       No         57.600000
          Yes        42.400000
Name: proportion, dtype: float64

The pandas `crosstab` function can create a contingency table of the comparitive frequencies. The variable that is passed first, populates the rowss of the table.

In [14]:
# Create a cross tabulation of the anaemia and diabetes columns
pd.crosstab(df.diabetes, df.anaemia)

anaemia,No,Yes
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1
No,98,76
Yes,72,53


The joint frequencies from the table above is expressed as a percentage below.

In [15]:
# Express the cross tabulation as percentages
pd.crosstab(df.diabetes, df.anaemia, normalize=True) * 100

anaemia,No,Yes
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1
No,32.77592,25.41806
Yes,24.080268,17.725753



Generate a contingency table, indicating the joint percentages of the `anaemia` variable for each of the classes in the `death` column.

<span style="color:#00FF00">Solution</span>

In [16]:
pd.crosstab(df.death, df.anaemia, normalize=True) * 100

anaemia,No,Yes
death,Unnamed: 1_level_1,Unnamed: 2_level_1
No,40.133779,27.759197
Yes,16.722408,15.384615


 ## <span style="color:#0096FF">Exploring numerical variables</span>

Various statistics can be calculated for continuous numerical variables. The include measures of central tendency and measures of dispersion.

### <span style="color:#FFD700">Measures of central tendency</span>

Measures of central tendency are statistical indicators that represent the center point or typical value of a dataset. These measures indicate where most values in a dataset fall and are also referred to as the central location of a distribution. The three main measures of central tendency are as follows.

1. **Mean**: The mean, often called the average, is calculated by adding all data points in the dataset and then dividing by the number of data points. The mean is sensitive to outliers, meaning that extremely high or low values can skew the mean.

2. **Median**: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle numbers. The median is not affected by outliers or skewed data.

3. **Mode**: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode was used to explore categorical variables in the previous section.

These measures help provide a summary of the dataset and can give a general sense of the _typical_ value that might be expected.

The mean for a numerical variable can be calculated using the `mean` method for a pandas series object. The mean of the `age` variable is calculated below.

In [17]:
# Calculate the mean of the age variable
df.age.mean()

60.83389297658862

The mean for a variable can be calculated for different groups, created from the classes of a categorical variable. The mean of the `age` variable is calculated for each of the classes in the `death` variable, using the `groupby` method.

In [18]:
# Calculate the mean age by the groups in the death column
df.groupby('death').age.mean()

death
No     58.761906
Yes    65.215281
Name: age, dtype: float64

The `median` method calculates the median of a pandas series object that contains numerical data. The median of the `age` variable is calculated below for each class in the `death` variable.

In [19]:
# Calculate the median age by the groups in the death column
df.groupby('death').age.median()

death
No     60.0
Yes    65.0
Name: age, dtype: float64

<span style="color:#00FF00">Task</span>

Determine the median `ejection_fraction` for each of the classes in the `anaemia` variable.

<span style="color:#00FF00">Solution</span>

In [20]:
df.groupby('anaemia')['ejection_fraction'].median()

anaemia
No     38.0
Yes    38.0
Name: ejection_fraction, dtype: float64

### <span style="color:#FFD700">Measures of dispersion</span>

Measures of dispersion, also known as measures of variability, provide insights into the spread or variability of a dataset. They indicate how spread out the values in a dataset are around the center (mean or median). The main measures of dispersion are as follows.

1. **Range**: The range is the simplest measure of dispersion and is calculated as the difference between the highest and the lowest value in the dataset.

2. **Variance**: Variance measures how far each number in the set is from the mean (average) and thus from every other number in the set. It's often used in statistical and probability theory.

3. **Standard Deviation**: The standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It's the most commonly used measure of spread.

4. **Interquartile Range (IQR)**: The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is used to measure statistical dispersion and data variability by dividing a data set into quartiles.

These measures help to understand the variability within a dataset, the reliability of statistical estimations, and the level of uncertainty. They are crucial in many statistical analyses as they provide context to the measures of central tendency.

The minimum and maximum values of a numerical variable (expressed as a pandas series object) are calculatd using the `min` and `max` methods.

In [21]:
# Determine the minimum value of the age column
df.age.min()

40.0

In [22]:
# Determine the maximum value of the age column
df.age.max()

95.0

In [23]:
# Determine the range of the age column
df.age.max() - df.age.min()

55.0

The sample variance of a numerical variable (expressed as a pandas series object) is determined using the `var` method. The `ddof` argument is set to $1$ to calculate the sample variance. The `ddof` argument is the _delta degrees of freedom_ and is used to calculate the sample variance. The default value is $0$ and is used to calculate the population variance. The difference between these two equations are shown in (1) for the sake of interest. In (1) $s^{2}$ is the sample variance, $\sigma^{2}$ is the population variance, $x_{i}$ is the values for the numerical variable in each of the $n$ subjects in the sample, $N$ is the population size, and $\bar{x}$ is the mean of the variable, and $n$ is the sample size.

$$
\begin{align*}
&s^{2} = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}}{n-1} \\ \\
&\sigma^{2} = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}}{N} \\ \\
\end{align*}
$$

In [24]:
# Determine the sample variance of the age column
df.age.var(ddof=1)

141.48648290797084

The `std` methhod determines the standard deviation. The sample standard deviation is calculated below.

In [25]:
# Determine the sample standard deviation of the age column
df.age.std(ddof=1)

11.894809074044478

The `stats` module of the scipy package contains many functions for statistical analysis. The `iqr` function in this module determines the interquartile range. A quartile values takes a fraction (of $1$) as argument. If the fraction is $0.25$ then the value that is returned is the first quartile (the $25^{th}$ percentile). This is a values from the data such that a quarter of the values are less than it. If the fraction is $0.75$ then the value that is returned is the third quartile (the $75^{th}$ percentile). This is the value in the data set for which three-quarter of values are less than it. The interquartile range of the `age` values is calculated below.

In [26]:
# Determine the interquartile range of the age column
stats.iqr(df.age)

19.0

<span style="color:#00FF00">Task</span>

Determine the variance in `ejection_fraction` for each of the classes in the `anaemia` variable.

<span style="color:#00FF00">Solution</span>

In [27]:
df.groupby('anaemia').age.var(ddof=1)

anaemia
No     141.763348
Yes    139.675064
Name: age, dtype: float64

### <span style="color:#FFD700">Correlation between numerical variables</span>

The correlation between two continuous numerical variable indicates how the values in one variable changes as the value of the other changes. The correlation requires a value for each variable from each subject in a sample. The correlation between two variables is a number between $-1$ and $1$. A value of $-1$ indicates a perfect negative correlation, a value of $0$ indicates no correlation, and a value of $1$ indicates a perfect positive correlation. A positive value inidcates a positive correlation (as the values of one variable increases, so does the other.) A negative values indicates a negative correlation (as the values of one variable increases, the other decreases.)

As a rule of thumb the following values are used to indicate the strength of the correlation.

- $0.00$ to $0.19$ _very weak_
- $0.20$ to $0.39$ _weak_
- $0.40$ to $0.59$ _moderate_
- $0.60$ to $0.79$ _strong_
- $0.80$ to $1.00$ _very strong_

The same rule of thumb is used for negative values.

The correlation between two variables can be calculated using the `corr` method. The correlation between the `age` and `ejection_fraction` variables is calculated below.

In [28]:
# Calculate the correlation between the age and ejection-fraction variables
df.age.corr(df['ejection_fraction'])

0.060098363232912864

There is a weak positive correlation between the two variables.

### <span style="color:#FFD700">Using pandas built-in descriptive statistics</span>

The `describe` method calculate various statistics for a pandas series object that contains numerical data. The method is used for the `age` variable below.

In [29]:
# Describe the age column
df.age.describe()

count    299.000000
mean      60.833893
std       11.894809
min       40.000000
25%       51.000000
50%       60.000000
75%       70.000000
max       95.000000
Name: age, dtype: float64

Note that the value for $50\%$ is the fiftieth percentile value. Half of the ages are less than this value, i.e. it is the median.

# <span style="color:#0096FF">Quiz questions</span>

### <span style="color:#FFD700">Questions</span>

1. How do you import the pandas package in Python?

2. What function do you use to read a CSV file in pandas?

3. How do you display the first 5 rows of a DataFrame in pandas?

4. How do you calculate the mean of a column named `Age` in a DataFrame named `df`?

5. How do you calculate the median of a column named `Salary` in a DataFrame named `df`?

6. How do you calculate the standard deviation of a column named `Score` in a DataFrame named `df`?

7. How do you find the number of missing values in each column of a DataFrame named `df`?

8. How do you calculate the correlation between two columns, `Age` and `Salary`, in a DataFrame named `df`?

9. How do you select a subset of a DataFrame `df` where the column `Age` is greater than $30$?

10. How do you calculate the range (maximum - minimum) of a column named `Score` in a DataFrame named `df`?

11. How do you group a DataFrame `df` by a column named `Department` and calculate the mean of `Salary` within each group?

12. How do you group a DataFrame `df` by two columns, `Department` and `Job Title`, and count the number of rows within each group?

13. How do you use the groupby method to find the maximum `Age` in each `Department` in a DataFrame `df`?

14. How do you create a cross-tabulation table that shows the frequency count of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?

15. How do you create a cross-tabulation table that shows the mean `Salary` for each combination of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?