{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 9 | EXPLORATORY DATA ANALYSIS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter serves as an introduction to the fundamental concepts and techniques of exploratory data analysis (EDA), a critical step in any data analysis process. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EDA is an approach and a set of techniques that allows the initial understanding and apprecaition of the information in complex datasets. It is an essential component of data science, machine learning, and statistical modeling. It is the first step in the data analysis process, exploring the data to understand its main characteristics before making any assumptions, conducting statistical test, or building predictive models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The primary goal of EDA is to maximize the data scientist's insight into a dataset and into the underlying structure of the data and to provide all of the specific items that a data scientist would want to extract from a data set. These include the following.\n", "\n", "- A good-fitting, parsimonious model\n", "- Estimates for parameters\n", "- Uncertainties for those estimates\n", "- A ranked list of important factors\n", "- Conclusions as to whether individual factors are statistically significant\n", "- Optimal settings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EDA is not merely a set of techniques, but a mindset. It encourages the data scientist to remain open-minded, to explore the data from multiple angles, to question assumptions, and to be ready to modify initial hypotheses based on the insights gained from the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data explores the summary of data, termed __descriptive statistics__." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Packages used in this notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from scipy import stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Import data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data for this chapter is in the `heart_failure.csv` file. It is imported below and metadata about the data is returned using the attributes and methods used in chapter 8." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import the heart_failure.csv file and assign it to the vaiable df\n", "df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/heart_failure.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageanaemiacreatinine_phosphokinasediabetesejection_fractionhypertensionplateletsserum_creatinineserum_sodiumsexsmokingtimedeath
075.0No582No20Yes265000.001.9130MaleNo4Yes
155.0No7861No38No263358.031.1136MaleNo6Yes
265.0No146No20No162000.001.3129MaleYes7Yes
350.0Yes111No20No210000.001.9137MaleNo7Yes
465.0Yes160Yes20No327000.002.7116FemaleNo8Yes
\n", "
" ], "text/plain": [ " age anaemia creatinine_phosphokinase diabetes ejection_fraction \\\n", "0 75.0 No 582 No 20 \n", "1 55.0 No 7861 No 38 \n", "2 65.0 No 146 No 20 \n", "3 50.0 Yes 111 No 20 \n", "4 65.0 Yes 160 Yes 20 \n", "\n", " hypertension platelets serum_creatinine serum_sodium sex smoking \\\n", "0 Yes 265000.00 1.9 130 Male No \n", "1 No 263358.03 1.1 136 Male No \n", "2 No 162000.00 1.3 129 Male Yes \n", "3 No 210000.00 1.9 137 Male No \n", "4 No 327000.00 2.7 116 Female No \n", "\n", " time death \n", "0 4 Yes \n", "1 6 Yes \n", "2 7 Yes \n", "3 7 Yes \n", "4 8 Yes " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the first 5 rows of the dataframe\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(299, 13)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the shape of the dataframe\n", "df.shape # Returns a tuple with number of rows and columns" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 299 entries, 0 to 298\n", "Data columns (total 13 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 age 299 non-null float64\n", " 1 anaemia 299 non-null object \n", " 2 creatinine_phosphokinase 299 non-null int64 \n", " 3 diabetes 299 non-null object \n", " 4 ejection_fraction 299 non-null int64 \n", " 5 hypertension 299 non-null object \n", " 6 platelets 299 non-null float64\n", " 7 serum_creatinine 299 non-null float64\n", " 8 serum_sodium 299 non-null int64 \n", " 9 sex 299 non-null object \n", " 10 smoking 299 non-null object \n", " 11 time 299 non-null int64 \n", " 12 death 299 non-null object \n", "dtypes: float64(3), int64(4), object(6)\n", "memory usage: 30.5+ KB\n" ] } ], "source": [ "# Return info about the dataframe\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Exploring categorical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `info` method used above shows various column names (statistical variables) that are object types. This indicates categorical variables that are were not encoded by numbers when the data files was generated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exploratory data analysis of categorical variables starts by determining all the possible values that the variable can take, called the __sample space__ of the variable. The different elements in the sample space are called __classes__ or __levels__. The number or frequency of occurrence of each class can be determined. Finally, the relative frequency or proportion of each class can also be calculated by dividing the frequency of the class by the total number of observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `anaemia` column indicates whether a suject has anemia. The sample space (all the values or elements or classes) that the `anaemia` variable can takes can be determined using the `unique` method." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['No', 'Yes'], dtype=object)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the unique classes for the anaemia column\n", "df['anaemia'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample space elements are `No` (no anemia present) and `Yes` (anemia present). The `value_counts` method returns the frequency of each of the classes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "anaemia\n", "No 170\n", "Yes 129\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the value counts of the classes in the anaemia column\n", "df.anaemia.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A total of $170$ subjects did not have anemia and $129$ did." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The passing the value of `True` to the `normalize` argument returns the relative frequency (or proportion) of each class." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "anaemia\n", "No 0.568562\n", "Yes 0.431438\n", "Name: proportion, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the relative frequency of the classes in the anaemia column\n", "df.anaemia.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "About $56.9\\%$ of the subjects did not have anemia and about $43.1\\%$ of subjects did. The __mode__ of this variable is therefor `No`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Determine the sample space elements, the frequency, and the relevant frequencies of the `diabetes` variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['No', 'Yes'], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.diabetes.unique()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "diabetes\n", "No 174\n", "Yes 125\n", "Name: count, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.diabetes.value_counts()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "diabetes\n", "No 0.58194\n", "Yes 0.41806\n", "Name: proportion, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.diabetes.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Comparative descriptive statistics__ for categorical variables expressed the frequency (or proportion) of the classes in the data, after grouping the vlaues by the sample space elements of another categorical variable. Below, the frequency of the `anaemia` variable is determined for each of the classes in the `diabetes` variable, by making use of the `groupby` method." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "diabetes anaemia\n", "No No 98\n", " Yes 76\n", "Yes No 72\n", " Yes 53\n", "Name: count, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the frequency of the classes in the anaemia column after grouping by the classses in the diabetes column\n", "df.groupby('diabetes').anaemia.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `value_counts` method can be added to show the proportion of each. By multiplying by $100$, the percentages are obtained." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "diabetes anaemia\n", "No No 56.321839\n", " Yes 43.678161\n", "Yes No 57.600000\n", " Yes 42.400000\n", "Name: proportion, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Express the proportions as percentages\n", "df.groupby('diabetes').anaemia.value_counts(normalize=True) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pandas `crosstab` function can create a contingency table of the comparitive frequencies. The variable that is passed first, populates the rowss of the table." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
anaemiaNoYes
diabetes
No9876
Yes7253
\n", "
" ], "text/plain": [ "anaemia No Yes\n", "diabetes \n", "No 98 76\n", "Yes 72 53" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a cross tabulation of the anaemia and diabetes columns\n", "pd.crosstab(df.diabetes, df.anaemia)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The joint frequencies from the table above is expressed as a percentage below." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
anaemiaNoYes
diabetes
No32.77592025.418060
Yes24.08026817.725753
\n", "
" ], "text/plain": [ "anaemia No Yes\n", "diabetes \n", "No 32.775920 25.418060\n", "Yes 24.080268 17.725753" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Express the cross tabulation as percentages\n", "pd.crosstab(df.diabetes, df.anaemia, normalize=True) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Generate a contingency table, indicating the joint percentages of the `anaemia` variable for each of the classes in the `death` column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
anaemiaNoYes
death
No40.13377927.759197
Yes16.72240815.384615
\n", "
" ], "text/plain": [ "anaemia No Yes\n", "death \n", "No 40.133779 27.759197\n", "Yes 16.722408 15.384615" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(df.death, df.anaemia, normalize=True) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Exploring numerical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Various statistics can be calculated for continuous numerical variables. The include measures of central tendency and measures of dispersion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Measures of central tendency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Measures of central tendency are statistical indicators that represent the center point or typical value of a dataset. These measures indicate where most values in a dataset fall and are also referred to as the central location of a distribution. The three main measures of central tendency are as follows.\n", "\n", "1. **Mean**: The mean, often called the average, is calculated by adding all data points in the dataset and then dividing by the number of data points. The mean is sensitive to outliers, meaning that extremely high or low values can skew the mean.\n", "\n", "2. **Median**: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle numbers. The median is not affected by outliers or skewed data.\n", "\n", "3. **Mode**: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode was used to explore categorical variables in the previous section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These measures help provide a summary of the dataset and can give a general sense of the _typical_ value that might be expected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean for a numerical variable can be calculated using the `mean` method for a pandas series object. The mean of the `age` variable is calculated below." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "60.83389297658862" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the mean of the age variable\n", "df.age.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean for a variable can be calculated for different groups, created from the classes of a categorical variable. The mean of the `age` variable is calculated for each of the classes in the `death` variable, using the `groupby` method." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "death\n", "No 58.761906\n", "Yes 65.215281\n", "Name: age, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the mean age by the groups in the death column\n", "df.groupby('death').age.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `median` method calculates the median of a pandas series object that contains numerical data. The median of the `age` variable is calculated below for each class in the `death` variable." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "death\n", "No 60.0\n", "Yes 65.0\n", "Name: age, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the median age by the groups in the death column\n", "df.groupby('death').age.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Determine the median `ejection_fraction` for each of the classes in the `anaemia` variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "anaemia\n", "No 38.0\n", "Yes 38.0\n", "Name: ejection_fraction, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('anaemia')['ejection_fraction'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Measures of dispersion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Measures of dispersion, also known as measures of variability, provide insights into the spread or variability of a dataset. They indicate how spread out the values in a dataset are around the center (mean or median). The main measures of dispersion are as follows.\n", "\n", "1. **Range**: The range is the simplest measure of dispersion and is calculated as the difference between the highest and the lowest value in the dataset.\n", "\n", "2. **Variance**: Variance measures how far each number in the set is from the mean (average) and thus from every other number in the set. It's often used in statistical and probability theory.\n", "\n", "3. **Standard Deviation**: The standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It's the most commonly used measure of spread.\n", "\n", "4. **Interquartile Range (IQR)**: The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is used to measure statistical dispersion and data variability by dividing a data set into quartiles." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These measures help to understand the variability within a dataset, the reliability of statistical estimations, and the level of uncertainty. They are crucial in many statistical analyses as they provide context to the measures of central tendency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The minimum and maximum values of a numerical variable (expressed as a pandas series object) are calculatd using the `min` and `max` methods." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the minimum value of the age column\n", "df.age.min()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "95.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the maximum value of the age column\n", "df.age.max()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "55.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the range of the age column\n", "df.age.max() - df.age.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample variance of a numerical variable (expressed as a pandas series object) is determined using the `var` method. The `ddof` argument is set to $1$ to calculate the sample variance. The `ddof` argument is the _delta degrees of freedom_ and is used to calculate the sample variance. The default value is $0$ and is used to calculate the population variance. The difference between these two equations are shown in (1) for the sake of interest. In (1) $s^{2}$ is the sample variance, $\\sigma^{2}$ is the population variance, $x_{i}$ is the values for the numerical variable in each of the $n$ subjects in the sample, $N$ is the population size, and $\\bar{x}$ is the mean of the variable, and $n$ is the sample size." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\begin{align*}\n", "&s^{2} = \\frac{\\sum_{i=1}^{n} \\left( x_{i} - \\bar{x} \\right)^{2}}{n-1} \\\\ \\\\\n", "&\\sigma^{2} = \\frac{\\sum_{i=1}^{n} \\left( x_{i} - \\bar{x} \\right)^{2}}{N} \\\\ \\\\\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "141.48648290797084" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the sample variance of the age column\n", "df.age.var(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `std` methhod determines the standard deviation. The sample standard deviation is calculated below." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11.894809074044478" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the sample standard deviation of the age column\n", "df.age.std(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `stats` module of the scipy package contains many functions for statistical analysis. The `iqr` function in this module determines the interquartile range. A quartile values takes a fraction (of $1$) as argument. If the fraction is $0.25$ then the value that is returned is the first quartile (the $25^{th}$ percentile). This is a values from the data such that a quarter of the values are less than it. If the fraction is $0.75$ then the value that is returned is the third quartile (the $75^{th}$ percentile). This is the value in the data set for which three-quarter of values are less than it. The interquartile range of the `age` values is calculated below." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19.0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Determine the interquartile range of the age column\n", "stats.iqr(df.age)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Determine the variance in `ejection_fraction` for each of the classes in the `anaemia` variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "anaemia\n", "No 141.763348\n", "Yes 139.675064\n", "Name: age, dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('anaemia').age.var(ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation between numerical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The correlation between two continuous numerical variable indicates how the values in one variable changes as the value of the other changes. The correlation requires a value for each variable from each subject in a sample. The correlation between two variables is a number between $-1$ and $1$. A value of $-1$ indicates a perfect negative correlation, a value of $0$ indicates no correlation, and a value of $1$ indicates a perfect positive correlation. A positive value inidcates a positive correlation (as the values of one variable increases, so does the other.) A negative values indicates a negative correlation (as the values of one variable increases, the other decreases.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a rule of thumb the following values are used to indicate the strength of the correlation.\n", "\n", "- $0.00$ to $0.19$ _very weak_\n", "- $0.20$ to $0.39$ _weak_\n", "- $0.40$ to $0.59$ _moderate_\n", "- $0.60$ to $0.79$ _strong_\n", "- $0.80$ to $1.00$ _very strong_\n", "\n", "The same rule of thumb is used for negative values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The correlation between two variables can be calculated using the `corr` method. The correlation between the `age` and `ejection_fraction` variables is calculated below." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.060098363232912864" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the correlation between the age and ejection-fraction variables\n", "df.age.corr(df['ejection_fraction'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a weak positive correlation between the two variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using pandas built-in descriptive statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `describe` method calculate various statistics for a pandas series object that contains numerical data. The method is used for the `age` variable below." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 299.000000\n", "mean 60.833893\n", "std 11.894809\n", "min 40.000000\n", "25% 51.000000\n", "50% 60.000000\n", "75% 70.000000\n", "max 95.000000\n", "Name: age, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Describe the age column\n", "df.age.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the value for $50\\%$ is the fiftieth percentile value. Half of the ages are less than this value, i.e. it is the median." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quiz questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. How do you import the pandas package in Python?\n", "\n", "2. What function do you use to read a CSV file in pandas?\n", "\n", "3. How do you display the first 5 rows of a DataFrame in pandas?\n", "\n", "4. How do you calculate the mean of a column named `Age` in a DataFrame named `df`?\n", "\n", "5. How do you calculate the median of a column named `Salary` in a DataFrame named `df`?\n", "\n", "6. How do you calculate the standard deviation of a column named `Score` in a DataFrame named `df`?\n", "\n", "7. How do you find the number of missing values in each column of a DataFrame named `df`?\n", "\n", "8. How do you calculate the correlation between two columns, `Age` and `Salary`, in a DataFrame named `df`?\n", "\n", "9. How do you select a subset of a DataFrame `df` where the column `Age` is greater than $30$?\n", "\n", "10. How do you calculate the range (maximum - minimum) of a column named `Score` in a DataFrame named `df`?\n", "\n", "11. How do you group a DataFrame `df` by a column named `Department` and calculate the mean of `Salary` within each group?\n", "\n", "12. How do you group a DataFrame `df` by two columns, `Department` and `Job Title`, and count the number of rows within each group?\n", "\n", "13. How do you use the groupby method to find the maximum `Age` in each `Department` in a DataFrame `df`?\n", "\n", "14. How do you create a cross-tabulation table that shows the frequency count of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?\n", "\n", "15. How do you create a cross-tabulation table that shows the mean `Salary` for each combination of `Department` (rows) and `Job Title` (columns) in a DataFrame `df`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "datascience", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }