{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n", " \n", "Authors: [Ilya Baryshnikov](https://www.linkedin.com/in/baryshnikov-ilya/), [Maxim Uvarov](https://www.linkedin.com/in/maxis42/), and [Yury Kashnitsky](https://www.linkedin.com/in/festline/). Translated and edited by [Inga Kaydanova](https://www.linkedin.com/in/inga-kaidanova-a92398b1/), [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#
Assignment #2 (demo)\n", "##
Analyzing cardiovascular disease data \n", " \n", " \n", "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data) + [solution](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data-solution).**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this assignment, you will answer questions about a dataset on cardiovascular disease. You do not need to download the data: it is already in the repository. There are some Tasks that will require you to write code. Complete them and then answer the questions in the [form](https://docs.google.com/forms/d/13cE_tSIb6hsScQvvWUJeu1MEHE5L6vnxQUbDYpXsf24).\n", "\n", "#### Problem\n", "\n", "Predict the presence or absence of cardiovascular disease (CVD) using the patient examination results.\n", "\n", "#### Data description\n", "\n", "There are 3 types of input features:\n", "\n", "- *Objective*: factual information;\n", "- *Examination*: results of medical examination;\n", "- *Subjective*: information given by the patient.\n", "\n", "| Feature | Variable Type | Variable | Value Type |\n", "|---------|--------------|---------------|------------|\n", "| Age | Objective Feature | age | int (days) |\n", "| Height | Objective Feature | height | int (cm) |\n", "| Weight | Objective Feature | weight | float (kg) |\n", "| Gender | Objective Feature | gender | categorical code |\n", "| Systolic blood pressure | Examination Feature | ap_hi | int |\n", "| Diastolic blood pressure | Examination Feature | ap_lo | int |\n", "| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |\n", "| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |\n", "| Smoking | Subjective Feature | smoke | binary |\n", "| Alcohol intake | Subjective Feature | alco | binary |\n", "| Physical activity | Subjective Feature | active | binary |\n", "| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |\n", "\n", "All of the dataset values were collected at the moment of medical examination." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get to know our data by performing a preliminary data analysis.\n", "\n", "# Part 1. Preliminary data analysis\n", "\n", "First, we will initialize the environment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import all required modules\n", "# Disable warnings\n", "import warnings\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "# Import plotting modules and set up\n", "import seaborn as sns\n", "\n", "sns.set()\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import matplotlib.ticker\n", "\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will use the `seaborn` library for visual analysis, so let's set that up too:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tune the visual settings for figures in `seaborn`\n", "sns.set_context(\n", " \"notebook\", font_scale=1.5, rc={\"figure.figsize\": (11, 8), \"axes.titlesize\": 18}\n", ")\n", "\n", "from matplotlib import rcParams\n", "\n", "rcParams[\"figure.figsize\"] = 11, 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it simple, we will work only with the training part of the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../../data/mlbootcamp5_train.csv\", sep=\";\")\n", "print(\"Dataset size: \", df.shape)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It would be instructive to peek into the values of our variables.\n", " \n", "Let's convert the data into *long* format and depict the value counts of the categorical features using [`factorplot()`](https://seaborn.pydata.org/generated/seaborn.factorplot.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_uniques = pd.melt(\n", " frame=df,\n", " value_vars=[\"gender\", \"cholesterol\", \"gluc\", \"smoke\", \"alco\", \"active\", \"cardio\"],\n", ")\n", "df_uniques = (\n", " pd.DataFrame(df_uniques.groupby([\"variable\", \"value\"])[\"value\"].count())\n", " .sort_index(level=[0, 1])\n", " .rename(columns={\"value\": \"count\"})\n", " .reset_index()\n", ")\n", "\n", "sns.factorplot(\n", " x=\"variable\", y=\"count\", hue=\"value\", data=df_uniques, kind=\"bar\", size=12\n", ");" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We can see that the target classes are balanced. That's great!\n", "\n", "Let's split the dataset by target values. Can you already spot the most significant feature by just looking at the plot?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_uniques = pd.melt(\n", " frame=df,\n", " value_vars=[\"gender\", \"cholesterol\", \"gluc\", \"smoke\", \"alco\", \"active\"],\n", " id_vars=[\"cardio\"],\n", ")\n", "df_uniques = (\n", " pd.DataFrame(df_uniques.groupby([\"variable\", \"value\", \"cardio\"])[\"value\"].count())\n", " .sort_index(level=[0, 1])\n", " .rename(columns={\"value\": \"count\"})\n", " .reset_index()\n", ")\n", "\n", "sns.factorplot(\n", " x=\"variable\",\n", " y=\"count\",\n", " hue=\"value\",\n", " col=\"cardio\",\n", " data=df_uniques,\n", " kind=\"bar\",\n", " size=9,\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that the distribution of cholesterol and glucose levels great differs by the value of the target variable. Is this a coincidence?\n", "\n", "Now, let's calculate some statistics for the feature unique values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for c in df.columns:\n", " n = df[c].nunique()\n", " print(c)\n", " if n <= 3:\n", " print(n, sorted(df[c].value_counts().to_dict().items()))\n", " else:\n", " print(n)\n", " print(10 * \"-\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the end, we have:\n", "- 5 numerical features (excluding *id*);\n", "- 7 categorical features;\n", "- 70000 records in total." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1. Basic observations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.1. (1 point). How many men and women are present in this dataset? Values of the `gender` feature were not given (whether \"1\" stands for women or for men) – figure this out by looking analyzing height, making the assumption that men are taller on average. **\n", "1. 45530 women and 24470 men\n", "2. 45530 men and 24470 women\n", "3. 45470 women and 24530 men\n", "4. 45470 men and 24530 women" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.2. (1 point). Which gender more often reports consuming alcohol - men or women?**\n", "1. women\n", "2. men" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.3. (1 point). What is the difference between the percentages of smokers among men and women (rounded)?**\n", "1. 4\n", "2. 16\n", "3. 20\n", "4. 24" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.4. (1 point). What is the difference between median values of age for smokers and non-smokers (in months, rounded)? You'll need to figure out the units of feature `age` in this dataset.**\n", "\n", "1. 5\n", "2. 10\n", "3. 15\n", "4. 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2. Risk maps\n", "### Task:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the website for the European Society of Cardiology, a [SCORE scale](https://www.escardio.org/Education/Practice-Tools/CVD-prevention-toolbox/SCORE-Risk-Charts) is provided. It is used for calculating the risk of death from a cardiovascular decease in the next 10 years. Here it is:\n", "\n", "\n", "Let's take a look at the upper-right rectangle, which shows a subset of smoking men aged from 60 to 65. (It's not obvious, but the values in the figure represent the upper bound).\n", "\n", "We see the value 9 in the lower-left corner of the rectangle and 47 in the upper-right. This means that, for people in this gender-age group whose systolic pressure is less than 120, the risk of a CVD is estimated to be 5 times lower than for those with the pressure in the interval [160,180).\n", "\n", "Let's calculate that same ratio using our data.\n", "\n", "Clarifications:\n", "- Calculate ``age_years`` feature – round age to the nearest number of years. For this task, select only the people of age 60 to 64, inclusive.\n", "- Cholesterol level categories differ between the figure and our dataset. The conversion for the ``cholesterol`` feature is as follows: 4 mmol/l $\\rightarrow$ 1, 5-7 mmol/l $\\rightarrow$ 2, 8 mmol/l $\\rightarrow$ 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.5. (2 points). Calculate the fraction of the people with CVD for the two segments described above. What is the ratio of these two fractions?**\n", "\n", "1. 1\n", "2. 2\n", "3. 3\n", "4. 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3. Analyzing BMI\n", "### Task:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new feature – BMI ([Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index)). To do this, divide weight in kilogramms by the square of the height in meters. Normal BMI values are said to be from 18.5 to 25. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.6. (2 points). Choose the correct statements:**\n", "\n", "1. Median BMI in the sample is within the range of normal BMI values.\n", "2. The BMI for women is on average higher than for men.\n", "3. Healthy people have, on average, a higher BMI than the people with CVD.\n", "4. For healthy, non-drinking men, BMI is closer to the norm than for healthy, non-drinking women" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.4. Cleaning data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task:\n", "We can see that the data is not perfect. It contains \"dirt\" and inaccuracies. We'll see this better as we visualize the data.\n", "\n", "Filter out the following patient segments (we consider these as erroneous data)\n", "\n", "- diastolic pressure is higher than systolic \n", "- height is strictly less than 2.5 percentile (Use `pd.Series.quantile` to compute this value. If you are not familiar with the function, please read the docs.)\n", "- height is strictly more than 97.5 percentile\n", "- weight is strictly less than 2.5 percentile\n", "- weight is strictly more than 97.5 percentile\n", "\n", "This is not everything that we can do to clean this data, but this is sufficient for now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1.7. (2 points). What percent of the original data (rounded) did we throw away?**\n", "\n", "1. 8\n", "2. 9\n", "3. 10\n", "4. 11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2. Visual data analysis\n", "\n", "## 2.1. Correlation matrix visualization\n", "\n", "To understand the features better, you can create a matrix of the correlation coefficients between the features. Use the initial dataset (non-filtered).\n", "\n", "### Task:\n", "\n", "Plot a correlation matrix using [`heatmap()`](http://seaborn.pydata.org/generated/seaborn.heatmap.html). You can create the matrix using the standard `pandas` tools with the default parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Question 2.1. (1 point).** Which pair of features has the strongest Pearson's correlation with the *gender* feature?\n", "\n", "1. Cardio, Cholesterol\n", "2. Height, Smoke\n", "3. Smoke, Alco\n", "4. Height, Weight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2. Height distribution of men and women\n", "\n", "From our exploration of the unique values earlier, we know that the gender is encoded by the values *1* and *2*. Although you do not know the mapping of these values to gender, you can figure that out graphically by looking at the mean values of height and weight for each value of the *gender* feature.\n", "\n", "### Task:\n", "\n", "Create a violin plot for the height and gender using [`violinplot()`](https://seaborn.pydata.org/generated/seaborn.violinplot.html). Use the parameters:\n", "- `hue` to split by gender;\n", "- `scale` to evaluate the number of records for each gender.\n", "\n", "In order for the plot to render correctly, you need to convert your `DataFrame` to *long* format using the `melt()` function from `pandas`. Here is [an example](https://stackoverflow.com/a/41575149/3338479) of this for your reference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2.2. (1 point).** Which pair of features has the strongest Spearman correlation?\n", "\n", "1. Height, Weight\n", "2. Age, Weight\n", "3. Cholesterol, Gluc\n", "4. Cardio, Cholesterol\n", "5. Ap_hi, Ap_lo\n", "6. Smoke, Alco" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3. Rank correlation\n", "\n", "In most cases, *the Pearson coefficient of linear correlation* is more than enough to discover patterns in data. \n", "But let's go a little further and calculate a [rank correlation](https://en.wikipedia.org/wiki/Rank_correlation). It will help us to identify such feature pairs in which the lower rank in the variational series of one feature always precedes the higher rank in the another one (and we have the opposite in the case of negative correlation).\n", "\n", "### Task:\n", "\n", "Calculate and plot a correlation matrix using the [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2.3. (1 point).** Why do these features have strong rank correlation?\n", "\n", "1. Inaccuracies in the data (data acquisition errors).\n", "2. Relation is wrong, these features should not be related.\n", "3. Nature of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4. Age\n", "\n", "Previously, we calculated the age of the respondents in years at the moment of examination." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task:\n", "\n", "Create a *count plot* using [`countplot()`](http://seaborn.pydata.org/generated/seaborn.countplot.html) with the age on the *X* axis and the number of people on the *Y* axis. Your resulting plot should have two columns for each age, corresponding to the number of people for each *cardio* class of that age." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2.4. (1 point).** What is the smallest age at which the number of people with CVD outnumber the number of people without CVD?\n", "\n", "1. 44\n", "2. 55\n", "3. 64\n", "4. 70" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }