{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Small diabetes study\n", "\n", "In this assignment, we will work with a small dataset of diabetes patients taken from [here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).\n", "\n", "\n", "## Introduction to probability and statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib\n", "import pytest\n", "import ipytest\n", "import unittest\n", "\n", "ipytest.autoconfig()\n", "\n", "df = pd.read_csv(\"../../assets/data/diabetes.tsv\",sep='\\t')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In this dataset, columns as the following:\n", "* Age and sex are self-explanatory\n", "* BMI is body mass index\n", "* BP is average blood pressure\n", "* S1 through S6 are different blood measurements\n", "* Y is the qualitative measure of disease progression over one year\n", "\n", "Let's study this dataset using methods of probability and statistics.\n", "\n", "### Task 1: Compute mean values and variance for all values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_df_mean(df):\n", " if df is None:\n", " raise Exception('df cannot be None.')\n", " return df____\n", "\n", "def get_df_std(df):\n", " if df is None:\n", " raise Exception('df cannot be None.')\n", " return df____\n", "\n", "df_mean = get_df_mean(df)\n", "df_std = get_df_std(df)\n", "\n", "print(df_mean, df_std)" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "def create_test_df():\n", " return pd.DataFrame(\n", " {\n", " 'c1': [1, 2, 3, 4, 5], \n", " 'c2': [6, 7, 8, 9, 10]\n", " }\n", " )\n", "\n", "class TestGetDFMean(unittest.TestCase):\n", "\n", " def test_get_df_mean_happy_case(self):\n", " # assign\n", " test_df = create_test_df()\n", " \n", " # act\n", " actual_result = get_df_mean(test_df)\n", "\n", " # assert\n", " assert actual_result['c1'] == 3\n", " assert actual_result['c2'] == 8\n", "\n", " def test_get_df_mean_with_none_df(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " get_df_mean(None)\n", "\n", " def test_get_df_mean_with_empty_df(self):\n", " # act\n", " actual_result = get_df_mean(pd.DataFrame())\n", "\n", " # assert\n", " assert actual_result.equals(pd.Series(dtype=\"float64\"))\n", "\n", "class TestGetDFStd(unittest.TestCase):\n", "\n", " def test_get_df_std_happy_case(self):\n", " # assign\n", " test_df = create_test_df()\n", " \n", " # act\n", " actual_result = get_df_std(test_df)\n", "\n", " # assert\n", " assert actual_result['c1'] == 1.58113883008418981\n", " assert actual_result['c2'] == 1.5811388300841898\n", "\n", " def test_get_df_std_with_none_df(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " get_std(None)\n", "\n", " def test_get_df_std_with_empty_df(self):\n", " # act\n", " actual_result = get_df_std(pd.DataFrame())\n", "\n", " # assert\n", " assert actual_result.equals(pd.Series(dtype=\"float64\"))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", "\n", "You can consider to use pandas.DataFrame.mean and pandas.DataFrame.std.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Task 2: Plot boxplots for BMI, BP and Y depending on gender" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Part 1: get the data ready" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "columns_to_plot = ['BMI', 'BP', 'Y']\n", "\n", "def filter_by(df, column_name, column_value):\n", " return df[df[____] == ____]\n", "\n", "# filter the df by 'SEX' == 1\n", "df_sex_1 = filter_by(____, ____, ____)\n", "\n", "# filter the df by 'SEX' == 2\n", "df_sex_2 = filter_by(____, ____, ____)" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "def create_test_df():\n", " return pd.DataFrame(\n", " {\n", " 'name': [\"Lucy\", \"Tom\", \"Lily\", \"Anny\", \"Mike\"], \n", " 'gender': [\"f\", \"m\", \"f\", \"f\", \"m\"],\n", " }\n", " )\n", "\n", "class TestFilterBy(unittest.TestCase):\n", "\n", " def test_filter_by_happy_case(self):\n", " # assign\n", " test_df = create_test_df()\n", " \n", " # act\n", " actual_result = filter_by(test_df, 'gender', \"f\")\n", "\n", " # assert\n", " assert actual_result.equals(\n", " pd.DataFrame(\n", " {\n", " 'name': [\"Lucy\", \"Lily\", \"Anny\"], \n", " 'gender': [\"f\", \"f\", \"f\"],\n", " },\n", " index=pd.Index([0, 2, 3]),\n", " )\n", " )\n", "\n", " def test_filter_by_with_none_df(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " filter_by(None, 'gender', \"f\")\n", "\n", " def test_filter_by_with_empty_df(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " filter_by(pd.DataFrame(), 'gender', \"f\")\n", " \n", " def test_filter_by_invalid_column_name(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " filter_by(test_df, 'invalid_column_name', \"f\")\n", " \n", " \n", " def test_filter_by_invalid_column_value(self):\n", " # assign\n", " test_df = create_test_df()\n", " \n", " # act\n", " actual_result = filter_by(test_df, 'gender', \"invalid_column_value\")\n", " \n", " # assert\n", " assert actual_result.equals(\n", " pd.DataFrame(\n", " columns=['name', 'gender']\n", " )\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", " \n", "Refer to indexin and selecting data on pandas.DataFrame.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Part 2: plot the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def df_boxplot(df, column):\n", " if df is not None and not df.empty:\n", " df.____(column=____)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_boxplot(df_sex_1, columns_to_plot)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_boxplot(df_sex_2, columns_to_plot)" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "from unittest.mock import Mock, patch\n", "\n", "class TestDFBoxPlot(unittest.TestCase):\n", " \n", " def test_df_boxplot_happy_case(self):\n", " # assign\n", " test_df = Mock(return_value=pd.DataFrame(\n", " {\n", " 'c1': [1, 2, 3, 4, 5], \n", " }\n", " ))\n", " test_df.empty = False\n", " \n", " with patch.object(test_df, 'boxplot') as mock_df_boxplot:\n", " # act\n", " actual_result = df_boxplot(test_df)\n", "\n", " # assert\n", " mock_df_boxplot.assert_called_once()\n", "\n", " def test_df_boxplot_with_empty_df(self):\n", " # assign\n", " test_df = Mock(return_value=pd.DataFrame())\n", " \n", " with patch.object(test_df, 'boxplot') as mock_df_boxplot:\n", " # act\n", " actual_result = df_boxplot(test_df)\n", "\n", " # assert\n", " mock_df_boxplot.assert_not_called()\n", " \n", " def test_df_boxplot_with_none_df(self):\n", " # assign\n", " test_df = Mock(return_value=None)\n", " \n", " with patch.object(test_df, 'boxplot') as mock_df_boxplot:\n", " # act\n", " actual_result = df_boxplot(test_df)\n", "\n", " # assert\n", " mock_df_boxplot.assert_not_called()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", "\n", "You can consider to use pandas.DataFrame.boxplot.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3: What is the distribution of Age, Sex, BMI and Y variables?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def df_plot(df, column):\n", " if df is not None and not df.empty:\n", " df____(column=____)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "age_distribution = df_plot(df['AGE'], columns_to_plot)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sex_distribution = df_plot(df['SEX'], columns_to_plot)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bmi_distribution = df_plot(df['BMI'], columns_to_plot)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_distribution = df_plot(df['Y'], columns_to_plot)" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "from unittest.mock import Mock, patch\n", "\n", "class TestDFPlot(unittest.TestCase):\n", "\n", " def test_df_plot_happy_case(self):\n", " # assign\n", " test_df = Mock(return_value=pd.DataFrame(\n", " {\n", " 'c1': [1, 2, 3, 4, 5], \n", " }\n", " ))\n", " test_df.empty = False\n", " \n", " with patch.object(test_df, 'plot') as mock_df_plot:\n", " # act\n", " actual_result = df_plot(test_df)\n", "\n", " # assert\n", " mock_df_plot.assert_called_once()\n", "\n", " def test_df_plot_with_empty_df(self):\n", " # assign\n", " test_df = Mock(return_value=pd.DataFrame())\n", " \n", " with patch.object(test_df, 'plot') as mock_df_plot:\n", " # act\n", " actual_result = df_plot(test_df)\n", "\n", " # assert\n", " mock_df_plot.assert_not_called()\n", " \n", " def test_df_plot_with_none_df(self):\n", " # assign\n", " test_df = Mock(return_value=None)\n", " \n", " with patch.object(test_df, 'plot') as mock_df_plot:\n", " # act\n", " actual_result = df_plot(test_df)\n", "\n", " # assert\n", " mock_df_plot.assert_not_called()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", " \n", "You can consider to use pandas.DataFrame.plot.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4: Test the correlation between different variables and disease progression (Y)\n", "\n", "> **Hint** Correlation matrix would give you the most useful information on which values are dependent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_df_corr_with(df, with_column):\n", " if df is not None and not df.empty:\n", " return df____()[____]\n", "\n", "df_corr_y = get_df_corr_with(____, ____)\n", "df_corr_y" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "def create_test_df():\n", " return pd.DataFrame(\n", " {\n", " 'c1': [1, 2, 3, 4, 5], \n", " 'c2': [6, 7, 8, 9, 10]\n", " }\n", " )\n", "\n", "class TestGetDfCorrWith(unittest.TestCase):\n", "\n", " def test_get_df_corr_with_happy_case(self):\n", " # assign\n", " test_df = create_test_df()\n", " \n", " # act\n", " actual_result = get_df_corr_with(test_df, 'c2')\n", "\n", " # assert\n", " assert actual_result.equals(pd.Series(\n", " [1.0, 1.0], index=['c1', 'c2']\n", " ))\n", "\n", " def test_get_df_corr_with_with_none_df(self):\n", " # act \n", " actual_result = get_df_corr_with(None, 'any_column')\n", " \n", " # assert\n", " assert not actual_result\n", "\n", " def test_get_df_corr_with_with_empty_df(self):\n", " # act\n", " actual_result = get_df_corr_with(pd.DataFrame(), 'any_column')\n", "\n", " # assert\n", " assert not actual_result\n", " \n", " def test_get_df_corr_with_with_invalid_column_name(self):\n", " # act & assert\n", " with pytest.raises(Exception):\n", " get_df_corr_with(create_test_df(), 'invalid_column')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", " \n", "You can consider to use pandas.DataFrame.corr.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5: Test the hypothesis that the degree of diabetes progression is different between men and women" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the correlation between 'SEX' and 'Y'\n", "df_corr_sex_with_y = get_df_corr_with(____, ____)[____]\n", "df_corr_sex_with_y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the scatterplot between 'SEX' and 'Y'\n", "\n", "def df_scatterplot(df, c1, c2):\n", " if df is not None and not df.empty:\n", " df.____(____, _____)\n", "\n", "df_scatterplot(df, 'SEX', 'Y')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fill in True or False\n", "diabetes_progression_correlated_with_sex = ____\n", "print(f\"The degree of diabetes progression is {'different' if diabetes_progression_correlated_with_sex else 'not different'} between men and women.\")" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "
Check result by executing below... 📝
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "from unittest.mock import Mock, patch\n", "\n", "class TestDFScatterPlot(unittest.TestCase):\n", "\n", " def test_df_scatterplot_happy_case(self):\n", " # assign\n", " test_df = Mock(\n", " plot=Mock(scatter=Mock()),\n", " empty=False,\n", " return_value=pd.DataFrame(\n", " {\n", " 'c1': [1, 2, 3, 4, 5], \n", " 'c2': [6, 7, 8, 9, 10], \n", " }\n", " )\n", " )\n", " \n", " # act\n", " actual_result = df_scatterplot(test_df, 'c1', 'c2')\n", "\n", " # assert\n", " test_df.plot.scatter.assert_called_once_with('c1', 'c2')\n", "\n", " def test_df_scatterplot_with_empty_df(self):\n", " # assign\n", " test_df = Mock(plot=Mock(scatter=Mock()), empty=True, return_value=pd.DataFrame())\n", " \n", " # act\n", " actual_result = df_scatterplot(test_df, 'c1', 'c2')\n", "\n", " # assert\n", " test_df.plot.scatter.assert_not_called()\n", " \n", " def test_df_scatterplot_with_none_df(self):\n", " # assign\n", " test_df = Mock(plot=Mock(scatter=Mock()), empty=True, return_value=None)\n", " \n", " # act\n", " actual_result = df_scatterplot(test_df, 'c1', 'c2')\n", "\n", " # assert\n", " test_df.plot.scatter.assert_not_called()\n", "\n", "assert not diabetes_progression_correlated_with_sex" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "
\n", " \n", "
👩‍💻 Hint\n", "\n", "You can consider to use pandas.DataFrame.corrwith to get the correlation, and pandas.DataFrame.plot.scatter to plot the scatterplots.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rubric\n", "\n", "Exemplary | Adequate | Needs Improvement\n", "--- | --- | -- |\n", "All required tasks are complete, graphically illustrated and explained | Most of the tasks are complete, explanations or takeaways from graphs and/or obtained values are missing | Only basic tasks such as computation of mean/variance and basic plots are complete, no conclusions are made from the data\n", "\n", "## Acknowledgments\n", "\n", "Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 4 }