{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#ffa500\">1 | INTRODUCTION TO HEALTH DATA SCIENCE</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dct=\"http://purl.org/dc/terms/\"><span property=\"dct:title\">This chapter of an Introduction to Health Data Science</span> by <span property=\"cc:attributionName\">Dr JH Klopper</span> is licensed under <a href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1\" target=\"_blank\" rel=\"license noopener noreferrer\" style=\"display:inline-block;\">Attribution-NonCommercial-NoDerivatives 4.0 International<img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nd.svg?ref=chooser-v1\"></a></p>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: This is a Jupyter Notebook. If you are not familiar with Jupyter Notebooks, please read the [Jupyter Notebook Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Python packages used in this notebook</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This course uses the Python computer language and Jupyter notebooks for the generation of the course content. At the start of each notebook, we will list the Python packages used in that notebook. The following Python packages are imported for use in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_csv\n",
    "import pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from plotly import express, io"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "io.templates.default = \"gridon\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Introduction</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chapter serves as a first introduction to health data science. Health Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured health data. It combines aspects of data science, statistics, machine learning, and health informatics to generate insights that can be used to improve health outcomes, enhance patient care, and inform health policy."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The image below visualizes the intersection of the major disciplines that form the foundation of health data science. The intersection of these disciplines is where health data science resides."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Venn diagram of health data science](Venn_Diagram_Data_Science.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While data science pertains to all fields of the management and analysis of data, health data science is specifically concerned with the management and analysis of health data. Health data science is a broad field that encompasses many different disciplines, including biostatistics, epidemiology, health informatics, and machine learning."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">The growth in health data science</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Health data science has seen significant growth over the past few years. This growth has been driven by several factors.\n",
    "\n",
    "1. **Increased Data Availability**: The proliferation of electronic health records (EHRs), wearable technology, and other digital health tools has led to an explosion in the amount of avalibale health-related data. Health-related data can be analyzed to gain insights into patient health, disease trends, treatment effectiveness, and more.\n",
    "\n",
    "2. **Technological Advancements**: Advances in technologies such as machine learning, artificial intelligence, and cloud computing have made it possible to analyze large, complex health datasets. These technologies can identify patterns and trends in the data that would be difficult, if not impossible, to detect manually or through standard statistical analysis.\n",
    "\n",
    "3. **Demand for Personalized Medicine**: There iss an ever-growing demand for personalized medicine, which tailors treatment to the individual patient based on their unique genetic makeup, lifestyle, and environment. Health data science plays a crucial role in personalized medicine by analyzing patient data to identify individual risk factors, predict disease progression, and determine the most effective treatments.\n",
    "\n",
    "4. **Public Health Needs**: The COVID-19 pandemic has underscored the importance of health data science in public health. Health data scientists have played a key role in tracking the spread of the virus, identifying risk factors for severe disease, and evaluating the effectiveness of various interventions.\n",
    "\n",
    "5. **Policy and Investment**: Governments and private sector companies around the world are investing heavily in health data science. They recognize its potential to improve patient care, reduce healthcare costs, and drive innovation in the healthcare industry."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that this is a course on health data science, it serves as a good example to use data to show the current growth in health data science. The PubMed repository is a good source of data for this purpose. PubMed is a free search engine that provides access to over 32 million citations and abstracts from biomedical and life sciences journals. It is maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The PubMed repository was accessed in mid-2023 and the search term _data science_ was used. At the top-left of the first results page is a downloadable link to a spreadsheet file (in comma-separated values file format). This file has two columns: _Year_ and _Count_. The _Year_ column contains the year and the _Count_ column contains the number of results for that year. The file was downloaded and saved as `PubMed_Data_Science.csv` in the same folder as this notebook. The pandas `read_csv` function is used below to import the file as a pandas dataframe object, assigned to the computer variable `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the PubMed_Data_Science.csv file and assign the dataframe object to the variable df\n",
    "df = read_csv(\"PubMed_Data_Science.csv\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `head` method is used to display the first five rows of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2022</td>\n",
       "      <td>8837</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2021</td>\n",
       "      <td>7104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020</td>\n",
       "      <td>4511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2019</td>\n",
       "      <td>2686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2018</td>\n",
       "      <td>1565</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Year  Count\n",
       "0  2022   8837\n",
       "1  2021   7104\n",
       "2  2020   4511\n",
       "3  2019   2686\n",
       "4  2018   1565"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first 5 rows of the dataframe\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plotly data visualization package is used to create a bar plot of the number of PubMed results for the search term _data science_ over time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "config": {
        "plotlyServerURL": "https://plot.ly"
       },
       "data": [
        {
         "alignmentgroup": "True",
         "hovertemplate": "Year=%{x}<br>Count=%{y}<extra></extra>",
         "legendgroup": "",
         "marker": {
          "color": "#1F77B4",
          "pattern": {
           "shape": ""
          }
         },
         "name": "",
         "offsetgroup": "",
         "orientation": "v",
         "showlegend": false,
         "textposition": "auto",
         "type": "bar",
         "x": [
          2022,
          2021,
          2020,
          2019,
          2018,
          2017,
          2016,
          2015,
          2014,
          2013,
          2012,
          2011,
          2010,
          2009,
          2008,
          2007,
          2006,
          2004,
          2000,
          1998,
          1997,
          1992
         ],
         "xaxis": "x",
         "y": [
          8837,
          7104,
          4511,
          2686,
          1565,
          836,
          451,
          172,
          77,
          29,
          10,
          6,
          3,
          1,
          1,
          4,
          4,
          1,
          3,
          1,
          1,
          1
         ],
         "yaxis": "y"
        }
       ],
       "layout": {
        "barmode": "relative",
        "legend": {
         "tracegroupgap": 0
        },
        "template": {
         "data": {
          "pie": [
           {
            "automargin": true,
            "type": "pie"
           }
          ]
         },
         "layout": {
          "xaxis": {
           "showgrid": true,
           "title": {
            "standoff": 15
           }
          },
          "yaxis": {
           "showgrid": true,
           "title": {
            "standoff": 15
           }
          }
         }
        },
        "title": {
         "text": "Count of Articles using the search term Data Science in PubMed by Year"
        },
        "xaxis": {
         "anchor": "y",
         "domain": [
          0,
          1
         ],
         "title": {
          "text": "Year"
         }
        },
        "yaxis": {
         "anchor": "x",
         "domain": [
          0,
          1
         ],
         "title": {
          "text": "Count"
         }
        }
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create a plotly bar plot using the Year as the x-axis and the Count of articles as the y-axis\n",
    "fig = express.bar(df, x=\"Year\", y=\"Count\", title=\"Count of Articles using the search term Data Science in PubMed by Year\")\n",
    "fig.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this data, it is clear that, at least in the biomedical and life sciences fields, the number of publications on data science has increased dramatically over the past few years."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Careers in health data science</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The salaries for jobs in health data science is very high. The [2020 Burtch Works Study](https://www.burtchworks.com/industry-insights/data-science-analytics-salaries-are-on-the-rise) on Data Science Tools found that the median base salary for data scientists in the United States was $\\$130,000$. The median base salary for data scientists in the United States with a graduate degree was $\\$150,000$. The median base salary for data scientists in the United States with a graduate degree and $10$ or more years of experience was $\\$250,000$."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the image below is a reproduction of a bar plot from ZipRecruiter, showing more specific detail on health data science-related salaries."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![HDS_Salary](ZipRecruiterSalary.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Health data</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Health data refers to any data that is related to a person's health status, their health care, or their health determinants. It encompasses a broad range of information types, including but not limited to the following.\n",
    "\n",
    "1. **Clinical Data**: This includes medical histories, laboratory test results, radiology images, and other data types that are typically found in electronic health records (EHRs). Clinical data is usually generated during the process of patient care within a healthcare setting such as a hospital, clinic, or doctor's office.\n",
    "\n",
    "2. **Genomic Data**: This is data derived from an individual's genetic material (DNA). With the advent of technologies like high-throughput sequencing, it's now possible to generate detailed data on an individual's entire genome. Genomic data can provide insights into an individual's risk of developing certain diseases and their likely response to different types of treatment.\n",
    "\n",
    "3. **Patient-Generated Data**: This is health-related data that is created, recorded, or gathered by individuals, family members, or caregivers to help address a health concern. This can include data from wearable devices (like heart rate or step count), self-reported symptoms, or health diary entries.\n",
    "\n",
    "4. **Social Determinants of Health Data**: This includes data about the conditions in which people are born, grow, live, work, and age. Factors such as socioeconomic status, education, neighborhood and physical environment, employment, and social support networks, as well as access to health care can influence a wide range of health outcomes.\n",
    "\n",
    "5. **Claims and Cost Data**: This includes data from health insurance claims, which can provide information on patient diagnoses, procedures, medications, and the cost of care.\n",
    "\n",
    "6. **Pharmaceutical and Research Data**: This includes data from clinical trials, drug research, and other pharmaceutical research. This data is crucial for the development of new treatments and therapies."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Health data can be used for a variety of purposes, such as improving patient care, conducting medical research, informing public health initiatives, and guiding health policy decisions. However, it's important to handle health data responsibly due to privacy and security concerns."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Planning a health data science project</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Health data science is a rapidly evolving field that leverages the power of data to improve healthcare outcomes. The planning of a health data science project involves several critical steps, including, among other steps and not necessarli in any particular order, defining a research question or questions, developing a research protocol, securing funding, obtaining ethical approval if required, identifying data sources, capturing and wrangling data (that is cleaning up the data, tarnsofrming it into a relevent format, and verifying the integrity of the data), analyzing data, reporting results, and disseminating or communicating the findings."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Research Questions</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step in planning a health data science project is defining the research question. This question should be specific, measurable, achievable, relevant, and time-bound. This spells the well-know acronymn SMART. A resaerch question should address a gap in the current knowledge or offer a novel approach to a known problem. The research question guides the entire project and influences the choice of data sources, the design of data collection tools, and the selection of data analysis methods."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The topic is the first lecture of the course Research Methods Foundation at the Milken Institute School of Public Health."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Research Protocol</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the research question is defined, the next step is to develop a research protocol. This document outlines the project's objectives, methodology, and timeline. It includes details about the study design, the data to be collected, the data analysis plan, and the expected outcomes. The research protocol serves as a roadmap for the project, guiding its implementation and helping to ensure that the research is conducted systematically and ethically."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A template for such a protocol is included in this course. In general, the following serves as a guideline for inclusion in a research protocol.\n",
    "\n",
    "1. **Title**: A clear and concise title that accurately reflects the nature of the study.\n",
    "\n",
    "2. **Background and Rationale**: This section should provide a brief overview of what is known about the research topic, identify gaps in the current knowledge, and explain why the research is needed.\n",
    "\n",
    "3. **Objectives**: Clearly defined primary and secondary objectives of the study. These should be specific, measurable, achievable, relevant, and time-bound (SMART).\n",
    "\n",
    "4. **Study Design**: A description of the study design (e.g., randomized controlled trial, cohort study, case-control study, cross-sectional study, etc.), including the rationale for choosing this design.\n",
    "\n",
    "5. **Study Population and Sampling**: Detailed information about the study population, including inclusion and exclusion criteria, and the method of participant recruitment and selection.\n",
    "\n",
    "6. **Data Collection Methods**: A description of how data will be collected, including the type of data (e.g., demographic data, clinical data, survey responses), the data collection instruments (e.g., questionnaires, medical records), and the procedures for data collection.\n",
    "\n",
    "7. **Data Analysis Plan**: A detailed plan for how the data will be analyzed, including the statistical methods that will be used.\n",
    "\n",
    "8. **Ethical Considerations**: A discussion of the ethical issues related to the study, including how participant consent will be obtained, how participant confidentiality will be protected, and how potential risks and benefits will be balanced.\n",
    "\n",
    "9. **Timeline**: An estimated timeline for the different stages of the research project, from participant recruitment to data analysis and reporting.\n",
    "\n",
    "10. **Budget**: An estimated budget for the research project, including the costs of personnel, data collection, data analysis, and dissemination of results.\n",
    "\n",
    "11. **Dissemination Plan**: A plan for how the results of the research will be disseminated, including potential journals for publication and conferences for presentation.\n",
    "\n",
    "12. **References**: A list of references for all sources cited in the protocol."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A well-written research protocol is crucial for the success of a healthcare research project. It provides a roadmap for the project, ensures that the research is conducted systematically and ethically, and facilitates communication about the project with stakeholders, funding bodies, and ethical review boards."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Funding</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Securing funding is a crucial step in the planning process. Funding can come from various sources, including government agencies, non-profit organizations, and private companies. The funding proposal should clearly articulate the project's significance, objectives, methodology, and potential impact. It should also include a detailed budget that outlines the project's expected costs."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are numerous organizations in the United States that provide funding for healthcare research projects. Some of the most prominent organizations are listed below.\n",
    "\n",
    "1. **National Institutes of Health (NIH)**: The NIH is the largest public funder of biomedical research in the world. It provides funding for research that aims to enhance health, lengthen life, and reduce illness and disability.\n",
    "\n",
    "2. **Centers for Disease Control and Prevention (CDC)**: The CDC provides funding for a wide range of health research, particularly in areas related to public health and disease prevention.\n",
    "\n",
    "3. **Patient-Centered Outcomes Research Institute (PCORI)**: PCORI funds research that can help patients and those who care for them make better-informed decisions about the healthcare choices.\n",
    "\n",
    "4. **Agency for Healthcare Research and Quality (AHRQ)**: AHRQ provides funding for research that aims to improve the quality, safety, efficiency, and effectiveness of healthcare.\n",
    "\n",
    "5. **Robert Wood Johnson Foundation (RWJF)**: RWJF is the nation's largest philanthropy dedicated solely to health. It provides funding for research and initiatives to help everyone in America have an equal opportunity to live the healthiest life possible.\n",
    "\n",
    "6. **Bill & Melinda Gates Foundation**: While much of its focus is on global health, the Gates Foundation also funds research in the U.S. that addresses health inequities and improves access to healthcare services.\n",
    "\n",
    "7. **American Cancer Society (ACS)**: ACS provides funding for a wide range of research projects aimed at understanding and treating cancer.\n",
    "\n",
    "8. **American Heart Association (AHA)**: AHA funds research related to cardiovascular disease and stroke.\n",
    "\n",
    "9. **Susan G. Komen Foundation**: This foundation provides funding for research focused on breast cancer.\n",
    "\n",
    "10. **Pharmaceutical Companies**: Many pharmaceutical companies have grant programs that fund research related to their therapeutic areas of interest."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are just a few examples. The specific organization that researchers might approach for funding would depend on the nature of their project and the alignment with the funding organization's priorities and interests."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Ethical Review Boards</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before most project can begin, they must be reviewed and approved by an ethical review board. This board ensures that the project complies with ethical guidelines and that the rights and welfare of the participants are protected. The review process involves submitting an application that details the project's objectives, methodology, potential risks and benefits, and measures to protect participant confidentiality and privacy."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Institutional Review Boards (IRBs), also known as ethical review boards, play a pivotal role in ensuring the ethical conduct of research involving human subjects. Their primary responsibility is to protect the rights, welfare, and well-being of research participants."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IRBs originated in response to historical abuses in human subjects research, such as the infamous Tuskegee Syphilis Study. Today, they serve as an essential checkpoint in the research process, particularly in biomedical and behavioral research."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An IRB is typically composed of a diverse group of individuals, including scientists, non-scientists, and community members. This diversity ensures a comprehensive review of research protocols from various perspectives. Scientists contribute their technical expertise, non-scientists provide a lay perspective, and community members represent the interests and values of the community from which research participants may be drawn."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before a research study involving human subjects can commence, the study protocol must be reviewed and approved by an IRB. The IRB reviews the protocol to ensure that the study is designed to minimize potential harm to participants, that risks are outweighed by potential benefits, and that participants will be selected in a fair manner. They also ensure that participants will give informed consent, meaning they will be adequately informed about the study's purpose, procedures, risks, benefits, alternatives, and their rights as participants."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Informed consent is a cornerstone of ethical research. It respects individual autonomy by ensuring that participants voluntarily agree to participate in research, fully understanding what participation entails. The IRB reviews the informed consent documents and procedures to ensure they are appropriate and comprehensible to potential participants."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IRBs also conduct ongoing reviews of approved studies to ensure they continue to meet ethical standards. They can require modifications to study protocols, suspend studies that are not being conducted in accordance with the approved protocol, or terminate studies that have been associated with unexpected serious harm to participants."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IRBs play a critical role in upholding ethical standards in research involving human subjects. They protect the rights and welfare of research participants, ensure informed consent, and promote ethically sound research practices, thereby fostering public trust in research."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Sources of Data</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identifying appropriate sources of data is a key step in the planning process. Data can come from various sources, including electronic health records, health insurance claims, patient surveys, wearable devices, and public health databases. The choice of data sources depends on the research question and the available resources."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open data is data that has been deindentified and made freely available to the public. It can be used for a variety of purposes, including research, education, and innovation. Open data is often used in health data science projects because it is readily available and can be used to answer a wide range of research questions."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several sources of open healthcare data available in the United States. The list below provides a brief overview of some of the most prominent sources.\n",
    "\n",
    "1. **HealthData.gov**: This is the U.S. government's principal health data repository. It provides access to over a thousand datasets on a wide range of health topics, including healthcare quality, health outcomes, medical devices, and more.\n",
    "\n",
    "2. **Centers for Medicare & Medicaid Services (CMS)**: CMS provides access to a variety of datasets related to Medicare and Medicaid services, including data on utilization, payment, and quality of care.\n",
    "\n",
    "3. **National Center for Health Statistics (NCHS)**: NCHS is a part of the Centers for Disease Control and Prevention (CDC) and provides statistical information that guides actions and policies to improve the health of the American people.\n",
    "\n",
    "4. **Behavioral Risk Factor Surveillance System (BRFSS)**: The BRFSS is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.\n",
    "\n",
    "5. **ClinicalTrials.gov**: This is a database of privately and publicly funded clinical studies conducted around the world. It provides information about the purpose of each trial, who may participate, locations, and phone numbers for more details.\n",
    "\n",
    "6. **The National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program**: The SEER Program provides information on cancer statistics in an effort to reduce the cancer burden among the U.S. population.\n",
    "\n",
    "7. **The National Health and Nutrition Examination Survey (NHANES)**: NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States.\n",
    "\n",
    "8. **FDA's OpenFDA**: OpenFDA provides APIs and datasets for drug, device, and food-related data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These sources provide a wealth of information that can be used for a variety of purposes, from academic research to policy development to public health initiatives. However, it's important to note that while these datasets are publicly available, they must still be used responsibly and in accordance with any applicable terms of use."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Tools for Data Capture</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on the data sources, different tools may be used for data capture. These can range from electronic data capture systems for collecting data from electronic health records, to survey software for administering patient surveys, to application programming interfaces or APIs for accessing data from wearable devices or public health databases."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still the two most common tools for the manual capturing of data are spreadsheets and databases. Spreadsheets are a popular choice because they are easy to use and can be used for a wide range of tasks, from data entry to data analysis. Databases are another common choice because they offer more advanced features, such as data validation and data integrity checks."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A spreadsheet is a file made of rows and columns that help sort, organize, and arrange data efficiently. It's a type of software application that enables users to store, manipulate, and analyze data in tabular form."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each cell within the grid of a spreadsheet can contain a number, text, or a formula. A formula is a command inserted into a cell that carries out calculations using the data in other cells. This feature makes spreadsheets particularly useful for performing complex calculations and data analysis."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spreadsheets are widely used in various fields such as business, finance, accounting, and science for tasks like financial analysis, budgeting, project management, data analysis, and record keeping. The most commonly used spreadsheet program is Microsoft Excel, but there are many other programs available, such as Google Sheets and Apple's Numbers."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A database is a structured set of data. It is an organized collection of information that can easily be accessed, managed, and updated. Databases can store data about people, products, orders, or anything else of interest to an individual or an organization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Databases are typically managed by a Database Management System (DBMS), which provides users with the tools to add, edit, and delete data, generate reports, and perform other operations. There are different types of databases, such as relational databases, object-oriented databases, hierarchical databases, and network databases, each with their own structure and type of DBMS."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examples of database software, or DBMS, include the following.\n",
    "\n",
    "1. **Oracle Database**: A popular relational DBMS with a wide range of capabilities for small to large enterprises.\n",
    "\n",
    "2. **MySQL**: An open-source relational DBMS that is widely used for web databases.\n",
    "\n",
    "3. **Microsoft SQL Server**: A relational DBMS with a variety of editions for different needs, from small applications to large enterprise solutions.\n",
    "\n",
    "4. **PostgreSQL**: An open-source object-relational DBMS that supports a wide variety of data types and has strong compliance with the SQL standard.\n",
    "\n",
    "5. **MongoDB**: A leading NoSQL database, which is document-oriented, meaning it stores data in a semi-structured format known as BSON (a binary representation of JSON-like documents).\n",
    "\n",
    "6. **SQLite**: A self-contained, serverless, and zero-configuration database engine used in embedded systems and small to medium web and desktop applications.\n",
    "\n",
    "7. **IBM Db2**: A family of hybrid data management solutions designed to provide a robust set of capabilities to handle data and analytics."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are just a few examples of the many database software options available. The choice of database software depends on the specific needs and resources of the user or organization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Data Wrangling</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data is captured, it often needs to be cleaned and transformed in a process known as data wrangling. This involves dealing with missing or inconsistent data, removing outliers, and converting data into a format suitable for analysis. Data wrangling is a critical step that can significantly impact the quality of the analysis and the validity of the results. Data wrangling is often the most time-consuming part of a health data science project."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the era of big data, organizations and institutions across various sectors are inundated with vast amounts of data. However, this data often comes in a raw, unstructured, or semi-structured format that is not immediately suitable for analysis."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the important steps in data wrangling are listed below.\n",
    "\n",
    "1. **Data Discovery**: The first step in data wrangling involves understanding the data you have. This includes identifying the data types, the structure of the data, and any potential issues that might affect the quality of the data.\n",
    "\n",
    "2. **Data Structuring**: Raw data often comes in a format that is not suitable for analysis. Structuring involves transforming the data into a format that is easier to work with. This could involve reshaping the data, combining multiple datasets, or converting the data into a different format.\n",
    "\n",
    "3. **Data Cleaning**: This step involves identifying and correcting errors in the data, such as missing values, inconsistent entries, or outliers. Data cleaning is crucial for ensuring the accuracy of the subsequent analysis.\n",
    "\n",
    "4. **Data Enriching**: Enrichment involves adding new data or variables to the existing dataset to enhance the analysis. This could involve adding demographic data, calculating new variables, or integrating data from different sources.\n",
    "\n",
    "5. **Data Validating**: The final step in data wrangling is validating the dataset. This involves checking the data for consistency and accuracy, and ensuring that it meets the requirements of the subsequent analysis."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various tools available for data wrangling, including programming languages like Python and R, which have packages specifically designed for data wrangling. These tools can help automate many of the tasks involved in data wrangling, making the process more efficient."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Data Analysis</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another crucuial step is to analyze the data. This involves selecting appropriate statistical or machine learning methods to answer the research question. The choice of methods depends on the nature of the data and the research question. The analysis may involve descriptive statistics, inferential statistics, predictive modeling, or other techniques."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Crucial to any data analysis is exploratory data analysis (EDA). This is a first exploration of the information contained within the data. EDA is indeed a critical step in the data analysis pipeline. It is a philosophy or an approach towards understanding data and involves a variety of techniques to summarize, visualize, and interpret data. The primary goal of EDA is to explore the data to uncover underlying structures, extract important variables, detect anomalies and outliers, test underlying assumptions, and develop simple models."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "EDA comprises several steps. Some of these include the following.\n",
    "\n",
    "1. __Summary Statistics__: EDA involves generating summary statistics for the data. This includes measures of central tendency like mean, median, and mode, measures of dispersion like range, variance, and standard deviation, and measures of shape like skewness and kurtosis.\n",
    "\n",
    "2. __Visualization__: One of the most powerful tools in EDA is data visualization. Graphical representations of data can reveal patterns, trends, and relationships that are not apparent from summary statistics alone. Common types of visualizations used in EDA include histograms, box plots, scatter plots, and correlation matrices.\n",
    "\n",
    "3. __Identifying Relationships__: EDA involves exploring the relationships between variables. This can be done using correlation coefficients for numerical variables and cross-tabulations or chi-square tests for categorical variables.\n",
    "\n",
    "4. __Checking Assumptions__: Many statistical tests and models rely on certain assumptions about the data. EDA can help check whether these assumptions are met. For example, a Q-Q plot can be used to check if data is normally distributed."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various tools and software available for conducting EDA. Programming languages like Python and R have extensive libraries for data manipulation, statistical analysis, and data visualization, making them popular choices for EDA. Spreadsheet software like Microsoft Excel and Google Sheets also offer basic tools for EDA."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The various methods of data analysis will be discussed throughout this course."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Data Reporting</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reporting research results is a crucial part of the scientific process. It involves summarizing the findings of a research study, interpreting the results in the context of the research question, and discussing the implications of the findings. The goal of reporting research results is to communicate the findings to others, allowing them to understand, evaluate, and build upon the research."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Effective reporting of research results typically involves several key components, listed below.\n",
    "\n",
    "1. __Introduction__: This section provides the context for the research, including the research question and the significance of the study.\n",
    "\n",
    "2. __Methods__: This section describes the methodology used in the study, including the study design, data collection methods, and data analysis techniques. This allows others to evaluate the appropriateness of the methods and to replicate the study.\n",
    "\n",
    "3. __Results__: This section presents the findings of the study. This typically involves a combination of text, tables, and figures to summarize the data and highlight the key findings.\n",
    "\n",
    "4. __Discussion__: This section interprets the results in the context of the research question and the existing literature. It discusses the implications of the findings, the limitations of the study, and potential directions for future research.\n",
    "\n",
    "5. __Conclusion__: This section summarizes the key findings and their implications. It provides a clear answer to the research question and highlights the contribution of the study to the field."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To ensure the quality and transparency of research reporting, several guidelines have been developed. These include the following.\n",
    "\n",
    "1. The __CONSORT guidelines for randomized trials__. More can be found at [Guidelines for reporting outcomes in trial reprots](http://doi.org/10.1001/jama.2022.21022).\n",
    "\n",
    "2. The __STROBE guidelines for observational studies__. More can be found at the [STROBE](https://www.strobe-statement.org) website.\n",
    "\n",
    "3. The __PRISMA guidelines for systematic reviews and meta-analyses__. More can be found at the [PRISMA](http://www.prisma-statement.org/?AspxAutoDetectCookieSupport=1) website."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These guidelines provide a checklist of items that should be included in the research report to ensure comprehensive and transparent reporting."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Dissemination and Communication of Results</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dissemination and communication of the results can involve publishing the findings in a peer-reviewed journal, presenting the results at a conference or sharing the findings with relevant stakeholders. The goal is to ensure that the knowledge gained from the project is shared widely and can be used to inform decision-making in healthcare."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Effective communication is crucial at this stage. The results should be presented in a way that is accessible to the intended audience, whether they are healthcare professionals, policymakers, patients, or the general public. This may involve using visualizations to illustrate the findings, translating technical terms into plain language, and highlighting the key takeaways."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In conclusion, planning a health data science project involves a series of interconnected steps, each of which plays a crucial role in the project's success. By carefully defining the research question, developing a detailed research protocol, securing funding, obtaining ethical approval, identifying appropriate data sources, capturing and wrangling data, analyzing data, reporting results, and effectively disseminating findings, researchers can leverage the power of data to generate insights that improve healthcare outcomes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Conclusion</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In conclusion, healthcare data science stands at the intersection of multiple disciplines, leveraging the power of data, statistical methods, and advanced computational tools to generate insights that can improve patient care, enhance health outcomes, and inform policy decisions. The field has seen significant growth in recent years, driven by the proliferation of digital health data, advancements in technology, and a growing recognition of the potential of data-driven approaches in healthcare."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From predicting disease outbreaks to personalizing treatment plans, healthcare data science has the potential to revolutionize the way we understand and approach health and disease. However, it also presents new challenges, particularly in terms of data privacy, security, and ethical use of data. As the field continues to evolve, it will be crucial to address these challenges and ensure that healthcare data science is used responsibly and effectively for the benefit of all."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Understanding the basics of healthcare data science can provide valuable insights into this exciting and rapidly evolving field. As we continue to generate and collect more health-related data than ever before, the role of data science in healthcare will only become more important. The future of healthcare is data-driven, and healthcare data science will be at the forefront of this transformation."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "datascience",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}