{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#ffa500\">4 | SPREADSHEETS AND DATABASES</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dct=\"http://purl.org/dc/terms/\"><span property=\"dct:title\">This chapter of an Introduction to Health Data Science</span> by <span property=\"cc:attributionName\">Dr JH Klopper</span> is licensed under <a href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1\" target=\"_blank\" rel=\"license noopener noreferrer\" style=\"display:inline-block;\">Attribution-NonCommercial-NoDerivatives 4.0 International<img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nd.svg?ref=chooser-v1\"></a></p>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Introduction</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the healthcare, the management and analysis of health data are critical for improving patient care, conducting research, and informing policy decisions. Two types of software, in particular, have become indispensable in this regard. They are spreadsheet software and database software. Both offer unique capabilities for storing, organizing, and analyzing health data, but they also have distinct characteristics that make them suitable for different purposes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spreadsheet software, such as Microsoft Excel or Google Sheets, provides a user-friendly interface for data entry and manipulation. It allows users to organize data in rows and columns, perform calculations, create charts, and apply various data analysis tools. For small datasets and simple analyses, spreadsheet software can be a quick and intuitive solution. However, it may not be ideal for managing large or complex datasets due to limitations in data capacity, performance, and data integrity controls."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the other hand, database software, such as MySQL, Oracle, or Microsoft SQL Server, is designed to handle larger and more complex datasets. Databases can store vast amounts of data, maintain data integrity, and provide powerful tools for data querying and reporting. They allow for the creation of relationships between different data elements, which can be crucial in healthcare settings where patient data may be spread across multiple tables or databases."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both spreadsheet and database software play vital roles in health data science. The choice between them depends on the size and complexity of the dataset, the required tasks, and the user's technical expertise. Understanding their capabilities and limitations is key to leveraging these tools effectively in the ever-evolving landscape of health data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Spreadsheets</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section provides a short introduction to the use of spreadsheet software for health data management."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">Spreadsheet Basics</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A spreadsheet is a computer program that allows users to organize data in rows and columns. It is often used for storing, manipulating, and analyzing data in tabular form. The most popular spreadsheet software is Microsoft Excel, but there are many other options available, such as Google Sheets, LibreOffice Calc, and Apple Numbers."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A spreadsheet consists of a grid of cells arranged in rows and columns. Each cell can contain a value, a formula, or a function. Values can be numbers, text, or dates. Formulas are mathematical expressions that perform calculations on values in other cells. Functions are predefined formulas that perform specific tasks, such as calculating the sum of a range of cells or finding the average of a set of values."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows an example of a spreadsheet in Microsoft Excel. It contains data on the age and diastolic and systolic blood pressure of participants on three interventions.\n",
    "\n",
    "![Spreadsheet](Spreadsheet.png)\n",
    "*Figure 4.1: A spreadsheet in Microsoft Excel*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">Data Entry</span> "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the main uses of spreadsheets is data entry. They provide a user-friendly interface for entering data into a table format. Users can enter data manually or import it from other sources, such as text files or databases. Spreadsheets also allow users to edit and format data in various ways, such as changing the font size or color, adding borders around cells, or applying conditional formatting rules."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows an example of data entry in Microsoft Excel. The data is entered manually into a table format. The entries for the systolic blood pressure values that are in excess of $120$ are shaded red.\n",
    "\n",
    "![Data Entry](RedFormat.png)\n",
    "\n",
    "*Figure 4.2: Data entry in Microsoft Excel*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">Data Manipulation</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another common use of spreadsheets is data manipulation. They provide tools for sorting, filtering, and summarizing data in various ways. Users can sort data by one or more columns, filter data based on specific criteria, or summarize data using formulas such as SUM, AVERAGE, or COUNT."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows an example of data manipulation in Microsoft Excel. The data is sorted by the age of the particpants in ascending order, , and summarized using the AVERAGE function to calculate the mean of each variable.\n",
    "\n",
    "![Data Manipulation](Average.png)\n",
    "*Figure 4.3: Data manipulation in Microsoft Excel*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">Data Analysis</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spreadsheets also provide tools for analyzing data in various ways. Users can perform calculations on data using formulas or functions, create charts to visualize data, or apply various data analysis tools such as pivot tables or conditional formatting rules."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows an example of data analysis in Microsoft Excel. The data is analyzed using a pivot table to summarize the numerical variable by the classes of the Group (categorical) variable.\n",
    "\n",
    "![Data Analysis](Pivot.png)\n",
    "\n",
    "*Figure 4.4: Data analysis in Microsoft Excel*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">Data Visualization</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, spreadsheets provide tools for visualizing data in various ways. Users can create charts to visualize data in a graphical format, such as bar charts or pie charts. They can also apply conditional formatting rules to highlight specific data points or trends in the data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows an example of data visualization in Microsoft Excel. A chart is created to visualize the correlation between the age and diastolic blood pressure values.\n",
    "\n",
    "![Data Visualization](Scatter.png)\n",
    "\n",
    "*Figure 4.5: Data visualization in Microsoft Excel*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While data entry and manipulation can be fairly straight forward using spreadsheet software, consideration must be given to the _layout of the data_. The key concept here is that of tidy data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Tidy data</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data is at the heart of any analytical task, be it simple data exploration or building complex predictive models. However, raw data is often messy and not suitable for analysis directly. It is estimated that data scientists spend about 80% of their time cleaning and preprocessing data, which makes it a crucial step in the data analysis process. One key aspect of data preprocessing is transforming the data into a 'tidy' format. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The concept of tidy data was introduced by statistician Hadley Wickham in the paper [Tidy Data](https://www.jstatsoft.org/article/view/v059i10), with the aim of providing a standard way to organize data values within a dataset. According to Wickham, tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is tidy when each variable forms a column, each observation forms a row, and each type of observational unit forms a table. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tidy data framework offers several benefits. For one, it makes data cleaning and preprocessing more systematic and less error-prone. Moreover, tidy data structure is optimized for vectorized programming languages like R or Python, allowing for efficient code that is easier to write, read, and debug. Lastly, because it is a standard, tidy data allows for the development of reusable tools that can work with many different datasets."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tidy data principles are widely applied in statistical modeling and visualization. Many R packages, for instance, assume data is in a tidy format. Packages like `ggplot2` for visualization, `dplyr` for data manipulation, and 'modelr' for modeling are designed to work with tidy data. Therefore, ensuring your data is tidy can make your analysis more efficient and your code more reusable."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tidying data often involves several steps. First, the reshape needs to be reshaped so that each variable is a column, each observation is a row, and each type of observational unit is a table. This can involve merging multiple datasets, splitting a single dataset into multiple ones, or transposing the dataset."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing values and outliers may need to be considered, as these can affect the quality of your analysis. How this is handled will depend on the specific context and the nature of the data.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the variables need to be transformed to make them easier to work with. This can involve scaling numeric variables, converting categorical variables into dummy variables, or creating new variables by combining existing ones."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tidy data is a powerful concept that simplifies the initial stages of data analysis. By providing a standard way to organize data, it makes data cleaning and preprocessing more efficient and less error-prone. The principles of tidy data are now deeply integrated into many tools for data analysis and provide a solid foundation for any data analytical task. Despite the time and effort required to tidy data, the benefits are worth it, yielding more robust and reliable analytical results. It is best to spend the time when designing a spreadsheet to ensure that the data is tidy, rather than trying to tidy it later."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### <span style=\"color:#FFD700\">File formats</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spreadsheet software packages save files in proprietary and unique file formats. These formats add additional features to the spreadsheet, such as macros or formulas, which are not supported by other spreadsheet software packages. For example, Microsoft Excel saves files in the .xlsx format, while Google Sheets saves files in the .gsheet format. The additional data stored in these files formats can cause compatibility issues when importing files into data analysis tools such as R or Python."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All spreadsheet software programs can export spreadsheet as comma-separated values (CSV) files. These files contain only the data in the spreadsheet and do not include any additional features such as macros or formulas. They can be imported into data analysis tools such as R or Python without any compatibility issues."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A CSV file is a simple file format. It has several advantages, particularly when compared to other formats such as XLSX or ODS. Four such advantages are listed below.\n",
    "\n",
    "1. **Universality and Compatibility**: CSV is a simple, plain text format that can be read and written by many programs, including most spreadsheet software (like Excel, Google Sheets, or LibreOffice Calc), many database management systems, and programming languages. This makes it ideal for data exchange between applications, even ones that run on different platforms.\n",
    "\n",
    "2. **Simplicity**: Because it is a text-based file format, you can open, read, and edit CSV files using a plain text editor if needed. Each line of the file corresponds to a row in the table, and commas separate the values (or fields) within each row.\n",
    "\n",
    "3. **Size**: CSV files are generally smaller in size compared to other formats like XLSX, as they contain no formatting, formulas, macros, or other extra features. This makes them more efficient for storing and transferring large volumes of data.\n",
    "\n",
    "4. **Import and Export**: CSV is often used to import and export data from web and mobile applications, databases, and data analysis tools. It is one of the most commonly supported formats for data import in various software tools.\n",
    "\n",
    "However, keep in mind that while CSV files have these advantages, they do not support features such as cell formatting, formulas, charts, or images that other file formats like XLSX support. Therefore, the choice of format should depend on your specific needs. Also note that other so-called delimited file formats are available such as tab-delimited files (TSV) or pipe-delimited files (PSV). These formats are similar to CSV but use different delimiters (e.g., tabs or pipes) instead of commas to separate values within each row."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Databases</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A health database is a structured collection of health-related data, often managed and stored in an organized manner for easy retrieval and analysis. It can include various types of health information, such as electronic health records (EHRs), insurance claims, pharmaceutical research data, patient demographics, clinical trials data, laboratory results, imaging data, and public health statistics. This data can be both structured (e.g., numerical values, categorical variables) and unstructured (e.g., clinical notes, radiology reports)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Health databases are crucial for a wide range of purposes, including clinical decision support, research, public health reporting, patient care management, and health policy development. They can facilitate interoperability, enhance the quality and safety of healthcare delivery, and provide the foundation for predictive analytics and precision medicine. Databases can help healthcare organizations improve patient care, reduce costs, and increase efficiency. It can also provide valuable insights into the health of populations and inform policy decisions. Moreover, databases are crucial tools for the collection of data for research purposes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The creation of a database takes some planning and effort, but it can be a worthwhile investment in the long run. The key components of a databse are tables and the relationships between them. Tables are used to store data in a structured format, and relationships are used to link data across tables."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Table and Relationships](TablesAndRelationships.svg)\n",
    "\n",
    "*Figure 4.: Table and Relationships*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another key concept in database design is normalization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Database normalization</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Database normalization is a systematic process used in relational database design to organize data to reduce redundancy and improve data integrity. The primary objective of normalization is to eliminate anomalies (insertion, update, and deletion anomalies) that can occur when data is added, modified, or deleted, which can lead to loss of data consistency."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The process of normalization involves dividing larger tables into smaller, less redundant tables and defining relationships between them. The relationships between these tables are established based on a set of rules or __normal forms__. Each rule corresponds to a normal form. There are six normal forms, each with a progressively stricter set of requirements: First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF, also known as Project-Join Normal Form (PJNF))."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While database normalization helps to keep the database design clean, flexible, and efficient, it's important to note that it's not always necessary to achieve the highest level of normalization. Sometimes, denormalization (intentionally adding redundancy to data) can be beneficial for performance in read-heavy applications. It's a matter of balance between performance, complexity, and data integrity."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first three normal forms are discussed below. The remaining three normal forms are beyond the scope of this course."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First Normal Form (1NF) is the initial step towards normalization in database design. A table is said to be in 1NF if it follows the following three rules.\n",
    "\n",
    "1. Each table should have a primary key: Unique identifier for each record in the table.\n",
    "\n",
    "2. Each column in the table should contain atomic (indivisible) values, meaning each cell should contain a single value. There should be no repeating groups or arrays.\n",
    "\n",
    "3. Values stored in a column should be of the same domain, meaning they should be of the same type (integer, string, date, etc.).\n",
    "\n",
    "For example, the table below is in non-normal form.\n",
    "\n",
    "| PatientID |     Name     |    Complaints                         |\n",
    "|:---------:|:------------:|:-------------------------------------:|\n",
    "|     1     |   John Doe   |   Headache, runny nose                |\n",
    "|     2     |   Jane Doe   |   Swollen ankles, Shortness of breath |\n",
    "|     3     |   Mary Jane  |   Fever, Runny nose                   |\n",
    "\n",
    "This table is not in 1NF because the \"Purchased Items\" column contains multiple values.\n",
    "\n",
    "To bring this table into first normal form (1NF), it is modified so that each cell in the table contains only atomic (single) values.\n",
    "\n",
    "| PatientID |     Name     |    Complaints            |\n",
    "|:---------:|:------------:|:------------------------:|\n",
    "|     1     |   John Doe   |      Headache            |\n",
    "|     1     |   John Doe   |      Runny nose          |\n",
    "|     2     |   Jane Doe   |      Swollen ankles      |\n",
    "|     2     |   Jane Doe   |      Shortness of breath |\n",
    "|     3     |   Mary Jane  |      Fever               |\n",
    "|     3     |   Mary Jane  |      Runny nose          |\n",
    "\n",
    "In the normalized version, each row represents a unique transaction and each cell contains a single value, making the table compliant with the 1NF."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Second Normal Form (2NF) is a level of database normalization that extends the First Normal Form (1NF) by ensuring that all non-prime attributes (attributes not part of the primary key) in the table are fully functionally dependent on the primary key.\n",
    "\n",
    "A table is said to be in 2NF if it adheres to the following two rules.\n",
    "\n",
    "1. The table is in 1NF.\n",
    "2. All non-prime attributes in the table must depend on the whole of a candidate key, not just part of it.\n",
    "\n",
    "For example, consider the following table:\n",
    "\n",
    "| StudentID | Subject   | Lecturer      |\n",
    "|:---------:|:---------:|:-------------:|\n",
    "| 1         | Math      | Prof. Johnson |\n",
    "| 1         | Science   | Prof. Smith   |\n",
    "| 2         | Math      | Prof. Johnson |\n",
    "| 3         | Science   | Prof. Smith   |\n",
    "\n",
    "It is assumed that the combination of StudentID and Subject forms a composite primary key. The Lecturer column depends on the Subject column, but not on the whole primary key (it doesn't depend on StudentID). This violates the rules of 2NF.\n",
    "\n",
    "To bring this table into 2NF, it is split into two tables.\n",
    "\n",
    "**Student_Subject Table:**\n",
    "\n",
    "| StudentID | Subject |\n",
    "|:---------:|:-------:|\n",
    "| 1         | Math    |\n",
    "| 1         | Science |\n",
    "| 2         | Math    |\n",
    "| 3         | Science |\n",
    "\n",
    "**Subject_Lecturer Table:**\n",
    "\n",
    "| Subject   | Lecturer      |\n",
    "|:---------:|:-------------:|\n",
    "| Math      | Prof. Johnson |\n",
    "| Science   | Prof. Smith   |\n",
    "\n",
    "Now, in the Student_Subject table, each subject chosen by the student forms a unique record. In the Subject_Lecturer table, each subject has a unique lecturer assigned. Both tables are now in 2NF because every non-prime attribute (in this case, only Lecturer) is fully functionally dependent on the primary key."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Third Normal Form (3NF) is a level of database normalization that extends the Second Normal Form (2NF) by ensuring that all non-prime attributes in the table are not transitively dependent on the primary key. In other words, no non-prime attribute should depend on another non-prime attribute.\n",
    "\n",
    "A table is in 3NF if it adheres to the following two rules.\n",
    "\n",
    "1. The table is in 2NF.\n",
    "2. Every non-prime attribute is non-transitively dependent on every candidate key in the table.\n",
    "\n",
    "An example is shown in the table below.\n",
    "\n",
    "**Student Table:**\n",
    "\n",
    "| StudentID | CourseID | CourseName  |\n",
    "|:---------:|:--------:|:-----------:|\n",
    "| 1         | C1       | Math        |\n",
    "| 2         | C2       | Science     |\n",
    "| 3         | C3       | History     |\n",
    "| 4         | C1       | Math        |\n",
    "\n",
    "Here, the primary key is a combination of StudentID and CourseID. However, the attribute CourseName is transitively dependent on the primary key through the CourseID. This means the table is not in 3NF.\n",
    "\n",
    "To bring this table into 3NF, it is split into two tables.\n",
    "\n",
    "**Student_Course Table:**\n",
    "\n",
    "| StudentID | CourseID |\n",
    "|:---------:|:--------:|\n",
    "| 1         | C1       |\n",
    "| 2         | C2       |\n",
    "| 3         | C3       |\n",
    "| 4         | C1       |\n",
    "\n",
    "**Course Table:**\n",
    "\n",
    "| CourseID | CourseName  |\n",
    "|:--------:|:-----------:|\n",
    "| C1       | Math        |\n",
    "| C2       | Science     |\n",
    "| C3       | History     |\n",
    "\n",
    "In these tables, all non-prime attributes (in this case, only CourseName) are non-transitively dependent on the primary key. Hence, both tables are now in 3NF."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">REDCap Database</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many institutions, including The George Washington University provides access to the RedCap database for data collection and academic research."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>: Read the [REDCap](https://redcap.smhs.gwu.edu) website of The George Washington University."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>: Watch the three [RECap videos](https://redcap.smhs.gwu.edu/redcap-videos) on the website."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Differences between spreadsheets and databases</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a table that summarizes the differences between the two major types of software used for health data management: spreadsheets and databases."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Key                        | Spreadsheet                         | Database                                                     |\n",
    "|:---------------------------|:------------------------------------|:-------------------------------------------------------------|\n",
    "| Primary Use                | For calculations and small datasets | To manage and manipulate large datasets                      |\n",
    "| Data Structure             | Flat or tabular structure           | More complex structures (tables, relations)                  |\n",
    "| Data Volume                | Ideal for small volumes of data     | Ideal for large volumes of data                              |\n",
    "| Scalability                | Limited                             | High, can manage large amounts of data                       |\n",
    "| Data Relations             | Limited                             | Relationships between data are integral                      |\n",
    "| Data Integrity             | Limited, no built-in measures       | Built-in measures to ensure data integrity                   |\n",
    "| Multi-user Accessibility   | Limited                             | Multiple users can access and manipulate data simultaneously |\n",
    "| Security                   | Basic security measures             | Advanced security measures, user-level access                |\n",
    "| Complexity                 | Easy to learn and use               | Requires knowledge of database design and SQL                |\n",
    "| Data Analysis Capabilities | Basic analysis and visualizations   | Sophisticated querying, reporting, and analysis              |"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please note that both spreadsheets and databases have their specific strengths and are well-suited to different types of tasks. The choice between the two should be based on specific needs and requirements."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Links</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are links to commonly used spreadsheet and database software.\n",
    "\n",
    "1.  [Spreadsheet software](https://en.wikipedia.org/wiki/Spreadsheet_software)\n",
    "2.  [Database software](https://en.wikipedia.org/wiki/Database_software)\n",
    "3.  [Microsoft Excel](https://www.microsoft.com/en-us/microsoft-365/excel)\n",
    "4.  [Google Sheets](https://www.google.com/sheets/about/)\n",
    "5.  [LibreOffice Calc](https://www.libreoffice.org/discover/calc/)\n",
    "6.  [Apple Numbers](https://www.apple.com/numbers/)\n",
    "7.  [MySQL](https://www.mysql.com/)\n",
    "8.  [Oracle](https://www.oracle.com/database/)\n",
    "9.  [Microsoft SQL Server](https://www.microsoft.com/en-us/sql-server/sql-server-2019)\n",
    "10. [PostgreSQL](https://www.postgresql.org/)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Quiz questions</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Questions</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.  What is the difference between a spreadsheet and a database?\n",
    "2.  What are the advantages and disadvantages of using a spreadsheet for health data management?\n",
    "3.  What are the advantages and disadvantages of using a database for health data management?\n",
    "4.  What are some examples of tasks that can be performed using a spreadsheet?\n",
    "5.  What are some examples of tasks that can be performed using a database?\n",
    "6.  What are some examples of tasks that can be performed using both a spreadsheet and a database?\n",
    "7.  What are some examples of tasks that can be performed using neither a spreadsheet nor a database?\n",
    "8.  What are some examples of tasks that can be performed using a spreadsheet but not a database?\n",
    "9.  What are some examples of tasks that can be performed using a database but not a spreadsheet?\n",
    "10. What are some examples of tasks that can be performed using both a spreadsheet and a database?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}