{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e4239c7c336c20bf1aafb33f507ce27c", "grade": false, "grade_id": "cell-9fc931a1b92d396c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Worksheet 2: Introduction to Reading Data\n", "\n", "### Lecture and Tutorial Learning Goals:\n", "\n", "After completing this week's lecture and tutorial work, you will be able to:\n", "\n", "* define the following:\n", " - absolute file path\n", " - relative file path\n", " - url\n", "* read data into R using a relative path and a url\n", "* compare and contrast the following functions:\n", " - `read_csv` \n", " - `read_tsv`\n", " - `read_csv2`\n", " - `read_delim`\n", " - `read_excel`\n", "* match the following `tidyverse` `read_*` function arguments to their descriptions:\n", " - `file` \n", " - `delim`\n", " - `col_names`\n", " - `skip`\n", "* choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R\n", "* use `readxl` library's `read_excel` function and arguments to load a sheet from an excel file into R\n", "* connect to a database using the `DBI` library's `dbConnect` function\n", "* list the tables in a database using the `DBI` library's `dbListTables` function\n", "* create a reference to a database table that is queriable using the `tbl` from the `dbplyr` library\n", "* retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` library\n", "* use `write_csv` to save a data frame to a csv file\n", "\n", "* *optional*: scrape data from the web\n", " * read/scrape data from an internet URL using the rvest `html_nodes` and `html_text` functions\n", " * compare downloading tabular data from a plain text file (e.g. *.csv) from the web versus scraping data from a .html file\n", "\n", "\n", "This worksheet covers parts of [the Reading chapter](https://datasciencebook.ca/reading.html) of the online textbook. You should read this chapter before attempting the worksheet." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "69f19e7bf35b269b0e7a242769d8111a", "grade": false, "grade_id": "cell-4c6ba62a17e1fc56", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "### Run this cell before continuing. \n", "library(tidyverse)\n", "library(repr)\n", "library(readxl)\n", "source(\"tests.R\")\n", "source(\"cleanup.R\")\n", "options(repr.matrix.max.rows = 6)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "3cf267c6038f34f3613aced045f3f418", "grade": false, "grade_id": "cell-3b8465487dd766f5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## 1. Comparing Absolute Paths, Relative Paths, and URLs" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "669fcdc7925e7fd301869b4e2aa6cb77", "grade": false, "grade_id": "cell-33341d6ba389a6a5", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 1.1** Multiple Choice:\n", "
{points: 1}\n", "\n", "If you needed to read a file using an absolute path, what would be the first symbol in your argument (...) when using the `read_csv` function?\n", "\n", "A. `read_csv(\">...\")`\n", "\n", "B. `read_csv(\";...\")`\n", "\n", "C. `read_csv(\"...\")`\n", "\n", "D. `read_csv(\"/...\")`\n", "\n", "*Assign your answer to an object called `answer1.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fcf288b0c2a750bfc3d5d16cea2f0456", "grade": false, "grade_id": "cell-6bdc2f57a189abcf", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "95a8391a2ccb7db143ea5c1bb86d6476", "grade": true, "grade_id": "cell-62f0d28a4d321e00", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_1.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1f1b85e5f81c6f0bf7adc7c4d40198bc", "grade": false, "grade_id": "cell-b91542cf20075a6d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 1.2** True or False: \n", "
{points: 1}\n", "\n", "An absolute path can *never* look the same as a relative path.\n", "\n", "*Assign your answer to an object called `answer1.2`. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. `\"true\"` or `\"false\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "67df28e702c357988b3ffe66dbb754aa", "grade": false, "grade_id": "cell-419826637ad2ec8b", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f118e4132ee94403fc4a25fea9ec1314", "grade": true, "grade_id": "cell-decda128ce0061b5", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_1.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8ef00bce0f1e8ecc980c4a7493216dfb", "grade": false, "grade_id": "cell-b3a3c0bfe11c30aa", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 1.3** \n", "Match the following paths with the correct path type that they represent:\n", "
{points: 1}\n", "\n", "*Example Path*\n", "\n", "A. `/Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "B. `https://www.ubc.ca`\n", "\n", "C. `file_1.csv`\n", "\n", "D. `/Users/name/Documents/Course_A/homework/my_first_homework.docx`\n", "\n", "E. `homework/my_second_homework.docx`\n", "\n", "F. `https://www.random_website.com`\n", "\n", "\n", "*Path Type*\n", "\n", "1. absolute\n", "2. relative\n", "3. URL\n", "\n", "For every argument, create an object using the letter associated with the example path and assign it the corresponding number from the list of path types. For example: `B <- 1`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6ddb42725f1a5c450b915b83c41b409b", "grade": false, "grade_id": "cell-09da6197edcf6859", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8f651f67ecd352eeb1b5a3b49e53e2c0", "grade": true, "grade_id": "cell-bcb67971fe82e848", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_1.3()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1ae149390f96a7fdcbbab7af62963cfe", "grade": false, "grade_id": "cell-4589f01853a62fe7", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 1.4** Multiple Choice:\n", "
{points: 1}\n", "\n", "If the absolute path to a data file looks like this: `/Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "What would the relative path look like if the working directory (i.e., where the Jupyter notebook is where you are running your R code from) is now located in the `UBC` folder?\n", "\n", "A. `sn_trial_1.xlsx`\n", "\n", "B. `/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "C. `BIOL363/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "D. `UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "E. `/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`\n", "\n", "*Assign your answer to an object called `answer1.4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e60f9373a8db17a71523857d9d3cf423", "grade": false, "grade_id": "cell-c7993ed9e8e63184", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "98f2e299b7809abf2488626649729d33", "grade": true, "grade_id": "cell-0a781125473dc3d8", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_1.4()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "0b71d3cd6e2fcc9f36d42ab36c6a3e04", "grade": false, "grade_id": "cell-798a7cbf057308b4", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 1.5**\n", "
{points: 1}\n", "\n", "Match the following paths with the most likely kind of data format they contain. \n", "\n", "*Paths:*\n", "\n", "1. `https://www.ubc.ca/datasets/data.db`\n", "2. `/home/user/downloads/data.xlsx`\n", "3. `data.tsv`\n", "4. `examples/data/data.csv`\n", "5. `https://en.wikipedia.org/wiki/Normal_distribution`\n", "\n", "*Dataset Types:*\n", "\n", "A. Excel Spreadsheet\n", "\n", "B. Database\n", "\n", "C. HTML file\n", "\n", "D. Comma-separated values file\n", "\n", "E. Tab-separated values file\n", "\n", "For every dataset type, create an object using the letter associated with the example and assign it the corresponding number from the list of paths. For example: `F <- 5`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "12d6b68b3169ecf3f7f41ea302ce1d7e", "grade": false, "grade_id": "cell-050dd93f81d2fc95", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6aff1dc46def341826fce97f1c22263b", "grade": true, "grade_id": "cell-f4e891dadfecbbf6", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_1.5()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "49022afa8ea44b4ad8918cdf1e442d1d", "grade": false, "grade_id": "cell-5596109d0e6be6f8", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## 2. Argument Modifications to Read Data\n", "Reading files is one of the first steps to wrangling data and consequently `read_csv` is a crucial function. However, despite how effortlessly it has worked so far, it has its limitations. `read_csv` works with particular files and does not accept differing formats. \n", "\n", "Not all data sets come as perfectly organized like the ones you worked with last week. Time and effort were put into ensuring that the files were arranged with headers, columns were separated by commas, and the beginning excluded metadata. \n", "\n", "Now that you understand how to read files located outside (or inside) of your working directory, you can begin to learn the tips and tricks necessary to overcoming the setbacks of `read_csv`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Run this cell to learn more about the arguments used in read_csv\n", "### Reading over the help file will assist with the next question. \n", "\n", "?read_csv" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7bf6b68d831dea31eabb9098afa4f627", "grade": false, "grade_id": "cell-d827ca1d36fb14e0", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 2.1** \n", "
{points: 1}\n", "\n", "Match the following descriptions with the corresponding arguments used in `read_csv`:\n", "\n", "*Descriptions*\n", "\n", "G. Character that separates columns in your file. \n", "\n", "H. Specifies whether or not the first row of data in your file are column labels. Also allows you to create a vector that can be used to label columns. \n", "\n", "I. This is the file name, path to a file, or URL. \n", "\n", "J. Specifies the number of lines which must be ignored because they contain metadata. \n", "\n", "\n", "*Arguments*\n", "\n", "1. `file`\n", "2. `delim`\n", "3. `col_names`\n", "4. `skip`\n", "\n", "For every description, create an object using the letter associated with the description and assign it the corresponding number from the list of functions. For example: `G <- 1`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "0005bca8091135afd37cf473f65cc4c3", "grade": false, "grade_id": "cell-bbc885297c410cb2", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "465f30f8342e492ece3171a09d6dd092", "grade": true, "grade_id": "cell-50ce2f054a8c009a", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_2.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f642bc2821bea60c1dfbebc29088b8c8", "grade": false, "grade_id": "cell-5b3f94746c785534", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 2.2** True or False:\n", "
{points: 1}\n", "\n", "`read_csv2` and `read_delim` can both be used for reading files that have columns separated by `;`. \n", "\n", "*Assign your answer to an object called `answer2.2`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `\"true\"` or `\"false\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "af005ea250f7b2271456f40533eb4c3d", "grade": false, "grade_id": "cell-a35b96936f6bb761", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "94e62ce1c32cca27d1684d83c22df692", "grade": true, "grade_id": "cell-7e29230345864f5d", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_2.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "a1c854d2531ec903a94f68dcc70b79e4", "grade": false, "grade_id": "cell-4750064840f67738", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 2.3** Multiple Choice: \n", "
{points: 1}\n", "\n", "`read_tsv` can be used for files that have columns separated by which of the following:\n", "\n", "A. letters\n", "\n", "B. tabs\n", "\n", "C. numbers\n", "\n", "D. commas \n", "\n", "*Assign your answer to an object called `answer2.3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f514c5c65c7b2187fc76a6b73796d2c5", "grade": false, "grade_id": "cell-518590cd5df1f1b7", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1b47fe7daf16aac99151abbdbd7bb2c8", "grade": true, "grade_id": "cell-8dfe160165696251", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_2.3()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d60049ec90167aa2aa5de878c1963d84", "grade": false, "grade_id": "cell-ba92ca46de3933fe", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## 3. Happiness Report (2017)\n", "This data was taken from [Kaggle](https://www.kaggle.com/unsdsn/world-happiness) and ranks countries on happiness based on rationalized factors like economic growth, social support, etc. The data was released by the United Nations at an event celebrating International Day of Happiness. According to the website, the file contains the following information:\n", "\n", "* Country = Name of the country.\n", "* Region = Region the country belongs to.\n", "* Happiness Rank = Rank of the country based on the Happiness Score.\n", "* Happiness Score = A metric measured by asking the sampled people the question: \"How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest?\"\n", "* Standard Error = The standard error of the happiness score.\n", "* Economy (GDP per Capita) = The extent to which GDP contributes to the calculation of the Happiness Score.\n", "* Family = The extent to which Family contributes to the calculation of the Happiness Score.\n", "* Health (Life Expectancy) = The extent to which Life expectancy contributed to the calculation of the Happiness Score.\n", "* Freedom = The extent to which Freedom contributed to the calculation of the Happiness Score.\n", "* Trust (Government Corruption) = The extent to which Perception of Corruption contributes to Happiness Score.\n", "* Generosity = The extent to which Generosity contributed to the calculation of the Happiness Score.\n", "* Dystopia Residual = The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.\n", "\n", "To clean up the file and make it easier to read, we only kept the country name, happiness score, economy (GDP per capita), life expectancy, and freedom. The happiness scores and rankings use data from the Gallup World Poll, which surveys citizens in countries from around the world.\n", "\n", "Kaggle stores this information but it is compiled by the *Sustainable Development Solutions Network*. They survey these factors nearly every year (since 2012) and allow global comparisons to optimize political decision making. These landmark surveys are highly recognized and allow countries to learn and grow from one another. One day, they will provide a historical insight on the nature of our time. " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9c34a06becfee66483f1d433859a15f0", "grade": false, "grade_id": "cell-4f9336c659cb50f6", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.1** Fill in the Blank: \n", "
{points: 1}\n", "\n", "Trust is the extent to which \\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_ contributes to Happiness Score. \n", "\n", "A. Corruption \n", "\n", "B. Government Intervention \n", "\n", "C. Perception of Corruption \n", "\n", "D. Tax Money Designation \n", "\n", "*Assign your answer to an object called `answer3.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c7d5008f613469e887506c804a4ff0ed", "grade": false, "grade_id": "cell-b4e3e33488ef32b7", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "84a45b95a0dc81aec2bc004316316e4d", "grade": true, "grade_id": "cell-56c422da7266129b", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "de4d00af95cf70767d71b53b8ab10c7c", "grade": false, "grade_id": "cell-783457361cf6c5df", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.2** Multiple Choice: \n", "
{points: 1}\n", "\n", "What is the happiness report?\n", "\n", "A. Study conducted by the governments of multiple countries. \n", "\n", "B. Independent survey of citizens from multiple countries.\n", "\n", "C. Study conducted by the UN. \n", "\n", "D. Survey given to international students by UBC's psychology department. \n", "\n", "*Assign your answer to an object called `answer3.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2b77ae034a4bb74dc48b8da85540f977", "grade": false, "grade_id": "cell-f11d6a9fb2cb9712", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7e8a6a943873caafc16b59b24d2757e3", "grade": true, "grade_id": "cell-7566d4f3f306ca1c", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "41924492692fd7b8d246e9161c0e40f5", "grade": false, "grade_id": "cell-9f8a4bdfe511f79b", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.3** Fill in the Blanks (of the Table):\n", "
{points: 1}\n", "\n", "It is often a good idea to try to \"inspect\" your data to see what it looks like before trying to load it into R. This will help you figure out the right function to call and what arguments to use. When your data are stored as plain text, you can do this easily with Jupyter (or any text editor). \n", "\n", "Open all the files named `happiness_report...` in the `data` folder with the plain text editor in your working directory (the `worksheet_02` directory) using Jupyter (**Right click the file -> Open With -> Editor**). This will allow you to visualize the files and the organization of your data. Based on your findings, fill in the missing items in the table below. This table will be very useful to refer back to in the coming weeks. \n", "\n", "*You'll notice that trying to open one of the files gives you an error (`File Load Error ... is not UTF-8 encoded`). This means that this data is not stored as human-readable plain text. For this special file, just fill in the* `read_*` *function entry, the other columns will be left blank.*" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "19badd3b0b5359a6a78472e00127f0c8", "grade": false, "grade_id": "cell-1935595960d193a5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "|File Name | delim | Header | Metadata | skip | read_* |\n", "|--------------------------------|------------|--------|----------|--------------------|----------|\n", "|`_.csv` |`\";\"`, `\",\"`, `\"\\\"`, or `\"tab\"`|`\"yes\"`or `\"no\"`|`\"yes\"`or `\"no\"`|`NA` or # of lines|`read_*`|\n", "|`happiness_report.csv` |, |**A** |no |`NA` |`read_csv` |\n", "|`happiness_report_semicolon.csv`|; |yes |no |`NA` |**B** |\n", "|`happiness_report.tsv` |**C** |yes |no |`NA` |`read_tsv` |\n", "|`happiness_report_metadata.csv` |, |yes |**D** |2 |`read_csv` |\n", "|`happiness_report_no_header.csv`|, |**E** |no |`NA` |`read_csv` |\n", "|`happiness_report.xlsx` | | | | |**F**|\n", "\n", "For the missing items (labelled A to F) in the table above, create an object using the letter and assign it the corresponding missing value.\n", "\n", "For example: `A <- \"yes\"`. The possible options for each column are given in the first row of the table. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "3a3b6ef10d4a96a74b7d9626b1c94e88", "grade": false, "grade_id": "cell-3b30ff40fa20980c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "332fa02c35128f4c4d8619de6b02dc53", "grade": true, "grade_id": "cell-7078427e58de2889", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_3.3()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "28d9642b617c8063133d9ef5c7b3d442", "grade": false, "grade_id": "cell-d034da5aced69138", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.4** \n", "
{points: 1}\n", "\n", "Read the file `happiness_report.csv` in the `data` folder using the shortest relative path. **Hint:** preview the data using Jupyter (as discussed above) so you know which `read_*` function and arguments to use.\n", "\n", "*Assign the relative path (the string) to an object named* `happiness_report_path`, *and assign the output of the correct* `read_*` *function you call to an object named* `happiness_report`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "5535290e89b2ebd21b157f8e17706ec0", "grade": false, "grade_id": "cell-9637860c62acbf70", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# happiness_report_path <- \"...\"\n", "# ... <- ...(file = happiness_report_path)\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "print(happiness_report, n = 10) # the n = 10 argument tells R to print 10 lines " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d32bbb42a4f5b79e536712bdbf0d1550", "grade": true, "grade_id": "cell-3488c6479abcf3d4", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.4()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "57467cbc4b566f8c8a121915df112427", "grade": false, "grade_id": "cell-0e8db9c50093cc95", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.5** Multiple Choice:\n", "
{points: 1}\n", "\n", "If Norway is in \"first place\" based on the happiness score, at what position is Canada?\n", "\n", "A. 3rd\n", "\n", "B. 15th\n", "\n", "C. 7th\n", "\n", "D. 28th\n", "\n", "*Hint: create a new cell and run `happiness_report`.* \n", "\n", "*Assign your answer to an object called `answer3.5`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e3c793eda1325d9ea1cd290a11997fef", "grade": false, "grade_id": "cell-9528400a7684ff70", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "32109ca7d04aac17648a218d34a2b88b", "grade": true, "grade_id": "cell-3876f08146bbc9ff", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.5()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5071fe81b3947759d4613b4cd7b1a788", "grade": false, "grade_id": "cell-ef3eb28b4bd11d8e", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.6.1**\n", "
{points: 1}\n", "\n", "For each question in the ranges 3.6.1 to 3.6.5 and 3.7.1 to 3.7.2, fill in the `...` in the code given. Replace `fail()` with your finished answer. Refer to your table above and don't be afraid to ask for help. Remember you can use `?` help operator to access documentation for a function (e.g. `?read_csv`).\n", "\n", "Read in the file `happiness_report_semicolon.csv` using `read_delim` and name it `happy_semi_df`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8f2fe8e70b0164014fe71581998e9add", "grade": false, "grade_id": "cell-1a59d2fec3747ff0", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- read_delim(file = \"data/...\", delim = \"...\")\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_semi_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2acf3cc56ad5d7e2f82b978bd64299ca", "grade": true, "grade_id": "cell-ae36542297ad995b", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.6.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5f5819f520c0b53a92588de6716fb713", "grade": false, "grade_id": "cell-f396ee421ad33cb4", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Take a look at the **data type** in the `GDP_per_capita`, `life_expectancy`, and `freedom` columns. It says ``; that stands for \"character\" or text data -- not numeric as we would hope! The `happiness_score` column has `` (stands for \"double-precision floating point number\", a numeric type), which is correct. We'd like the other columns to have this type as well... what happened?\n", "\n", "If we look closer, we'll see that the decimal point in this data was a *comma* `,` rather than a period (common in some European countries).\n", "\n", "Instead of `read_delim`, for this data we'll need another function that can handle commas as decimal points." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9d6a3eba122585ae49b339dbb5fa43f4", "grade": false, "grade_id": "cell-2d6d41a3d7f0e440", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.6.2**\n", "
{points: 1}\n", "\n", "Read in the file `happiness_report_semicolon.csv` again, but this time use a different `read_*` function than `read_delim` to ensure that the column types are correct. Remember you can use `?` help operator to access documentation for a function (e.g. `?read_csv`). **Hint:** take a look at the list of `read_*` functions at the top of this worksheet under the learning goals section. Name the data frame `happy_semi_df2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "09932b3a20178f1dbdbbf37f6bc562d7", "grade": false, "grade_id": "cell-dfa890bab0fa0ec8", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- ...(file = \"...\")\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_semi_df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6d56ff40177debd8397843f6855a9593", "grade": true, "grade_id": "cell-e99b9418831c851b", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.6.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5d4c7243657fe1d15afd2278a8b075be", "grade": false, "grade_id": "cell-f6fd13c112fece7a", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.6.3**\n", "
{points: 1}\n", "\n", "Read in the file `happiness_report.tsv` using the appropriate `read_*` function and name it `happy_tsv`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4fce2cfc44379b77ca945ac0ea67d37b", "grade": false, "grade_id": "cell-e25b1333a4928479", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- ...(file = \"...\")\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_tsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2cc7e91a81e3dc5408e99920b52cfad0", "grade": true, "grade_id": "cell-4943aa347e83b909", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.6.3()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fb264f7b4650e7275fe9e368219d019f", "grade": false, "grade_id": "cell-4bdb6b7ea045900b", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.6.4**\n", "
{points: 1}\n", "\n", "Read in the file `happiness_report_metadata.csv` using the appropriate `read_*` function and name it `happy_metadata`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e94c8e220e0927cee900c86082753e77", "grade": false, "grade_id": "cell-87cd2b408dab554c", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- ...(file = \"data/happiness_report_metadata.csv\", skip = ...)\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2a8fd1056231973670c48f18b5f03e69", "grade": true, "grade_id": "cell-e1e499a580fd7227", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.6.4()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b1955b3aa1e61a609afe4b4454762a71", "grade": false, "grade_id": "cell-444f7f3a1d7a98d7", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 3.6.5**\n", "
{points: 1}\n", "\n", "Read in the file `happiness_report_no_header.csv` using the appropriate `read_*` function and name it `happy_header`. \n", "Note: If the argument `col_names` is a character vector, the values will be used as the names of the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b122db1f544051073220709f916ca5ba", "grade": false, "grade_id": "cell-68d0893af7818c87", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- ...(file = \"...\", col_names = c(\"country\", \"happiness_score\", \"GDP_per_capita\", \"life_expectancy\", \"freedom\"))\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_header" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "fad84075341b2b04690d50f41fa86f55", "grade": true, "grade_id": "cell-abb25e9b6b619bf1", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.6.5()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "eb301db029a7e29c041a102ce2d752c3", "grade": false, "grade_id": "cell-47ac997d068d7480", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 3.7**\n", "
{points: 1}\n", "\n", "Earlier when you tried to open `happiness_report.xlsx` in Jupyter, you received an error message (`File Load Error ... is not UTF-8 encoded`). This happens because Excel spreadsheet files are not stored in plain text, and so Jupyter can't open them with its default text viewing program. This makes them a bit harder to inspect before trying to open in `R`.\n", "\n", "To inspect the data, we will just try to load `happiness_report.xlsx` using the most basic form of the appropriate `read_*` function, passing only the filename as an argument. Assign the output to a variable called `happy_xlsx`.\n", "\n", "*Note: you can also try to examine `.xlsx` files with Microsoft Excel or Google Sheets before loading into R.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d32b10e0785d4e8991d24e64c4f45aeb", "grade": false, "grade_id": "cell-6c241a49bbfc7b35", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#... <- ...(path = \"...\")\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "happy_xlsx" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5517d30e45f08b55a9d4c6a73e1da72d", "grade": true, "grade_id": "cell-3b0b5ff5faaddf50", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_3.7()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "0aa4f1c57858baa75836d1ae1ead15cc", "grade": false, "grade_id": "cell-9ef61e2af378b40d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.8** \n", "
{points: 1}\n", "\n", "Opening the data on a text editor showed some clear differences. Do all the data sets look the same once reading them on your R notebook (`\"yes\"` or `\"no\"`)? \n", " \n", "*Assign your answer to an object called `answer3.8`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `\"yes\"` or `\"no\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c5ce4071ed1e759636d37e8309d7a9b8", "grade": false, "grade_id": "cell-ca10e96ab5e42077", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "bfb9ca3c4cb44af1a9dca9305ca310ef", "grade": true, "grade_id": "cell-b10f0e14c91d5ebf", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.8()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "591e1b8c2a8e1272dbd278a4df887824", "grade": false, "grade_id": "cell-e8cd9683896b730d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 3.9** \n", "
{points: 1}\n", "\n", "Using the `happy_header` data set that you read earlier, plot `life_expectancy` vs. `GDP_per_capita`. Note that the statement \"plot A vs. B\" usually means to plot A on the y-axis, and B on the x-axis. Be sure to use `xlab` and `ylab` to give your axes human-readable labels.\n", "\n", "*Assign your answer to an object called `header_plot`.* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d64ab1ea1d72b54f8c24231db0480b6d", "grade": false, "grade_id": "cell-1341c1e60d1df98d", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "options(repr.plot.width = 8, repr.plot.height = 7)\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "header_plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1f470bf62b29e4d4a670996bc63e3005", "grade": true, "grade_id": "cell-06afbc4c37639e7d", "locked": true, "points": 1, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_3.9()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "240c293982ba4f3a13ded74a5d18d92e", "grade": false, "grade_id": "cell-c000b7bf1fb43b1c", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## 4. Reading Data from a Database\n", "\n", "### Investigating the reliability of flights into and out of Boston Logan International Airport\n", "\n", "Delays and cancellations seem to be an unavoidable risk of air travel. A missed connection, or hours spent waiting at the departure gate, might make you wonder though: how reliable is air travel, *really*?\n", "\n", "The US Bureau of Transportation Statistics keeps a continually-updated [Airline On-Time Performance Dataset](https://transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data) that has tracked the scheduled and actual departure / arrival time of flights in the United States from 1987 to the present day. In this section we'll do some exploration of this data to try to answer some of the above questions. The actual data we'll be using was from only the year 2015, and was compiled into the [2015 Kaggle Flight Delays Dataset](https://www.kaggle.com/usdot/flight-delays) from the raw Bureau data. But even that dataset is too large to handle in this course (5.8 million flights in just one year!), so the data have been filtered down to flights that either depart or arrive at Logan International Airport (`BOS`), resulting in around 209,000 flight records. \n", "\n", "Our data has the following variables (columns):\n", "\n", "- year\n", "- month\n", "- day\n", "- day of the week (from 1 - 7.999..., with fractional days based on departure time)\n", "- origin airport code\n", "- destination airport code\n", "- flight distance (miles)\n", "- scheduled departure time (local)\n", "- departure delay (minutes)\n", "- scheduled arrival time (local)\n", "- arrival delay (minutes)\n", "- diverted? (True/False)\n", "- cancelled? (True/False)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b1bfc9349ce1579642b409d7daf8c2a5", "grade": false, "grade_id": "cell-68b27df331c7af9f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.1** True or False:\n", "
{points: 1}\n", "\n", "We can use our dataset to figure out which airline company was the least likely to experience a flight delay in 2015.\n", "\n", "*Assign your answer to an object called `answer4.1`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `\"true\"` or `\"false\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "adbe89f8a103aade1718e8c787fa9fd0", "grade": false, "grade_id": "cell-1eb5f7e0f07ed793", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1641668438d28461b579ee13cb38e9f8", "grade": true, "grade_id": "cell-847ba82ee38201de", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2e51ba603b7a05923cebd2012e4fedbd", "grade": false, "grade_id": "cell-7b7a78cb0fab16e5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.2** Multiple Choice\n", "
{points: 1}\n", "\n", "If we're mostly concerned with getting to our destination on time, which variable in our dataset should we use as the y-axis of a plot?\n", "\n", "A. flight distance\n", "\n", "B. departure delay\n", "\n", "C. origin airport code\n", "\n", "D. arrival delay\n", "\n", "*Assign your answer as a single character to an object called `answer4.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "0cd6c9281afdfef39258b89545b6213b", "grade": false, "grade_id": "cell-72b0739f5d3a6017", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "3ee32eadc18e4416b3cad8c08379e796", "grade": true, "grade_id": "cell-990bcb46001bf6bc", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "db83d0f611b01d86593a7c6e21c45fe1", "grade": false, "grade_id": "cell-7b877620e95a911b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Let's start exploring our data. The file is stored in `data/flights_filtered.db` in your working directory (still the `worksheet_02` folder). If you try to open the file in Jupyter to inspect its contents, you'll again run into the `File Load Error ... is not UTF-8 encoded` message you got earlier when trying to open an Excel spreadsheet in Jupyter. This is because the file is a *database* (often denoted by the `.db` extension), which are usually not stored in plain text. \n", "\n", "We'll need more R packages to help us handle this kind of data: \n", "\n", "- the [database interface (`DBI`) package](https://www.rdocumentation.org/packages/DBI/versions/0.5-1) for opening, connecting to, and interfacing with databases\n", "- the [R SQLite (`RSQLite`) package](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) so that DBI can talk to SQLite databases \n", " - there are many kinds of databases; the `flights_filtered.db` database is an SQLite database\n", "- the [dbplyr package](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) for manipulating tables in the database using functions in R\n", " - without this, in order to retrieve data from the database, we would have to know a whole separate language, Structured Query Language (SQL)\n", "\n", "Let's load those now.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ba9446effcd2a77475332790f148e6da", "grade": false, "grade_id": "cell-b03325c723b1c817", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell before continuing.\n", "library(DBI)\n", "library(RSQLite)\n", "library(dbplyr)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "dbf740d92c2f022cddb5e26781db3b19", "grade": false, "grade_id": "cell-aafba0a844ff4070", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In order to open a database in R, you need to take the following steps:\n", "\n", "1. Connect to the database using the `dbConnect` function.\n", "2. Check what tables (similar to R dataframes, Excel spreadsheets) are in the database using the `dbListTables` function\n", "3. Once you've picked a table, create an R object for it using the `tbl` function\n", "\n", "*Note: the `tbl` function returns a **reference** to a database table, not the actual data itself. This allows R to talk to the database / get subsets of data without loading the entire thing into R!*\n", "\n", "The next few questions will walk you through this process.\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "c6aa3478ac3ef615da4becb4812631d6", "grade": false, "grade_id": "cell-9b0fb86c56d041a0", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.3.1** \n", "
{points: 1}\n", "\n", "Use the `dbConnect` function to open and connect to the `flights_filtered.db` database in the `data` folder.\n", "\n", "*Note: we have provided the first argument, `RSQLite::SQLite()`, to `dbConnect` for you below. This just tells the dbConnect function that we will be using an SQLite database.*\n", "\n", "*Assign the output to a variable named `conn`*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b9cebeb84e0782d316a3c52723333974", "grade": false, "grade_id": "cell-1448c7dbce0b45ab", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#... <- dbConnect(RSQLite::SQLite(), '...') #replace ... with the database relative path\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b96bdbdbaca93dffa57cb5d6c515a735", "grade": true, "grade_id": "cell-450ac4401cbdd5df", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.3.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ac05568de3ade4553cce7fd4d02793b4", "grade": false, "grade_id": "cell-45d198de7245ad8f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.3.2**\n", "
{points: 1}\n", "\n", "Use the `dbListTables` function to inspect the database to see what tables it contains.\n", "\n", "*Make a new variable named `flights_table_name` that stores the name of the table with our data in it*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this cell to figure out how to answer the question\n", "# Call the dbListTables function in this cell and take a look at the output\n", "# If you don't know what argument to give dbListTables, use ?dbListTables to find out!\n", "\n", "#dbListTables(...) #replace ... with the right argument\n", "\n", "#once you've called this and seen the output, insert the output string in the cell below as denoted" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ffcd8d612d689d41f40b288dcaadd461", "grade": false, "grade_id": "cell-dd28dbea4ff01626", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#... <- '...'\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7fb8b7c56e01f46a6b58c802aeba8257", "grade": true, "grade_id": "cell-37f36762e19a02f2", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.3.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "39da0ec2f52b8ba069815e3724fabe20", "grade": false, "grade_id": "cell-6097516dbac498b6", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.3.3**\n", "
{points: 1}\n", "\n", "Use the `tbl` function to create an R reference to the table so that you can manipulate it with `dbplyr` functions.\n", "\n", "*Make a new variable named `flight_data` based on the output of `tbl`*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "7d193bd89995c3fe435c2994ef3ae193", "grade": false, "grade_id": "cell-ae916513622496e1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#flight_data <- ...\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f0c057bad53ed5a7b99a55b6e0e5d584", "grade": true, "grade_id": "cell-25d8325b46b0321d", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.3.3()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5df3b0988528da606cb548d0e51deefb", "grade": false, "grade_id": "cell-f5b6931ac3257fe2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now that we've connected to the database and created an R table object, we'll take a look at the first few rows and columns of the flight on-time performance data. Even though `flight_data` isn't a regular R dataframe---it's a database table connection, or specifically a `tbl_SQLiteConnection`---the functions from the `dbplyr` package let us treat it like an R dataframe!\n", " \n", "So let's try using the `head` function (which allows us to see the first few rows of a dataset) and see what happens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "9a8e6136ae30c48d1963d896bf9a0bed", "grade": false, "grade_id": "cell-4f7812a97fb3b366", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell before continuing.\n", "head(flight_data)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6a74273e9983700c82bc4e1a1ce9edfb", "grade": false, "grade_id": "cell-925839d2057b8935", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "It works! And---as luck would have it---it also works to use the `select` and `filter` functions you've learned about previously.\n", "\n", "*Note: not all functions that you're familiar with work on database table `tbl` reference objects. For example, if you try to run `nrow` (to count the rows) or `tail` (to get the last rows of the table), you won't get the result you expect*.\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8b241794075b5e7dc2d72c0f8d5ec29b", "grade": false, "grade_id": "cell-85a8745db2165d6b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.4**\n", "
{points: 1}\n", "\n", "Use the `select` and `filter` functions to extract the **arrival and departure delay** columns for rows where **the origin airport is BOS.**\n", "\n", "*Store your answer in a variable called* `delay_data`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2c18675ed495f1cdbc1fe711c53b17d6", "grade": false, "grade_id": "cell-c510f5a9427d66c4", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "389605081c7d6c0e05d257e2ba90e765", "grade": true, "grade_id": "cell-1ebff82c6cecfdd2", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.4()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1026a61ada058863ab4b6ae72d38e16e", "grade": false, "grade_id": "cell-eeda91a1819b7875", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Take a look at `delay_data` to make sure it has the two columns we expect.\n", "# Run this cell before continuing.\n", "head(delay_data)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1ba6bc6e332107a24dd16f2ec22f12d5", "grade": false, "grade_id": "cell-fe64925771cd864e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "You'll notice in the `Source:` line that the dimension of the table is listed as `[?? x 2]`. This is because databases do things in the *laziest* way possible. Since we only asked the database for its `head` (the first few rows), it didn't bother going through all the rows to figure out how many there are. This sort of laziness can help make things run a lot faster when dealing with large datasets.\n", "\n", "Our next task is to visualize our data to see whether there is a difference in delays for arrivals at and departures from `BOS`. But before we do that, let's figure out just how much data we're working with using the `count` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d4dedbb622ded8b231b151aaac05f6d6", "grade": false, "grade_id": "cell-07a5e924ff63c820", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell before continuing.\n", "count(delay_data)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b95050ef99880d8a6447011e9cd602b1", "grade": false, "grade_id": "cell-ae414c879686e90f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Yikes---that's a lot of data! If we tried to do a scatter plot of these, we probably wouldn't be able to see anything useful; all the points would be mushed together. Let's try using a *histogram* instead. A histogram helps us visualize how a particular variable is distributed in a dataset. It does this by separating the data into *bins*, and then plotting vertical bars showing how many data points fell in each bin.\n", "\n", "For example, we could use a histogram to visualize the distribution of waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming with the `geom_histogram` layer. The `bins` argument specifies the number of bins to use in the histogram.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7d45e53b4a4ad449c936f2efc7565293", "grade": false, "grade_id": "cell-b99d6333eff0b0fb", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell before continuing.\n", "ggplot(faithful, aes(x = waiting)) + \n", " geom_histogram(bins = 40) + \n", " xlab(\"Waiting Time (mins)\") + \n", " ylab(\"Count\") + \n", " theme(text = element_text(size=20))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5ded1bd40c64e538a9e387c8bdb2bddc", "grade": false, "grade_id": "cell-a64059c2fa597366", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "We'll use histograms to visualize the departure delay times and arrival delay times separately." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8e4c0193461784638df7dd71bc5f6ef5", "grade": false, "grade_id": "cell-b0e0f2ed0c5dddfd", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.5**\n", "
{points: 1}\n", "\n", "Plot the **arrival** delay time data as a histogram. You will plot the delay (in hours) separated into 15-minute-wide bins on the x axis. The y axis will show the percentage of flights departing BOS that had that amount of delay during 2015.\n", "\n", "You'll do this by finishing the code segment provided below. There are 4 places where `...` appears in the provided code below. Replace each instance of `...` with the correct item from the following list:\n", "\n", "- `ARRIVAL_DELAY/60`\n", "- `'steelblue'`\n", "- `\"Delay (hours)\"`\n", "- `geom_histogram`\n", "\n", "*Assign the output of ggplot to an object called* `arrival_delay_plot`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "341efa5afc92ceb72c5328c1c4c14752", "grade": false, "grade_id": "cell-79abf1e26b1b8f0c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace each ... with the correct item in the list above. \n", "\n", "# ... <- ggplot(delay_data, aes(x = ...)) +\n", "# ...(aes(y = 100 * stat(count) / sum(stat(count))),\n", "# binwidth = .25,\n", "# fill = \"lightblue\",\n", "# color = ...) +\n", "# scale_x_continuous(limits = c(-2, 5)) +\n", "# ylab(\"% of Flights\") +\n", "# xlab(...) +\n", "# theme(text = element_text(size=20))\n", "# ...\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "eea0c175e108a37501f4681d3c0e80b8", "grade": true, "grade_id": "cell-f6883fdfdfd2696a", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.5()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b61d8fe10b4547180ab97cd120e28e16", "grade": false, "grade_id": "cell-a6abb9eda899782f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.6**\n", "
{points: 1}\n", "\n", "Plot the **departure** delay time data as a histogram with the same format as the previous plot. **Hint:** copy and paste your code from the previous block! The only thing that will change is column from `delay_data` that you use for the x-axis.\n", "\n", "*Assign the output of ggplot to an object called* `departure_delay_plot`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "90e39ee9851bb151df74df2ff47736f8", "grade": false, "grade_id": "cell-ccb6e468d200a27c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "32c24bfad6885c519de826b59ab81d38", "grade": true, "grade_id": "cell-72a3083183536c26", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.6()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "82f709a9c61549512c19cf956d26c67e", "grade": false, "grade_id": "cell-1c7eef04ed3ca25e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.7**\n", "
{points: 1}\n", "\n", "Look at the two plots you generated. Are departures from or arrivals to `BOS` more likely to be on time (at most 15 minutes ahead/behind schedule)?\n", "\n", "_Assign your answer (either `\"departures\"` or `\"arrivals\"`) to an object called `answer4.7`._" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a58ecfec87d4ddd23b7ebd8fa06278cf", "grade": false, "grade_id": "cell-2dbbff8a5c76c43d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "c6ac5ab14aeb4d446478261b07346933", "grade": true, "grade_id": "cell-92730484cc1f18bc", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.7()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fcfe800de597f3d2a4de5d18b22b8e8a", "grade": false, "grade_id": "cell-f19dab8a7027fedf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "So far, we've done everything using the `delay_data` database reference object constructed using functions from the `dbplyr` library. Remember: this isn't the data itself! If we want to save the small data subset that we've constructed to our local machines (perhaps to share it on the web or with collaborators), we'll need to take one last step. \n", "\n", "**Question 4.8.1**\n", "
{points: 1}\n", "\n", "We want to download the arrival / departure times data where the origin airport is BOS from the database. We will use the `collect` function to do this, which of the following should you use? \n", "\n", "A. `collect(delay_data)`\n", "\n", "B. `collect(flights_table_name)`\n", "\n", "C. `collect(conn)`\n", "\n", "D. `collect(flight_data)`\n", "\n", "*Assign your answer to an object called `answer4.8.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "045fd90d242d4d0dade7dc835613b3e7", "grade": false, "grade_id": "cell-ef26b547a5c2721d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "603e21d105166942280356f369375fec", "grade": true, "grade_id": "cell-266f864a8ceefdd6", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.8.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ce81626b89475aa63b074c038f5d3431", "grade": false, "grade_id": "cell-ca24c491ffce682a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Question 4.8.2**\n", "
{points: 1}\n", "\n", "**If you input the wrong argument in the `collect()` function below your worksheet will time out. Please double check you have the correct answer to question 4.8.1 above and input the correct argument in the `collect()` function below!** \n", "\n", "Use the `collect` function to download the arrival / departure times data where the origin airport is BOS from the database and store it in a dataframe object called `delay_dataframe`. Then, use the `write_csv` function to write the dataframe to a file called `delay_data.csv`. Save the file in the `data/` folder.\n", "\n", "*Note: there are many possible ways to use `write_csv` to customize the output. Just use the defaults here!*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#If you don't know how to call collect or write_csv, use this cell to \n", "#check the documentation by calling ?collect or ?write_csv\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "1c90281b21afdaffd9276d3bf386d4f2", "grade": false, "grade_id": "cell-d6099a57242aca0b", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# If you input the wrong tbl in the collect() function below your worksheet will time out\n", "# Please make sure you check your answer to question 4.8.1 and input the correct tbl in the collect() function below \n", "#delay_dataframe <- collect(...)\n", "#write_csv(..., ...)\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2226f92f40d1dd5e0d82aefcd2a22976", "grade": true, "grade_id": "cell-9cd92c82958ba9fe", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_4.8.2()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "caa58e5c5e67196c2aa4b8e50319c2a0", "grade": false, "grade_id": "cell-00117e573b347f82", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "source(\"cleanup.R\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "28a050226cb8791cdf4e1a8aa210a557", "grade": false, "grade_id": "cell-286b4e70afde82f3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## 5 (Optional). Reading Data from the Internet\n", "\n", "### How has the World Gross Domestic product changed throughout history?\n", "\n", "\n", "As defined on Wikipedia, the \"Gross world product (GWP) is the combined gross national product of all the countries in the world.\" Living in our modern age with our roaring (sometimes up and sometimes down) economies, one might wonder how the world economy has changed over history. To answer this question we will scrape data from the [Wikipedia Gross world product page](https://en.wikipedia.org/wiki/Gross_world_product).\n", "\n", "Your data set will include the following columns: \n", "* `year`\n", "* `gwp_value`\n", "\n", "Specifically we will scrape the 2 columns named \"Year\" and \"Real GWP\" in the table under the header \"Historical and prehistorical estimates\". **The end goal of this exercise is to create a line plot with year on the x-axis and GWP value on the y-axis.**" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "85cdf20f17d413b086ef1cab67ea5bd1", "grade": false, "grade_id": "cell-2cfe0d273d96fdf7", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.1.0** Multiple Choice: \n", "
{points: 0}\n", "\n", "Under which of the following headers in the table will we scrape from on the [Wikipedia Gross world product page](https://en.wikipedia.org/wiki/Gross_world_product)?\n", "\n", "A. Gross world product\n", "\n", "B. Recent growth\n", "\n", "C. Historical and prehistorical estimates\n", "\n", "D. See also\n", "\n", "*Assign your answer to an object called `answer5.1.0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d6a63003c79ef9b5b816298a855f6907", "grade": false, "grade_id": "cell-a9b7dd149eeb7782", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "394bb7ca6d6ae46ed20b3a3ce65cce33", "grade": true, "grade_id": "cell-631492053b2d5ec4", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.1.0()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fca585857777a2519f3ac3cb705a8efb", "grade": false, "grade_id": "cell-cc8eaf2a5c4886c3", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.1.1** Multiple Choice: \n", "
{points: 0}\n", "\n", "What is going to be the x-axis of the scatter plot we create?\n", "\n", "A. compound annual growth rate\n", "\n", "B. the value of the gross world product\n", "\n", "C. year\n", "\n", "*Assign your answer to an object called `answer5.1.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).* " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "3812069a45cc597fec98a5cb34a53d94", "grade": false, "grade_id": "cell-6a1ec0727ab3608a", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "265340062c687dd3ab8b15e84110f0da", "grade": true, "grade_id": "cell-29f38784d22f2c8b", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.1.1()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "544d4470d7c988611a3092229e7134ba", "grade": false, "grade_id": "cell-63c50471d354863a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "We need to now load the `rvest` package to begin our web scraping!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "dcb3639930d0a746a7895981c5d82469", "grade": false, "grade_id": "cell-66f474de0fdcb58d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell \n", "library(rvest)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "a1b165e2c68bc5dfbdde13cee9d7d281", "grade": false, "grade_id": "cell-5815e9c54d669efe", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.2**\n", "
{points: 0}\n", "\n", "Use `read_html` to download information from the URL given in the cell below. Instead of copying the entire URL, you can simply use the object (`url`) after `read_html()`.\n", "\n", "*Assign your answer to an object called `gwp`.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "0be3249741897d9b19ed4ba9ee9d496a", "grade": false, "grade_id": "cell-eb85da73cb3eedb1", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "url <- 'https://en.wikipedia.org/wiki/Gross_world_product'\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "print(gwp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "79e117db5738721bb28f32418e8d6c15", "grade": true, "grade_id": "cell-0c1e5c213d4da058", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.2()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "beb46df7915d62d37e509535cbf96e36", "grade": false, "grade_id": "cell-f3a7fe22bb441577", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.3**\n", "\n", "Run the cell below to create the first column of your data set (the year from the table under the \"Historical and prehistorical estimates\" header). The node was obtained using `SelectorGadget`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8c5b4b22c8b021a70b8cdf90b5c68ff6", "grade": false, "grade_id": "cell-4856375846dd1d86", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Run this cell to create the first column for your data set. \n", "year <- html_text(html_nodes(gwp, \".wikitable tbody:nth-child(1) td:nth-child(1)\"))\n", "head(year)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4a5735be51b5809ce081f2d22037a84d", "grade": false, "grade_id": "cell-497d59ada8bb5194", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "We can see that although we want numbers for the year, the data we scraped includes the characters `CE` and `\\n` (a newline character). We will have to do some string manipulation and then convert the years from characters to numbers. \n", "\n", "First we use the `str_replace_all` function to match the string `\" CE\\n\"` and replace it with nothing `\"\"`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell.\n", "# Use stringr library.\n", "library(stringr)\n", "# Replace \" CE\\n\" with nothing.\n", "year <- str_replace_all(string = year, pattern = \" CE\\n\", replacement = \"\")\n", "print(year)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7be0e99f1c55621563509665bd0764fa", "grade": false, "grade_id": "cell-f90a8f5c3c41d414", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "When we print year, we can see we were able to remove `\" CE\\n\"`, but we missed that there is also `\" BCE\\n\"` on the earliest years! There are also commas (`\",\"`) in the large BCE years that we will have to remove. We also need to put a `-` sign in front of the BCE numbers so we don't confuse them with the CE numbers after we convert everything to numbers. To do this we will need to use a similar strategy to clean this all up! \n", "\n", "This week we will provide you the code to do this cleaning, next week you will learn to do these kinds of things yourself. After we do all the string/text manipulation then we use the `as.numeric` function to convert the text to numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell to clean up the year data and convert it to a number.\n", "# Use grep to select the lines containing \" BC\\n\" and put a - at the beginning of them.\n", "year[grepl(pattern = \" BCE\\n\", x = year)] <- str_replace_all(string = year[grepl(pattern = \" BCE\\n\", x = year)], pattern = \"^\", replacement = \"-\")\n", "\n", "# Replace all commas with nothing.\n", "year <- str_replace_all(string = year, pattern = \",\", replacement = \"\")\n", "# Extract the minus symbol and the numbers.\n", "year <- as.numeric(str_extract(string = year, pattern = \"-?[0-9]+\"))\n", "print(year)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ed83d42aaaad35e641da953fd5e8dabd", "grade": false, "grade_id": "cell-ee11d70af0ae1f21", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.4**\n", "
{points: 0}\n", "\n", "Create a new column for the gross world product (GWP) from the table we are scraping. Don't forget to use `SelectorGadget` to obtain the CSS selector needed to scrape the GWP values from the table we are scraping. Assign your answer to an object called `gwp_value`. \n", "\n", "Fill in the `...` in the cell below. Copy and paste your finished answer into the `fail()`. \n", "\n", "Refer to **Question 5.3** and don't be afraid to ask for help. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "0dda70904680d98163b5608e65e2d8fd", "grade": false, "grade_id": "cell-fc7e31721b1040ff", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#... <- ...(html_nodes(gwp, ...))\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "head(gwp_value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "372c936377897f9217de2a887ad18dd9", "grade": true, "grade_id": "cell-15126d99f838d9b6", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.4()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7924a07e289d03e2323c7e38c0082600", "grade": false, "grade_id": "cell-b146bdcc907149d7", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Again, looking at the output of `head(gwp_value)` we see we have some cleaning and type conversions to do. We need to remove the commas, the extraneous trailing information in the first 3 columns, and the `\"\\n\"` character again. We provide the code to do this below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell to clean up the year data and convert it to a number.\n", "\n", "# Replace all commas with nothing.\n", "gwp_value <- str_replace_all(string = gwp_value, pattern = \",\", replacement = \"\")\n", "\n", "# Extract the numbers and decimals.\n", "gwp_value <- as.numeric(str_extract(string = gwp_value, pattern = \"[0-9.]+\")) \n", "head(gwp_value)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7b875f4e42e8f36bbf2ff82e266ac1b1", "grade": false, "grade_id": "cell-e7cd941c74d291b3", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.5**\n", "
{points: 0}\n", "\n", "Use the `tidyverse` `tibble` function to create a data frame named `gwp` with `year` and `gwp_value` as columns. The general form for the creating data frames from vectors/lists using the `tibble` function is as follows:\n", "\n", "```tibble(COLUMN1_NAME, COLUMN2_NAME, COLUMN3_NAME, ...)```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "de44a69b1db7bdf66bc4679a3cbd1bfe", "grade": false, "grade_id": "cell-d1371cd4b958c5bc", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- tibble(..., ...)\n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "gwp" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2a5f2644d25277811405416e1aa1fce5", "grade": true, "grade_id": "cell-2e58f17d6d715a58", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.5()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d1de75d1889dc1c1c4d758863d5076ce", "grade": false, "grade_id": "cell-4ec224570b0a6b37", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "One last piece of data transformation/wrangling we will do before we get to data visualization is to create another column called `sqrt_year` which scales the year values so that they will be more informative when we plot them (if you look at our year data we have a lot of years in the recent past, and fewer and fewer as we go back in time). Often times you can just transform the scale within `ggplot` (for example see what we do with the `gwp_value` later on), but the year value is tricky for scaling because it contains negative values. So we need to first make everything positive, then take the square root, and then re-transform the values that should be negative to negative again! We provide the code to do this below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gwp <- mutate(gwp, sqrt_year = sqrt(abs(year)))\n", "gwp <- mutate(gwp, sqrt_year = if_else(year < 0, sqrt_year * -1, sqrt_year))\n", "gwp" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "71e19ecfe3b381e589986cc456351052", "grade": false, "grade_id": "cell-ee318d84dfeeaa45", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.6**\n", "
{points: 0}\n", "\n", "Create a line plot using the `gwp` data frame where `sqrt_year` is on the x-axis and `gwp_value` is on the y-axis. *We provide the plot code to relabel the x-axis with the human understandable years instead of the tranformed ones we plot.* Name your plot object `gwp_historical`. To make a line plot instead of a scatter plot you should use the `geom_line()` function instead of the `geom_point()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c5d9e0b6476530aa0df8d1336d2ea47c", "grade": false, "grade_id": "cell-bbe05ddba3e2c8e7", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "#... <- ggplot(gwp, aes(x = ..., y = ...)) +\n", " #geom_line() +\n", " #scale_y_continuous(trans='log10') +\n", " #scale_x_continuous(breaks = c(-1000, -750, -500, -250, -77.7, 0, 38.7), \n", " # labels = c(\"-1000000\", \"-562500\", \"-250000\", \"-62500\", \"-5000\", \"0\", \"1500\")) +\n", " #ylab(\"...\") +\n", " #xlab(\"Year\") +\n", " #theme(text = element_text(size=20))\n", "\n", "\n", "options(repr.plot.width=15, repr.plot.height=7)\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer\n", "gwp_historical" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1be620321fc45c45f4cc240470ed886b", "grade": true, "grade_id": "cell-bb0696a343e1255c", "locked": true, "points": 0, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "test_5.6()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "bdc6572dc2f768aab9f037782e070f89", "grade": false, "grade_id": "cell-fac12005e571ba5b", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Question 5.7** \n", "
{points: 0}\n", "\n", "Looking at the line plot, when does the Gross World Domestic Product first start to more rapidly increase (i.e., when does the slope of the line first change)? \n", "\n", "A. roughly around year -1,000,000\n", "\n", "B. roughly around year -250,000\n", "\n", "C. roughly around year -5000\n", "\n", "D. roughly around year 1500\n", "\n", "\n", "*Assign your answer to an object called `answer5.7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `\"F\"`).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "9f6cab021183ac8c50e5fb303a33b0b0", "grade": false, "grade_id": "cell-dd7f4c508c47aec3", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# Replace the fail() with your answer. \n", "\n", "# your code here\n", "fail() # No Answer - remove if you provide an answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b4961df5619abd23faea277ef4d0e69d", "grade": true, "grade_id": "cell-8375f0716c6f8541", "locked": true, "points": 0, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "test_5.7()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "4071a662a12539ed237aa1e137192965", "grade": false, "grade_id": "cell-953eb170c47470b8", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "source(\"cleanup.R\")" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.3.0" } }, "nbformat": 4, "nbformat_minor": 4 }