--- title: "Homework Unit 2: EDA Cleaning" author: "Your name here" date: "`r lubridate::today()`" format: html: embed-resources: true toc: true toc_depth: 4 editor_options: chunk_output_type: console --- ## Introduction to Application Assignments Welcome to your first homework assignment! Please enter your first and last name at the top of this document where it says "Your name here". Everything you need to complete this assignment can be downloaded from the course web book. This includes: - **Two markdown files (hw_unit_2_cleaning.qmd and hw_unit_2_modeling.qmd):** To complete this homework (and the other homeworks) type all responses directly into these Quarto markdown files. To learn more about Quarto and working in markdown files, see the Quarto chapter of R for Data Science 2E and [help on markdown]9https://quarto.org/docs/authoring/markdown-basics.html) on the Quarto website - **ames_raw_class.csv**: The `.csv` file containing the data set to be used in this assignment. - **ames_data_dictionary.pdf:** The data dictionary describing variables contained in the data set. ## Some Notes on Formating in Markdown In markdown, when you are talking about variables, R functions, etc., you can display that code distinctly (with a grey background) when the script is rendered by surrounding the text with `. We will use this in the web book and our homework scripts Note that other mathematical elements in markdown are by convention surrounded by `$` when specifying them inline. This renders the enclosed element in a distinct typeset when rendered. You'll learn more about this notation later, but it is especially useful when composing mathematical elements (like equations) in your notebooks. For example: 2^2 + 4 = 8$ ----- ## Some Notes on Projects and File Management As we describe in the book, you should always use RStudio projects for file and path management. We recommend making a folder called iaml on your computer and setting the project root to that folder. Inside that folder, create a sub-folder called homework and within it, sub-folders for each unit (e.g., `unit_02`). You can save all files (data, scripts, etc) in that associated unit folders. You will then be able to access data files and save output using only relative paths using the `here::here()` function as shown in the book. ## Packages, Source, Etc In the application assignment shells, some code chunks, like the `conflicts` and `packages` chunks below, will be pre-filled with code for you to run. Other code chunks will be empty with text instructions above them describing code for you to fill in. The shells will be more and more empty as the semseter advances! This code chunk will set up conflict policies to reduce errors associated with function conflicts ```{r conflicts} options(conflicts.policy = "depends.ok") devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true") tidymodels_conflictRules() ``` This code chunk loads all packages you will need for this assignment. Note that we didn't load the `janitor` package here. You will use its `clean_names()` function but you can use that without loading the package by just pre-pending the namespace when you call it (i.e., `janitor::clean_names()`) ```{r packages} #| message: false #| warning: false library(kableExtra, exclude = "group_rows") # you may or may not use this one! library(tidyverse) # for general data wrangling library(tidymodels) # for modeling ``` You may also use the EDA and plotting functions that John shares on his `lab_support` repo. You can source the scripts that contain those functions directly from Github with the code below (note that you may need to install the `devtools` package if you haven't done this previously). ```{r source} #| message: false #| warning: false devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true") ``` Set up some other environment settings ```{r} options(tibble.width = Inf, tibble.print_max = Inf) theme_set(theme_classic()) # maybe you prefer this to classic theme? ``` ## Read and Setup Dataframe ### Read Data In the chunk below, set the variable `path_data` to the location of your data files. Make sure you have your iaml project open in RStudio. When you call `here::here()` it will set your root path to be inside of the iaml folder. Assuming you have a subfolder called homework and a folder within that folder called `unit_02`, `path_data` will work as set. If you have some other organization, you will need to modify `path_data` to reflect that folder structure. ```{r path} path_data <- "homework/unit_02" ``` This assignment will use the Ames Housing Prices Dataset (also seen in Unit 2 of the course text). Read in the `ames_raw_class.csv` data below. ```{r load-data} ``` ### Select variables We will explore a different set of variables than those demoed in the course text. `select()` from the dataset for the variables below and convert all variable names to "snake_case": `SalePrice`, `Garage Area`, `Neighborhood`, `MS SubClass`, `Total Bsmt SF`, `Bsmt Qual`, `Central Air`, `TotRms`, `AbvGrd`, `Fireplaces`, and `Fireplace Qu`. ```{r select-data} ``` ### Review the data dictionary Familiarize yourself with the variables we will be using by looking each one up in the data dictionary downloaded with your homework files. Reference the codebook frequently as you perform cleaning checks below. ## Exploring the data for cleaning This script should only contain EDA steps necessary for cleaning the full dataset (i.e., not subsets of it that will be allocated for train, vaidation, or test). Be mindful of which aspects of the data set you explore at this stage to prevent information leakage between later training and validation sets (to be saved out as the final step of cleaning in this document). Provide observations at each stage of the process, even if you do not make any changes. Remember that you can use `glimpse()`, `skim_some()`, `print()`, `head()`, and/or kable tables to explore and display your data. Everything that you need to do has been demonstrated in the web book. ### Variable classes Confirm that all variables are read in as the expected class. Remember that we class nominal and ordinal variables as factors and interval and ratio variables as numeric. Below the code chunk, type some observations you have about the observed variable classes compared to descriptions in the data dictionary. Make any appropriate adjustments to variable class below using `mutate()`, `factor()`, or `as.numeric()`. ```{r clean-class} ``` **Variable class notes:** *Enter your notes here* Type your text observation between the two single asterisks (`*`) above. Asterisks are a way to format text in markdown files that is displayed once the completed file is "rendered" to a html format. Surrounding text with **two asterisks per side** creates **bold** text, while surrounding text with *one asterisk per side* creates *italicized* text. Italicizing makes your text responses a little easier for us to find while grading homework, and we are grateful to you for using it. ### Missing data Use an appropriate technique to review missing data at this stage. Clearly document missing data ("missingness") across each variable in the dataset. For variables with high missingness, write code that allows you to visually inspect all observations of missing data (for each such variable). Using information from related variables and the data dictionary to guide you, speculate (in prose) why data may be missing. Clean variables with high missingness using `mutate()`, `replace_na()`, and `fct_relevel()` if you believe any of the NAs are not really missing but instead problems with how the data were coded (e.g., see the example of this in the book.) ```{r clean-missing} ``` **Missing data notes:** *Enter your notes here* ### Numeric variables Explore `min` and `max` values for numeric variables, recording notes on any observations that look suspicious or potentially invalid. Use the data dictionary and associated variables to help you decide whether suspicious observations may represent (in)valid responses. ```{r clean-numeric} ``` **Numeric variable notes:** *Enter your notes here* ### Categorical variables Use `print()` and `levels()` print the levels of each categorical variable. You might consider using `walk()` to do all the categorical variables at once! You can use `tidy_responses()` (a function from John) to convert all responses to snake_case. Check to make sure all levels converted properly. If needed, correct any response levels with conversion errors using `mutate()` and `fct_recode()`. Document your observations for categorical levels. ```{r clean-categorical} ``` **Categorical variable notes:** *Enter your notes here* ## Train test split Now that we have completed our data cleaning, we will split our data into train and test sets and save out the cleaned files. Since John held out a separate test set from the data we were given, your split will actually create our training and validation sets. We will use his holdout set as the test data. ### Generate a train/test split Assign 25% of the data to be our validation set. Stratify this split on the `sale_price` variable. ```{r split-train} ``` ### Save cleaned files Save out your cleaned train and validation sets as csv files and name them `hw_unit_2_train.csv` and `hw_unit_2_val.csv`. ```{r save-files} ``` ## Save and render ### Save Save this `.qmd` file with your last name at the end (e.g. `hw_unit_2_marji.qmd`) ### Render Click the "render" button at the top of your screen to render the file to html. Upload this html file to Canvas to submit the first part of the assignment. Take care not to rest on your laurels: you have a 2nd part to do (but we are proud of your for getting this far!).