############################################################## ### Title: Bootcamp example code ### Author: Magdalena Bennett ### Date Created: 08/23/2023 ### Last edit: [08/23/2023] - Created code ############################################################## #Clear memory rm(list = ls()) #Clear the console cat("\014") #Turn off scientific notation (turn back on with 0) options(scipen = 999) # Load packages library(tidyverse) #includes dplyr and ggplot2! # If there is a package you don't have installed, you can use install.packages("tidyverse") # Only run once! (no need to install packages every time you run your code) # Load data (this is loading data directly from Github) sales = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/bootcamp/data/US_Regional_Sales_Data.csv") ## Inspecting your data # Exercise 1: Let's explore the data. How many variables and observations do we have? What type of variables do we have? # Exercise 2: Install the package vtable, load it, and run the code sumtable(sales). What do you get? Use the ?sumtable to see the options for this function. ## Data wrangling # Exercise 1: Unit cost and unit price should be numeric. Let's change this! (hint: you can use the function gsub() to replace "," for "", and as.numeric() to transform a variable!). ## Keep the same names for the variables and the dataset. # Exercise 2: What are the different values for the sales channel in this dataset? Use the function table() to see! ## Create a new dataset for in-store and online sales. Call it "sales_min". How many variables do we have? # Exercise 3: Use the original dataset "sales", and create a new variable called "minority", ## which takes the value of 1 if the sales channel is in-store or online, and 0 in another case. # Exercise 4: What is the average price for sales made through a minority channel vs a non-minority channel? ## Plotting data! # Exercise 1: Create a scatter plot between unit cost (x axis) and unit price (y axis) # Exercise 2: Now, let's make that plot pretty. Use theme_minimal() to get rid of the grey background. Color the points with the color "deepskyblue3", ## and change the axis titles to something more informative (e.g. Unit price ($)). This can be done with xlab() and ylab(). # Exercise 3: Using the same code as before, now we want to color observations from the minority sales channel in one color, and the non-minority in another color. ## Write some code that does that (e.g. you will need to change your aesthetics!) # Exercise 4: Finally, using the same code as in exercise 2, include a regression line in this plot using geom_smooth(). ## Regressions # Let's load a new dataset: The Gapminder gapminder = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/bootcamp/data/gapminder.csv") # Exercise 1: What type of data do we have? # Exercise 2: Transform population into millions (divide pop by 10^6), and then regress life expectancy on gdp per capita and population. What do you obtain? # Exercise 3: Include now continent in the previous regression. Do your results change? How does it look when you include a factor variable in a regression? ## Bringing everything together # Exercise 1: Create a new variable called gdpPercap_log, which is the logarithm of the GDP per capita. Now plot life expectancy against the log(GDP per capita), ## and describe the relationship. # Exercise 2: Using the same plot as before, now color the points by continent and make the size proportional by population (in millions). # Exercise 3: Do the same thing as before (exercise 2), but only for Europe! # Exercise 4: Finally, run a regression that helps you estimate the association between life expectancy and GDP per capita, conditional on population, ## for the year 2007 and then, another regression for the year 1982.