---
title: 'Lab: Getting Started in R and EPA Fuel Efficiency Data'
author: "ENS-215"
date: "Winter 2025"
output:
html_document:
df_print: paged
theme: journal
toc: yes
toc_float: yes
---
## Overview
Today's lab will consist of two parts. First we will make sure that you are properly set up with all of the software and accounts that you need for this class. Next you will work with some fuel economy data from the Environmental Protection Agency (EPA) to learn more about fuel efficiency in automobiles -- which is an important from an environmental/climate perspective given that vehicles are responsible for 51% of CO~2~ emissions for a typical U.S. household[^1].
As a reminder I want you to talk to your neighbors and discuss the material and your thoughts. You'll learn from each other and have much more fun this way too.
## Step 0: Lab guidelines and formatting
I will go over the guidelines for lab report expectations including the content, structure, and formatting of your reports. Remember that you can find additional details on lab [guidelines](https://stahlm.github.io/ENS_215/Labs/Lab_Guidelines.html) and [goals](https://stahlm.github.io/ENS_215/Winter_2025/Syllabus.html#Labs) on our class website.
## Step 1: Analysis of fuel efficiency in automobiles
Automobiles are a significant contributor to both CO~2~ emissions and air pollution. Understanding fuel efficiency in vehicles and the key factors that affect the efficiency are important from both an environmental and economic perspective. Today you will use EPA vehicle fuel economy data to learn more about this issue and explore some interesting questions.
### Background reading
The following resources will provide some helpful background related to today's lab. This resources will also be helpful as you are writing up your lab report.
+ [EPA Fuel Economy Page: _Reduce Climate Change_](https://www.fueleconomy.gov/feg/climate.shtml)
+ [EPA Fuel Economy Page: _Reduce Oil Dependence Costs_](https://www.fueleconomy.gov/feg/oildep.shtml)
+ [EPA Fuel Economy Page: _Increase Energy Sustainability_](https://www.fueleconomy.gov/feg/consres.shtml)
+ [US Energy Information Administration](https://www.eia.gov/energyexplained/use-of-energy/transportation.php)
+ [Union of Concerned Scientists: _Half the Oil_](https://www.ucsusa.org/clean-vehicles#.XC0AWlxKiUl)
+ [I also recommend checking out this interactive feature from the UCS](https://www.ucsusa.org/clean-vehicles/oil-is-changing)
### Fuel Economy Analysis
#### Set up for analysis
We are going to load in the `tidyverse` package. If you haven't installed it yet, you will first need to do that (it only needs to be installed once and once installed can be loaded in anytime you want). To install `tidyverse` go to the `Packages` tab and click the `install` button. In the window that pops up, type `tidyverse` and click `Install`.
```{r message = FALSE, warning = F}
library(tidyverse)
```
The data is conveniently part of one of the packages included with tidyverse, so we can easily load in the data, which is called `mpg`.
We'll assign the `mpg` data to a new object that we'll call `fuel_data`. However, before you run the code block below, you should first learn a bit more about the `mpg` dataset.
To do that type `?mpg` in your **Console** (we'll use the console, because we only want to get this info once, and not every single time we run our Notebook).
```{r}
fuel_data <- mpg # assigning mpg to a new object
```
#### Use the `View()` function to see the dataset in a spreadsheet-style viewer.
This can be really helpful when you are trying to get familiar with a dataset that you've loaded in.
Let's type `View(fuel_data)` in the **Console** (not in our Notebook) since we just want to do this once (and not every single time we run our Notebook).
#### Use the `str()` function to examine the structure of the `fuel_data` dataset
```{r}
# Your code here
```
+ How many variables are there?
+ How many observations?
+ What does `chr`, `int`, and `num` mean in the output above?
#### Look at your **Base R Cheatsheet** to find a function that allows you to examine the first six rows of the `fuel_data` dataset. Apply this function to your `fuel_data` object and it will print these results to your notebook.
```{r}
# Your code here
```
+ What do the `disp`, `cyl` and `cty` columns represent? (Hint: type `?mpg` into your console to get more info on this dataset)
+ What website was the data obtained from?
#### How many rows and columns are there in the `fuel_data` dataset? There is a function that will give you the dimensions of an object (_i.e._ number of rows and columns). Look at your **Base R Cheatsheet** to find this function.
```{r}
# Your code here
```
+ Number of rows?
+ Number of columns?
+ Use the approach that you employed to get the dimensions and assign the value for the number of rows to a variable `n_rows` and number of columns to `n_cols` (Hint: you can find functions for just the number of rows and just number of columns respectively on your **Base R Cheatsheet**)
#### Use the `summary()` function to get summary statistics on the `fuel_data` dataset
```{r}
# Your code here
```
+ For highway driving how many miles per gallon does
+ The most fuel efficient vehicle get?
+ The least fuel efficient vehicle get?
+ What is the average miles per gallon for city driving?
#### Create a new variable to store the ratio of the `hwy` to `cty` fuel economy
To create a new variable you can use the `mutate()` function (which is included in the `dplyr` package that loads in with `tidyverse`). You can get more info on `mutate()` by typing `?mutate()` to your console or by Googling it. FYI, we are going to learn a lot more about `dplyr` later in the term`
You can also create a new variable by using the `$` notation to assign it. The `$` allows us to access (or create) a new variable in a data frame. For instance `fuel_data$new_variable_name <- expression that defines the variable`
I've shown you how to do it both ways in the code below. You can pick one (they do the same thing) and uncomment it so that it will run.
```{r}
## Use mutate() to create the new variable (hwy2cty) and add it to the fuel_data object
# fuel_data <- mutate(fuel_data, hwy2cty = hwy/cty)
## Use the $ notation to create the new variable (hwy2cty) and add it to the fuel_data object
# fuel_data$hwy2cty <- fuel_data$hwy/fuel_data$cty
```
+ Examine this new variable
+ Is there any vehicle that gets better gas milage in the city as compared to the highway?
+ Is there any vehicle that has twice as good gas milage on the highway relative to the city?
+ What is the largest ratio between highway and city gas milage?
+ Look at the code above that we used to create the new variable `hwy2cty` and spend a few minutes discussing with your neighbor(s) what the code is doing.
#### Examine factors influencing fuel economy: Displacement
Let's see how the size of a vehicle's engine (displacement) influences the fuel economy
Use `ggplot` to create a scatter plot of `hwy` vs `displ`. Remove the `#` and replace the `...` with the appropriate values.
```{r}
# ggplot(data = ...) + geom_point(aes(x = ... , y = ...))
```
+ How does displacement influence the fuel economy?
+ Do any points appear to be outliers from the observed relationship? Any thoughts on what might explain these outliers?
#### Examine factors influencing fuel economy: Displacement and vehicle class
Let's add another variable to our analysis to try and further explain what influences fuel economy. We'll now color the points by their vehicle class.
```{r}
# ggplot(data = ...) + geom_point(aes(x = ... , y = ... , color = ... ))
```
+ What new information did this provide us with?
+ Can we say more about factors affecting fuel economy?
+ Can we explain some of the outliers?
#### Add a new variable to `fuel_data`
Below I've added a new variable `region` to `fuel_data` using the `mutate()` function. This step is relatively advanced, so don't worry if it looks well beyond what we've covered so far. However, I want you to take a look at the code and try to decipher what I did here. I tried to use variable names that are logical and if you see a function that you don't understand, try looking at the help file and/or Googling it.
```{r}
us_makes <- c("chevrolet","dodge","ford","jeep",
"lincoln","mercury","pontiac") # list of U.S. manufacturers
fuel_data <- mutate(fuel_data, region = if_else(manufacturer %in% us_makes,"US","Foreign") )
```
+ What is the variable `region` telling us?
+ Can you briefly explain the logic and steps taken to create the variable `region`?
You may find `region` to be another interesting variable to examine in the next section.
### Additional analysis of your choosing
With any remaining time you should perform further exploratory analysis of the `fuel_data` dataset. You should discuss ideas with your classmates and you can also ask me for guidance (but try to brainstorm some ideas before asking). I will be moving about the classroom and checking in with everyone, providing guidance and suggestions, and hearing what you've learned about the dataset being analyzed.
**Note that this section of the lab is important** and should not be given only a cursory work through. An important learning goal for this term is for you to develop your own independent research skill. Thus, all of our labs this term will have a large, independent component where you are expected to apply what you've learned to ask novel and interesting questions and furthermore to try and go beyond what you've learned in class.
When trying to get started with something new, remember the _copy/paste/tweak_ approach. This approach can help give you a good starting point for your work.
## Step 2 (optional): Work with an expanded EPA fuel efficiency dataset
If you are interested in further exploring EPA fuel efficiency data you can [download the full EPA dataset here](https://www.fueleconomy.gov/feg/epadata/vehicles.csv). The description of each of the variables in the dataset is [available here](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle).
This dataset has information on thousands of vehicles and their fuel/energy efficiency (along with dozens of other related variables) for models from 1984-2022. Notably this dataset contains information on many electric and hybrid vehicles.
Unlike the `mpg` dataset you used in **Part 1**, the full EPA dataset is quite a bit "messier" (e.g. variables have missing data in some rows) and quite a bit more complex (upwards of 80 variables are reported for a given vehicle). However, the larger number of observations (i.e., vehicles) and variables (i.e., vehicle attributes) allows for a much richer dataset.
If you decide to work on this analysis let me know and I can help guide you on how to load in and get started with the data.
```{r echo = F, eval = F}
df_epa_expanded <- read.csv("https://www.fueleconomy.gov/feg/epadata/vehicles.csv")
```
## Step 3: Submit the lab
Your lab is due prior to the start of next week's lab. Once you are finished and satisfied with your work you should `knit` it to an `html` file and submit both your `html` and `Rmd` file to Nexus.
To `knit` a file you can go to the menu bar at the top of your notebook and click the dropdown that currently says `preview` and select the `knit to html` option. This will `knit` your document, which runs all of your code and generates a nice report in html format (the file is saved in your current working directory).
An even easier way to `knit` your file is to go to the header at the top of your document and change `html_notebook` to `html_document` and then save your file. You will then see that the `Preview` option in the menu bar will have changed to `Knit`. Click `Knit` and your report will be `knit`.
**Before you leave lab today make sure you know how to knit your document**
**Make sure your file is properly named BEFORE you submit it** The correct naming structure is `LabName_YourLastName`
**Make sure to replace the `author` and `date` in the header with your info.**
[^1]: source: