Exploratory data analysis

MACS 30500 University of Chicago

Exploratory data analysis

  1. Generate questions about your data
  2. Search for answers by visualising, transforming, and modeling your data
  3. Use what you learn to refine your questions and or generate new questions
  4. Rinse and repeat until you publish a paper
  • Variation
  • Covariation

Characteristics of EDA

ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth()

Characteristics of CDA

ggplot(diamonds, aes(carat, price)) +
  geom_point(alpha = .01) +
  geom_smooth(se = FALSE) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Exponential relationship between carat size and price",
       subtitle = "Sample of 54,000 diamonds",
       x = "Carat size",
       y = "Price") +
  theme_minimal()

mpg

mpg
## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
##  1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
##  2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
##  3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
##  4         audi         a4   2.0  2008     4   auto(av)     f    21    30
##  5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
##  6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
##  7         audi         a4   3.1  2008     6   auto(av)     f    18    27
##  8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
##  9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Histogram

ggplot(mpg, aes(hwy)) +
  geom_histogram()

geom_rug()

ggplot(mpg, aes(hwy)) +
  geom_histogram() +
  geom_rug()

Binning histograms

ggplot(mpg, aes(hwy)) +
  geom_histogram(bins = 50) +
  geom_rug()

ggplot(mpg, aes(hwy)) +
  geom_histogram(bins = 10) +
  geom_rug()

Bar chart

ggplot(mpg, aes(class)) +
  geom_bar()

Covariation

  1. Two-dimensional graphs
  2. Multiple window plots
  3. Utilizing additional channels

Box plot

ggplot(mpg, aes(class, hwy)) +
  geom_boxplot()

Scatterplot

ggplot(mpg, aes(displ, hwy)) +
  geom_point()

Multiple window plots

ggplot(mpg, aes(hwy)) +
  geom_histogram() +
  facet_wrap(~ drv)

Multiple window plots

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~ drv)

Utilizing additional channels

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

Utilizing additional channels

ggplot(mpg, aes(displ, hwy, color = class, size = cyl)) +
  geom_point()

Utilizing additional channels

ggplot(mpg, aes(displ, hwy, shape = class)) +
  geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).