Working with relational data

Relational data is multiple tables of data that when combined together answer research questions. Relations define the important element, not just the individual tables. Relations are defined between a pair of tables, or potentially complex structures can be built up with more than 2 tables. In many situations, data is stored in a relational format because to do otherwise would introduce redundancy and use unnecessary storage space.

This data structure requires relational verbs to combine data across tables. Mutating joins add new variables to one data frame from matching observations in another, whereas filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.

`superheroes` and `publishers`

Let’s review how these different types of joining operations work with relational data on comic books. Load the rcfss library. There are two data frames which contain data on comic books.

library(tidyverse)
library(rcfss)

superheroes

## # A tibble: 7 x 4
##       name alignment gender         publisher
##      <chr>     <chr>  <chr>             <chr>
## 1  Magneto       bad   male            Marvel
## 2    Storm      good female            Marvel
## 3 Mystique       bad female            Marvel
## 4   Batman      good   male                DC
## 5    Joker       bad   male                DC
## 6 Catwoman       bad female                DC
## 7  Hellboy      good   male Dark Horse Comics

publishers

## # A tibble: 3 x 2
##   publisher yr_founded
##       <chr>      <int>
## 1        DC       1934
## 2    Marvel       1939
## 3     Image       1992

Would it make sense to store these two data frames in the same tibble? No! This is because each data frame contains substantively different information:

superheroes contains data on superheroes
publishers contains data on publishers

The units of analysis are completely different. Just as it made sense to split Minard’s data into two separate data frames, it also makes sense to store them separately here. That said, depending on the type of analysis you seek to perform, it makes sense to join the data frames together temporarily. How should we join them? Well it depends on how you plan to ask your question. Let’s look at the result of several different join operations.

Mutating joins

Inner join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ijsp <- inner_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 6 x 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1  Magneto       bad   male    Marvel       1939
## 2    Storm      good female    Marvel       1939
## 3 Mystique       bad female    Marvel       1939
## 4   Batman      good   male        DC       1934
## 5    Joker       bad   male        DC       1934
## 6 Catwoman       bad female        DC       1934

We lose Hellboy in the join because, although he appears in x = superheroes, his publisher Dark Horse Comics does not appear in y = publishers. The join result has all variables from x = superheroes plus yr_founded, from y.

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

inner_join(x = superheroes, y = publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934

Left join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ljsp <- left_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 7 x 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA

We basically get x = superheroes back, but with the addition of variable yr_founded, which is unique to y = publishers. Hellboy, whose publisher does not appear in y = publishers, has an NA for yr_founded.

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

left_join(x = superheroes, y = publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA

Right join

right_join(x, y): Return all rows from y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(rjsp <- right_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 7 x 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1   Batman      good   male        DC       1934
## 2    Joker       bad   male        DC       1934
## 3 Catwoman       bad female        DC       1934
## 4  Magneto       bad   male    Marvel       1939
## 5    Storm      good female    Marvel       1939
## 6 Mystique       bad female    Marvel       1939
## 7     <NA>      <NA>   <NA>     Image       1992

We basically get y = publishers back, but with the addition of variables name, alignment, and gender, which is unique to x = superheroes. Image, who did not publish any of the characters in superheroes, has an NA for the new variables.

We could also accomplish virtually the same thing using left_join() by reversing the order of the data frames in the function:

left_join(publishers, superheroes)

## Joining, by = "publisher"

## # A tibble: 7 x 5
##   publisher yr_founded     name alignment gender
##       <chr>      <int>    <chr>     <chr>  <chr>
## 1        DC       1934   Batman      good   male
## 2        DC       1934    Joker       bad   male
## 3        DC       1934 Catwoman       bad female
## 4    Marvel       1939  Magneto       bad   male
## 5    Marvel       1939    Storm      good female
## 6    Marvel       1939 Mystique       bad female
## 7     Image       1992     <NA>      <NA>   <NA>

Doing so returns the same basic data frame, with the column orders reversed. right_join() is not used as commonly as left_join(), but works well in a piped operation when you perform several functions on x but then want to join it with y and only keep rows that appear in y.

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

right_join(x = superheroes, y = publishers)

name	alignment	gender	publisher	yr_founded
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
NA	NA	NA	Image	1992

Full join

full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.

(fjsp <- full_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 8 x 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA
## 8     <NA>      <NA>   <NA>             Image       1992

We get all rows of x = superheroes plus a new row from y = publishers, containing the publisher “Image”. We get all variables from x = superheroes AND all variables from y = publishers. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

full_join(x = superheroes, y = publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA
NA	NA	NA	Image	1992

Filtering joins

Semi join

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y (potentially duplicating rows in x), whereas a semi join will never duplicate rows of x. This is a filtering join.

(sjsp <- semi_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 6 x 4
##       name alignment gender publisher
##      <chr>     <chr>  <chr>     <chr>
## 1   Batman      good   male        DC
## 2    Joker       bad   male        DC
## 3 Catwoman       bad female        DC
## 4  Magneto       bad   male    Marvel
## 5    Storm      good female    Marvel
## 6 Mystique       bad female    Marvel

We get a similar result as with inner_join() but the join result contains only the variables originally found in x = superheroes. But note the row order has changed.

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

semi_join(x = superheroes, y = publishers)

name	alignment	gender	publisher
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel

Anti join

anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.

(ajsp <- anti_join(superheroes, publishers))

## Joining, by = "publisher"

## # A tibble: 1 x 4
##      name alignment gender         publisher
##     <chr>     <chr>  <chr>             <chr>
## 1 Hellboy      good   male Dark Horse Comics

We keep only Hellboy now (and do not get yr_founded).

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

anti_join(x = superheroes, y = publishers)

name	alignment	gender	publisher
Hellboy	good	male	Dark Horse Comics

Acknowledgments

This page is derived in part from “UBC STAT 545A and 547M”, licensed under the CC BY-NC 3.0 Creative Commons License.

Session Info

devtools::session_info()

## Session info -------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-03-19

## Packages -----------------------------------------------------------------

##  package   * version date       source        
##  backports   1.1.2   2017-12-13 CRAN (R 3.4.3)
##  base      * 3.4.3   2017-12-07 local         
##  compiler    3.4.3   2017-12-07 local         
##  datasets  * 3.4.3   2017-12-07 local         
##  devtools    1.13.5  2018-02-18 CRAN (R 3.4.3)
##  digest      0.6.15  2018-01-28 CRAN (R 3.4.3)
##  evaluate    0.10.1  2017-06-24 CRAN (R 3.4.1)
##  graphics  * 3.4.3   2017-12-07 local         
##  grDevices * 3.4.3   2017-12-07 local         
##  htmltools   0.3.6   2017-04-28 CRAN (R 3.4.0)
##  knitr       1.20    2018-02-20 CRAN (R 3.4.3)
##  magrittr    1.5     2014-11-22 CRAN (R 3.4.0)
##  memoise     1.1.0   2017-04-21 CRAN (R 3.4.0)
##  methods   * 3.4.3   2017-12-07 local         
##  Rcpp        0.12.15 2018-01-20 CRAN (R 3.4.3)
##  rmarkdown   1.9     2018-03-01 CRAN (R 3.4.3)
##  rprojroot   1.3-2   2018-01-03 CRAN (R 3.4.3)
##  stats     * 3.4.3   2017-12-07 local         
##  stringi     1.1.7   2018-03-12 CRAN (R 3.4.3)
##  stringr     1.3.0   2018-02-19 CRAN (R 3.4.3)
##  tools       3.4.3   2017-12-07 local         
##  utils     * 3.4.3   2017-12-07 local         
##  withr       2.1.1   2017-12-19 CRAN (R 3.4.3)
##  yaml        2.1.18  2018-03-08 CRAN (R 3.4.4)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Relational data: a quick review

Working with relational data

superheroes and publishers

Mutating joins

Inner join

Left join

Right join

Full join

Filtering joins

Semi join

Anti join

Acknowledgments

Session Info

`superheroes` and `publishers`